IEEE Xplore At-A-Glance
  • Abstract

An Efficient Direct Approach to Visual SLAM

The majority of visual simultaneous localization and mapping (SLAM) approaches consider feature correspondences as an input to the joint process of estimating the camera pose and the scene structure. In this paper, we propose a new approach for simultaneously obtaining the correspondences, the camera pose, the scene structure, and the illumination changes, all directly using image intensities as observations. Exploitation of all possible image information leads to more accurate estimates and avoids the inherent difficulties of reliably associating features. We also show here that, in this case, structural constraints can be enforced within the procedure as well (instead of a posteriori), namely the cheirality, the rigidity, and those related to the lighting variations. We formulate the visual SLAM problem as a nonlinear image alignment task. The proposed parameters to perform this task are optimally computed by an efficient second-order approximation method for fast processing and avoidance of irrelevant minima. Furthermore, a new solution to the visual SLAM initialization problem is described whereby no assumptions are made about either the scene or the camera motion. Experimental results are provided for a variety of scenes, including urban and outdoor ones, under general camera motion and different types of perturbations.



IN ORDER TO autonomously navigate in an unknown environment, a robot must be able to build a representation of the surrounding map and self-localize with respect to it. Even though it is possible to perform the latter without the former by computer vision using an appropriate tensor (e.g., the essential matrix), precision may be lost rapidly. This happens because important structural constraints, e.g., the scene rigidity, are not effectively exploited in a long run. Having understood that both estimation processes are intimately tied together, an appealing strategy is then to perform them simultaneously. This is generally referred to as simultaneous localization and mapping (SLAM) in the robotics community. This class of methods focuses on computationally tractable algorithms that incrementally (i.e., causally) integrate information. At the expense of usually accumulating drifts earlier, they are suitable to real-time operation required by robotic platforms. A slightly different class of methods, mainly developed by the computer vision community, refers to structure from motion (SFM)techniques. Noncausal schemes fall into this latter class. These algorithms, mostly aimed at high levels of accuracy, are allowed to run in a time-consuming batch process. This paper focuses on the former class. The reader may refer to, e.g., [1] and [2] for some well-established SFM methods.

A. Related Work

The techniques that simultaneously and causally reconstruct the camera pose and the scene structure can be divided into two classes, which are briefly discussed next.

1) Feature-Based Methods to Visual SLAM

A standard scheme to visual SLAM consists of first extracting a sufficiently large set of features (e.g., points, lines), and robustly matching them between successive images. These corresponding features are the input to the joint process of estimating the camera pose and scene structure. The majority of visual SLAM methods fall into this class, e.g., [3], [4], [5], independently of the applied filtering technique, e.g. , extended Kalman filter EKF-SLAM [6] and FastSLAM 2.0 [7]. This represents the discrete case. Another possibility consists of computing the needed correspondences in the form of optical flow (the velocity). This has been exploited in, e.g., [8] and [9]. In both cases, since the prior step of data association is error-prone, care must be taken in order to avoid propagating them to subsequent steps. On the other hand, these methods may handle large interframe displacements of the objects.

2) Direct Methods to Visual SLAM

In this class of methods, the intensity value of the pixels is directly exploited to obtain the required parameters. That is, there is no prior step of data association: this is simultaneously solved. An important strength of these methods concerns the level of accuracy that they can attain. This characteristic is mainly due to the exploitation of all possible image information, even from areas where gradient information is weak. The reader may refer to, e.g., [10] for a more profound discussion about this subject.

In this spirit, the technique proposed in [11]can be assigned to this class. However, it does not consider the strong coupling between motion and structure in their separated estimation processes from pixel intensities. Furthermore, it is sensitive to variable illumination. In that method, new information is initialized with a “best guess. ” The technique proposed in [12], though using a unified framework, relies on the linearity of image gradient. This limits the system to work under very small interframe displacements of the objects. This approach is relatively robust to lighting variations, but its model of illumination changes is overparameterized (which may lead, for example, to convergence problems). New information is initialized in a separate filter, and is inserted into the main filter after a probation period. Also, in a unified framework, central catadioptric cameras are adequately dealt with in [13]. The latter uses the same approximation method we use in this paper for obtaining the related optimal parameters. Nevertheless, its set of parameters is different from ours not only because illumination changes are handled here, but also due to the structural constraints we explicitly enforce. Moreover, initialization is not a concern in that work.

B. Overview of the Method

We formulate the visual SLAM problem as a nonlinear image registration task. In other words, we consider visual SLAM as the problem of estimating the appropriate parameters that optimally align a reference image with successive frames of a video sequence. A subset of the proposed parameters is naturally the camera pose and scene structure. Since the result of direct image alignments is such that each pixel intensity is matched as closely as possible across images, the technique in fact also returns a dense correspondence (see Fig. 1).

Figure 1
Fig. 1. Hangar sequence: 751-frame example of visual SLAM by aligning reference regions to successive images. All pixels within both regions are exploited, leading to a precise result. The recovered angle between walls is 89.7°. The regions are defined relative to where they were first viewed and transfered to a common reference frame only for visualization purposes. (a) (Top) Reference region is selected. (Bottom) Using appropriate parameters, this region is automatically aligned (registered) to a different image. The image on the right is the warped region that is used to compute a residual. Other reference regions may be continuously selected and aligned if computing resources are available. (b) Subset of the parameters recovered by the proposed alignment algorithm is naturally the camera pose and the scene structure. Since monocular images are used, the scale factor is set arbitrarily.

Roughly speaking, the optimal parameters are obtained as follows. Consider a parametric generative model that deforms (warps)an image. Using an estimate of the parameters, an image can be warped toward another one. The residual between the warped image and the second one is then used to iteratively refine the parameters of the models. In this paper, we focus on a deterministic optimal formulation of the visual SLAM. As for the uncertainty calculations, one can either directly cast the image registration as a stochastic optimization problem, or couple the approach with a standard filtering technique (e.g., EKF). The latter alternative is considered here, but the former is believed to represent a promising research direction.

Despite the impressive computing power to date, in a real-time setting, the entire image cannot, in general, be considered for processing v. Therefore, an adequate selection of image regions is performed in this paper. Given that the selected regions may either leave the field-of-view or simply not fit the used models, the technique is able to both reject the latter and automatically insert new regions. Also, to improve computational efficiency [14], the scene is geometrically modeled as a collection of planar surfaces. This modeling is considered by all direct methods mentioned in Section I-A.2 as well.

C. Contributions

In this paper, a new approach to visual SLAM is proposed. We formulate it as a direct image registration problem. In order to solve it efficiently, consistently, and robustly, a new photogeometric generative model is presented, i.e., besides the global and local geometric parameters, global and local photometric ones are considered as optimization variables as well. This enables the system to work under generic illumination changes and achieve more accurate alignments. In turn, the global variables related to motion directly enforce the rigidity constraint of the scene within the minimization process. We remark that the proposed framework still preserves the advantages from motion parameterization using the Lie algebra. With regard to the last but not the least structural constraint of the scene, the positive depth constraint (i.e., cheirality), a new structure parameterization is proposed to enforce it during the optimization also. Surprisingly, none of the existing direct approaches have exploited this constraint. The simultaneous enforcement within the optimization (instead of a posteriori) of all these structural constraints significantly contributes to improving robustness, stability, and accuracy.

Another contribution of this paper concerns the initialization of the visual SLAM. This is not a trivial issue since the scene structure becomes observable only when the amount of translation is sufficiently large with respect to its depths [15], [16]. Given this ill-conditioning, some systems, e.g., [11] rely on a simple solution:one installs a known target in the environment and uses it in the initial frame. Other systems recover and decompose the essential matrix. However, if the scene is planar, then such a matrix is degenerate, which leads to an erroneous translation vector. In this paper, a new solution for initializing the system is proposed whereby the environment is neither altered nor assumed to be nonplanar.

This paper is a revised and extended version of the visual SLAM approach that we have proposed in [17]. Besides, more thorough experiments are carried out, and a technique to automatically insert new regions is described.



Besides the standard notations, in the sequel we adopt Formula to, respectively, represent an increment to be found, an augmented version, a modified version, and the Euclidean norm of a variable v. Here, a superscripted asterisk, e.g., v^* is used to represent a variable defined with respect to the reference frame, whereas a superscripted circle, e.g., v° denotes its optimal value relative to a given cost function. Also, braces represent a set, e.g., {vi }i = 1n = {v1, v2, …,vn }, and 0 (respectively 1) is a matrix of zeros (respectively ones) of appropriate dimensions. Moreover, let p = [u,v,1]⊤ be the homogeneous vector containing the image coordinates of a pixel. Then, we denote as I (p) the image intensity of the pixel p. Bilinear interpolation is used for subpixel coordinates. Consider an image I of a rigid scene. After displacing the camera by a rotation Formula and a translation Formula, another image I of the same scene is acquired. This motion can be represented by a homogeneous transformation matrix Formula.

A. Lie Algebra Formula and the Lie Group Formula

Let Ai, i = 1, 2, …, 6, be the canonical basis of the Lie algebra Formula [18]. Any Formula can thus be written as a linear combination of Ai Formula TeX Source $${\bf A}({\bf v}) = \sum_{i=1}^6 \nu_i {\bf A}_i\in {\se}(3)\eqno{\hbox{(1)}}$$ where Formula represents its coordinates.

The Lie algebra Formula is related to its Lie group Formula via the exponential mapFormula TeX Source $$\exp \colon {\se}(3) \to {\bb S}{\bb E}(3); \qquad {\bf A}({\bf v}) \mapsto \exp \bigl({\bf A}({\bf v})\bigr).\eqno{\hbox{(2)}}$$The mapping (2) is smooth and one-to-one onto, with a smooth inverse, within a very large neighborhood around the origin of Formula and the identity element of Formula. The most important benefits of using this parameterization in our problem will be made clear when applying it in Sections II-C and III-D.

B. Plane-Based Two-View Epipolar Geometry

As previously stated, for efficiency reasons, we model the scene as a collection of planar regions. In this case, the coordinates of a pixel p^* in such a region of I∗ are linked to its corresponding p in I by a projective homography G [15]Formula TeX Source $${\bf p} \, \propto \, {\bf G} \, {\bf p}^*.\eqno{\hbox{(3)}}$$The symbol “∝” indicates proportionality up to a nonzero-scale factor. A warping operator w can then be defined asFormula TeX Source $$\eqalignno{{\bf p} \, &= {\bf w}({\bf G},{\bf p}^*)&\hbox{(4)}\cr&= \left[{g_{11} u^* + g_{12} v^* + g_{13}\over g_{31} u^* + g_{32} v^* + g_{33}},{g_{21} u^* + g_{22} v^* + g_{23}\over g_{31} u^* + g_{32} v^* + g_{33}},1\right]^\top&\hbox{(5)}}$$ where {gij} denotes the elements of the matrix G.

Consider the calibrated setting, where K denotes the upper triangular (3 × 3) matrix containing the camera's intrinsic parameters. Using the equation of the plane together with of the rigid-body motion, G can be written as a function of the camera displacement and the scene structureFormula TeX Source $${\bf G}({\bf T}, {\bf n}_d^*) \, \propto \, {\bf K} \, (\, {\bf R} + {\bf t}\, {\bf n}_d^{*\top}) \,{\bf K}^{-1}\eqno{\hbox{(6)}}$$where Formula denotes the normal vector of the plane scaled by its distance to the reference camera frame.

C. Model-Based Image Alignment Parameterized in Formula

Consider a textured planar surface, or that it can be locally approximated by a plane. For simplicity, let us suppose for the moment that the scaled normal vector nd^* (i.e., the metric model)of this planar target is known. We will show in Section III-C how the image alignment (registration) problem can be adequately solved when this metric model is unknown.

The problem of “metric model”-based direct image alignment can be formulated as a search for the optimal matrix Formula to warp all the pixels in a region Formula so that their intensity values match as closely as possible to their corresponding ones in the current image I [19]. Since one seeks an optimal pose given a scene model, this technique can also be referred to as model-based visual odometry, or simply localization. To this end, a nonlinear minimization procedure has to be derived since the pixel intensity I(p) is, in general, nonlinear in p.More formally, given an estimate Formula of T, the problem is to find the optimal incremental Formula through an iterative method, e.g., [19] that solvesFormula TeX Source $$\min_{\tilde{{\bf v}} \in {\bb R}^6} \ {1\over 2}\sum_{{\bf p}_i^* \in {\cal R}^*} \left[{\cal I} \Bigl({\bf w}\bigl({\bf G} \bigl({\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}} \bigr),{\bf p}_i^* \bigr) \Bigr) - {\cal I}^*({\bf p}_i^*) \right]^2\eqno{\hbox{(7)}}$$with an update of the transformation matrix asFormula TeX Source $$\widehat{{\bf T}} \longleftarrow{\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}} \, = \,\exp \bigl({\bf A}(\widetilde{{\bf v}}) \bigr) \,\widehat{{\bf T}}\eqno{\hbox{(8)}}$$by using the mapping (2).The arrow “Formula” denotes the update assignment within the iterations. The convergence may then be established when the increments become arbitrarily small, i.e., Formula. Due to the properties of this mapping, the resulting matrix Formula in (8) is always in the group, and hence, no approximation is performed. If this parameterization is not applied, the resulting Formula has to be projected onto its group manifold, clearly reducing its rate and domain of convergence. Therefore, the local parameterization (1)improves stability and accuracy, and thus, is highly suitable to express incremental displacements. Another important property will be exploited in Section III-D to solve optimization problems, e.g., (7) efficiently and with nice convergence properties.


Proposed Direct Visual SLAM Approach

This section presents a unified approach where geometric and photometric models are appropriately included in a direct visual SLAM. Furthermore, it is also shown how to consistently and efficiently obtain the optimal global and local parameters related to all these models.

A. Selection of Image Regions

In order to satisfy the real-time requirements, we select a set of nonoverlapping image patches according to an appropriate score. For direct methods, high scores should reflect strong image gradient along different directions.

Let the image region Formula be a (w × w) matrix containing pixel intensities. Then, obtain a suitable gradient-based image G∗ from I∗. Given G∗, a score image S∗ can be defined as the sum of all values of G∗ within a (w × w) block centered at every pixel. A second criterion to be considered, possibly with a different weight, is based on the quantity of local extrema of G∗ (denoted E∗) within each block. This may prevent the system from assigning high scores on single peaks, which would define patches with the same drawbacks as regions defined around standard interest points (e.g., Harris corners). The neighborhood of an isolated point may not contain enough information to constrain all degrees of freedom. Other criteria are also possible, e.g., the degree of spread of the regions, but these earlier two have shown to be sufficient.

All needed block operations are efficiently performed by a convolution(denoted by “ ⊗ ”) with the (w × w) kernel Kw = 1 Formula TeX Source $$\eqalignno{{\cal S}^* \, &= \, \lambda \, {\cal G}^* \otimes{\cal K}_w + \eta \, {\cal E}^* \otimes {\cal K}_w&\hbox{(9)}\cr&=(\lambda \, {\cal G}^* + \eta \, {\cal E}^*) \otimes{\cal K}_w.&\hbox{(10)}}$$ Typical weights are λ = ‖ G∗ ⊗ Kw−1 and η = 1. The resulting S∗ contains the scores that are sorted, without any absolute thresholds on the strengths to be tuned. The amount of regions (defined around each score) considered for further processing depends only on the available computing resources.

B. Handling Generic Illumination Changes

An important issue to all vision-based methods is their robustness to variable lighting. A widely used technique to increase this robustness is to model the change in illumination as an affine transformation [20]. Despite the fact that improved results are obtained, only global changes are modeled.

Recently, we proposed in [21] a new model of illumination changes to cope with generic lighting variations. Illumination changes are viewed as a surface that can evolve with time. In that paper, we have successfully applied it to the direct visual tracking problem parameterized in the projective space. Here, we will show that this model can be straightforwardly applied to the efficient direct visual tracking problem parameterized in the Euclidean space. Indeed, for efficiency reasons, we use here the discretized realization of that generic model (see Fig. 2). Let the region have a sufficiently small size. Lighting variations are then explained by a local and aglobal term Formula, respectively:Formula TeX Source $${\cal I}' (\alpha, \beta, {\bf p}_i) = \alpha \, {\cal I}({\bf p}_i) + \beta.\eqno{\hbox{(11)}}$$This piecewise affine model (there is an α per region) can be interpreted as a photometric generative model for regulating the contrast of a particular region and the brightness of the entire image. This discretized model has been shown to be a good compromise between modeling error and computational complexity (it has few parameters and leads to a sparse Jacobian, as shown in Section IV). Nevertheless, it still does not require any prior knowledge about either the reflectance properties of the surface, which can be non-Lambertian, or the characteristics of the light sources, such as power, number, and their pose in space.

Figure 2
Fig. 2. (Boxes) Discretized surface for approximating (colored) the lighting changes.

We remark that the model (11) is different from existing ones when applied to different parts of the same image. For example, the method proposed in [12] uses an affine model consisting of two local parameters per region. That is, it does not consider the global variations explicitly, which represent, e.g., the shift in the camera gain. In this latter overparameterized formulation, estimation of many more parameters are required. This may degrade frame-rate performance, and even worse, it may lead to convergence problems. Another important difference regards to how the related parameters are obtained. The global and local parameters related to our model are simultaneously obtained by an efficient second-order approximation method, yielding nicer convergence properties.

In fact, given that an iterative procedure is used and that the update rule for the illumination parameters can simply beFormula TeX Source $$\cases{\widehat{\alpha} \longleftarrow \widetilde{\alpha} + \widehat{\alpha} \cr \cr\widehat{\beta} \longleftarrow \widetilde{\beta} + \widehat{\beta}}\eqno{\hbox{(12)}}$$ we can define the transformed pixel intensity asFormula TeX Source $${\cal I}' \bigl(\widetilde{{\bf v}}, \widetilde{\alpha},\widetilde{\beta}, {\bf p}_i^* \, \bigr) = (\widetilde{\alpha} +\widehat{\alpha}) \; {\cal I} \Bigl({\bf w}\bigl({\bf G}\bigl({\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}}\bigr), {\bf p}_i^* \bigr) \Bigr) + \widetilde{\beta} +\widehat{\beta}.\eqno{\hbox{(13)}}$$ This can then be viewed as a photogeometric generative model. Therefore, by incorporating (13), the model-based visual tracking problem (7) becomesFormula TeX Source $$\min_{ {\matrix{\tilde{{\bf v}} \in {\bb R}^6\cr\tilde{\alpha}, \tilde{\beta} \in {\bb R}}}} \ {1\over 2}\sum_{{\bf p}_i^* \in {\cal R}^*} \bigl[\, {\cal I}'\bigl(\widetilde{{\bf v}}, \widetilde{\alpha}, \widetilde{\beta},{\bf p}_i^* \, \bigr) - {\cal I}^*({\bf p}_i^*) \, \bigl]^2.\eqno{\hbox{(14)}}$$

C. Full System

Since the metric model of the scene is unknown a priori, its structure parameters must be included in (14) as optimization variables as well. Indeed, the depth of some image points (not necessarily image features) together with a regularization function can be used as these variables. The latter function is needed in two-image direct reconstructions in order to avoid obtaining an underconstrained system (more unknowns than equations). As stated previously, we represent the scene here as a collection of planar regions. This, in fact, acts as our regularization function. This choice leads to a versatile and computationally efficient description of the scene (it has few parameters and leads to a sparse Jacobian, as will be shown).

We include the structure parameters as follows. First, we perform a parameterization of the scaled normal vector Formula by using the depth zi∗ > 0 of any (noncollinear)three image points pi∗, i = 1,2,3, within the region R∗ (e.g., its corners). For a 3-D point that lies on the plane nd^* and the equation of perspective projection, we haveFormula TeX Source $${\bf n}_d^{*\top} \, {\bf K}^{-1} {\bf p}_i^* = {1\over z_i^*}.\eqno{\hbox{(15)}}$$ Using these three points, define the vector of inverse depthsFormula TeX Source $${\bf z}^* = \left[{1\over z_1^*}, {1\over z_2^*}, {1\over z_3^*} \right]^\top\eqno{\hbox{(16)}}$$ which is the natural value to be computed. The relation between both representations is thenFormula TeX Source $${\bf n}_d^* = {\bf M} \, {\bf z}^*\qquad \hbox{with} \;{\bf M} = {\bf K}^\top \! \left[\, {\bf p}_1^*,{\bf p}_2^*, {\bf p}_3^* \, \right]^{-\top} \in{\bb R}^{3 \times 3}.\eqno{\hbox{(17)}}$$ Next, given that the depths must be strictly positive scalars and that an iterative procedure has to be devised, we propose to parameterize them asFormula TeX Source $${\bf z}^* \, = \, {\bf z}^*({\bf y}) = \exp({\bf y})>0,\qquad {\bf y} \in {\bb R}^3.\eqno{\hbox{(18)}}$$ This provides the update ruleFormula TeX Source $$\widehat{{\bf z}}^{\,*} \longleftarrow{\bf z}^*(\widetilde{{\bf y}}) \cdot\widehat{{\bf z}}^{\,*} \, = \, \exp(\widetilde{{\bf y}})\cdot \widehat{{\bf z}}^{\,*}\eqno{\hbox{(19)}}$$ where “·” denotes element-wise multiplication.

Remark III.1 (Cheirality Constraint). By using the proposed efficient parameterization of the structure Formula, we enforce, within the optimization procedure, that the scene is always in front of the camera. That is, zi∗ > 0 ∀ i.

Accordingly, the photogeometric generative model expressed in (13) has to be changed intoFormula TeX Source $$\eqalignno{& {\cal I}” \bigl(\widetilde{{\bf v}},\widetilde{\alpha}, \widetilde{\beta}, \widetilde{{\bf y}},{\bf p}_i^* \, \bigr) \cr& = (\widehat{\alpha} \! + \!\widetilde{\alpha}) \; {\cal I} \Bigl({\bf w} \bigl({\bf G} \bigl({\bf T}(\widetilde{{\bf v}}) \,\widehat{{\bf T}}, \, {\bf n}_d^*({\bf z}^*(\widetilde{{\bf y}}) \cdot\widehat{{\bf z}}^{\,*}) \bigr), {\bf p}_i^* \bigr) \Bigr) \! + \!\widehat{\beta} \! + \! \widetilde{\beta}.\cr&&\hbox{(20)}}$$ Incorporating this modification into all regions Rj∗, j = 1, 2, …, n, our problem becomesFormula TeX Source $$\min_{{\bf x} \in {\bb R}^{6+4n}} \ {1\over 2} \ \sum_j\sum_{{\bf p}_{ij}^* \in {\cal R}_j^*} \bigl[\;\underbrace{{\cal I}”({\bf x}, {\bf p}_{ij}^*) -{\cal I}^*({\bf p}_{ij}^*)}_{d_{ij}({\bf x})} \; \bigr]^2\eqno{\hbox{(21)}}$$ and where Formula has7+4n−1 parameters, since the scale factor cannot be recovered from monocular images only. Thus, one has to fix it (to a strictly positive value) to obtain a consistent solution to the problem. It can be noted that the set x comprises both global geometric and photometric parameters Formula, as well as local geometric and photometric ones Formula.

Remark III.2 (Rigidity Constraint). Observe that in formulation (21), the regions are not independently tracked. In fact, the rigidity constraint of the scene is explicitly enforced, within the optimization procedure also, since all regions share the same incremental motion parameters.

D. Optimization Procedure

Concisely, our system (21) can then be interpreted as seeking the optimal valueFormula TeX Source $${\bf x}^\circ = \matrix{\raise-4pt\hbox{$\arg\,\min$}\cr { ^{ {\bf x} \in{\bb R}^{6+4n}}}}\ {1\over 2} \, \big \Vert \,{\bf d}({\bf x}) \, \big \Vert^2\eqno{\hbox{(22)}}$$ such that the norm of the vector of intensity discrepanciesd(x) = {dij(x) }is minimized. In order to iteratively solve this nonlinear optimization problem, an expansion in Taylor series is first performed. To this end, another key technique to achieve nice convergence properties is to perform an efficient second-order approximation of d(x) [22]. Indeed, it can be shown that, neglecting the third-order remainder, a second-order approximation of d(x) around x = 0 isFormula TeX Source $${\bf d}({\bf x}) = {\bf d}({\bf 0}) +{1\over 2} \bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \, {\bf x}.\eqno{\hbox{(23)}}$$ In our case, the current Jacobian J(0) is divided into the Jacobian relative to the motion parameters, the illumination parameters, and the structure parametersFormula TeX Source $${\bf J}({\bf 0}) = \bigl[\{\bf J}_{{\bf v}}({\bf 0}), \{\bf J}_{\alpha\beta}({\bf 0}), \{\bf J}_{{\bf z}^*}({\bf 0}) \ \bigr]\eqno{\hbox{(24)}}$$ where Formula TeX Source $$\cases{{\bf J}_{{\bf v}}({\bf 0}) = \widehat{\alpha} \,{\bf J}_{{\cal I}} {\bf J}_{{\bf w}}{\bf J}_{\hat{{\bf T}}} {\bf J}_{{\bf V}}({\bf 0}) \cr{\bf J}_{\alpha\beta}({\bf 0}) =\bigl[\ \nabla_{\hat{\beta}} \, {\cal I}”({\bf 0}), \\nabla_{\hat{\alpha}} \, {\cal I}”({\bf 0}) \ \bigr] =\bigl[\ 1, \ {\cal I} \ \bigr] \cr{\bf J}_{{\bf z}^*}({\bf 0}) = \widehat{\alpha} \,{\bf J}_{{\cal I}} {\bf J}_{{\bf w}}{\bf J}_{\hat{{\bf n}}^{\!*}} {\bf M} \, {\bf z}^*({\bf 0})}$$ by applying the chain rule. Correspondingly, the reference JacobianJ(x) is divided intoFormula TeX Source $${\bf J}({\bf x}) = \bigl[\{\bf J}_{{\bf v}}({\bf x}), \{\bf J}_{\alpha\beta}({\bf x}), \{\bf J}_{{\bf z}^*}({\bf x}) \ \bigr]\eqno{\hbox{(25)}}$$ where Formula TeX Source $$\cases{{\bf J}_{{\bf v}}({\bf x}) = \alpha \,{\bf J}_{{\cal I}^*} {\bf J}_{{\bf w}}{\bf J}_{{\bf T}} {\bf J}_{{\bf V}}({\bf x}) \cr{\bf J}_{\alpha\beta}({\bf x}) =\bigl[\ 1, \ {\cal I}^* \, \bigr] \cr{\bf J}_{{\bf z}^*}({\bf x}) = \alpha \,{\bf J}_{{\cal I}^*} {\bf J}_{{\bf w}}{\bf J}_{{\bf n}^{\!*}} {\bf M} \, {\bf z}^*({\bf x}).}$$

By applying a necessary condition for x = x° to be an extremum of our cost function in (22) givesFormula TeX Source $$\nabla_{{\bf x}} \biggl({1\over 2} \,{\bf d}({\bf x})^\top{\bf d}({\bf x}) \biggr)\bigg\vert _{{\bf x} = {\bf x}^\circ} =\nabla_{{\bf x}} \bigl({\bf d}({\bf x}) \bigr)^\top\bigg\vert _{{\bf x} = {\bf x}^\circ}{\bf d}({\bf x}^\circ) = {\bf 0}.\eqno{\hbox{(26)}}$$ Provided that J(x)|x = x° is full rank(see Section IV) and using (23) around x = x°, one has from (26)Formula TeX Source $${1\over 2} \, \bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \, {\bf x}^\circ =-{\bf d}({\bf 0}).\eqno{\hbox{(27)}}$$ This is not a linear system in x° because of J(x). However, due to the suitable parameterization of the alignment (see  Section II-C), we exploit the left-invariance property of the vector fields on Lie groups [18]. In fact, given that the space of the parameters x is homeomorphic to a Lie group defined over Formula, this property means that JV(x) x° = JV(0) x°. Then, provided that Formula and Formula, the left-hand side of (27)can be written asFormula TeX Source $$\eqalignno{& \! {1\over 2} \,\bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \,{\bf x}^\circ = {\bf J}' \, {\bf x}^\circ = \bigl[\,{\bf J}_{{\bf v}}', \, {\bf J}_{\alpha \beta}', \,{\bf J}_{{\bf z}^*}' \bigr] \, {\bf x}^\circ \cr&\!\! = \! {1\over 2} \,\Bigl[\widehat{\alpha} \, ({\bf J}_{{\cal I}} \! + \!{\bf J}_{{\cal I}^*}) {\bf J}_{{\bf w}}{\bf J}_{{\bf v}}”, \bigl[2,({\cal I} \! + \! {\cal I}^*) \bigr], \widehat{\alpha} \,({\bf J}_{{\cal I}} \! + \! {\bf J}_{{\cal I}^*}){\bf J}_{{\bf w}}{\bf J}_{{\bf z}^*}” \! \Bigr] \, {\bf x}^\circ\cr&&\hbox{(28)}}$$ with Formula and Formula.

By appropriately stacking each J′ above to take into consideration all regions j = 1, 2, …, n, i.e.,Formula TeX Source $$\eqalignno{\overline{{\bf J}'} &= \left[\matrix{{{\bf J}_{1{\bf v}}'} &\; {\bf 1} & {{\bf J}_{1\alpha}'} & {\bf 0} & {\bf 0} & {\bf 0}& {{\bf J}_{1{\bf z}^*}'} & {\bf 0} & {\bf 0} &{\bf 0} \cr{{\bf J}_{2{\bf v}}'} &\; {\bf 1} & {\bf 0} & {{\bf J}_{2\alpha}'} & {\bf 0} &{\bf 0} & {\bf 0} & {{\bf J}_{2{\bf z}^*}'} & {\bf 0} &{\bf 0} \cr\vdots & \; \vdots & {\bf 0} & {\bf 0} & \ddots &{\bf 0} & {\bf 0} & {\bf 0} & \ddots & {\bf 0} \cr{{\bf J}_{n{\bf v}}'} & \; {\bf 1} & {\bf 0} &{\bf 0} & {\bf 0} & {{\bf J}_{n\alpha}'} & {\bf 0} & {\bf 0} &{\bf 0} & {{\bf J}_{n{\bf z}^*}'}} \right]\cr&= \bigl[\ \overline{{\bf J}'}_{\!\!{\bf v}}, \ \overline{{\bf J}'}_{\!\! \alpha \beta}, \\overline{{\bf J}'}_{\!\! {\bf z}^*} \, \bigr]&\hbox{(29)}}$$ a rectangular linear system is hence finally achievedFormula TeX Source $$\overline{{\bf J}'} \; {\bf x}^\circ =-{\bf d}({\bf 0})\eqno{\hbox{(30)}}$$ whose solution x° is obtained in the least-squares sense by solving its normal equations. The optimal solution is found by iteratively updating the parameters according to (8), (12), and (19)until the displacements become arbitrarily small.

Therefore, we provide a second-order approximation method that leads to a computationally efficient optimization procedure because only first-order derivatives are involved. In other words, differently from second-order minimization techniques (e.g., Newton), the Hessians are never computed explicitly. This also contributes to obtain nicer convergence properties. Furthermore, the proposed model of illumination changes together with the used representation of the scene yield sparse (diagonal) Jacobians, respectively, Formula andFormula, as shown in (29). Efficiency is then further improved.


Initialization of the System

In this section, a method to initialize the proposed visual SLAM formulation is described. Essentially, the technique consists of a hierarchical framework in the sense of the number of parameters to explain the image motion.

A. Hierarchical Formulation

At the beginning of the task, the amount of translation may be small relative to the distance to the scene. If this occurs, the augmented Jacobian of the structure Formula [see (29)] is ill-conditioned, which means that the structure parameters are not yet observable. In this situation, the motion parameters together with the illumination ones can explain most of the image differences. The latter reasoning also applies once the optimal structure parameters (i.e., the map) have already been obtained. In this case, there is no reason to maintain them as optimization variables. Besides that their values may be perturbed, e.g., when the image resolution decreases, less parameters in the minimization mean more available computing resources. Once again, motion parameters and illumination ones can explain most of the image discrepancies. As a matter of fact, in this case, the proposed visual SLAM approach effectively runs in a robust localization mode.

Therefore, for every new image, we initially attempt to align the regions by using only a subset of parameters from (30)Formula TeX Source $$\bigl[\ \overline{{\bf J}'}_{\!\! {\bf v}}, \\overline{{\bf J}'}_{\!\! \alpha \beta} \bigr] \ \bigl[\,\widetilde{{\bf v}}^{\circ\top} \!, \, \widetilde{\beta}^\circ,\, \bigl\{ \widetilde{\alpha}_j^\circ \bigr\}_{j=1}^n \bigr]^\top= -{\bf d}({\bf 0})\eqno{\hbox{(31)}}$$ whose solution Formula is also obtained in the least-square sense, and then it iteratively updates (8)and (12). The structure parameters are only simultaneously used as optimization variables, i.e., by solving (30), whenever the difference between the resulting cost value by using (31) and the resulting one from previous (image) optimization exceeds the image noise. We remark that in any case, the structure (plus motion and illumination) parameters are required to compute the discrepancies d(0). These parameters can either be the optimal ones from preceding image registrations or an initial value. In fact, this shows how all past observations have contributed to incrementally building and maintaining a coherent description of the map (and locations).

B. Augmenting the Domain and the Rate of Convergence

A limitation of the visual SLAM approach proposed in Section III regards its domain of convergence. Although the parameters are obtained by a second-order approximation method with nice convergence properties, it does not ensure that the global minimum will be reached. Global optimization methods such as simulated annealing are too time-consuming to be considered in a real-time setting.

However, a possible solution to avoid getting wedged in local minima consists of using, e.g., feature-based techniques as a bootstrap to our method. We remark that even though a recovered set of parameters can represent a local minimum, it may be close to the global one. Hence, the regions may still have been effectively aligned in the image. A standard pose recovery technique can then be used with all these registered (i.e., corresponding) pixels. Afterward, the scene can be reconstructed by triangulating them [15].In addition to augmenting the domain of convergence, this approach may also augment the rate of convergence. If these estimated motion and/or structure are closer to the true ones than those by using the proposed approach, they will act in this case as a prediction for aligning a new image.

Other predictors can additionally be tested to improve convergence properties. In fact, the coupling between the deterministic image registration proposed in Section III with a probabilistic filtering technique can be performed at this stage. Here, we use a variable-order Kalman filter to provide both another estimate of the optimization variables and the covariances. The input (i.e., observations) to the filtering are the recovered parameters from the optimization process. In order to initialize the system (i.e., when a new image is available), the best set of parameters among all predictors is simply chosen by comparing their resulting cost value.


Region Rejection and Insertion

A. Outliers Rejection

Within direct methods, outliers correspond to regions that do not fit the models, e.g., regions related to independently moving objects. Surface discontinuities and occluding boundaries can also be viewed as outliers. Hence, they must be detected and discarded by the algorithm. For this, two meaningful metrics are used to evaluate thej th template: a photometric measure as well as a geometric one. The photometric measure is defined directly from our cost function in (21) asFormula TeX Source $$\varepsilon_j^2 ({\bf x}^\circ) ={1\over {\rm card}({\cal R}_j^*)} \sum_{{\bf p}_{ij}^* \in{\cal R}_j^*} d_{ij}^2({\bf x}^\circ)\eqno{\hbox{(32)}}$$ where card(·) denotes the cardinality of the set. Notice that the illumination variations have already been compensated here. The geometric measure is the side ratio between the current and the previously warped region. That is, if a template significantly shrinks or elongates in at least one direction, this may signify insufficient content for constraining all parameters (and can thus be discarded). We remark that while (32) is evaluated after obtaining the optimal solution, the geometric measure can be evaluated within the iterations, provided that the region has been adequately initialized (see next section). This may prevent such regions from perturbing the solution.

B. Insertion of New Regions

Given that regions may leave the field-of-view or eventually be rejected from the optimization, the system must be able to insert new regions whenever computing resources are available. The initialization of new regions follows the natural way of specialization: we start by the most generic stratum to the most specialized one. In other words, we first characterize each new region in the projective space. Using this knowledge and of the recovered interframe displacement, we can obtain its best possible Euclidean structure until that moment.

This algorithm is detailed as follows. Let the current image be indexed by “τ.” New regions can be selected in this image according to the procedure described in Section III-A. Denote this image by Formula since it contains the reference template of these particular regions. Then, we have the following steps.

  1. When a new image is available, obtain the projective homography that best aligns each j th selected regionFormula TeX Source $$\eqalignno{ \{{\bf G}_j^\circ, \alpha_j^\circ, \beta_j^\circ\} \, &{=} \matrix{\arg\;\min\cr{ \matrix{{\bf G}_j \in {\bb SL}(3)\cr\alpha_j, \beta_j \in {\bb R}}}} \, {1\over 2} \!\cr&\!\quad\times\!\!\!\sum_{{\bf p}_{ij}^* \in {\cal R}^*_j} \!\!\! \bigl[\alpha_j{\cal I} \bigl({\bf w}({\bf G}_j, {\bf p}_{ij}^*)\bigr) \! {+} \! \beta_j {-} {\cal I}^*_\tau({\bf p}_{ij}^*)\bigr]^2\cr&& \hbox{(33)}\,}$$ as described in [21]. Since each region is treated independently, we have 8+2 parameters to be recovered per region. Optionally, this procedure may be initialized by, e.g., a correlation measure.

  2. Determine the scaled normal vector relative to the frame where the region was first viewed (i.e., corresponding to Formula) using the closed-form solution described in [23]Formula TeX Source $$\widehat{{\bf n}}_{d\, j}^* = {\bigl(\mu \, {\bf K}^{-1} \, {\bf G}_j^\circ \, {\bf K} - {\bf R}_\tau^\circ \bigr)^\top {\bf t}_\tau^\circ\over \Vert {\bf t}_\tau^\circ \Vert^2}\eqno{\hbox{(34)}}$$ with the obtained Gj° in step 1 and the local displacement from the visual SLAM result (30)or (31). The factor Formula is given from the median singular value of K−1 Gj° K. Of course, one must have Formula.

    Figure 3
    Fig. 3. (Top) Excerpts from the 81-frame Pyramid sequence superimposed with the regions aligned (in red) by using the proposed approach. Observe the successful rejection of regions that do not fit the models (notably in the junctions of planes). (Bottom) Reconstructed structure and motion (represented by three-color frames) seen from different viewpoints. Final pose drift is of less than 0.001% of the total amount of translation and 0.091° for the rotation.
  3. An iterative refinement may then be conducted using the same procedure as described in Section III-D, but using only the structure as optimization variable, i.e., with only three parameters to be recovered per region.

If the j th new region is not declared as an outlier, it is ready to be exploited from the next image. To this end, the photogenerative model (20) can adequately incorporate each new relative reference frame by multiplying the global Formula by the inverse of the relative τT0.

This insertion algorithm is intrinsically different to existing direct ones. For example, besides being sensitive to variable lighting, the method in [11] does not take into account all available knowledge to initialize Formula (it uses a“best guess”). This may lead to convergence problems. Furthermore, differently from [12] where new regions are backprojected to the global reference frame, we avoid altering the original information by adequately incorporating them in (20). This possibility is also an attractive characteristic of the proposed SLAM formulation.


Experimental Results

In order to validate the algorithm and assess its performance, we have tested it with both synthetic and real-world images. All results can be found as multimedia material published in IEEE Xplore with this paper. In all cases, trivial initial conditions are used:Formula. The photometric error is here measured by its rms (32). The j th region is declared as an outlier if either εj > 20 or if its geometric error is over 50%. The rms of the image noise is considered to be of 0.6level of grayscale. Moreover, we emphasize that no other sensory device than a single camera is used.

A. Pyramid Sequence

A synthetic scene was constructed so that a ground truth is available. It is composed of four planes disposed in pyramidal form, and cut by another plane on its top. In order to simulate realistic situations as closely as possible, textured images were mapped onto the planes. Then, a sequence of images was generated by displacing the camera while varying the illumination conditions. With respect to the trajectory, the camera performs a circular motion. The objective is twofold. First, returning the camera to the starting pose offers an important benchmark for SLAM algorithms. Second, this aims to show that past observations de facto contribute, within the proposed incremental technique, to build and maintain a coherent description of the structure and motion. With respect to the lighting variations, they are created by applying an α(k) that linearly changes the image intensities up to 50% of its original value, and a β(k) that varies sinusoidally with amplitude of 50 levels of grayscale.

We have then compared our approach (see some SLAM results in Fig. 3), which started with 50 regions of size 21 × 21 pixels, with traditional methods as well as with a direct method. With regard to standard methods, we used SIFT keypoints(1025 matches were initially found), and the subpixel Harris detector along with a zero-mean normalized cross-correlation with mutual consistency check for matching these latter points (235 were initially matched). Other than the initial ones, no features or regions are initialized here. Moreover, there is a relevant difference about how feature correspondences are established along the sequence. While keypoints are matched between the first (reference)and the current images, the latter had to be made between successive images (i.e., had to be tracked). In all cases, corresponding features were fed into a random sample access (RANSAC) procedure (typically 300 trials) with the state-of-the-art five-point algorithm [24] for robustly recovering the pose. This corresponds to a standard feature-based framework where a two-image reconstruction is considered and a nonplanar scene is assumed (because of the five-point algorithm). The comparisons are depicted in Fig. 4, where those strategies are respectively referred to as S + R + 5P and H + ZNCC + R + 5P. Since the scale factor is supposed to be unknown, the translation error is measured by the angle between the actual and the recovered translation directions, i.e., Formula. Notice that, despite exploiting many more features, the standard techniques obtain relatively larger errors, especially for large displacements (i.e., middle of the loop)and significant lighting changes. In addition, the results show an increasing percentage of outliers and a rapidly decreasing number of corresponding features. Therefore, to avoid an early failure, these methods certainly require a more frequent replacement of features. As a remark, despite their relative inferior accuracy, feature-based methods can have a larger domain of convergence, and thus, may be used as a bootstrap to our technique (as discussed in Section IV-B). For the requested accuracy, the proposed approach performed along the sequence of a median of seven iterations returned a median photometric error of 9.84 levels of grayscale, and used a median of 10.4% of each (500 × 500) image. For this sequence where perfect camera's intrinsic parameters are available, the proposed method realized a drift between the original and final pose (since a closed loop is performed) of less than 0.001% of the total amount of translation and 0.091° for the rotation. This shows that precise results with minimal drift are obtained.

Figure 4
Fig. 4. Results obtained from the proposed approach and traditional methods for the Pyramid sequence. (Top) Errors in the recovered motion. Relatively larger errors were obtained from traditional methods for large displacements and illumination changes. (Bottom) Percentage concerning the exploited regions and features. The notion of an outlier is made uniform here by using the same threshold for both features and any pixel of a region.

With respect to existing direct methods, we have made a comparison with [12]. Given that the displacements (motion and illumination) were not very small, which violate their assumptions, that algorithm failed at the beginning of the sequence. Our solution is able to deal with larger interframe displacements. The method proposed in [11] could not be applied since the scene is supposed to be unknown, and it is not possible to alter the environment (it needs a known target for the initialization).

B. Hangar Sequence

The application of the proposed technique to this outdoor sequence (see Fig. 1) also has a twofold objective. First, it aims at offering a didactic overview of the method, especially concerning the insertion of new information (the second region).Second, it shows its degree of robustness to different kinds of noise, e.g., shaking motion, image blur, etc. Very importantly, although we model the scene as a collection of planar regions, some occluding nonplanar objects have appeared throughout the sequence, e.g., see the tree in Fig. 1(a). These disturbances have not significantly perturbed the estimation process since they carry substantially less information compared to other parts of the patches. For the requested accuracy, the approach performed along the sequence a median of five iterations, and returned a median photometric error of 13.37 levels of grayscale. The recovered angle between the two walls is of 89.7°, using a median of 22.59% of each (320 × 240) image. This geometric measure is also an important benchmark for evaluating the technique (considering that these walls are truly perpendicular), since pose and structure are intimately tied together. The total displacement of the camera is of approximately 50 m, and the images were captured by a hand-held camcorder at 25 Hz.

Figure 5
Fig. 5. (Top) Excerpts from the 81-frame Canyon sequence superimposed with the regions registered (in red) by using the proposed approach. Observe the significant change in scale between first and last image. (Bottom) Reconstructed structure and motion seen from different viewpoints. Recovered poses are represented by three-color frames, and only the most stable regions are shown. See the parallelism and/or perpendicularity between most of them.

C. Canyon Sequence

We also run the proposed algorithm on a representative urban sequence, captured at approximately 12 Hz. It is also a challenging sequence in the sense that large interframe displacements are carried out, the objects are disposed at very different distances from the camera, and because there exists a significant change in scale. Furthermore, it corresponds to a typical urban scenario where cameras can be of particular importance for localization: narrow streets. In this case, positions from GPS may not be available or not sufficiently reliable. The obtained results are shown in Fig. 5, where the visual SLAM is successfully performed. The starting image was chosen such that the dominant plane is further away from the initial camera pose, compared to [17]. This choice aims to show the limitation of the optimization approach, which is local by nature. Notice that in the beginning of the task, despite the fact that the regions are effectively aligned in the images, the recovered motion and structure are not coherent with the true ones (see first camera poses in Fig. 5). This means that the algorithm got wedged in a local minimum. Thanks to the solution proposed in Section IV-B, this minimum is adequately treated and the correct parameters are subsequently obtained. For the requested accuracy, the approach performed along the sequence a median of 12 iterations, returned a median photometric error of 10.77 levels of grayscale, used a median of 34 image regions of size 31 × 31 pixels (at the time they are selected), and exploited a median of 17.01% of each (760 × 578) image. The total displacement of the camera is of approximately 60 m.

Figure 6
Fig. 6. (Top) Excerpts from the 230-frame Round-about sequence superimposed with the regions aligned (in red) by using the proposed approach. Observe the presence of a pedestrian in the first image and a moving car in the third image. (Bottom left) Reconstructed structure and motion. Recovered poses are represented by very small frames. (Bottom right) Satellite image of the scenario. The path length is of approximately 150 m.

D. Round-about Sequence

This sequence is also illustrative since other different types of noise are present, e.g., pedestrians and moving vehicles. Nevertheless, the technique automatically coped with such outliers. Excerpts from this sequence and the obtained SLAM results can be seen in Fig. 6. We can observe that coherent motion and structure are recovered. For the requested accuracy, the approach performed along the sequence a median of ten iterations, returned a median photometric error of 11.37 levels of grayscale, used a median of 37 image regions of size 31 × 31 pixels (at the time they are selected), and exploited a median of 10.84% of each (760 × 578) image. This sequence was captured at approximately 12 Hz by a camera-mounted car, where the path length measured by Google Earth is of approximately 150 m.


Conclusion And Perspectives

In this paper, we have proposed a different formulation of the vision-based SLAM problem. The technique is based on image alignment(i.e., image registration) using appropriate motion, structure, and illumination parameters, without first having to find feature correspondences. The major advantages and limitations of this approach are described here. Namely, the strengths concern its high accuracy and absence of feature extraction process. Additionally, we have proved that standard methods need to add more frequently new features to track, especially under either significant lighting variations or lengthy camera displacements. Hence, the proposed method reduces the drift by maintaining for longer the estimation of the displacement with respect to the same reference frame. On the other hand, in order to be tractable in real time, we use a local optimization procedure to obtain the related parameters. Alternatives to avoid getting trapped in local minima are then discussed in the paper. Another important research topic regards loop closure, which was not the objective of this paper. Nevertheless, we believe that the proposed direct technique is promising since existing ones (which have a smaller convergence domain) have already performed this task. Other future works may also focus on merging/growing regions with similar structure, which may lead to more stable and faster estimates.


Manuscript received December 15, 2007; revised July 04, 2008. First published September 26, 2008; current version published nulldate. This work was supported in part by the Brazilian Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) Foundation under Grant 1886/03-7 and in part by the International Agreement Fundação de Amparo à Pesquisa do Estado de São Paulo-Institut National de Recherche en Informatique et en Automatique(FAPESP-INRIA) under Grant 04/13467-5. This paper was recommended for publication by Associate Editor J. Neira and Editor L. Parker upon evaluation of the reviewers' comments.

G. Silveira is with the Institut National de Recherche en Informatique et en Automatique (INRIA), Project Advanced Robotics and Autonomous System (ARobAS), Sophia-Antipolis 06902, France, and also with the Division of Robotics and Computational Vision (DRVC), Centro de Pesquisa Renato Archer (CenPRA) Research Center, Campinas 13069-901, Brazil (e-mail:

E. Malis and P. Rives are with the Institut National de Recherche en Informatique et en Automatique (INRIA), Project Advanced Robotics and Autonomous System (ARobAS), Sophia-Antipolis 06902, France (e-mail:;

This paper has supplementary downloadable multimedia material available at provided by the author. Contents: The multimedia material is composed of four videos. Each one corresponds to a different image sequence, as described in the accompanying article. This material is 10.8 MB in size. Description: In all videos, the left frame shows the input images superimposed with the aligned regions, i.e., the exploited regions. The right frame shows both the 3-D camera pose and the scene structure being incrementally and causally recovered. Only the regions that are currently exploited by the technique are displayed in both the frames. Player information: The demos were encoded with MSMPEG4V2 codec (Microsoft MPEG-4 v2). They were tested both under Linux with MPlayer as well as under Windows with Microsoft Windows Media Player version 10.

Color versions of one or more of the figures in this paper are available online at


1. Shape and motion from image streams under orthography: A factorization method

C. Tomasi, T. Kanade

Int. J. Comput. Vis., Vol. 9, issue (2) pp. 137–154, 1992

2. Feature based methods for structure and motion estimation

P. H. S. Torr, A. Zisserman

Proc. Workshop Vis. Algorithms: Theory Pract., 1999, 278–294

3. Recursive 3-D motion estimation from a monocular image sequence

T. J. Broida, S. Chandrashekhar, R. Chepalla

IEEE Trans. Aerosp. Electron. Syst., Vol. 26, issue (4) pp. 639–656, 1990-07

4. Real-time simultaneous localization and mapping with a single camera

A. Davison

Proc. Int. Conf. Comput. Vis., 2003, 1403–1410

5. Scalable monocular SLAM

E. Eade, T. Drummond

Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 17–22, 2006, 1 pp. 469–476

6. On the representation and estimation of spatial undertainty

R. C. Smith, P. Cheeseman

Int. J. Robot. Res., Vol. 5, issue (4) pp. 56–68, 1986

7. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges

M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit

Proc. Int. Joint Conf. Artif. Intell., Acapulco, Mexico 2003-08, pp. 1151–1156

8. Passive navigation

A. R. Bruss, B. K. P. Horn

Comput. Vis., Graph., Image Process., Vol. 21, issue (1) pp. 3–20, 1983

9. Motion parameter estimation from global flow field data

R. Hummel, V. Sundareswaran

IEEE Pattern Anal. Mach. Intell., Vol. 15, issue (5) pp. 459–476 1993-05

10. About direct methods

M. Irani, P. Anandan

Corfu, Greece
Proc. Workshop Vis. Algorithms: Theory Pract., 1999-09, 267–277

11. Locally planar patch features for real-time structure from motion

N. D. Molton, A. J. Davison, I. D. Reid

Br. Mach. Vis. Conf. (BMVC), presented at the, Kingston, U.K., 2004-09

12. A semidirect approach to structure from motion

H. Jin, P. Favaro, S. Soatto

Vis. Comput., Vol. 6 pp. 377–394, 2003

13. Constrained multiple planar template tracking for central catadioptric cameras

C. Mei, S. Benhimane, E. Malis, P. Rives

Edinburgh, U.K.
Proc. Br. Mach. Vis. Conf., 2006-09, 4–7

14. Geometrically constrained structure from motion: Points on planes

R. Szeliski, P. H. S. Torr

Proc. Eur. Workshop 3-D Struct. Mult. Images Large-Scale Environ., 1998, 171–186

15. Three-Dimensional Computer Vision—A Geometric Viewpoint

O. Faugeras

Cambridge, MA
Three-Dimensional Computer Vision—A Geometric Viewpoint, MIT Press 1993

16. Real-time robust detection of planar regions in a pair of images

G. Silveira, E. Malis, P. Rives

Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Beijing, China 2006-10 pp. 49–54

17. An efficient direct method for improving visual SLAM

G. Silveira, E. Malis, P. Rives

Rome, Italy
Proc. IEEE Int. Conf. Robot. Autom., Apr. 10–14, 2007, 4090–4095

18. Foundations of Differential Manifolds and Lie Groups

F. W. Warner

New York
Foundations of Differential Manifolds and Lie Groups, Springer-Verlag 1987

19. Integration of Euclidean constraints in template based visual tracking of piecewise-planar scenes

S. Benhimane, E. Malis

Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., Beijing, China 2006-10, pp. 1218–1223

20. Lucas–Kanade 20 years on: A unifying framework: Part 3

S. Baker, R. Gross, I. Matthews

Carnegie Mellon Univ., Pittsburgh, PA, 2003 Tech. Rep. CMU-RI-TR-03-35

21. Real-time visual tracking under arbitrary illumination changes

G. Silveira, E. Malis

Minneapolis, MN
Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 17–22, 2007, 1–6

22. Improving vision-based control using efficient second-order minimization techniques

E. Malis

Proc. IEEE Int. Conf. Robot. Autom., Apr. 26/May 1 2004, 2 pp. 1843–1848

23. The efficient E-3D visual servoing

G. Silveira, E. Malis, P. Rives

Int. J. Optomechatron., vol. 2, issue (3), p. 166–184, 2008-07

24. An efficient solution to the five-point relative pose problem

D. Nister

Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 18–20, 2003 2 pp. II-195–II-202


Geraldo Silveira

Geraldo Silveira received the B.Sc. (Hons.) degree from the State University of Campinas (UNICAMP), Sao Paulo, Brazil, and the M.Sc. degree from the Federal University of Rio Grande do Norte (UFRN), Rio Grande do Norte, Brazil, in 2000 and 2002, respectively, both in electrical engineering. He is currently working toward the Ph.D. degree at the Ecole des Mines de Paris (ENSMP) and the Institut National de Recherche en Informatique et en Automatique (INRIA), Sophia-Antipolis, France.

Geraldo Silveira In 2002, he joined the Centro de Pesquisas Renato Archer (CenPRA), Sao Paulo, as a Research Engineer. His current research interests include computer vision, vision-based control, and robotics.

Mr. Silveira was ranked in Top 10 of 2002 by the Brazilian Computer Society for the M.Sc. thesis. In 2004, he received the Best Master's Thesis of 2001–2003 Award endowed by SIEMENS.

Ezio Malis

Ezio Malis (A'03) received the Graduate degrees in electronics and automatics from the University Politecnico di Milano, Milan, Italy, and the Ecole Supérieure d'Electricité (Supélec), Gif-Sur-Yvette, Paris, France, both in 1995 and the Ph.D. degree in computer vision and robot control from the University of Rennes, Rennes, France, in 1998.

Ezio Malis In 2000, he joined the Institut National de Recherche en Informatique et en Automatique (INRIA), Sophia-Antipolis, France, as a Research Scientist. Prior to this, he spent two years as a Research Associate at the University of Cambridge, Cambridge, U.K. His current research interests include automatics, robotics, computer vision, and particular vision-based control.

Dr. Malis was the receipient of the IEEE King-Sun Fu Memorial Best Transactions Paper Award and the IEEE Wegbreit Best Vision Paper Award in 2002.

Patrick Rives

Patrick Rives (M'04) received the Doctorat de 3ième cycle degree in robotics from the Université des Sciences et Techniques du Languedoc, Montpellier, France, in 1981 and the Habilitation à diriger les recherches degree from the Université de Nice, Nice, France, in 1991.

Patrick Rives He was a Research Fellow with the Institut National de la Recherche Scientifique (INRS) Laboratory, Montreal, QC, Canada, for one year. In 1982, he joined the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France. He is the Research Director at INRIA Sophia Antipolis-Méditérranée, Rennes, and the Head of the project team Advanced Robotics and Autonomous Systems (ARobAS). His main research interests include sensor-based control applied to the navigation and the control of mobile robots with a particular emphasis on sensor-based control techniques. He has also addressed the problems of autonomous navigation and SLAM for aerial, underwater, and urban vehicles.

Cited By

No Citations Available


IEEE Keywords

No Keywords Available

INSPEC: Controlled Indexing

SLAM (robots), mobile robots, pose estimation, robot vision

More Keywords

No Keywords Available


No Corrections




1,908 KB


3,338 KB


1,317 KB


4,536 KB

Indexed by Inspec

© Copyright 2011 IEEE – All Rights Reserved