**I**N ORDER TO autonomously navigate in an unknown environment, a robot must be able to build a representation of the surrounding map and self-localize with respect to it. Even though it is possible to perform the latter without the former by computer vision using an appropriate tensor (e.g., the essential matrix), precision may be lost rapidly. This happens because important structural constraints, e.g., the scene rigidity, are not effectively exploited in a long run. Having understood that both estimation processes are intimately tied together, an appealing strategy is then to perform them simultaneously. This is generally referred to as simultaneous localization and mapping (SLAM) in the robotics community. This class of methods focuses on computationally tractable algorithms that incrementally (i.e., causally) integrate information. At the expense of usually accumulating drifts earlier, they are suitable to real-time operation required by robotic platforms. A slightly different class of methods, mainly developed by the computer vision community, refers to structure from motion (SFM)techniques. Noncausal schemes fall into this latter class. These algorithms, mostly aimed at high levels of accuracy, are allowed to run in a time-consuming batch process. This paper focuses on the former class. The reader may refer to, e.g., [1] and [2] for some well-established SFM methods.

### A. Related Work

The techniques that simultaneously and causally reconstruct the camera pose and the scene structure can be divided into two classes, which are briefly discussed next.

*1) Feature-Based Methods to Visual SLAM*

A standard scheme to visual SLAM consists of first extracting a sufficiently large set of features (e.g., points, lines), and robustly matching them between successive images. These corresponding features are the input to the joint process of estimating the camera pose and scene structure. The majority of visual SLAM methods fall into this class, e.g., [3], [4], [5], independently of the applied filtering technique, e.g. , extended Kalman filter EKF-SLAM [6] and FastSLAM 2.0 [7]. This represents the discrete case. Another possibility consists of computing the needed correspondences in the form of optical flow (the velocity). This has been exploited in, e.g., [8] and [9]. In both cases, since the prior step of data association is error-prone, care must be taken in order to avoid propagating them to subsequent steps. On the other hand, these methods may handle large interframe displacements of the objects.

*2) Direct Methods to Visual SLAM*

In this class of methods, the intensity value of the pixels is directly exploited to obtain the required parameters. That is, there is no prior step of data association: this is simultaneously solved. An important strength of these methods concerns the level of accuracy that they can attain. This characteristic is mainly due to the exploitation of all possible image information, even from areas where gradient information is weak. The reader may refer to, e.g., [10] for a more profound discussion about this subject.

In this spirit, the technique proposed in [11]can be assigned to this class. However, it does not consider the strong coupling between motion and structure in their separated estimation processes from pixel intensities. Furthermore, it is sensitive to variable illumination. In that method, new information is initialized with a “best guess. ” The technique proposed in [12], though using a unified framework, relies on the linearity of image gradient. This limits the system to work under very small interframe displacements of the objects. This approach is relatively robust to lighting variations, but its model of illumination changes is overparameterized (which may lead, for example, to convergence problems). New information is initialized in a separate filter, and is inserted into the main filter after a probation period. Also, in a unified framework, central catadioptric cameras are adequately dealt with in [13]. The latter uses the same approximation method we use in this paper for obtaining the related optimal parameters. Nevertheless, its set of parameters is different from ours not only because illumination changes are handled here, but also due to the structural constraints we explicitly enforce. Moreover, initialization is not a concern in that work.

### B. Overview of the Method

We formulate the visual SLAM problem as a nonlinear image registration task. In other words, we consider visual SLAM as the problem of estimating the appropriate parameters that optimally align a reference image with successive frames of a video sequence. A subset of the proposed parameters is naturally the camera pose and scene structure. Since the result of direct image alignments is such that each pixel intensity is matched as closely as possible across images, the technique in fact also returns a dense correspondence (see Fig. 1).

Roughly speaking, the optimal parameters are obtained as follows. Consider a parametric generative model that deforms (warps)an image. Using an estimate of the parameters, an image can be warped toward another one. The residual between the warped image and the second one is then used to iteratively refine the parameters of the models. In this paper, we focus on a deterministic optimal formulation of the visual SLAM. As for the uncertainty calculations, one can either directly cast the image registration as a stochastic optimization problem, or couple the approach with a standard filtering technique (e.g., EKF). The latter alternative is considered here, but the former is believed to represent a promising research direction.

Despite the impressive computing power to date, in a real-time setting, the entire image cannot, in general, be considered for processing v. Therefore, an adequate selection of image regions is performed in this paper. Given that the selected regions may either leave the field-of-view or simply not fit the used models, the technique is able to both reject the latter and automatically insert new regions. Also, to improve computational efficiency [14], the scene is geometrically modeled as a collection of planar surfaces. This modeling is considered by all direct methods mentioned in Section I-A.2 as well.

### C. Contributions

In this paper, a new approach to visual SLAM is proposed. We formulate it as a direct image registration problem. In order to solve it efficiently, consistently, and robustly, a new photogeometric generative model is presented, i.e., besides the global and local geometric parameters, global and local photometric ones are considered as optimization variables as well. This enables the system to work under generic illumination changes and achieve more accurate alignments. In turn, the global variables related to motion directly enforce the rigidity constraint of the scene within the minimization process. We remark that the proposed framework still preserves the advantages from motion parameterization using the Lie algebra. With regard to the last but not the least structural constraint of the scene, the positive depth constraint (i.e., cheirality), a new structure parameterization is proposed to enforce it during the optimization also. Surprisingly, none of the existing direct approaches have exploited this constraint. The simultaneous enforcement within the optimization (instead of *a posteriori*) of all these structural constraints significantly contributes to improving robustness, stability, and accuracy.

Another contribution of this paper concerns the initialization of the visual SLAM. This is not a trivial issue since the scene structure becomes observable only when the amount of translation is sufficiently large with respect to its depths [15], [16]. Given this ill-conditioning, some systems, e.g., [11] rely on a simple solution:one installs a known target in the environment and uses it in the initial frame. Other systems recover and decompose the essential matrix. However, if the scene is planar, then such a matrix is degenerate, which leads to an erroneous translation vector. In this paper, a new solution for initializing the system is proposed whereby the environment is neither altered nor assumed to be nonplanar.

This paper is a revised and extended version of the visual SLAM approach that we have proposed in [17]. Besides, more thorough experiments are carried out, and a technique to automatically insert new regions is described.

Besides the standard notations, in the sequel we adopt to, respectively, represent an increment to be found, an augmented version, a modified version, and the Euclidean norm of a variable **v**. Here, a superscripted asterisk, e.g., **v**^* is used to represent a variable defined with respect to the reference frame, whereas a superscripted circle, e.g., **v**° denotes its optimal value relative to a given cost function. Also, braces represent a set, e.g., {*v*_{i} }_{i = 1}^{n} = {*v*_{1}, *v*_{2}, …,*v*_{n} }, and **0** (respectively **1**) is a matrix of zeros (respectively ones) of appropriate dimensions. Moreover, let **p** = [*u*,*v*,1]⊤ be the homogeneous vector containing the image coordinates of a pixel. Then, we denote as *I* (**p**) the image intensity of the pixel **p**. Bilinear interpolation is used for subpixel coordinates. Consider an image *I*^{∗} of a rigid scene. After displacing the camera by a rotation and a translation , another image *I* of the same scene is acquired. This motion can be represented by a homogeneous transformation matrix .

### B. Plane-Based Two-View Epipolar Geometry

As previously stated, for efficiency reasons, we model the scene as a collection of planar regions. In this case, the coordinates of a pixel **p**^* in such a region of *I*∗ are linked to its corresponding **p** in *I* by a projective homography **G** [15]
TeX Source
$${\bf p} \, \propto \, {\bf G} \, {\bf p}^*.\eqno{\hbox{(3)}}$$The symbol “∝” indicates proportionality up to a nonzero-scale factor. A warping operator **w** can then be defined as
TeX Source
$$\eqalignno{{\bf p} \, &= {\bf w}({\bf G},{\bf p}^*)&\hbox{(4)}\cr&= \left[{g_{11} u^* + g_{12} v^* + g_{13}\over g_{31} u^* + g_{32} v^* + g_{33}},{g_{21} u^* + g_{22} v^* + g_{23}\over g_{31} u^* + g_{32} v^* + g_{33}},1\right]^\top&\hbox{(5)}}$$ where {*g*_{ij}} denotes the elements of the matrix **G**.

Consider the calibrated setting, where **K** denotes the upper triangular (3 × 3) matrix containing the camera's intrinsic parameters. Using the equation of the plane together with of the rigid-body motion, **G** can be written as a function of the camera displacement and the scene structure
TeX Source
$${\bf G}({\bf T}, {\bf n}_d^*) \, \propto \, {\bf K} \, (\, {\bf R} + {\bf t}\, {\bf n}_d^{*\top}) \,{\bf K}^{-1}\eqno{\hbox{(6)}}$$where denotes the normal vector of the plane scaled by its distance to the reference camera frame.

### C. Model-Based Image Alignment Parameterized in

Consider a textured planar surface, or that it can be locally approximated by a plane. For simplicity, let us suppose for the moment that the scaled normal vector **n**_{d}^* (i.e., the metric model)of this planar target is known. We will show in Section III-C how the image alignment (registration) problem can be adequately solved when this metric model is unknown.

The problem of “metric model”-based direct image alignment can be formulated as a search for the optimal matrix to warp all the pixels in a region so that their intensity values match as closely as possible to their corresponding ones in the current image *I* [19]. Since one seeks an optimal pose given a scene model, this technique can also be referred to as model-based visual odometry, or simply *localization.* To this end, a nonlinear minimization procedure has to be derived since the pixel intensity *I*(**p**) is, in general, nonlinear in **p**.More formally, given an estimate of **T**, the problem is to find the optimal incremental through an iterative method, e.g., [19] that solves
TeX Source
$$\min_{\tilde{{\bf v}} \in {\bb R}^6} \ {1\over 2}\sum_{{\bf p}_i^* \in {\cal R}^*} \left[{\cal I} \Bigl({\bf w}\bigl({\bf G} \bigl({\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}} \bigr),{\bf p}_i^* \bigr) \Bigr) - {\cal I}^*({\bf p}_i^*) \right]^2\eqno{\hbox{(7)}}$$with an update of the transformation matrix as
TeX Source
$$\widehat{{\bf T}} \longleftarrow{\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}} \, = \,\exp \bigl({\bf A}(\widetilde{{\bf v}}) \bigr) \,\widehat{{\bf T}}\eqno{\hbox{(8)}}$$by using the mapping (2).The arrow “” denotes the update assignment within the iterations. The convergence may then be established when the increments become arbitrarily small, i.e., . Due to the properties of this mapping, the resulting matrix in (8) is always in the group, and hence, no approximation is performed. If this parameterization is not applied, the resulting has to be projected onto its group manifold, clearly reducing its rate and domain of convergence. Therefore, the local parameterization (1)improves stability and accuracy, and thus, is highly suitable to express incremental displacements. Another important property will be exploited in Section III-D to solve optimization problems, e.g., (7) efficiently and with nice convergence properties.

SECTION III

## Proposed Direct Visual SLAM Approach

This section presents a unified approach where geometric and photometric models are appropriately included in a direct visual SLAM. Furthermore, it is also shown how to consistently and efficiently obtain the optimal global and local parameters related to all these models.

### A. Selection of Image Regions

In order to satisfy the real-time requirements, we select a set of nonoverlapping image patches according to an appropriate score. For direct methods, high scores should reflect strong image gradient along different directions.

Let the image region be a (*w* × *w*) matrix containing pixel intensities. Then, obtain a suitable gradient-based image *G*∗ from *I*∗. Given *G*∗, a score image *S*∗ can be defined as the sum of all values of *G*∗ within a (*w* × *w*) block centered at every pixel. A second criterion to be considered, possibly with a different weight, is based on the quantity of local extrema of *G*∗ (denoted *E*∗) within each block. This may prevent the system from assigning high scores on single peaks, which would define patches with the same drawbacks as regions defined around standard interest points (e.g., Harris corners). The neighborhood of an isolated point may not contain enough information to constrain all degrees of freedom. Other criteria are also possible, e.g., the degree of spread of the regions, but these earlier two have shown to be sufficient.

All needed block operations are efficiently performed by a convolution(denoted by “ ⊗ ”) with the (*w* × *w*) kernel *K*_{w} = **1**
TeX Source
$$\eqalignno{{\cal S}^* \, &= \, \lambda \, {\cal G}^* \otimes{\cal K}_w + \eta \, {\cal E}^* \otimes {\cal K}_w&\hbox{(9)}\cr&=(\lambda \, {\cal G}^* + \eta \, {\cal E}^*) \otimes{\cal K}_w.&\hbox{(10)}}$$ Typical weights are λ = ‖ *G*∗ ⊗ *K*_{w} ‖^{−1} and η = 1. The resulting *S*∗ contains the scores that are sorted, without any absolute thresholds on the strengths to be tuned. The amount of regions (defined around each score) considered for further processing depends only on the available computing resources.

### B. Handling Generic Illumination Changes

An important issue to all vision-based methods is their robustness to variable lighting. A widely used technique to increase this robustness is to model the change in illumination as an affine transformation [20]. Despite the fact that improved results are obtained, only global changes are modeled.

Recently, we proposed in [21] a new model of illumination changes to cope with generic lighting variations. Illumination changes are viewed as a surface that can evolve with time. In that paper, we have successfully applied it to the direct visual tracking problem parameterized in the projective space. Here, we will show that this model can be straightforwardly applied to the efficient direct visual tracking problem parameterized in the Euclidean space. Indeed, for efficiency reasons, we use here the discretized realization of that generic model (see Fig. 2). Let the region have a sufficiently small size. Lighting variations are then explained by a *local* and a*global* term , respectively:
TeX Source
$${\cal I}' (\alpha, \beta, {\bf p}_i) = \alpha \, {\cal I}({\bf p}_i) + \beta.\eqno{\hbox{(11)}}$$This piecewise affine model (there is an α per region) can be interpreted as a photometric generative model for regulating the contrast of a particular region and the brightness of the entire image. This discretized model has been shown to be a good compromise between modeling error and computational complexity (it has few parameters and leads to a sparse Jacobian, as shown in Section IV). Nevertheless, it still does not require any prior knowledge about either the reflectance properties of the surface, which can be non-Lambertian, or the characteristics of the light sources, such as power, number, and their pose in space.

We remark that the model (11) is different from existing ones when applied to different parts of the same image. For example, the method proposed in [12] uses an affine model consisting of two local parameters per region. That is, it does not consider the global variations explicitly, which represent, e.g., the shift in the camera gain. In this latter overparameterized formulation, estimation of many more parameters are required. This may degrade frame-rate performance, and even worse, it may lead to convergence problems. Another important difference regards to how the related parameters are obtained. The global and local parameters related to our model are simultaneously obtained by an efficient second-order approximation method, yielding nicer convergence properties.

In fact, given that an iterative procedure is used and that the update rule for the illumination parameters can simply be
TeX Source
$$\cases{\widehat{\alpha} \longleftarrow \widetilde{\alpha} + \widehat{\alpha} \cr \cr\widehat{\beta} \longleftarrow \widetilde{\beta} + \widehat{\beta}}\eqno{\hbox{(12)}}$$ we can define the transformed pixel intensity as
TeX Source
$${\cal I}' \bigl(\widetilde{{\bf v}}, \widetilde{\alpha},\widetilde{\beta}, {\bf p}_i^* \, \bigr) = (\widetilde{\alpha} +\widehat{\alpha}) \; {\cal I} \Bigl({\bf w}\bigl({\bf G}\bigl({\bf T}(\widetilde{{\bf v}}) \, \widehat{{\bf T}}\bigr), {\bf p}_i^* \bigr) \Bigr) + \widetilde{\beta} +\widehat{\beta}.\eqno{\hbox{(13)}}$$ This can then be viewed as a photogeometric generative model. Therefore, by incorporating (13), the model-based visual tracking problem (7) becomes
TeX Source
$$\min_{ {\matrix{\tilde{{\bf v}} \in {\bb R}^6\cr\tilde{\alpha}, \tilde{\beta} \in {\bb R}}}} \ {1\over 2}\sum_{{\bf p}_i^* \in {\cal R}^*} \bigl[\, {\cal I}'\bigl(\widetilde{{\bf v}}, \widetilde{\alpha}, \widetilde{\beta},{\bf p}_i^* \, \bigr) - {\cal I}^*({\bf p}_i^*) \, \bigl]^2.\eqno{\hbox{(14)}}$$

### C. Full System

Since the metric model of the scene is unknown *a priori*, its structure parameters must be included in (14) as optimization variables as well. Indeed, the depth of some image points (not necessarily image features) together with a regularization function can be used as these variables. The latter function is needed in two-image direct reconstructions in order to avoid obtaining an underconstrained system (more unknowns than equations). As stated previously, we represent the scene here as a collection of planar regions. This, in fact, acts as our regularization function. This choice leads to a versatile and computationally efficient description of the scene (it has few parameters and leads to a sparse Jacobian, as will be shown).

We include the structure parameters as follows. First, we perform a parameterization of the scaled normal vector by using the depth *z*_{i}∗ > 0 of any (noncollinear)three image points **p**_{i}∗, *i* = 1,2,3, within the region *R*∗ (e.g., its corners). For a 3-D point that lies on the plane **n**_{d}^* and the equation of perspective projection, we have
TeX Source
$${\bf n}_d^{*\top} \, {\bf K}^{-1} {\bf p}_i^* = {1\over z_i^*}.\eqno{\hbox{(15)}}$$ Using these three points, define the vector of inverse depths
TeX Source
$${\bf z}^* = \left[{1\over z_1^*}, {1\over z_2^*}, {1\over z_3^*} \right]^\top\eqno{\hbox{(16)}}$$ which is the natural value to be computed. The relation between both representations is then
TeX Source
$${\bf n}_d^* = {\bf M} \, {\bf z}^*\qquad \hbox{with} \;{\bf M} = {\bf K}^\top \! \left[\, {\bf p}_1^*,{\bf p}_2^*, {\bf p}_3^* \, \right]^{-\top} \in{\bb R}^{3 \times 3}.\eqno{\hbox{(17)}}$$ Next, given that the depths must be strictly positive scalars and that an iterative procedure has to be devised, we propose to parameterize them as
TeX Source
$${\bf z}^* \, = \, {\bf z}^*({\bf y}) = \exp({\bf y})>0,\qquad {\bf y} \in {\bb R}^3.\eqno{\hbox{(18)}}$$ This provides the update rule
TeX Source
$$\widehat{{\bf z}}^{\,*} \longleftarrow{\bf z}^*(\widetilde{{\bf y}}) \cdot\widehat{{\bf z}}^{\,*} \, = \, \exp(\widetilde{{\bf y}})\cdot \widehat{{\bf z}}^{\,*}\eqno{\hbox{(19)}}$$ where “·” denotes element-wise multiplication.

*Remark III.1 (Cheirality Constraint).* By using the proposed efficient parameterization of the structure , we enforce, within the optimization procedure, that the scene is always in front of the camera. That is, *z*_{i}∗ > 0 ∀ *i.*

Accordingly, the photogeometric generative model expressed in (13) has to be changed into
TeX Source
$$\eqalignno{& {\cal I}” \bigl(\widetilde{{\bf v}},\widetilde{\alpha}, \widetilde{\beta}, \widetilde{{\bf y}},{\bf p}_i^* \, \bigr) \cr& = (\widehat{\alpha} \! + \!\widetilde{\alpha}) \; {\cal I} \Bigl({\bf w} \bigl({\bf G} \bigl({\bf T}(\widetilde{{\bf v}}) \,\widehat{{\bf T}}, \, {\bf n}_d^*({\bf z}^*(\widetilde{{\bf y}}) \cdot\widehat{{\bf z}}^{\,*}) \bigr), {\bf p}_i^* \bigr) \Bigr) \! + \!\widehat{\beta} \! + \! \widetilde{\beta}.\cr&&\hbox{(20)}}$$ Incorporating this modification into all regions *R*_{j}∗, *j* = 1, 2, …, *n*, our problem becomes
TeX Source
$$\min_{{\bf x} \in {\bb R}^{6+4n}} \ {1\over 2} \ \sum_j\sum_{{\bf p}_{ij}^* \in {\cal R}_j^*} \bigl[\;\underbrace{{\cal I}”({\bf x}, {\bf p}_{ij}^*) -{\cal I}^*({\bf p}_{ij}^*)}_{d_{ij}({\bf x})} \; \bigr]^2\eqno{\hbox{(21)}}$$ and where has7+4*n*−1 parameters, since the scale factor cannot be recovered from monocular images only. Thus, one has to fix it (to a strictly positive value) to obtain a consistent solution to the problem. It can be noted that the set **x** comprises both global geometric and photometric parameters , as well as local geometric and photometric ones .

*Remark III.2 (Rigidity Constraint).* Observe that in formulation (21), the regions are not independently tracked. In fact, the rigidity constraint of the scene is explicitly enforced, within the optimization procedure also, since all regions share the same incremental motion parameters.

### D. Optimization Procedure

Concisely, our system (21) can then be interpreted as seeking the optimal value
TeX Source
$${\bf x}^\circ = \matrix{\raise-4pt\hbox{$\arg\,\min$}\cr { ^{ {\bf x} \in{\bb R}^{6+4n}}}}\ {1\over 2} \, \big \Vert \,{\bf d}({\bf x}) \, \big \Vert^2\eqno{\hbox{(22)}}$$ such that the norm of the vector of intensity discrepancies**d**(**x**) = {*d*_{ij}(**x**) }is minimized. In order to iteratively solve this nonlinear optimization problem, an expansion in Taylor series is first performed. To this end, another key technique to achieve nice convergence properties is to perform an efficient second-order approximation of **d**(**x**) [22]. Indeed, it can be shown that, neglecting the third-order remainder, a second-order approximation of **d**(**x**) around **x** = **0** is
TeX Source
$${\bf d}({\bf x}) = {\bf d}({\bf 0}) +{1\over 2} \bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \, {\bf x}.\eqno{\hbox{(23)}}$$ In our case, the current Jacobian **J**(**0**) is divided into the Jacobian relative to the motion parameters, the illumination parameters, and the structure parameters
TeX Source
$${\bf J}({\bf 0}) = \bigl[\{\bf J}_{{\bf v}}({\bf 0}), \{\bf J}_{\alpha\beta}({\bf 0}), \{\bf J}_{{\bf z}^*}({\bf 0}) \ \bigr]\eqno{\hbox{(24)}}$$ where
TeX Source
$$\cases{{\bf J}_{{\bf v}}({\bf 0}) = \widehat{\alpha} \,{\bf J}_{{\cal I}} {\bf J}_{{\bf w}}{\bf J}_{\hat{{\bf T}}} {\bf J}_{{\bf V}}({\bf 0}) \cr{\bf J}_{\alpha\beta}({\bf 0}) =\bigl[\ \nabla_{\hat{\beta}} \, {\cal I}”({\bf 0}), \\nabla_{\hat{\alpha}} \, {\cal I}”({\bf 0}) \ \bigr] =\bigl[\ 1, \ {\cal I} \ \bigr] \cr{\bf J}_{{\bf z}^*}({\bf 0}) = \widehat{\alpha} \,{\bf J}_{{\cal I}} {\bf J}_{{\bf w}}{\bf J}_{\hat{{\bf n}}^{\!*}} {\bf M} \, {\bf z}^*({\bf 0})}$$ by applying the chain rule. Correspondingly, the reference Jacobian**J**(**x**) is divided into
TeX Source
$${\bf J}({\bf x}) = \bigl[\{\bf J}_{{\bf v}}({\bf x}), \{\bf J}_{\alpha\beta}({\bf x}), \{\bf J}_{{\bf z}^*}({\bf x}) \ \bigr]\eqno{\hbox{(25)}}$$ where
TeX Source
$$\cases{{\bf J}_{{\bf v}}({\bf x}) = \alpha \,{\bf J}_{{\cal I}^*} {\bf J}_{{\bf w}}{\bf J}_{{\bf T}} {\bf J}_{{\bf V}}({\bf x}) \cr{\bf J}_{\alpha\beta}({\bf x}) =\bigl[\ 1, \ {\cal I}^* \, \bigr] \cr{\bf J}_{{\bf z}^*}({\bf x}) = \alpha \,{\bf J}_{{\cal I}^*} {\bf J}_{{\bf w}}{\bf J}_{{\bf n}^{\!*}} {\bf M} \, {\bf z}^*({\bf x}).}$$

By applying a necessary condition for **x** = **x**° to be an extremum of our cost function in (22) gives
TeX Source
$$\nabla_{{\bf x}} \biggl({1\over 2} \,{\bf d}({\bf x})^\top{\bf d}({\bf x}) \biggr)\bigg\vert _{{\bf x} = {\bf x}^\circ} =\nabla_{{\bf x}} \bigl({\bf d}({\bf x}) \bigr)^\top\bigg\vert _{{\bf x} = {\bf x}^\circ}{\bf d}({\bf x}^\circ) = {\bf 0}.\eqno{\hbox{(26)}}$$ Provided that **J**(**x**)|_{x = x°} is full rank(see Section IV) and using (23) around **x** = **x**°, one has from (26)
TeX Source
$${1\over 2} \, \bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \, {\bf x}^\circ =-{\bf d}({\bf 0}).\eqno{\hbox{(27)}}$$ This is not a linear system in **x**° because of **J**(**x**). However, due to the suitable parameterization of the alignment (see Section II-C), we exploit the left-invariance property of the vector fields on Lie groups [18]. In fact, given that the space of the parameters **x** is homeomorphic to a Lie group defined over , this property means that **J**_{V}(**x**) **x**° = **J**_{V}(**0**) **x**°. Then, provided that and , the left-hand side of (27)can be written as
TeX Source
$$\eqalignno{& \! {1\over 2} \,\bigl({\bf J}({\bf 0}) +{\bf J}({\bf x}) \bigr) \,{\bf x}^\circ = {\bf J}' \, {\bf x}^\circ = \bigl[\,{\bf J}_{{\bf v}}', \, {\bf J}_{\alpha \beta}', \,{\bf J}_{{\bf z}^*}' \bigr] \, {\bf x}^\circ \cr&\!\! = \! {1\over 2} \,\Bigl[\widehat{\alpha} \, ({\bf J}_{{\cal I}} \! + \!{\bf J}_{{\cal I}^*}) {\bf J}_{{\bf w}}{\bf J}_{{\bf v}}”, \bigl[2,({\cal I} \! + \! {\cal I}^*) \bigr], \widehat{\alpha} \,({\bf J}_{{\cal I}} \! + \! {\bf J}_{{\cal I}^*}){\bf J}_{{\bf w}}{\bf J}_{{\bf z}^*}” \! \Bigr] \, {\bf x}^\circ\cr&&\hbox{(28)}}$$ with and .

By appropriately stacking each **J**′ above to take into consideration all regions *j* = 1, 2, …, *n*, i.e.,
TeX Source
$$\eqalignno{\overline{{\bf J}'} &= \left[\matrix{{{\bf J}_{1{\bf v}}'} &\; {\bf 1} & {{\bf J}_{1\alpha}'} & {\bf 0} & {\bf 0} & {\bf 0}& {{\bf J}_{1{\bf z}^*}'} & {\bf 0} & {\bf 0} &{\bf 0} \cr{{\bf J}_{2{\bf v}}'} &\; {\bf 1} & {\bf 0} & {{\bf J}_{2\alpha}'} & {\bf 0} &{\bf 0} & {\bf 0} & {{\bf J}_{2{\bf z}^*}'} & {\bf 0} &{\bf 0} \cr\vdots & \; \vdots & {\bf 0} & {\bf 0} & \ddots &{\bf 0} & {\bf 0} & {\bf 0} & \ddots & {\bf 0} \cr{{\bf J}_{n{\bf v}}'} & \; {\bf 1} & {\bf 0} &{\bf 0} & {\bf 0} & {{\bf J}_{n\alpha}'} & {\bf 0} & {\bf 0} &{\bf 0} & {{\bf J}_{n{\bf z}^*}'}} \right]\cr&= \bigl[\ \overline{{\bf J}'}_{\!\!{\bf v}}, \ \overline{{\bf J}'}_{\!\! \alpha \beta}, \\overline{{\bf J}'}_{\!\! {\bf z}^*} \, \bigr]&\hbox{(29)}}$$ a rectangular linear system is hence finally achieved
TeX Source
$$\overline{{\bf J}'} \; {\bf x}^\circ =-{\bf d}({\bf 0})\eqno{\hbox{(30)}}$$ whose solution **x**° is obtained in the least-squares sense by solving its normal equations. The optimal solution is found by iteratively updating the parameters according to (8), (12), and (19)until the displacements become arbitrarily small.

Therefore, we provide a second-order approximation method that leads to a computationally efficient optimization procedure because only first-order derivatives are involved. In other words, differently from second-order minimization techniques (e.g., Newton), the Hessians are never computed explicitly. This also contributes to obtain nicer convergence properties. Furthermore, the proposed model of illumination changes together with the used representation of the scene yield sparse (diagonal) Jacobians, respectively, and, as shown in (29). Efficiency is then further improved.

SECTION IV

## Initialization of the System

In this section, a method to initialize the proposed visual SLAM formulation is described. Essentially, the technique consists of a hierarchical framework in the sense of the number of parameters to explain the image motion.

### A. Hierarchical Formulation

At the beginning of the task, the amount of translation may be small relative to the distance to the scene. If this occurs, the augmented Jacobian of the structure [see (29)] is ill-conditioned, which means that the structure parameters are not yet observable. In this situation, the motion parameters together with the illumination ones can explain most of the image differences. The latter reasoning also applies once the optimal structure parameters (i.e., the map) have already been obtained. In this case, there is no reason to maintain them as optimization variables. Besides that their values may be perturbed, e.g., when the image resolution decreases, less parameters in the minimization mean more available computing resources. Once again, motion parameters and illumination ones can explain most of the image discrepancies. As a matter of fact, in this case, the proposed visual SLAM approach effectively runs in a robust localization mode.

Therefore, for every new image, we initially attempt to align the regions by using only a subset of parameters from (30)
TeX Source
$$\bigl[\ \overline{{\bf J}'}_{\!\! {\bf v}}, \\overline{{\bf J}'}_{\!\! \alpha \beta} \bigr] \ \bigl[\,\widetilde{{\bf v}}^{\circ\top} \!, \, \widetilde{\beta}^\circ,\, \bigl\{ \widetilde{\alpha}_j^\circ \bigr\}_{j=1}^n \bigr]^\top= -{\bf d}({\bf 0})\eqno{\hbox{(31)}}$$ whose solution is also obtained in the least-square sense, and then it iteratively updates (8)and (12). The structure parameters are only simultaneously used as optimization variables, i.e., by solving (30), whenever the difference between the resulting cost value by using (31) and the resulting one from previous (image) optimization exceeds the image noise. We remark that in any case, the structure (plus motion and illumination) parameters are required to compute the discrepancies **d**(**0**). These parameters can either be the optimal ones from preceding image registrations or an initial value. In fact, this shows how all past observations have contributed to incrementally building and maintaining a coherent description of the map (and locations).

### B. Augmenting the Domain and the Rate of Convergence

A limitation of the visual SLAM approach proposed in Section III regards its domain of convergence. Although the parameters are obtained by a second-order approximation method with nice convergence properties, it does not ensure that the global minimum will be reached. Global optimization methods such as simulated annealing are too time-consuming to be considered in a real-time setting.

However, a possible solution to avoid getting wedged in local minima consists of using, e.g., feature-based techniques as a bootstrap to our method. We remark that even though a recovered set of parameters can represent a local minimum, it may be close to the global one. Hence, the regions may still have been effectively aligned in the image. A standard pose recovery technique can then be used with all these registered (i.e., corresponding) pixels. Afterward, the scene can be reconstructed by triangulating them [15].In addition to augmenting the domain of convergence, this approach may also augment the rate of convergence. If these estimated motion and/or structure are closer to the true ones than those by using the proposed approach, they will act in this case as a prediction for aligning a new image.

Other predictors can additionally be tested to improve convergence properties. In fact, the coupling between the deterministic image registration proposed in Section III with a probabilistic filtering technique can be performed at this stage. Here, we use a variable-order Kalman filter to provide both another estimate of the optimization variables and the covariances. The input (i.e., observations) to the filtering are the recovered parameters from the optimization process. In order to initialize the system (i.e., when a new image is available), the best set of parameters among all predictors is simply chosen by comparing their resulting cost value.

SECTION VI

## Experimental Results

In order to validate the algorithm and assess its performance, we have tested it with both synthetic and real-world images. All results can be found as multimedia material published in IEEE Xplore with this paper. In all cases, trivial initial conditions are used:. The photometric error is here measured by its rms (32). The *j* th region is declared as an outlier if either ε_{j} > 20 or if its geometric error is over 50%. The rms of the image noise is considered to be of 0.6level of grayscale. Moreover, we emphasize that no other sensory device than a single camera is used.

### A. Pyramid Sequence

A synthetic scene was constructed so that a ground truth is available. It is composed of four planes disposed in pyramidal form, and cut by another plane on its top. In order to simulate realistic situations as closely as possible, textured images were mapped onto the planes. Then, a sequence of images was generated by displacing the camera while varying the illumination conditions. With respect to the trajectory, the camera performs a circular motion. The objective is twofold. First, returning the camera to the starting pose offers an important benchmark for SLAM algorithms. Second, this aims to show that past observations *de facto* contribute, within the proposed incremental technique, to build and maintain a coherent description of the structure and motion. With respect to the lighting variations, they are created by applying an α^{(k)} that linearly changes the image intensities up to 50% of its original value, and a β^{(k)} that varies sinusoidally with amplitude of 50 levels of grayscale.

We have then compared our approach (see some SLAM results in Fig. 3), which started with 50 regions of size 21 × 21 pixels, with traditional methods as well as with a direct method. With regard to standard methods, we used SIFT keypoints(1025 matches were initially found), and the subpixel Harris detector along with a zero-mean normalized cross-correlation with mutual consistency check for matching these latter points (235 were initially matched). Other than the initial ones, no features or regions are initialized here. Moreover, there is a relevant difference about how feature correspondences are established along the sequence. While keypoints are matched between the first (reference)and the current images, the latter had to be made between successive images (i.e., had to be tracked). In all cases, corresponding features were fed into a random sample access (RANSAC) procedure (typically 300 trials) with the state-of-the-art five-point algorithm [24] for robustly recovering the pose. This corresponds to a standard feature-based framework where a two-image reconstruction is considered and a nonplanar scene is assumed (because of the five-point algorithm). The comparisons are depicted in Fig. 4, where those strategies are respectively referred to as S + R + 5P and H + ZNCC + R + 5P. Since the scale factor is supposed to be unknown, the translation error is measured by the angle between the actual and the recovered translation directions, i.e., . Notice that, despite exploiting many more features, the standard techniques obtain relatively larger errors, especially for large displacements (i.e., middle of the loop)and significant lighting changes. In addition, the results show an increasing percentage of outliers and a rapidly decreasing number of corresponding features. Therefore, to avoid an early failure, these methods certainly require a more frequent replacement of features. As a remark, despite their relative inferior accuracy, feature-based methods can have a larger domain of convergence, and thus, may be used as a bootstrap to our technique (as discussed in Section IV-B). For the requested accuracy, the proposed approach performed along the sequence of a median of seven iterations returned a median photometric error of 9.84 levels of grayscale, and used a median of 10.4% of each (500 × 500) image. For this sequence where perfect camera's intrinsic parameters are available, the proposed method realized a drift between the original and final pose (since a closed loop is performed) of less than 0.001% of the total amount of translation and 0.091° for the rotation. This shows that precise results with minimal drift are obtained.

With respect to existing direct methods, we have made a comparison with [12]. Given that the displacements (motion and illumination) were not very small, which violate their assumptions, that algorithm failed at the beginning of the sequence. Our solution is able to deal with larger interframe displacements. The method proposed in [11] could not be applied since the scene is supposed to be unknown, and it is not possible to alter the environment (it needs a known target for the initialization).

### B. Hangar Sequence

The application of the proposed technique to this outdoor sequence (see Fig. 1) also has a twofold objective. First, it aims at offering a didactic overview of the method, especially concerning the insertion of new information (the second region).Second, it shows its degree of robustness to different kinds of noise, e.g., shaking motion, image blur, etc. Very importantly, although we model the scene as a collection of planar regions, some occluding nonplanar objects have appeared throughout the sequence, e.g., see the tree in Fig. 1(a). These disturbances have not significantly perturbed the estimation process since they carry substantially less information compared to other parts of the patches. For the requested accuracy, the approach performed along the sequence a median of five iterations, and returned a median photometric error of 13.37 levels of grayscale. The recovered angle between the two walls is of 89.7°, using a median of 22.59% of each (320 × 240) image. This geometric measure is also an important benchmark for evaluating the technique (considering that these walls are truly perpendicular), since pose and structure are intimately tied together. The total displacement of the camera is of approximately 50 m, and the images were captured by a hand-held camcorder at 25 Hz.

### C. Canyon Sequence

We also run the proposed algorithm on a representative urban sequence, captured at approximately 12 Hz. It is also a challenging sequence in the sense that large interframe displacements are carried out, the objects are disposed at very different distances from the camera, and because there exists a significant change in scale. Furthermore, it corresponds to a typical urban scenario where cameras can be of particular importance for localization: narrow streets. In this case, positions from GPS may not be available or not sufficiently reliable. The obtained results are shown in Fig. 5, where the visual SLAM is successfully performed. The starting image was chosen such that the dominant plane is further away from the initial camera pose, compared to [17]. This choice aims to show the limitation of the optimization approach, which is local by nature. Notice that in the beginning of the task, despite the fact that the regions are effectively aligned in the images, the recovered motion and structure are not coherent with the true ones (see first camera poses in Fig. 5). This means that the algorithm got wedged in a local minimum. Thanks to the solution proposed in Section IV-B, this minimum is adequately treated and the correct parameters are subsequently obtained. For the requested accuracy, the approach performed along the sequence a median of 12 iterations, returned a median photometric error of 10.77 levels of grayscale, used a median of 34 image regions of size 31 × 31 pixels (at the time they are selected), and exploited a median of 17.01% of each (760 × 578) image. The total displacement of the camera is of approximately 60 m.

### D. Round-about Sequence

This sequence is also illustrative since other different types of noise are present, e.g., pedestrians and moving vehicles. Nevertheless, the technique automatically coped with such outliers. Excerpts from this sequence and the obtained SLAM results can be seen in Fig. 6. We can observe that coherent motion and structure are recovered. For the requested accuracy, the approach performed along the sequence a median of ten iterations, returned a median photometric error of 11.37 levels of grayscale, used a median of 37 image regions of size 31 × 31 pixels (at the time they are selected), and exploited a median of 10.84% of each (760 × 578) image. This sequence was captured at approximately 12 Hz by a camera-mounted car, where the path length measured by Google Earth is of approximately 150 m.