**A**CCURATE estimation of the ego-motion of a vehicle relative to the road is a key component for autonomous driving and computer-vision-based driving assistance. Using cameras instead of other sensors for computing ego-motion allows for a simple integration of ego-motion data into other vision-based algorithms, such as obstacle, pedestrian, and lane detection, without the need for calibration between sensors. This reduces maintenance and cost. In the robotics community as well, effective use of video sensors for obstacle detection and outdoor navigation has been a goal for many years.

Most of the work in estimating robot motion has been produced using stereo cameras, and can be traced back to Moravec's work [1]. Similar work has been reported elsewhere also (see [2], [3], [4]). Furthermore, stereo visual odometry has also been successfully used on Mars by the National Aeronautics and Space Administration (NASA) rovers since early 2004 [5]. Nevertheless, visual odometry methods for outdoor applications have also been produced, which use a single camera alone.

The problem of recovering relative camera poses and 3-D structure from a set of 2-D camera images has been largely studied for many years and is known in the computer vision community as “structure from motion” [6]. Very successful results have been obtained over long distances using either perspective or omnidirectional cameras (see [4] and [7]). The authors in [4] deal with the case of a stereo camera, and they also provide a monocular solution implementing a fully structure from motion algorithm that takes advantage of the five-point algorithm and random sample consensus (RANSAC) robust estimation. The authors in [7] provide two approaches for monocular visual odometry based on omnidirectional imagery. In the first approach, they use optical flow computation, while in the second one, full structure from motion.

Closely related to structure from motion is what is known in the robotics community as simultaneous localization and mapping (SLAM), which aims at estimating the motion of the robot while simultaneously building and updating the environment map. SLAM has most often been performed with other sensors than regular cameras; however, in the past years successful results have been obtained using single cameras alone (see [8], [9], [10], [11]). Recently, the authors in [10] presented a method for mapping large loops with a single hand-held camera. There, the authors extend Davison's work on visual 3-D-SLAM [9] and build outdoor, closed-loop maps much larger than previously achieved with visual input alone.

In this paper, we do not deal with visual SLAM, rather we concentrate on the development of a vision-based real-time method to estimate the motion of outdoor ground vehicles over long distances. In our approach, we used a single calibrated omnidirectional camera mounted on the roof of the car. We assume that the vehicle undergoes a purely 2-D motion over a predominant flat ground. Furthermore, because we wanted to perform visual odometry in city streets, flat terrains, as well as in motorways where buildings or 3-D structure are not always present, we chose to estimate the motion of the vehicle by tracking the ground plane.

Ground plane tracking has already been exploited by the robotics community for indoor visual navigation and most works have been produced using standard perspective cameras ([12]–[15]). In these works, the motion of the vehicle is estimated by using the property that the projection of the ground plane into two different camera views is related by a homography.

In this paper, we propose a similar approach for central omnidirectional cameras, but our goal is to estimate the ego-motion of the vehicle in outdoor environments and over long distances. Due to the large field of view (FOV) of the panoramic camera, interesting points from all around the car are extracted and matched from pairs of consecutive frames. Our key points are scale-invariant feature transform (SIFT) features [16], as they proved to work well with omnidirectional pictures also.

Furthermore, we want to extract those key points that belong to the ground plane only. To retain only these points and discard all the rest, we use a RANSAC-based outlier removal, which uses the constraint that coplanar points seen from different views are related by a homographic transformation. The remaining inliers are then used to compute the rotation and translation matrices. To update the motion, we use only the magnitude of the translation because the rotation estimated from features alone gives rise to large drift errors after several hundreds of meters. Conversely, to estimate the rotation angle of the vehicle, we use an appearance-based tracker. We show that by using this second tracker, the drift error stays very low over several hundreds of meters.

The performance of our approach has been evaluated on a real platform. We will show an example of a camera trajectory estimated purely from omnidirectional images over a distance of 400 m. For performance evaluation, the estimated path is superimposed onto a satellite image of the same test environment. Furthermore, we use image mosaicing to obtain a textured 2-D reconstruction of the estimated path.

This paper is organized as follows. Section II describes our homography-based ground plane navigation. Section III evaluates the performance of two different algorithms for tracking the ground plane. Section IV describes the appearance-based tracker. Section V summarizes the steps of the whole visual odometry algorithm. Finally, Section VI presents our experimental results.

SECTION II

## Homography-Based Ground Plane Navigation

The motion information that can be extracted by tracking 2-D features is central to our vehicle navigation system. Therefore, we briefly review a method that uses planar constraints and point tracking to compute the motion parameters.

### A. Homography and Planar Motion Parameters

Early work on exploiting coplanar relations has been presented by Tsai and Huang [17], Longuet-Higgins [18], and Faugeras and Lustman [19]. The coplanar relation between two different views of the same plane can be summarized as follows. Consider two camera-centered coordinate systems, frame 1 and frame 2, which are related by a rigid body transformation:
TeX Source
$${\bf X_{2}} = {\bf RX_{1}} + {\bf T}\eqno{\hbox{(1)}}$$ where **X**_{1},**X**_{2}∊ are the coordinates of a scene point relative to camera frames 1 and 2, respectively, and **R**∊**SO**(**3**), **T** ∊ are the rotation and the translation matrices encoding the relative position of the two coordinate systems. Now, assume that **X**_{1} lies on the plane defined by
TeX Source
$${\bf n^{T}X_{1}}=h\eqno{\hbox{(2)}}$$ where **n**∊ is the plane normal and *h* ∊ is the distance to the plane. Then, substituting (2) into (1) gives
TeX Source
$${\bf X_{2}}=\Big ({\bf R}+{ {\bf T}{\bf n^T}\over h} \Big){\bf X}_{1}.\eqno{\hbox{(3)}}$$ We call the matrix
TeX Source
$${\bf H} = {\bf R}+{ {\bf Tn}^{{\bf T}}\over h} \ \in {\bb R}^{3\times 3}\eqno{\hbox{(4)}}$$ the *(planar) homography matrix* since it denotes a linear transformation from **X**_{1} to **X**_{2} as
TeX Source
$${\bf X_2} = {\bf H} \cdot {\bf X_1}.\eqno{\hbox{(5)}}$$ Note that the matrix **H** depends on the motion parameters {**R, T**}as well as the structure parameters {**n**,*h*} of the plane.

Now, let *x*_{1},*x*_{2} ∊ be the normalized image coordinates of **X**_{1},**X**_{2} on the unit sphere, i.e.
TeX Source
$$\lambda _1 {\bf x_1} = {\bf X_1} \qquad \lambda _2 {\bf x_2} = {\bf X_2}\eqno{\hbox{(6)}}$$ where λ_{1}, λ_{2} are the depth factors.

Then, from (5), we have
TeX Source
$$\lambda _2 {\bf x_2} = {\bf H} \cdot \lambda _1 {\bf x_1}\eqno{\hbox{(7)}}$$ and then, we can write
TeX Source
$${\bf x_2} \sim {\bf H} \cdot {\bf x_1}\eqno{\hbox{(8)}}$$ where ∼ indicates the equality up to a scale factor.

Since (8) is defined up to a scale factor, **H** has only 8 DOF. This implies that four corresponding feature pairs (no three collinear) are required to linearly determine **H**. If more than four points are available, then a least-square solution can be searched. The algorithm we used in our implementation to recover **H** from a set of consistent point correspondences uses the normalized direct linear transformation (DLT) [6].

Observe that because the homography is defined up to a scale, we are able to recover **H** of the form
TeX Source
$${\bf H_L} = \lambda {\bf H} = \lambda \left({\bf R}+{ {\bf Tn^{T}}\over h} \right) \ \in {\bb R}^{3\times 3}\eqno{\hbox{(9)}}$$ for some (unknown) scale factor λ. However, as shown in [28], the scale factor can be computed as
TeX Source
$$\vert \lambda \vert = \sigma _2({\bf H_L})\eqno{\hbox{(10)}}$$ where σ_{2}(**H**_{L}) ∊ is the second largest singular value of **H**_{L}. Once λ is computed, the estimated homography can be normalized, i.e., **H** = **H**_{L}/λ.

Finally, observe that (8) also suggests a method to check whether a given set of points are coplanar. Namely, if we can select four coplanar corresponding point pairs that are in a sufficiently general configuration, then **H** can be computed and used to check whether the other points in the scene lie in the same plane. This is actually the principle of our RANSAC-based outlier removal and will be detailed in Section II-F.

### B. Omnidirectional Camera Model and Calibration

The model we used in our work is the generalized omnidirectional model proposed in [22]. This model describes the imaging function that maps world points into camera image points by means of an approximated Taylor series expansion whose coefficients are the calibration parameters. We used this model because the calibration is straightforward due to the availability of a Matlab toolbox available on the author's Web page and because it is also suitable to any central omnidirectional camera, both dioptric and catadioptric. However, it is important to remark that any other calibration method could have been adopted without restriction provided that the camera is central, i.e. , it possesses a single effective view point (SVP) [27]. When the SVP property is verified, any calibrated camera, both omnidirectional and perspective, can be represented as a set of optical rays emanating from the SVP to the viewing directions [see (6)]. The length of these optical rays can be chosen up to the user. Depending on the application, the rays can be projected onto the unit sphere [as we mentioned earlier before (6)] or onto a plane (perspective projection). The coordinates of these reprojected rays are usually referred to as normalized image coordinates. When the image coordinates are normalized, (8) is valid for both perspective and omnidirectional cameras.

### C. Homography or Euclidean Transformation?

In our experiments, we mounted the omnidirectional camera on the roof of the car (Fig. 1) with the *z*-axis of the mirror perpendicular to the ground plane (Fig. 2). By fixing the origin of our coordinate system in the center of projection of the omnidirectional camera (Fig. 2), we have that **n** = [0,0,−1]^{T}. The distance *h* of the origin to the ground plane can be measured manually.

According to the last considerations, the homography **H** has the form
TeX Source
$$\eqalignno{{\bf H} &= {\bf R}+{ {\bf Tn}^{{\bf T}}\over h} \cr &= \left[\matrix{ \cos \theta & -\sin \theta & 0 \cr \sin \theta & \cos \theta & 0\cr 0 & 0 & 1\cr } \right]+{ 1\over h} \left[\matrix{ t_1 \cr t_2 \cr 0 } \right] \cdot \left[\matrix{0 \cr 0 \cr -1} \right]^T \cr &= \left[\matrix{ \cos \theta & -\sin \theta & -t_1/h \cr \sin \theta & \cos \theta & -t_2/h\cr 0 & 0 & 1\cr } \right]&{\hbox{(11)}}}$$ where θ is the rotation angle of the camera about the *z*-axis, and *t*_{1} and *t*_{2} are the elements of **T**.

Equation (11) describes a Euclidean transformation on the image plane, which is a particular case of homography. The Euclidean transformation has only 3 DOF and allows us a more stable estimation of the motion with respect to the homography (i.e., 8 DOF) when the motion is constrained to be on the ground plane. However, because of the unavoidable vibrations the camera is subjected to during the motion of the vehicle as well as the nonperfect verticality of the camera, the form of **H** may appear slightly different from (11). Therefore, an 8-DOF homography is more appropriate than a Euclidean transformation to describe the relation between the two views.

In the next section, we will see two methods to decompose the homography to extract **R** and **T**: the “Triggs algorithm” and the “Euclidean method.” We will assume that the image coordinates **x**_{1} and **x**_{2} are matched correctly and satisfy the homography constraint (i.e., coplanarity). In Section II-F, we will explain how to extract these points.

In Section III, we will evaluate the performance of the two approaches when the planar motion constraint and the verticality of the camera are not perfectly satisfied.

### D. Decomposing **H**

If a camera is internally calibrated, it is possible to recover **R**, **T**, and **n** from **H** up to at most a twofold ambiguity. A linear method for decomposing **H** was originally developed by Wunderlich [25] and later reformulated by Triggs [26]. The algorithm of Triggs is based on the singular value decomposition of **H**. The description of this method as well as its Matlab implementation can be found in [26]. This algorithm outputs two possible solutions for **R**, **T**, and **n** that are all internally self-consistent. In the general case, some false solutions can be eliminated by sign (visibility) tests or geometric constraints, while in our case, we can disambiguate the solutions by choosing the one for which the computed plane normal **n** is closer to [0,0,−1]^{T}.

Once the two solutions are disambiguated, the rotation angle θ and the translation components *t*_{1}, *t*_{2} along the ground plane can be computed in the following manner. Indeed, because the Triggs algorithm also returns the normal **n** to the plane, and we are only interested in computing the motion along the plane, we can actually project the estimated **R** and **T** onto the plane using **n**. From these projections, the rotation angle θ and the translation components *t*_{1}, *t*_{2} along the ground plane are obtained.

In the remainder of this paper, we will refer to this method as the “Triggs algorithm.”

In our implementation, we used the Triggs algorithm but we sometimes interchanged it with another method also that we are now going to describe. Indeed, the Triggs algorithm works in general very well if the image points are spatially uniformly distributed on the camera image [see Fig. 3(a)]. If the image points are too close to a degenerate configuration or they are spatially distributed within one side of the whole omnidirectional image [Fig. 3(b)], then it is better to use the Euclidean approximation given in (11). The quantitative justification of this statement can be found in Section III where we evaluate the performance of the two methods.

Here, we describe how to use the Euclidean approximation to derive the rotation and translation parameters. By using **x**_{1} = [*x*_{1}, *y*_{1}, 1] and **x**_{2} = [*x*_{2}, *y*_{2}, 1], from (11), we have
TeX Source
$$\left\{\matrix{x_2 = cx_1-sy_1-a \cr y_2 = sx_1+cy_1-b} \right.\eqno{\hbox{(12)}}$$ with *c* = cos θ, *s* = sin θ, *a* = *t*_{1}/*h*, and *b* = *t*_{2}/*h.* Each point pair gives two equations, and hence given two point pairs we can linearly recover *c*, *s*, *a*, and *b.* When more point correspondences are given (say *n* corresponding pairs) a linear least-squares solution can be found with the pseudoinverse matrix method. To this end, observe that (12) can be rewritten as
TeX Source
$$\left[\matrix{ x_1 & -y_1 & -1 & 0 \cr y_1 & x_1 & 0 & -1 } \right] \cdot \left[\matrix{ c \cr s \cr a \cr b} \right] = \left[\matrix{ x_2 \cr y_2 \cr } \right]{\buildrel\triangle\over=}\, {\bf A} \cdot \left[\matrix{c \cr s \cr a \cr b} \right] = {\bf B}\eqno{\hbox{(13)}}$$ where **A**∊ ^{2n × 4} and **B**∊ ^{2n × 1}. The linear least-squares solution of (13) is [*c*, *s*, *a*, *b*]^{T} = **A**^{+}**B**, where **A**^{+} = (**A**^{T}**A**)^{−1}**A**^{T} is the pseudoinverse of **A**.

Observe that before solving (13), it is a good practice to normalize the data through the Hartley's normalization [6] (known as *preconditioning* in the numerical literature). This makes the linear solution more robust to image noise.

Finally, observe that the matrix **Q** = [*c*, −*s*; *s*, *c*] may not be orthonormal because of the method used to compute its coefficients *s* and *c.* However, we can compute an orthonormal matrix that better approximates **Q**.

The best rotation matrix **R**_{2D} to approximate **Q** in the Frobenius sense is **R**_{2D} = **UV**^{T}, where [**U, S, V**] = SVD(**Q**) and SVD(**Q**) is the singular value decomposition of **Q**.

Here, “best” is in the sense of the smallest Frobenius norm of the difference **R**_{2D}−**Q**, which solves the problem
TeX Source
$${\min _{\bf R_{2D}}} \Vert {\bf R_{2D}}-{\bf Q}{\Vert^2_{F}} \qquad \hbox{subject to} \, {\bf R_{2D}} \cdot {{\bf R_{2D}}^{\bf T}} = {\bf I}.\eqno{\hbox{(14)}}$$

Finally, from **R**_{2D}, the rotation angle θcan be computed easily. At the same time, *t*_{1}, *t*_{2} can be directly computed from *a* and *b* knowing *h.*

In the remainder of this paper, we will refer to this last method as the “Euclidean method.”

In the final implementation of our algorithm, we implemented both the Triggs algorithm and the Euclidean method. The trigger condition to use the one or the other is given by the spatial distributions of the image points. If the image points occupy both the left and the right half of the omnidirectional image [like in Fig. 3(a)], then the Triggs algorithm is used. If the image points are only in one half (i.e., either the left or the right one) of the image [like in Fig. 3(b)], then the Euclidean method is used.

### E. Maximum Likelihood Estimation

In Section II-D, we have described two different approaches to recover the translation parameters and the rotation angle of the vehicle, given a set of image correspondences that are assumed to lie on a plane. However, the solution given by both approaches is obtained by a linear method that minimizes an algebraic distance that is not meaningful physically. We can refine it through maximum likelihood inference. The maximum likelihood estimate can be obtained by minimizing the following functional:
TeX Source
$$\min _{\theta, t_1, t_2} \sum _{i=1}^n \Vert {\bf x}_1^i - {\bf \hat{x}_1}^i(\theta, t_1, t_2)\Vert ^2 + \Vert {\bf x_2}^{\!\!i} - {\bf \hat{x}_2}^{i}(\theta, t_1, t_2)\Vert ^2\eqno{\hbox{(15)}}$$ with . If the Triggs algorithm is used, **H** is defined as in [26], otherwise it is defined as in (11).

To minimize (15), we used the Levenberg–Marquadt algorithm. This algorithm requires an initial guess for θ, *t*_{1}, and *t*_{2}. As an initial guess, we used the linear solutions provided by (13) and (14).

### F. Coplanarity Check

The equations given in the previous sections assume that the corresponding image pairs **x**_{1} and **x**_{2} are correctly matched and that the points lie on the ground plane. Even though in omnidirectional images taken from the roof of the car, the ground plane is predominant, there are also many feature points that come from other objects than just the road, like cars, buildings, trees, guardrails, etc. Furthermore, there are also many unavoidable false matches that are more numerous than those usually output by the SIFT on standard perspective images (about 20%–30% according to [16]) because of the large distortion introduced by the mirror. To discard the outliers, we used the RANSAC paradigm [20]. The RANSAC steps in our case are formally the following.

Let **A** be the set of all image pairs output by SIFT from two consecutive frames. At each iteration, four putative corresponding pairs are randomly selected from **A** and a homography **H** is instantiated from these points (four is the minimum number of point pairs required to compute an 8-DOF homography).

The instantiated **H** is used to determine the subset **S**_{1} of point pairs in **A** that are within some error tolerance *d.* This subset **S**_{1} is called consensus set.

If the number of members in **S**_{1} is greater than some threshold *t*, which is a function of the expected number of outliers in **A**, then **S**_{1} is used to compute a new **H^***.

Otherwise, if the number of members in **S**_{1} is less than *t*, then a new subset **S**_{2} is randomly selected and the previous process is repeated.

If, after some predetermined number of trials, no consensus set with *t* or more members is found, then the homography **H^*** with the largest consensus set is used. How to the estimate *t* as a function of the number of outliers can be found in [20].

As an error measure to determine the subset **S**_{1} of pairs that are within the error tolerance *d*, we used the symmetric transfer error
TeX Source
$${\rm err}^i= \Vert {\bf x}_{\bf 2}^{i}-{\bf Hx}_{\bf 1}^{i}\Vert ^{\bf 2}+\Vert {\bf x}_{\bf 1}^{i}-{\bf H^{-1}}{\bf x}_{\bf 2}^{i}\Vert ^{\bf 2}. \eqno{\hbox{(16)}}$$ We reject every pair for which err^{i} > *d*, where *d* is computed statistically according to the Huber-type skipped means rule [21], i.e., *d* = 5.2MAD. MADstands for median absolute deviation and is defined as
TeX Source
$${\rm MAD}({\rm err})= {\rm median}_i\{{ \vert {\rm err}_i-{\rm median}_j({\rm err}_j)\vert }\}.\eqno{\hbox{(17)}}$$

Observe that if a prior estimate of the rotation angle is available (consider, for example, that coming from the visual compass of Section IV), then this can be used as a prestage in the first step of RANSAC to constrain the selection of putative matches. This has the effect of speeding up the search of inliers and reducing the percentage of outliers. In our implementation, we do this in the following manner: we assume a 3-DOF motion, then the rotation angle can be computed from at least two putative point pairs. If this rotation estimate is consistent (within a certain threshold) with that given by the visual compass, then the two point pairs are retained.

SECTION III

## Triggs Versus Euclid: A Performance Evaluation

In this section, we evaluate the performance of the Triggs and the Euclidean methods under the influence of image noise, point distribution, nonperfectly planar motion, and nonperfectly vertical camera. The analysis is done on simulated data.

In the simulated environment, we used the same omnidirectional camera model of our experiments. Also, the image size was assigned as in the real experiments, namely, 640 × 480 pixels. The self-occlusion of the camera was simulated by removing a circular region of radius *R*_{min} from the center of the omnidirectional image (like in Fig. 11). The length of the camera displacement was chosen equal to the maximum displacement of our real vehicle between two consecutive frames, namely 0.6 m. Also, the camera distance *h* to the ground plane was assigned as in the real experiments, namely 2 m.

In each simulation trial, we used ten feature points randomly selected from the ground plane. To randomly select these points, we used a uniform probability distribution over the interval [−5, 5] m along both the *x*- and *y*-directions. This point interval is consistent to that observed during the experiments. Once the features were selected, we projected them onto the image plane and considered only their pixel coordinates.

Our goal is to characterize the performance of the outputs of the Triggs and Euclidean methods when the planar motion constraint and the verticality of the camera are not perfectly satisfied. The outputs we are interested in are the motion parameters along the ground plane, i.e., the rotation angle θ (here called *yaw angle*) and the translation length along the ground plane , where *t*_{1},*t*_{2} are computed as given in Section II-D. For each simulation experiment, we will give the plots of the absolute errors of θ and **|T|**, respectively.

As a first study, we characterize the performance of the two algorithms with respect to Gaussian image noise. In the first stage, we consider the ideal case where the camera undergoes pure planar motion and the camera is perfectly vertical. In this experiment, the coordinates of all image points were corrupted by Gaussian noise with mean zero and variance σ. We varied σ from 0 to 2 pixels, and for each value, we ran 1000 independent trials where we computed the motion parameters θ and **|T|**. The results shown in Fig. 4 are the average of the absolute error. As observed, the error increases linearly with the noise variance. Furthermore, the error given by the Euclidean method is smaller than that given by the Triggs approach. This result was expected because the Euclidean method provides 3-DOF motion estimates that are more robust to image noise when the motion is purely planar and the camera perfectly vertical.

Conversely, when this assumption does not hold, the behavior of the two algorithms can be much different. As an example, consider Fig. 5 to see what happens if the camera does not appear perfectly vertical. To generate this plot, we kept the direction of motion still parallel to the ground plane but we slightly changed the orientation (roll and pitch angles) of the camera axis in the second position; in particular, we set roll = pitch = 1.0°. As observed, for small noise values, the Triggs algorithm performs better than the Euclidean one. Indeed, in the noise-free case, it can compute the exact orientation of the camera, while the Euclidean method cannot. Thus, for small noise values, it performs better than the Euclidean approach. Conversely, when the noise increases, the error on the six motion parameters becomes more relevant, while the error of the Euclidean method stays constant. In the next simulations, we will consider a noise variance of 0.3 pixels.

In Section II-D, we mentioned that we switch between the Triggs and the Euclidean method according to the distribution of the image points (see Fig. 3). The justification of this choice will now be given. First of all, consider that if the scene points are collinear, the homography cannot be computed because the points are in a degenerate configuration. In practice, however, the image points are always corrupted by noise or the scene points are nearly collinear. Thus, in both cases, the homography can be instantiated but, because the data are badly conditioned, the result of the homography decomposition by Triggs would diverge from the correct result. Conversely, the Euclidean transformation is not affected by the collinearity of the scene points. This brought us to the conclusion that if the scene points are nearly collinear (and so the image points lie nearly on a conic), the Euclidean transformation should be used; in all the other cases, the Triggs algorithm should be adopted. By performing several simulations, we actually found that it is better to use the Euclidean algorithm when the image points lie only on one half of the omnidirectional image (i.e., scene points potentially collinear).

To demonstrate this, we generated our random features along a line parallel to the *x*-axis and distant 1 m from the camera origin. Then, we perturbed the *y*-coordinate of the features with σ meters Gaussian noise. Finally, we projected the features onto the image plane and added 0.3 pixel Gaussian noise. We varied σ from 0 to 2 m and ran 1000 independent trials. The average values of the absolute error of the motion estimates are shown in Fig. 6. As observed, the Euclidean method performs better than the Triggs for every σ. However, this experiment was done by again considering pure planar motion and camera perfectly vertical; hence, the Euclidean method has to provide the best result.

Conversely, the effect of a nonperfectly vertical camera (namely, roll = pitch = 1°) can be seen in Fig. 7 As observed in these plots, the Euclidean method provides more accurate motion estimates than the Triggs method only when Σ < 0.2 m. This means that when Σ = 0.2, most features lie within 3 Σ = 0.6 m distance from the line. By increasing further on the inclination of the camera (roll, pitch > 1 °), we found that this threshold also increases. In practice, the inclination of the camera changes over the time due to jittery motion. This phenomenon motivated our decision of switching the two methods according to the distribution of the image points. Finally, observe that when σ approaches 0 (i.e., scene points are collinear), the motion estimates by Triggs diverges as expected.

Fig. 8 shows the sensitivity of the motion estimates to nonperfectly planar motion. We added a translation component *T*_{z} = 0.1 m along the *z*-axis and also varied the relative orientation of the camera in the second position. The sensitivity of the motion parameters is computed as a function of the roll angle of the camera in the second position. The roll was varied from 0° to 5°, and for each value, 1000 trials were run. For each run, 0.3 pixel noise was added to corrupt the image points. As observed, the Triggs algorithm always performs better than the Euclidean method, showing to be able to detect violations from the planar motion assumption. We also performed the same experiment by varying the pitch angle and we obtained similar plots.

Finally, Fig. 9 shows the influence of a nonperfectly vertical camera axis on the motion estimates. The motion is still assumed to be parallel to the ground plane. The plot is shown as a function of the angle between the plane normal and the camera axis. This angle was varied from 0° to 5° and for each value, 1000 trials were run. For each run, 0.3 pixel noise was added to corrupt the image points. As observed, the Triggs algorithm is able to detect deviations of the camera from the verticality.

SECTION IV

## Visual Compass

In Section II, we described how to use point features to compute the rotation and translation matrices. Unfortunately, when using features to estimate the motion, the resulting rotation is extremely sensitive to systematic errors due to the intrinsic calibration of the camera or the extrinsic calibration between the camera and the ground plane (the last case was studied in Section III). This effect is even more accentuated with omnidirectional cameras due to the large distortion introduced by the mirror. In addition to this, integrating rotational information over time has the major drawback of generally becoming less and less accurate as integration introduces additive errors at each step. An example of camera trajectory recovered only using the feature-based approach described in Section II is depicted in Fig. 13.

To improve the accuracy of the rotation estimation, we used an appearance-based approach. This approach was inspired by a recent work on the use of omnidirectional cameras as visual compass [23]. Directly using the appearance of the world as opposed to extracting features or structure of the world is attractive because methods can be devised that do not need precise calibration steps. Here, we describe how we implemented our visual compass.

For ease of processing, every omnidirectional image is unwrapped into cylindrical panoramas (see Fig. 10). The unwrapping considers only the white region of the omnidirectional image that is depicted in Fig 11. We call these unwrapped versions “appearances.” If the camera is perfectly vertical to the ground, then a pure rotation about its vertical axis will result in a simple column-wise shift of the appearance in the opposite direction. The exact rotation angle could then be retrieved by simply finding the best match between a reference image (before rotation) and a column-wise shift of the successive image (after rotation). The best shift is directly related to the rotation angle undertaken by the camera. In the general motion, translational information is also present. This general case will be discussed later.

The input to our rotation estimation scheme is thus made of appearances that need to be compared. To compare them, we used different similarity measures. In particular, we have tried cross-correlation, zero-mean normalized cross-correlation, *L*_{1}-norm (Manhattan distance), and an *L*_{2}-norm (Euclidean distance). The best results were obtained by using the Euclidean distance. A performance comparison with the other metrics is not given in this paper.

The Euclidean distance between two appearances *I*_{i} and *I*_{j}, with *I*_{j} being column-wise shifted (with column wrapping) by α pixels, is
TeX Source
$$d(I_i, I_j,\alpha)=\sqrt{\sum _{k=1}^h \sum _{h=1}^w \sum _{l=1}^c \vert I_i(k, h, l)-I_j(k, h-\alpha, l) \vert ^2}\eqno{\hbox{(18)}}$$ where *h* × *w* is the image size and *c* is the number of color components. In our experiments, we used the red, green, and blue (RGB) color space, thus having three color components per pixel.

Defining α_{m} the best shift that minimizes the distance *d*(*I*_{i},*I*_{j}, α_{m})≤ *d*(*I*_{i},*I*_{j},α) ∀ α ∊ , then the rotation angle Δ ϑ (in degrees) between *I*_{i} and *I*_{j} is
TeX Source
$$\Delta \vartheta = \alpha _m { 360\over w} .\eqno{\hbox{(19)}}$$

The width *w* of the appearance is the width of the omnidirectional image after unwrapping and can be chosen arbitrarily. In our experiments, we used *w* = 360, which means that the angular resolution was 1 pixel/°. To increase the resolution to 0.1°, we used cubic spline interpolation with 0.1 pixel precision. We also tried larger image widths, but we did not get any remarkable improvement in the final results. Thus, we used *w* = 360 as the unwrapping can be done in a negligible amount of time. The Euclidean distance between the two images in Fig. 10 as a function of the column-wise shift of the second image is shown in Fig. 12.

The distance minimization in (18) makes sense only when the camera undergoes a pure rotation about its vertical axis, as a rotation corresponds to a horizontal shift in the appearance. In the real case, the vehicle is moving and translational component is present. However, the “pure rotation” assumption still holds if the camera undergoes small displacements or the distance to the objects (buildings, tree, etc.) is large compared to the displacement. In the other cases, this assumption does not hold for the whole image, but an improvement that can be done over the theoretical method is to only consider parts of the images, namely the front and back part (see Fig. 11). Indeed, the contribution to the optic flow by the motion of the camera is not homogeneous in omnidirectional images; a forward/backward translation mostly contributes in the regions corresponding to the sides of the camera and very little in the parts corresponding to the front and back of the camera, while the rotation contributes equally everywhere.

Because we are interested in extracting the rotation information, only considering the regions of the images corresponding to the front and back of the camera allows us to reduce most of the problems introduced by the translation, in particular sudden changes in appearance (parallax).

According to the last considerations, in our experiments, we used a reduced FOV around the front and back of the camera (Fig. 11). The reduction of the FOV was done on both the horizontal and vertical planes. A reduced horizontal FOV of about 30° around the front part is shown by the white window in Fig. 10. Observe that we also reduced the FOV on the vertical plane, in particular under the horizon line. The reason was to reduce the influence of the changes in appearance of the road. The resulting vertical FOV was 50° above and 10° below the horizon line (the horizon line is indicated in red (gray) in Fig. 10).

The fact of reducing the FOV (especially on the horizontal plane) provided an important improvement over using the whole FOV in terms of stability and sensitivity to prominent features at the sides of the camera. The effect of the size of the horizontal FOV on the estimation of the camera trajectory is depicted in Fig. 14 and will be discussed in Section VI.

SECTION V

## Motion Estimation Algorithm

As already mentioned, the appearance-based approach provides rotation angle estimates that are more reliable and stable than those output by the pure feature-based approach. Here, we describe how we combined the rotation angle estimates of Section IV with the camera translation estimates of Section II.

In our experiments, the speed of the vehicle ranged between 10 and 20 Km/h, while the images were constantly captured at 10 Hz. This means that the distance covered between two consecutive frames ranged between 0.3 and 0.6 m. For this short distance, the camera configuration (*x*,*y*,θ), which contains its 2-D position (*x*, *y*) and orientation θ, can be approximated in this way
TeX Source
$$\left\{\matrix{\displaystyle x_{i+1} = x_i + \delta \rho _i \cos \left(\theta _i+{ \delta \theta _i\over 2}\right)\hfill \cr\displaystyle y_{i+1} = y_i + \delta \rho _i \sin \left(\theta _i+{ \delta \theta _i\over 2}\right)\hfill \cr\displaystyle \theta _{i+1} = \theta _i + \delta \theta _i\hfill}\right.\eqno{\hbox{(20)}}$$ where we use δ ρ = | **T** | *h* and δ θ = Δ ϑ.

Observe that **T** is the same translation vector used in Section II, and thus, , where *t*_{1} and *t*_{2} are computed as described in Section II-D and E. Parameter *h* is the scale factor (i.e., in our case, this is the height of the camera to the ground plane). The camera rotation angle Δ ϑ is computed as described in Section IV. Observe that we did not use at all the rotation estimates provided by the feature-based method of Section II.

Now, let us resume the steps of our motion estimation scheme, which have been detailed in Sections II and IV. Our omnidirectional visual odometry operates as follows.

Acquire two consecutive frames. Consider only the region of the omnidirectional image, which is between *R*_{min} and *R*_{max} (see Fig. 11).

Extract and match SIFT features between the two frames. Use the double consistency check to reduce the number of outliers. Then, use the calibrated camera model to normalize the feature coordinates by reprojecting them onto a plane perpendicular to the *z*-axis and distant 1 from the origin.

Unwrap the two images and compare them using the appearance method described in Section IV. In particular, minimize (18), with reduced FOV, to compute the column-wise shift between the appearances and use (19) to compute the rotation angle Δ ϑ.

Use RANSAC to reject points that are not coplanar (Section II-F); in particular, use the available rotation estimate Δ ϑ from the visual compass to speed up RANSAC, as explained at the end of Section II-F.

Apply the linear algorithm described in Section II-D to estimate **R** and **T** from the remaining inliers. In doing this, switch between the Triggs algorithm and the Euclidean method, as described in Section II-D. Then, refine **R** and **T** using maximum likelihood estimation (Section II-E).

Use δ ρ = | **T** | *h* and δ θ = Δ ϑ and integrate the motion using (20).

Repeat from step 1.

The approach proposed in this paper has been successfully tested on a real vehicle equipped with a central omnidirectional camera. A picture of our vehicle (a Smart) is shown in Fig. 1.

Our omnidirectional camera, composed of a hyperbolic mirror (KAIDAN 360 One VR) and a digital color camera (SONY XCD-SX910, image size 640 × 480 pixels), was installed on the front part of the roof of the vehicle. The frames were grabbed at 10 Hz, and the vehicle speed ranged between 10 and 20 km/h.

The resulting path estimated by our visual odometry algorithm using a 10° FOV is shown in Figs. 13, 16, and 17. Our ground truth is a satellite image of the same test environment provided by Google Earth (Fig. 16). The units used in the three figures are meters.

In this experiment, the vehicle was driven along a 400-m−long loop and returned to its starting position (pointed to by the yellow (gray) arrow in Fig. 16). The estimated path is indicated with red (gray) dots in Fig. 16 and is shown superimposed on a satellite image for comparison. The final error at the loop closure is about 6.5 m for the distance and 5° for the orientation. This error is due to the unavoidable visual odometry drift; however, observe that the trajectory is very well estimated until the third 90° turn. After this turn, the estimated path deviates smoothly from the expected path instead of continuing straight. After road inspection, we found that this deviation was due to three 0.3-m−tall road humps (indicated by the cyan (gray) circle in Fig. 16) that violate the planar motion assumption.

The content of Fig. 17 is very important as it allows us to evaluate the quality of motion estimation. In this figure, we show a textured top viewed 2-D reconstruction of the whole path. Observe that this image is not a satellite image but is an image mosaicing. Every input image of this mosaic was obtained by an inverse perspective mapping (IPM) of the original omnidirectional image onto an horizontal plane. This inverse mapping is always possible for central cameras, i.e., when a camera has a single effective viewpoint. After being undistorted through IPM, these images were merged together using the 2-D poses estimated by our visual odometry algorithm. The estimated trajectory of the camera is shown superimposed with red (gray) dots. If the reader visually and carefully compares the mosaic (see Fig. 17) with the corresponding satellite image (see Fig. 16), he will recognize in the mosaic the same elements that are present in the satellite image, i.e., trees, white footpaths, pedestrian crossings, roads' placement, etc. Furthermore, the reader can verify that the location of these elements in the mosaic fits well the location of the same elements in the satellite image.

As mentioned in Section IV, we also evaluated the effect of the reduced horizontal FOV on the final motion estimation. Fig. 14 shows the recovered estimated trajectory respectively using FOV = 10°, FOV = 20°, FOV = 30°, and FOV = 60°. Observe that the estimation of the trajectory improves as the FOV decreases. Indeed, as mentioned already in Section IV, the fact of reducing the FOV allows us to reduce most of the problems introduced by the translation, like sudden changes in parallax. The best performance in terms of closeness to the ground truth of Fig. 16 is obtained when FOV = 10°.

In Fig. 15, the effect of the FOV on the estimation of the heading direction is also shown. Also, the best performance is when FOV = 10°. In this case, in fact 90° turns are very well estimated. Furthermore, when FOV = 10° the heading direction stays quite constant after each turn, i.e., when the vehicle covers a straight path. Note that when the vehicle returns to its start position, the estimated heading direction is equal to 355°, which means that the orientation error at the loop closure is 5°.

Finally, a comparison of the proposed algorithm with the only feature based method of Section II is shown in Fig. 13.

In this paper, we described a real-time algorithm for computing the ego-motion of a vehicle relative to the road. The algorithm uses as input only those images provided by a single omnidirectional camera. The front ends of the system are two different trackers. The first one is a feature-based tracker that uses SIFT features and a RANSAC-based outlier removal to track the key points that most likely belong to the ground plane. The second one uses an appearance-based approach to give high-resolution estimates of the rotation angle of the vehicle. Using the first tracker to compute the vehicle displacement in the heading direction and the second tracker to compute the vehicle rotation proved to give very good visual odometry estimates under planar motion assumption. Furthermore, the performance of the motion estimation given by the proposed method is better than the pure feature-based approach (see Fig. 13).

The proposed algorithm was successfully applied to videos from an automotive platform. We gave an example of a camera trajectory estimated purely from omnidirectional images over a distance of 400 m. The accumulated error after 400 m (i.e., error at the loop closure) was about 6.5 m for the distance and 5° for the orientation. For performance evaluation, the estimated path was superimposed onto a satellite image of the same test environment and a textured 2-D reconstruction of the path was built.

### Acknowledgment

The authors would like to thank P. Lamon and L. Spinello for their useful help, suggestions, and comments.