Bound and Conquer: Improving Triangulation by Enforcing Consistency

We study the accuracy of triangulation in multi-camera systems with respect to the number of cameras. We show that, under certain conditions, the optimal achievable reconstruction error decays quadratically as more cameras are added to the system. Furthermore, we analyse the error decay-rate of major state-of-the-art algorithms with respect to the number of cameras. To this end, we introduce the notion of consistency for triangulation, and show that consistent reconstruction algorithms achieve the optimal quadratic decay, which is asymptotically faster than some other methods. Finally, we present simulations results supporting our findings. Our simulations have been implemented in MATLAB and the resulting code is available in the supplementary material.


INTRODUCTION
Cameras are finite; yet, we commonly assume Gaussian noise models with infinite tails.Clearly, this disparity, which appears across all of science, is offset by the approximation accuracy and mathematical convenience of the Gaussian distribution.However, by replacing cost functions based on the ℓ 2 -norm with their ℓ ∞counterparts, the computer vision community has begun to investigate the feasibility of bounded noise models [3]- [6].
In this paper, we continue this trend by formalising some of the benefits of bounded noise models in triangulation problems; i.e., problems which aim to estimate the three-dimensional (3-D) positions of feature points from their two-dimensional (2-D) projections.
For example, consider triangulating from a set of calibrated cameras.In the noise-free, infinite-resolution case, the exact location of the world point can be reconstructed by intersecting rays originating from each camera.However, in practice, various sources of uncertainty mean that the rays do not necessarily intersect.Error minimisation techniques are thus employed with the goal of finding the most probable reconstructed world point [4], [7].The degree to which we achieve this goal depends on two factors of the cost function: its correct modelling of the uncertainty and how accurately we can find its global minimum.
The simplest approach is to construct an overcomplete set of linear equations, each corresponding to one of the rays, and take the pseudo-inverse to find the least-squares solution.Although we find the global minimum, the cost function is not particularly meaningful and it is unlikely that it provides the best model of the underlying uncertainty.However, this technique performs reasonable well, particularly if the coordinates are correctly normalised [7], [8], and it is a good choice when time complexity is the principal concern.
Alternatively, we can attempt to minimise the ℓ 2 -norm of the reprojection error between the prospective 3-D points and the known image locations, which results in the maximum-likelihood estimator if we assume that the projected points are subjected to i.i.d.zero-mean Gaussian noise in the image plane.Although this assumption is very likely a much better approximation of the underlying uncertainty, the resulting cost function is non-convex and extremely difficult to solve.
Although for a small number of views the global minimum can be found by polynomial root finding [7], [9]- [14], the degree of the resulting polynomial grows quadratically with the number of views [11] and thus this is, in general, not practical.Therefore, often, we have to resort to iterative approaches that only converge to a local minimum, or more time-consuming branch and bound techniques [15].On a positive note, it is possible to verify whether a solution is globally optimal in the ℓ 2 -sense [16].
Inspired by these difficulties, the ℓ ∞ -norm has recently been considered as a measure of the reprojection error [3], [4], [17].Although this may not correspond to the best model of the underlying uncertainty, the resulting cost is a quasi-convex function and efficient algorithms can find the global optimum [5].However, since the ℓ ∞norm implicitly assumes a bounded noise model, care must be taken to limit the potential catastrophic effect of outliers [18], [19].
In this paper, we analyse the performance of multi-camera systems and reconstruction algorithms under the assumption of bounded noise and pixelisation.We provide two main contributions.First, we prove that, under certain conditions, the highest achievable point localisation accuracy of a multi-camera system is quadratically related to the number of cameras in the system.Second, we introduce the notion of consistency and show that consistent reconstruction algorithms achieve the optimal quadratic decay rate.
In the stereo case, there have been a number of excellent studies thoroughly analysing the point reconstruction error with relation to multiple parameters [20], [21].Furthermore, the error of depth estimation in linear camera arrays, has been analysed [22].
However, to the best of our knowledge, there is no work analysing arbitrary camera setups and certainly no work deriving fundamental scaling laws for the accuracy of point reconstruction, with respect to the number of cameras.Given the rapid increase in popularity of multi-camera systems, we hope this analysis provides a significant contribution.
Our work is inspired by results derived from frame quantisation.In particular, similar error decay rates have been derived for signal reconstruction from overcomplete quantised projections [23]- [25].In imaging terms, this corresponds to circular arrays of 2-D pixelised orthographic cameras.Furthermore, these results have recently been generalised to uniform bounded noise and any consistent estimate [26], [27].We extend these results to more general arrays of 3-D cameras using central projections, which is more typical in the context of imaging.
In what follows, we give a very brief overview of the problem setup and formally define the triangulation problem.After introducing major state-of-the-art techniques, we present our main results on the error decay rate of consistent reconstruction algorithms and the best possible performance of multi-camera systems.Finally, simulations are provided to support our findings.

Pinhole camera model
As is typical, we assume the pinhole camera model, which we will now briefly summarise.For a more thorough introduction, we refer the reader to [28].
As depicted in Figure 1, a pinhole camera projects a point in 3-D space, called the world point, to a point on the camera's 2-D image plane via a central projection.In homogeneous coordinates, pinhole projection can be expressed as a linear matrix multiplication: where ǔh and U h are the homogeneous representation of the projected point and world point, respectively.The 3 × 4 matrix, P, is the camera matrix, which can be decomposed as Here C denotes the camera centre, R the camera orientation, and | the column-wise concatenation operator.Together, the camera centre and the camera orientation form the extrinsic, or pose parameters.Finally, the matrix K contains the intrinsic camera parameters: namely, the focal length f and the coordinates of the principal point p = (c x , c y ).
In this paper, it will generally be more convenient to work with the Cartesian coordinates of Euclidean geometry.In this case, we will write the non-linear pinhole projection as ǔh [1] ǔh [2] , where ǔ and U are the Cartesian coordinate representations of the projected point and world point, respectively, and ǔh is the homogeneous representation of the projected point.

Sources of uncertainty
Due to various sources of uncertainty, the true image location ǔ is perturbed to yield the measurement u.The error term, incorporates both deterministic (e.g.pixelisation) and random (e.g.measurement noise) perturbation.

Pixelisation
Even when we are in a hypothetical noiseless scenario, we still have to deal with the uncertainty caused by the finite resolution of the camera sensors.This source of uncertainty, which we call pixelisation, is deterministic and, when projected back to the world space, leads to semi-infinite regions in the world space instead of rays.These regions, originating from the boundaries of the pixels, partition the world space into a finite number of regions.Each region consists of all world points whose projections map to the same pixel.Figure 2 depicts a simple example of such a partitioning for a camera with sixteen pixels.
When multiple cameras view the same region of interest, these regions intersect producing a finite number of regions, each corresponding to a particular combination Fig. 2. Pixelisation in a digital camera.
of pixels in the cameras.Clearly, smaller regions lead to a smaller uncertainty.

Non-deterministic sources of uncertainty
In reality, pixelisation is not the only source of uncertainty, with additionally noises arising from image sensor noise, as well as the error of corner localisation algorithms.
These perturbation sources, combined with pixelisation, eventually lead to ambiguity in measuring the exact location of an image point.This can be modelled as an additive noise, yielding

Bounded noise models
In this paper, we will be interested in bounded noise models: where q specifies the shape of the bounded noise and δ is referred to as the bandwidth of the noise.
The main advantage of bounded noise is that it allows us to completely dismiss regions of the solution space as impossible and, as we will show, the size of the remaining feasible region decays in such a way that the squared reconstruction error decays quadratically with the number of measurements.
On may question the applicability of bounded noise.Clearly, in the presence of outliers, additional work must be done to prevent these methods breaking down.However, if outlier techniques are applied, bounded approaches can be useful.It would be interesting to fully investigate this comparison, for practical problems; however, this is beyond the scope of this theoretical study.

The triangulation problem
This paper focuses on triangulation-a fundamental problem in multiple-view geometry, which, as well as being interesting in its own right, provides a basic building block for many higher-level computer vision tasks, such as visual metrology, Simulataneous Localisation And Mapping (SLAM), and Structure from Motion (SfM).
The aim is to recover the location of an unknown 3-D point U from its projections in M calibrated cameras; i.e. the camera matrices P 1 through P M are known and we estimate U from the measured projections u 1 , • • • , u M .To be able to prove things about triangulation, we formally define it as follows: Definition 1.A triangulation problem takes as input and estimates the underlying unknown 3-D world point U as follows: Here, Pi. denotes the projection operator corresponding to the camera matrix P i and the (u i , P i ) pairs denote the camera matrices of the M cameras along with the projections of the unknown 3-D world point U on their image planes.
In the previous definition, p ′ and p, are known as the image-space and residual-space norms, respectively.When we talk about algorithms that minimises the (ℓ p ′ , ℓ p )-norm of the reprojection error, for some particular p ′ and p, we are referring to the image-space and residual-space norms in this order.

Equivalence with camera localisation
As an aside, we briefly mention the connection between triangulation and a restricted version of camera localisation.In particular, by replacing the M cameras and single feature point of the triangulation problem with M feature points and a single camera, one can easily show that the triangulation problem, as just defined, is mathematically equivalent to localising a single camera from the image points of M feature points at known locations.The catch is that the equivalence is only valid if one assumes that the orientation of the camera is known.Of course, in most practical problems, the camera orientation is not known and heuristics must be applied if one wishes to adapt triangulation techniques to camera localisation.Since in this paper focuses on a mathematical analysis, we restrict our analysis to triangulation but note that the scaling laws we derive also apply to this restricted version of camera localisation.Deriving the scaling laws for the full camera localisation problem is an interesting open research problem.

RECONSTRUCTION ALGORITHMS
In this section, we briefly review the main techniques used to solve triangulation and other geometric reconstruction algorithms.

Linear triangulation
The simplest approach to triangulation is to construct a linear system of equations.Let p l T be the l-th row of a a camera matrix P.Then, and similarly for ǔ [2].Therefore, for a single camera, we have For M cameras, we can stack the 2M equations into a matrix A ∈ R 2M×4 , with AU h = 0.This equation can be solved efficiently with standard techniques, such as the Singular Value Decomposition (SVD).However, due to the conversion to homogeneous coordinates, this does not minimise the desired cost; i.e., a norm of the residual vector.
The advantages of linear triangulation are its speed and simplicity and it performs particularly well when the cameras are at almost the same depth to the point of interest.Its robustness can also be further improved by normalising the focal length and other metric distances in the problem instance [7], [8].

Reprojection error minimisation
If increased computational resources are available, one can directly attempt to minimise (7) for different values of p ′ and p.
The most common choice is the (ℓ 2 , ℓ 2 )-norm, but, as explained in the introduction, the resulting cost function is non-convex and often difficult to solve exactly.Therefore, often, linear triangulation is used to initialise a gradient descent approach thus accepting convergence to a local minimum [16], [28].
The ℓ ∞ -norm leads corresponds to the assumption of bounded noise.In the case of the (ℓ 2 , ℓ ∞ )-norm, the image points are bounded to circles on the image plane, which back project as cones in the 3-D world space.Consequently, Second Order Cone Programming (SOCP) can be applied [4].
As with all approaches based on bounded noise, ℓ ∞based methods suffer from being extremely sensitive to even a single outlier.Therefore, it is critical that these techniques are either combined with standard outlier removal techniques or, as has been recently proposed, relaxations applied [18], [19], [30].

CONSISTENT RECONSTRUCTION AND THE ACCURACY OF MULTI-CAMERA SYSTEMS
In this section, we present two results concerning the accuracy of multi-camera systems as more cameras are added to the system.
In the first, we prove a lower bound for the average reconstruction error of any multi-camera system, over a region of interest.This lower bound decreases quadratically as more cameras are added to the system.
Next, we introduce the concepts of consistency and consistent reconstruction and then prove that the reconstruction error of a consistent reconstruction algorithm is upper bounded by a term that decreases quadratically as more cameras are added to the system.Therefore, consistent reconstruction algorithms achieve the optimal error decay rate.This is not necessarily the case for other algorithms: linear triangulation, for example, can be shown to yield a linear error decay rate with respect to the number of cameras.

Lower bound for the accuracy of a multi-camera system
We would like to lower-bound the best possible reconstruction error achievable by a multi-camera system, using any possible triangulation algorithm.To do this, we first formally define what we mean by a triangulation algorithm.

Definition 2. A triangulation algorithm is any mapping
Since we are seeking a lower-bound, it makes sense to limit the uncertainties in the system.Therefore, in the following theorem, we assume that pixelisation is the only source or uncertainty; however, note that the size of pixels is arbitrary and thus this is a mild assumption that is interesting even if an image point can be localised with subpixel precision.

Theorem 1. Consider a multi-camera system of M cameras, each with an N × N pixel image sensor and define a fixed region of interest, R, with a finite non-zero volume. If we assume that the only source of uncertainty is pixelisation, the expected reconstruction error of any triangulation algorithm is lower-bounded by a term that is inversequadratically dependent on the number of cameras
where U ∈ R is any point in the region of interest, and Û is the result of reconstructing U, from its images in the multicamera system, using any triangulation algorithm.Here, the expectation is taken over the location of the point U in the region of interest.
Proof.For the proof, please refer to the supplementary material.
The above theorem states that, under certain assumptions, no triangulation algorithm can do better than a quadratic decay with respect to the number of cameras, regardless of the camera setup.In what follows, we show that, if the camera array is properly constructed, the expected reconstruction error of certain triangulation algorithms can be upper-bound by a term that decays quadratically with the number of cameras.Therefore, in doing so, we show that these triangulation algorithms reach the best possible decay rate.So which triangulation algorithms achieve this optimal decay?It turns out that the key property is consistency.

Consistency and Consistent Reconstruction
A key advantage of bounded noise models is that, given a noisy image point, we can restrict the true image point to a finite 2-D region, which we call an image-space consistency region: The shape of these regions depends on the type of bounded noise: a circle when q = 2, a diamond when q = 1 and a square when q = ∞.Note that, we assume that q ≥ 1 so the norm is properly defined.In this case, the image-space consistency region is convex.
If we back-project an image-space consistency region into the world space, we obtain a convex 3-D region, which we call a world-space consistency region: Again, depending on the type of bounded noise (q), these 3-D regions can be either a cone, a diamond-based pyramid, or a square-based pyramid.
We know that the true 3-D point must lie in the intersection of all world-space consistency regions: where P = {P i : i ∈ [1, M ]}.For a particular geometric reconstruction problem, we call this region the consistent region and any estimate that lies within it a consistent estimate.In addition, if a reconstruction algorithm always returns a consistent estimate, we call it a consistent reconstruction algorithm.This is stated more formally in the following definition.

Definition 3.
A triangulation algorithm is consistent, over the region of interest R, if for any U ∈ R and valid projection matrices P = {P i : 1 ≤ i ≤ M }.Note that a consistent triangulation algorithm returns no estimate when V u,δ,P = ∅.

Finding a consistent estimate
As stated in the following proposition, ℓ ∞ -based triangulation is consistent.
Proposition 1.Consider a multi-camera system viewing a point and assume that the image points are subjected to ℓ qnorm bounded noise: Then, any algorithm that minimises the (ℓ q , ℓ ∞ )-norm of the reprojection error is a consistent triangulation algorithm.
Proof.For the proof, please refer to the supplementary material.
For ℓ 2 -bounded noise, the consistent regions are cones and the SOCP technique outlined by Kahl et.al.[4] returns a consistent estimate.
For ℓ ∞ -bounded noise, the following simple linear program (LP) can be used to find a consistent estimate.Recall that, for the linear triangulation algorithm, we used p l T to denote the l-th row of a camera matrix P. To avoid homogeneous coordinates, let's separate the first three elements from the last: and similarly for ǔ [2].For bounded noise, with bandwidth δ, Therefore, for a single camera, we have Note that, here, we have assumed that the point is in front of the camera so that p3 T U + p 34 > 0. For M cameras, we can stack the 4M inequalities producing a matrix A ∈ R 4M×3 and vector b ∈ R 4M , such that AU ≤ b.Any point satisfying these constraints is consistent and thus the following LP will return a consistent estimate: Here, the vector c ∈ R 3 dictates which point in the consistent region is the optimum.As we show in the next section, for many cameras, any consistent estimate is good and thus in our simulations we use a standard LP solver with c = 0.
As an aside, we note that more complex LPs can be used to select a more desirable point in the consistent region.As just stated, this will only be beneficial for small M , since asymptotically they will achieve the same performance.
For example, one can take the previous linear program and remove all redundant constraints.Assume this produces M equivalent non-redundant constraints: ĀU ≤ b.Then, a LP that finds the point in the consistent region that minimises the average distance to the M planes at the limit of the constraints can be designed as follows.Let āl T be the l-th row of Ā.Then, the minimum distance from any point V ∈ R 3 to the l-th plane, is We thus wish to solve It is a standard exercise in linear programming to convert this problem into standard form using M auxiliary variable, one for each distance.

Error Decay in Consistent Reconstruction
We will shortly present a theorem stating that, for circular camera arrays, the expected reconstruction error of consistent algorithms is upper bounded by a term that decays quadratically with the number of cameras.However, since simulations suggest that the result holds in many more cases, including linear camera arrays and even quite general random setups, we present the following more general conjecture, which is numerically tested in the following section.
Conjecture 1. Define the region of interest, R, to be a sphere of finite radius r and place a point anywhere in this region.Place M cameras inside a larger finite radius sphere, with the same centre as R, i.i.d.uniformly at random such that they all see the whole region of interest.Furthermore, assume that the images of the world point in the cameras are perturbed with uniform bounded noise; i.e., for the world point U, the image u i in the i-th camera is computed as where ǫ i is zero-mean uniform bounded random satisfying ǫ i q ≤ δ.
In this situation, the expected reconstruction error of any consistent triangulation algorithm is upper-bounded by a term which decreases quadratically with the number of cameras; i.e., where U ∈ R is any point in the region of interest, and Û is the result of reconstructing U, from its images in the multicamera system, using a consistent triangulation algorithm.
Here, the expectation is taken over both the noise vector ǫ and the camera locations.Now, the theorem for circular camera arrays.Once again, simulation results are presented in the following section to support this theorem.Theorem 2. Place M cameras in a plane, i.i.d.uniformly at random on a finite radius circle oriented towards the centre of the circle.Define the region of interest, R, to be the intersection of the field of view of all cameras as M → ∞ and place a point anywhere in this region.
Furthermore, assume that the images of the world point in the cameras are perturbed with uniform bounded noise; i.e., for the world point U, the image u i in the i-th camera is computed as where ǫ i = [ǫ i,x , ǫ i,y ] T and ǫ i,x , ǫ i,y are zero-mean uniform bounded random variables with bandwidth δ.
In this situation, the expected reconstruction error of any consistent triangulation algorithm is upper-bounded by a term which decreases quadratically with the number of cameras; i.e., where U ∈ R is any point in the region of interest, and Û is the result of reconstructing U, from its images in the multicamera system, using a consistent triangulation algorithm.
Here, the expectation is taken over both the noise and the camera locations.
Proof.The proof makes use of [27, Corollary 6.2] and appears in the supplementary material.

SIMULATIONS
We now present simulations to verify our theoretical results.To approximate the expected reconstruction error, we take the average of many realizations.This means that any algorithm we use is run many times and thus has to be extremely robust.Unfortunately, this prevented us from using the branch and bound technique proposed in [29], since we were unable to prevent their implementation from crashing for certain realizations.Therefore, for the (ℓ 2 ,ℓ 2 )-norm we used the non-linear approach from Hartley and Zisserman [28].The technique is based on Newton iterations and we initialised it with the solution of linear triangulation.For the linear triangulation, we used the normalized implementation by the same authors [28].For the (ℓ 2 ,ℓ ∞ )-norm, we used the SOCP implementation provided by Kahl [4] and, finally, for the (ℓ ∞ ,ℓ ∞ )-norm minismisation, we used the linear program (LP) defined in (17) with c = 0. Of course, this just returns a consistent estimate and does not fully minimise the norm.

Numerical simulation of Conjecture 1
We simulated the setup explained in Conjecture 1.In order to generate cameras uniformly at random that see the whole region of interest, we use rejection sampling; i.e., we repeatedly generate a camera centre and rotation matrix uniformly at random and reject ones that do not see the whole region of interest.To generate rotation matrices uniformly at random we use the technique outlined in [32].
The results are shown in Fig. 3, for ℓ ∞ -norm and ℓ 2norm bounded noise.As expected, consistent algorithms matching the norm of the noise have a quadratic decay; i.e., the LP has a quadratic decay for ℓ ∞ -norm bounded noise and the minimum (ℓ 2 ,ℓ ∞ )-norm has a quadratic decay for ℓ 2 -norm bounded noise.
Non-consistent techniques, such as the (ℓ 2 ,ℓ 2 )-norm minimisation, can perform well for small M , but, as M increases, fail to reach the quadratic decay.

Numerical simulation of Theorem 2
Finally, we present a simulation to experimentally test Theorem 2. The algorithms are the same as for the simulation of Conjecture 1 and the results are shown in Fig. 4. As can be seen in the figure, all algorithms behave as expected.

CONCLUSION
We presented an analysis of the accuracy of multi-camera systems and their error decay rate with respect to the number of cameras.In doing so, we derived fundamental scaling laws stating that, under certain conditions, the accuracy of a multi-camera imaging system in reconstructing 3-D points increases quadratically with respect to the number of cameras.We also analysed the performance of state-of-theart algorithms with respect to their error decay rate.To do this, we introduced the notion of consistency and showed that consistent reconstruction algorithms achieve the optimal quadratic error decay.Furthermore, we showed that (ℓ ∞ ,ℓ ∞ )-norm based minimisation is consistent and, in addition, presented two simple linear programs that are also consistent.

Fig. 1 .
Fig. 1.An example of central projection in a pinhole camera

Fig. 3 .
Fig. 3. Verification of Conjecture 1. Expected squared error (E) as a function of the number of cameras (M ).