Accurate Line-Based Relative Pose Estimation With Camera Matrices

,


I. INTRODUCTION
Since the advent of sufficient computational resources in a compact format, visual localisation and mapping have become driving technologies in many applications that require a mobile system to automatically map and localize within GPS-denied, previously unknown environments. Important examples are given by robotics and virtual, augmented, or mixed reality. The present paper looks at solving this problem with regular cameras that simply perceive photometric projections of the environment. While popular alternatives have emerged in the form of active depth cameras, the regular passive case remains attractive due to a small form factor, no range limitations, and low energy consumption, which is why they already form part of existing products on the market. Despite the discovery of dense photometric The associate editor coordinating the review of this manuscript and approving it for publication was Rui-Jun Yan . registration methods [1], the state-of-the-art solutions for passive visual sensors currently remain sparse least-squares approaches that predominantly rely on feature extraction and bundle adjustment [2], [3]. The present paper explores the potential of passive multi-camera arrays, analyses the differences between the various camera array configurations, and presents a new algorithm for relative camera array pose estimation.
Our work is concerned with the type of features that are extracted from the images in such systems and mapped into a 3D representation. It has long been recognized that the addition of edge or line features can greatly contribute to the overall accuracy of a method, particularly in man-made environments that present lots of homogeneously colored surfaces with reduced texture. However, exclusive reliance on line features remains a challenge, as it turns the computation of the relative pose between two monocular images into an ill-posed problem. The back-projection of a 2D line measurement is a plane that intersects with the camera center. For any 2D-2D line correspondence between two views, these planes will generally intersect in a 3D line, independently of the relative camera pose. Line-based solutions therefore require corresponding measurements between at least three views, the relative pose of which can then be recovered using trifocal tensor geometry [4]. Besides other challenges such as partial detections and robust line description, the difficulty of stably computing the trifocal tensor has since ever prevented the realization of a truly convincing, purely line-based, monocular visual odometry pipeline. An alternative is given by using a stereo camera, triangulating lines in each stereo view, and then aligning them using a 3D-3D registration paradigm. This is however fragile, as reduced baselines and vanishing disparities for lines that intersect with the epipole result in inaccurate 3D line locations.
We present the first purely line-based visual odometry solution that reliably solves common benchmark cases. This is achieved through the following three contributions: • We set a new state-of-the-art for line-based stereo relative pose. The solver does not rely on inaccurate triangulations, but solves the problem directly from the original, measured 2D-2D line correspondences. The derivation starts from classical trifocal tensor geometry, and adapts it to the case of two stereo view-points. • We demonstrate the applicability of this approach to the plenoptic case, and realize a powerful, purely line-based relative pose solver for camera matrices. 1 We carefully exploit the redundancy in the views in order to benefit from as many as possible lines, however taking observability of depth given by their direction into account.
• In order to evaluate its potential for continuous tracking over real images, we embed our solver into a purely line-based visual odometry pipeline for camera matrices by adding line-based bundle adjustment in the back-end. 1 We define a camera matrix to be a square grid arrangement of perspective cameras. It may also be described as a light-field camera or multi-camera array.
The result demonstrates highly reliable and accurate, purely line-based motion estimation. The effectiveness of our approach is supported by our experimental results. Our purely line-based 2D-2D relative pose solver consistently outperforms existing alternatives, and our camera-matrix visual odometry framework is able to compete with some of the most recent RGBD-SLAM solutions in terms of tracking accuracy. Section III starts with the theory behind our trifocal tensor based solver and its application to camera matrices. Section IV outlines its inclusion into a visual odometry framework. Section V then presents our experimental results, where we apply our novel relative pose solver to both simulated and real data.

II. RELATED WORK
Line-based structure from motion has a long-standing history in the geometric computer vision community. Even early works such as [5] already demonstrate the feasibility of optimizing poses and structure by pure reliance on line features. [4] furthermore analyses the minimal case, and introduces the trifocal tensor to describe all scene independent relations in the minimal case (i.e. a triplet of views). The theory has later been summarized in [6]. Nearly a decade later, [7] and [8] introduce further seminal contributions detailing the representation, initialization, and optimization of 3D lines, possibly taking geometric priors into account.
While certainly providing a fundamental understanding of the geometry of lines, these contributions do not go as far as to provide the necessary cornerstone for a successful realization of a complete, purely line-based online structure-frommotion pipeline, which is a stable relative pose solver. The challenge of solely relying on lines has therefore been mostly circumvented by relying simultaneously on points and lines. Important early examples are given by [9] and [10], which introduce edge landmarks and an extended Kalman filter that estimates both points and lines, respectively. Reference [11] introduces a least-squares optimization-based alternative. More recently, we have experienced a revival of hybrid feature-based methods, with [12]- [15] providing monocular solutions, [16] a solution for RGB-D cameras, and [17], [18] alternatives for stereo cameras. Reference [19] proposes an extension to a popular direct point-based method [20] by utilizing lines to guide keypoint selection rather than to directly act as features. Reference [21] shows another possible combination of both modalities, where an initial point-based reconstruction is augmented through the addition of lines in a final back-end optimization step.
Despite its difficulty, there have been a few efforts on realizing purely line-based online structure-from-motion pipelines for photometric sensors [22]- [24]. Their results are however not very encouraging as [22] states that From the [. . . ] results, there is no incentive to believe that lines constrain the motion sufficiently to be used alone, and [23] and [24] only present small scale experiments without evaluating motion accuracy at all. Further notable work has been presented in [25], which estimates Plücker line coordinates through a Kalman filter. However, they again show only very few encouraging results, and-while demonstrating good empirical convergence properties-their filter-based bootstrapping does not provide any convergence guarantees. Reference [26] presents another interesting filtering based solution, but circumvents the initialization question by utilizing prior information about the environment.
Some works achieved a purely line-based solution by exploiting the structural regularity of manmade environments. Reference [27] adopts Manhattan World constraints by simultaneously grouping and estimating vanishing directions, and then obtains both intrinsic parameters as well as absolute orientation by extracting the normalized vanishing points and enforcing their orthogonality. Given a triplet of lines of which two are parallel and the third is perpendicular to the others, [28] estimates the relative rotation from vanishing point information and the relative translation using the potential intersection points between the lines. Reference [29] adopts both vertical and floor lines, and obtains camera poses using an EKF framework. Reference [30] extends [29]'s work into a stereo graph-SLAM system, in which they use Plücker line coordinates for initialization and orthonormal representation during optimization. Reference [31] again estimates relative rotation from vanishing directions. They furthermore introduce a novel line parametrization and an extended Kalman filter for continuous camera pose estimation. Reference [32] does not strictly depend on the Manhattan world assumption, but still requires the presence of multiple parallel lines for several dominant directions in the scene. The method proceeds by clustering parallel lines and obtaining their 3D line direction, which are then aligned for relative rotation computation. The relative translation is again estimated by the method presented in [28]. While the assumptions of orthogonality, verticality, or planarity of lines are common, they constrain the derived algorithms to special environments only.
Another category of works that are naturally related to ours is given by [33], and [34]. They look at the classical, point-based structure-from-motion problem with light-field cameras. [35] furthermore exploits the ray-manifold structure of point and line correspondences inside a light-field frame in order to triangulate them in 3D. Unfortunately, the imagebased geometric relations presented in [33]- [35] depend on ideal multi-camera arrays where all sensors are in a perfect plane and have the exact same orientation. Therefore, their application in the real world is limited to small scenes captured by high-accuracy light-field cameras, such as Lytro or Raytrix, and could not generalize well to the larger scenes. Finally, [36], [37], and [38] represent interesting attempts to realize online structure-from-motion (i.e. visual SLAM) pipelines. Despite promising results, they rely on points, and are only vaguely compared against the state-of-the-art set by the visual SLAM community.
In conclusion-by employing a camera matrix and leveraging trifocal tensor geometry-we present a novel, line-based relative pose solver that enables the first convincing, purely line-based visual odometry solution. It rivals the state-of-the-art given by point-based methods for normal cameras, or even direct RGBD methods.

III. RELATIVE POSE FOR CAMERA MATRICES WITH LINES
As stated in [6], the trifocal tensor encapsulates all the (projective) geometric relations between three views that are independent of scene structure. Since two multi-perspective views (stereo or plenoptic) naturally provide more than three perspective views of a scene, the trifocal tensor relations seemingly represent a good starting point for constraining the unknown relative pose from lines. After a short review, we see how they can be transferred to first stereo, and then plenoptic cameras. The section concludes with the automatic generation of the actual pose solver.

A. A SHORT REVIEW OF TRIFOCAL TENSOR GEOMETRY
Suppose a line L in 3D space is imaged as the corresponding triplet m ←→ m ←→ m across three perspective views, which are indicated in Figure 2(a) by their centers, C, C , and C . Trifocal tensor geometry describes the existence of three 3 × 3 matrices S i (or alternatively a cubic tensor of size 3) which-if correctly factored with the lines in the second and the third view-permit the reconstruction of the three coordinates of the line in the first view, respectively. Calling the result q, this line transfer is given by Since q ideally has to be proportional to the measured line in the first view, m, we obtain the incidence relation in which The trifocal tensor can be described as a function of the three camera matrices P, P , and P . Before proceeding, we first move to normalised coordinates given that we assume to be in the calibrated case. We start with normalising the line measurements m that satisfy the equation x T m = 0 for any image point x along the line. Inserting the matrix of intrinsic camera parameters K, we obtain The normalized line coordinates are therefore given by l = K T m, which is a 3-dimensional vector. f = K −1 x corresponds to a normalized line point and therefore a direction vector pointing towards the corresponding 3D point of L. f T l = 0, and l therefore also corresponds to the normal vector of the 3D plane spanned by the 3D line L and the camera  center C. Referring to the first view, P, P , and P in the calibrated case are where t , R = r 1 r 2 r 3 and t , R = r 1 r 2 r 3 enable the euclidean transformation of points from the first to the second and the first to the third camera frame. Given its general form taken from [6], the trifocal tensor T 1 , T 2 , T 3 that satisfies the incidence relation in the calibrated case is finally given by For further details on the trifocal tensor including its derivation, the reader is kindly referred to Chapter 15 of [6].

B. LINE-BASED STEREO RELATIVE POSE
Let us assume the existence of a calibrated stereo rig for which intra-camera rotations are known and compensated for (e.g. by applying the homography at infinity). The extrinsic stereo calibration parameters are therefore given by c 1 and c 2 , which denote the translational displacements of the left and right camera centres from a common stereo reference frame. Let us furthermore describe the relative pose between two subsequent stereo frames by the parameters {t, R}, which transform points from the first to the second stereo frame. The geometry is illustrated in Figure 2(b). The application of trifocal tensor geometry to the line-based stereo relative pose problem is straightforward. Any three distinct views selected from the two stereo frames need to satisfy the trifocal tensor incidence relation (4). Picking the left view of the first stereo frame to be the first view of our trifocal set, the right view to be the second one, and the left view of the second stereo frame to be the third view of the trifocal set, we can easily see that Substituting these expressions in (5) and (4), we obtain three constraints on the relative pose parameters between the two stereo frames. We can easily obtain three more equations by choosing the right view of the second stereo frame as the third view in our trifocal set, which will change t = t + Rc 1 − c 2 . It is furthermore intuitively clear that a single line correspondence across two stereo views is not sufficient to fully constrain the relative pose. A second line correspondence is needed, from which we can obtain another 6 constraints on the relative pose. With a total of 12 equations that are linear in the parameters t and R, we could already obtain a first solver by a simple linear solution. 2 This has in fact been done in [39], however concluding that solutions through linearization are unstable under noise and unable to resolve solution multiplicities. Reference [39] also presents a non-linear solver to the stereo line-based relative pose problem that implicitly enforces all constraints of orthonormality on R, however residing to a manually constructed, unbalanced variable elimination strategy. Besides an extension to camera matrices, the remainder of this section establishes a novel higher-order polynomial solver. It works for an arbitrary number of line correspondences and relies on the more modern strategy of solving the first-order optimality conditions with the Gröbner basis method.

C. EXTENSION TO CAMERA MATRICES
A camera-matrix frame can be regarded as the union of many perspective images that are captured with small baselines in between. The focus of the present paper lies on the standard configuration of square, planar camera matrices as illustrated in Figure 1. The availability of many views naturally raises the question which trifocal sets to choose among all available ones. In the present paper, we use 3 × 3 matrices of cameras, which would lead to 9 · 8 · 9 = 648 possible trifocal triplets, and a large number of possible intra and inter camera-matrix matching procedures. The problem however becomes more manageable if considering the fact that not all line correspondences are equally useful. For example, any 3D line that is parallel to or intersects with the baseline between two views will receive the same coordinates upon reprojection into those two views, and would thus be impossible to reconstruct in 3D or even use for constraining the relative pose. Seen from the image plane, these are lines that intersect with the epipole. Furthermore, in analogy to points, a wider baseline generally causes larger line displacements and thus a better signal-tonoise ratio for deriving the relative pose.
These insights can be leveraged to perform a smart selection of trifocal sets. Starting with the two views taken from the first camera-matrix frame, we only consider pairs of views with opposite baselines, thus ensuring maximal displacements of the line observations. In order to furthermore ignore lines which are too close to the epipoles (which are known in advance), we remove any lines for which the absolute value of the sine of the angle between its direction vector and the direction towards the epipole (at infinity) is too small. Lastly, we only consider the same two views in the second camera-matrix frame for forming trifocal sets based on the assumption that the relative rotation is small, and that-as a result-line directions do not undergo any significant changes. Our strategy therefore consists of forming line correspondences across 4 views (one stereo view in each camera-matrix frame), from which we obtain constraints on the relative pose via two trifocal sets. The restricted line directions, matching steps, and trifocal sets for two such cases are illustrated in Figure 3. For every new camera-matrix frame, we perform 4 intra-frame matching steps for the stereo pairs, and four inter-frame matching steps to match the first view of each virtual stereo pair to its corresponding subframe in the previous matrix frame.

D. SOLVER GENERATION
The method explained in the previous section gives us many hypothetical stereo line correspondences across two camera-matrix frames, each one leading to two trifocal sets of views that constrain the relative pose between the camera-matrix frames through incidence relation (4). Given the special form (6) of the trifocal tensor in the stereo case, and defining − → r = r T 1 r T 2 r T 3 T , the constraints originating from a trifocal set i can be easily rewritten as which are three linear constraints on the variables t and R. The exact form of matrices A 1i , A 2i , and vector a 3i , is presented in Appendix A. Two lines are sufficient to constrain the relative pose, and-given stereo observations inside each camera-matrix frame-they lead to 4 trifocal sets and thus 4 groups of 3 linear constraints. We start by choosing an alternative, minimal parametrization of the relative rotation, namely Cayley parameters [40]. While this renders the equations non-linear in the rotation, it has the advantage of explicitly enforcing the required orthonormality. 3 The equations are solvable using polynomial elimination theory, and an application of the latter immediately reveals that the 12 equations formed from two stereo line correspondences are in fact already more than required. A subset of 8 equations would have to be chosen in order to form a minimal case. This leaves us with the difficult question which ones to choose.
Since two distinct lines remain a requirement, we circumvent this question by changing our problem formulation. We notably construct the sum of squared algebraic errors resulting from each incidence relation, and generate a new minimal solver that seeks all stationary points of this energy, in closed-form. While this solver is slightly more expensive, the alternative underlying problem has the great advantage of making full use of 2 stereo line correspondences. More importantly, the energy can be integrated over any number of stereo line correspondences greater or equal to 2, the solver will generally work.
We start by stacking the algebraic constraints for each trifocal set, leading to where n indicates the total number of the trifocal sets.We eliminate the translation parameters, which simply appear linearly in the equations. We have Proposition: A T 1 A 1 has full rank. Proof: We have where s i > 0, i = 1, 2, . . . n, n ≥ 4 (two stereo correspondences for two distinct lines, leading to 4 trifocal sets). It is easily verified that any nonzero vector v leads to which implies that A T 1 A 1 is positive definite and has full rank. In particular, it is easy to see that if the ł i originate from two non-parallel lines that are constructed using different trifocal tensors, they will span a full rank space, thus fulfilling the above condition. It is also clear that having non-parallel lines is a condition for our solver to work, as it prevents unobservable parallax by sliding the camera-matrix frames in the direction parallel to the lines. We do not compute solutions for any pair of lines that are too parallel. A T 1 A 1 therefore automatically stays invertible for any correspondences that we evaluate.
Back substituting (9) in (8), we obtain We now turn our attention to the minimal Cayley parametrisation [40] of R, which is given by (13), as shown at the bottom of this page, where c = 1 + x 2 + y 2 + z 2 . We first substitute (13) into (12). Then using where s = [x 2 , xy, y 2 , xz, yz, z 2 , x, y, z, 1] T groups all monomials of the Cayley parameters {x, y, z}. The detailed form of M * is indicated in Appendix B. We can finally define the squared scalar residual over all measurements and compute all solutions for {x, y, z} from the first order optimality conditions 2s T M ∂s ∂x ∂s ∂y ∂s ∂z = 0. We create a solver for this problem from random experiments. Since the norm of the line measurements does not matter for the validity of (4), they can be constructed from random 3D end-points and relative camera poses defined over a finite prime field. After deriving the optimality conditions in the finite field, we apply the automatic solver generator of [41]. The size of the produced elimination template is 92 × 119, and the corresponding action matrix has a size of 27 (which also equals to the maximum number of stationary points in the problem). 4 The action matrix is to be understood as a multi-variate extension of the companion matrix. It is composed by the coefficients of the Gröbner basis, and permits the extraction of the solutions by eigen-decomposition. For a detailed review of the Gröbner basis theory and its application in automatic closed-form solver generation, we kindly refer the reader to [41], [43]. Once the rotation is retrieved, the translation can be found by back-substitution into (9). Our code is in the supplemental material and will be published.

IV. LINE-BASED VISUAL ODOMETRY FOR CAMERA MATRICES
We embed our novel line-based relative pose solver into an online visual odometry pipeline. We use state-of-the-art algorithms -EDLines [44] to detect line segments in each image, and LBD descriptors [45] to match them between sub-sequent frames (8 intra and inter frame stereo matching procedures as outlined in Section III-C). Intra-frame matching is furthermore supported by geometric cues to address any remaining issues such as fragmented lines or mismatches. An overview of the entire framework is given in Figure 4. It consists of a front-end and a back-end module, which perform visual odometry and local mapping, respectively. Both modules are detailed in the following.

A. VISUAL ODOMETRY FRONT-END
After all potential correspondences over trifocal triplets of views are extracted, we apply our solver within RANSAC [46] to robustly identify inliers and initialize the relative pose of the new camera-matrix frame. In each iteration, we utilize the four trifocal triplets that can be extracted from two line measurements across two stereo sub-sets. The success of the relative pose solver obviously depends on the angle between the considered 3D lines. We therefore immediately discard random samples for which the directions of the line measurements in the image are too similar. We furthermore discard rotations which are too far from identity. Once RANSAC terminates, we run a final iteration of the solver over all inlier correspondences. In order to classify inliers, we evaluate a uni-directional transfer error. We use the hypothesized trifocal tensor to transfer the line measurement from the second and third view back to the first one, and then evaluate the orthogonal distance of the end-points of the original line measurement in the first view. However, in order to preserve longer lines for which this error is much more sensitive w.r.t. pose errors, the transfer error is scaled down by the euclidean distance between the end-points in the image.

B. LOCAL MAPPING BACK-END
Our local mapping module performs optimization of all variables over a window of k most recent keyframes. A new  keyframe is added when its relative rotation with respect to the previous keyframe is above 5 degrees or the length of its relative translation is above a scene depth-related threshold. The number of keyframes is fixed to 4 in our experiments, which empirically optimizes the trade-off between computational efficiency and the density of the local view-graph. The optimization includes the poses of the camera-matrix frames as well as all parameters of 3D lines for which there exist at least two observations inside the window of considered frames. The procedure is commonly known as local or windowed bundle adjustment. After a first execution terminates, we run a map matching procedure in which we seek further correspondences between the current frame and the 3D lines, directly. After all correspondences have been added, we conclude with another round of local bundle adjustment. With better constraints provided by the map matching step, the estimation accuracy is further improved. An important question when applying bundle adjustment is how exactly to parametrize the optimized variables. Taking the camera pose in the calibrated case as an example, the difficulty lies in making sure that the orientation variable remains a point on the special-orthogonal group SO(3), either through a minimal, implicit parametrization, or by explicitly adding side-constraints to the optimization. In the spirit of consistency, we simply choose the minimal Cayley parameters [40] to represent the orientation of each camera-matrix frame. Furthermore, in order to ensure that we remain far away from 180 • rotations for which x y z T → ∞, we optimize pose changes from the initial absolute pose of each camera-matrix frame rather than the absolute pose directly.
We also need to decide how to parametrize the 3D lines, which are commonly represented by 6 Plücker line coordinates, but only have 4 degrees of freedom. The Plücker line coordinates are notably given by a line direction vector d with unit norm and a moment vector m that is orthogonal to d. As suggested in [6], a minimal parametrization could be given by two points on two planes, which would however leave us with the difficulty of defining these two planes as well as the inability to attain lines that are parallel to one of the two planes without being contained in it. With strong ties to [7], we propose a less restrictive minimal parametrization of the 3D lines for back-end optimization. Since d and m are orthogonal, they can notably be parametrized as the first two columns of a rotation matrix that again is represented as a function of Cayley parameters [40]. The fact that the moment vector m does not have unit norm is furthermore reflected by multiplying the second column of this matrix by a scale parameter s. We obtain where c = 1 + x 2 + y 2 + z 2 . We furthermore switch to local optimization by adding a starting orientation and a starting line moment s 0 to the parametrization, thus resulting in As required, this parametrization has only 4 degrees of freedom, and implicitly enforces all side-constraints on the Plücker line coordinates.
We conclude the exposition of our line-based bundle adjustment back-end by explaining the reprojection error of every 3D line into every camera-matrix sub-frame where a measurement is available. Measurements are notably given by two end points in each respective sub-frame, while our  3D lines are parametrized as infinitely long. The reprojection error is simply given by the sum of the two squared orthogonal distances between the end points and the reprojected line. Note that, in contrast to [12], [17], we employ a genuine, virtually infinite line representation in 3D rather than two end-points. In turn, we utilize end-points where they naturally occur, namely for our measurements extracted in the image plane. This appears to be a more logical definition, as endpoints defined in 3D could easily collapse and produce a zero residual.

V. RESULTS
Our experimental evaluation splits into two parts. We first analyze our plenoptic, trifocal tensor based 2D-2D registration on various simulated cases, and compare against the more intuitive 3D-3D registration alternative. We then analyse its performance for line-based visual odometry, and compare against state-of-the-art visual tracking solutions on a common benchmark sequence. We conclude with a demonstration on a real system.

A. VALIDATION OF THE TRIFOCAL-TENSOR BASED SOLVER
We first validate our line-based 2D-2D registration method on simulated experiments, and compare it against a simple 3D-3D registration alternative that is preceded by triangulation in each frame. A basic simulation experiment is generated as follows. We hypothesize a random relative rotation and translation for the second camera-matrix frame, as well as random 3D end-points for two lines constrained to lie within the volume defined by x ∈ [−3, 3], y ∈ [−2, 2], and z ∈ [3,5] (i.e. in front of the first camera-matrix frame), the size of which is chosen according to a realistic distribution of lines in regular indoor scenes. We then project the 3D points onto all image planes assuming all intrinsic parameters are equal and given by a focal length of 480 and a principal point of [320 240] T , and there are 9 camera centres arranged in a 3 × 3 grid of 0.5 m width and height. We finally add random uniform noise of at most 0.5 pixels to all image coordinates. For the 3D-3D registration method, we first use the stateof-the-art triangulation method outlined in [35], an accurate and robust algorithm that works under light-field/cameramatrix settings (i.e. the configuration we address), to recover noise-affected 3D line measurements in each camera-matrix frame, and then align them by first recovering the relative rotation from the line directions, and then the relative translation from a straightforward linear equation.
We evaluate the solvers from seven different aspects, which are precision for 1) varying rotation angles, 2) varying FIGURE 6. Histogram curves of errors for ours and the linear solver. The y-axis indicates the number of runs that produce different level of errors. VOLUME 8, 2020 translation lengths, 3) varying noise levels for the line endpoints, 4) different numbers of line correspondences, 5) the angle between the lines in the minimal case of 2 lines, 6) the length of the lines, and 7) the intra-camera-matrix baseline. For each experiment, we only change one of the parameters and leave all others unchanged with respect to the basic experiment configuration outlined above. Detailed parameter variations are indicated in Table 1. We run 5000 independent runs for each experiment. The mean and median errors are finally reported in Figure 5. As can be observed, our direct 2D-2D registration method consistently outperforms the 3D-3D registration alternative. It proves that the error in the triangulated lines has a negative impact on the 3D-3D registration alternative. We confirm that the success of online visual motion estimation depends crucially on a triangulation-independent registration method.
We also include the linear solver from [39] using {3 lines} into our experiments. We run 5000 independent runs for each method with the default setting of the six parameters. Histograms representing distributions over errors are shown in Figure 6, which indicates that our method has more runs with smaller errors, and thus outperforms the linear solver in both robustness and accuracy.

B. COMPARISON AGAINST STATE-OF-THE-ART ON A COMMON BENCHMARK SEQUENCE
Comparing our camera-matrix visual odometry results on common benchmark datasets is difficult since none of them provides images captured by a plenoptic camera. Fortunately, the realistic synthetic benchmark given by the ICL-NUIM datasets [52] enables us to render novel views as captured by a camera matrix. The ICL-NUIM datasets include 8 sequences, but they are relatively similar with slight difference in the characteristics of the motion. We therefore analyze only two subsequences. Note that the rendering takes around 12 hours for a single sequence only. The experiments on synthetic datasets are complemented by further experiments on real-world sequences. We define a virtual 3 × 3 camera matrix with 0.1m baselines, while keeping the standard intrinsic parameters unchanged, and render new views for two living room sequences for which sufficient lines can be observed. We compare our framework against multiple state-of-the-art solutions from the literature [3], [47]- [51]. We evaluate the absolute trajectory error using the tools provided by the TUM RGBD-SLAM benchmark [53] (see reference for detailed definition), and summarize the median values over 50 runs in Table 2. Example views of the environment with overlaid line reconstructions as well as the final trajectory are illustrated in Figure 7. Note that the sequences have no obvious loop, hence the question of whether or not loop closure is performed in the pipeline is irrelevant; results simply indicate tracking performance. Drift is hardly observable, and the high accuracy is underlined by numbers. Although an RGB-D camera is a different type of sensor and thus not directly comparable, it is interesting to observe that we can achieve similar performance in terms of motion estimation accuracy despite only using passive photometric sensors. We also slightly outperform all comparison algorithms on sequence kt2. The sequence kt1 contains several frames in which only few lines can be detected, which limits the achievable ATE. Nonetheless, our algorithm maintains excellent performance with respect to state-of-the-art passive stereo alternatives. Further details about the results obtained by our method including the mean, median, maximum, and minimum errors of both the absolute trajectory error and the relative pose error can be found in Table 3. Even if changing the intra-camera-matrix baselines to a different value, sub-centimetre accuracy remains achievable. Note that when increasing the camera matrices' baseline, the accuracy rises first and falls afterwards, which is against the observation in Figure 5. This issue is caused by the decreased number of available feature correspondences. In simulation, we can assume that the image plane is infinitely large, thus the cameras always observe all the features we add in virtual 3D space. However, real cameras have limited field of view, which causes the number of feature correspondences observed by the virtual stereo views to decrease as the baseline increases (see Table 4). Therefore, we confirm that there is only one optimal baseline for a given average scene depth.

C. COMPARISON AGAINST DIFFERENT SETUPS ON ICL-NUIM SEQUENCE kt2
We first run a stereo version of our own framework by disabling all but two of the camera-matrix sub-frames. We confirm that purely line-based stereo visual odometry with a horizontal intra-camera baseline is unable to produce reliable results, as horizontal lines become largely unusable. The remaining lines are often vertical, which still renders part of the motion unobservable. Since this setup hardly enables reliable motion estimation, we do not demonstrate its results. We instead run a second version of this experiment where we choose the top left and the bottom right views to form a stereo rig with a diagonal baseline. This arrangement successfully measures disparities in both horizontal and vertical directions, and hence is able to track the camera. However, as illustrated in Figure 7, the performance is still not satisfying, as multiple tracking failures occur. Upon tracking failure, the algorithm is simply restarted. The trajectory for this stereo camera setup is therefore composed of several parts.
We also evaluate our algorithm for multiple camera array configurations with different number of views and differently  oriented baselines, which are shown in Figure 8. For some setups, tracking is prone to get lost because of insufficient views. We therefore evaluate 2 kinds of errors: ''Error combined'', which is the combined error of all separate parts in cases with tracking failure, and ''Error full'', which is the error in cases without tracking failure. We run 50 independent runs for each setup and the detailed results are summarized in Table 5. We can observe that the errors and the ratio of failure cases decrease as more views and therefore more virtual stereo baselines are added to the matrix. Furthermore, it can be observed that the direction of the baseline also matters. As line features are often horizontal or vertical, diagonal baselines lead to better results than horizontal/vertical baselines. Best results are achieved with configurations that have both diagonal and horizontal/vertical baselines.
Though case 7 has better accuracy than case 5 in ''Error combined'', it seems surprising that it has worse accuracy in ''Error Full''. A possible explanation is that since there are fewer failure cases for case 7, ''Error full'' for the final 6-view case is evaluated over more cases, and it may include some cases that are challenging possibly explaining the slightly lower accuracy. What can however be concluded is that more views mean better robustness.

D. APPLICATION TO REAL DATA
Most real-world stereo datasets are limited to regular stereo cameras and do not provide the additional views that we need for the evaluation of our camera-array-based framework. Furthermore, their baseline is mostly parallel to the ground, and-as shown in our experiments-this configuration is not advantageous for our line-based solver. We therefore underwent the effort of recording custom datasets.
In a final experiment, we demonstrate a successful application to real data collected with our own, custom-designed camera-matrix camera (cf. Figure 1). The camera rig is constructed with 9 GoPro HERO4 with baselines around 9 centimetres, which targets indoor scenarios. All the videos are  collected with 1280×740 resolution and 60Hz frame rate, but are downsampled to 960 × 540 and 30Hz in our experiments. We pre-calibrate the camera system by taking images of a chessboard and using the method from [54], which calibrates both intrinsic and extrinsic parameters. The videos of the moving camera rig are furthermore time-synchronized by recording not only images, but also sound clicks throughout the dataset. Due to low ambient noise, these clicks are easily extracted from the soundtrack and aligned using simple cross-correlation in Matlab. Distortion and rotation between the different views in the matrix are finally compensated for by applying the calibration parameters. An example qualitative result is indicated in Figure 9, where we move the camera along an elliptic trajectory in the {x, y} plane. We also extract high-quality baselines by applying COLMAP [55], [56] and ORB-SLAM2 [3] (stereo), and observe little deviation by our method.
Using COLMAP's results as the ground truth, we further include quantitative results for the eight cases of representative setups in Table 6. 50 independent runs are executed for each setup. The results are generally consistent with the previous observation in synthetic experiments. We obtain better accuracy and robustness when more views and therefore more virtual stereo baselines are added to the matrix.

E. PRACTICAL CONCERNS
Our framework is implemented in C++ and relies on OpenCV and the Ceres non-linear optimization toolbox. All experiments are conducted on an Intel Core i7 clocked at 3.60 GHz, and the implementation uses hyper-threading on all 4 cores in the OpenCV-based line extraction and matching and Ceres-based backend optimization parts of the implementation. For a 3 × 3 matrix of VGA cameras, our current implementation consumes an average of 86ms for line extraction, 8ms for intra-frame stereo matching, 3ms for inter-frame stereo matching, and 59ms for the robust rel-ative pose computation, all on a standard CPU architecture. The mapping part consumes an average of 7ms for line triangulation, 90ms for map matching, and 75ms for local bundle adjustment. Note however that mapping is only executed whenever a new keyframe is defined, and can furthermore be executed in a parallel thread. We therefore achieve an overall frame-rate of about 6Hz. While there is certainly room for improvement (notably through parallelisation), our implementation is able to process slow motion in real-time.

VI. DISCUSSION
We present a novel stereo trifocal tensor solver that achieves state-of-the-art accuracy, and successfully embed it into an algorithm to compute the motion of a camera matrix. The result is the first visual odometry framework for camera matrices that achieves state-of-the-art performance on common benchmark sequences. Comparison algorithms include both passive and active, feature-based and direct methods. What is perhaps most surprising is that this result has been achieved by pure reliance on line features. We successfully demonstrate that-in contrast to monocular or stereo setups-the multiple, differently oriented baselines of a camera matrix lead to the necessary level of robustness to handle the purely line-based scenario, eventually resulting in the outstanding tracking accuracy that one would expect from regressed line measurements. We consider this an exciting result that motivates further exploitation of plenoptic vision for an economic realization of next-generation visual localisation and mapping solutions.

APPENDIX A
Here we present the form of matrices A 1i , A 2i , and a 3i . Let we can get the following equation l proj =   a 11 t + a 12 − → r + a 13 a 21 t + a 22 − → r + a 23 a 31 t + a 32 − → r + a 33 where a 11 = l x l T , a 21 = l y l T , a 31 = l z l T , and l = [l x , l y , l z ] T , c = c 1 , c = c 2 , and c = c 1 or c 2 , depending on which trifocal set from the stereo correspondence across the two camera-matrix frames is considered. l proj is a function of the line measurements in the second and third view, and the form is derived by simply applying the trifocal tensor equations. The trifocal tensor incidence relation (4) The equations are derived automatically using a symbolic computation toolbox and the code is submitted as supplementary material.

APPENDIX B
Here we outline the form of matrix M * . With q, s and c provided, we substitute A * and a * into (14) from the paper, and obtain using a symbolic computation toolbox. The code is submitted as supplementary material. He is also affiliated with the University of Delaware. He has published over 120 articles at highly refereed conferences and journals, including over 70 articles at premier conferences and journals such as CVPR/ICCV/ECCV/TPAMI. He has also been granted ten U.S. patents. His research has been generously supported by the National Science Foundation (NSF), the National Institute of Health, the Army Research Office, and the Air Force Office of Scientific Research (AFOSR). His research interests span a range of topics in computer vision and computer graphics, especially on computational photography, non-conventional optics, and camera design. He is a recipient of the NSF CAREER Award, the AFOSR YIP Award, and the Outstanding Junior Faculty Award from the University of Delaware. He has served as the General Chair, the Program Chair, and the Area Chair of many international conferences such as CVPR, ICCV, ECCV, ICCP, and NIPS. He