BAX: A Bundle Adjustment Accelerator With Decoupled Access/Execute Architecture for Visual Odometry

As the demand for embedded-vision grows, solving large optimization problems in real-time with energy and cost budget is a challenge. We present BAX, a hardware accelerator of bundle adjustment (BA), which solves the least-squares problem of state estimation in visual odometry (VO). BAX consists of a frontend and a backend for control and computation, respectively. The frontend generates instructions on-the-fly executed at the backend to perform the BA algorithm. The backend adopts decoupled access/execute (DAE) architecture, which separates the memory access unit (MAU) from the pipeline. The MAU can prefetch vectors and matrices ahead of computations. To further reduce the latency of data reorganization, three transpose-free dataflows are proposed for matrix multiplication operations on the vector processing unit (VPU). Besides, a unified architecture for both forward and backward substitution is designed for matrix decomposition in the linear solver. All the data are stored in 442kB on-chip memory, and the local map is maintained efficiently by the hierarchical graph memory. Compared with the baseline architecture, the processing time is reduced by 53.9% through the above techniques. BAX is implemented in 32-bit floating-point precision with data normalization on FPGA. It completes a full BA in about 63.44ms at 200MHz, consuming 1.12W power. BAX is $1.73\times $ and $22.38\times $ faster than the desktop and embedded CPUs, respectively, and achieves 90% performance of the GPU at much less power consumption.


I. INTRODUCTION
Bundle adjustment (BA) is the problem of refining a visual reconstruction or navigation to produce jointly optimal 3D structure and viewing parameter (camera pose and calibration) estimates [1]. It refers to bundling up the light from each 3D point to the optical center and adjusting to minimize the projection errors of all points concerning both structure and viewing parameters. Compared to the filter-based optimization approach [2], BA trades more computational cost for higher accuracy [3]. It has become the last step of almost every feature-based visual tracking system and 3D reconstruction algorithm. Image features are the primitives such as points, lines, and regions with distinctive geometric and structural information. A general framework of visual navigation is visual odometry (VO) [4] or The associate editor coordinating the review of this manuscript and approving it for publication was Yasar Amin . visual simultaneous localization and mapping (V-SLAM). VO is the pose estimation process of an agent by matching the features extracted at a stream of images. The poses over multiple frames describe the agent's long-term trajectory, and the points denoted by the features constructs a local map of the surrounding. The trajectory and local map can be further refined by BA [5]. V-SLAM usually takes VO as the frontend and performs a large global BA for loop-closure at the backend. This framework is widely used in applications such as unmanned aerial vehicles (UAVs), automatic driving, and argument reality (AR). A typical method for 3D reconstruction is Structure-from-Motion (SfM) [6]. SfM has a similar pipeline to V-SLAM, but the purpose is to restore the 3D structure of observations rather than estimate the agent's poses. Compared to SfM and V-SLAM, VO performs local BA with fewer frames and a smaller map but has the requirement of real-time processing.  Large scale BA in offline visual reconstruction like SfM usually requires high-performance CPUs, GPUs, and even distributed computing [7]- [11]. These solutions are inappropriate for embedded systems because of the high power and hardware cost. For online visual navigation in embedded systems, BA is still a critical issue due to its high computational intensity and numerical precision. We profile a VO program with ORB feature [12] and local BA on an ARM CPU. As shown in Table 1, it achieves only less than three fps, and BA consumes over 25% of the total time. Many researchers have studied hardware acceleration of image feature extraction, which is beyond the scope of this work. This work focuses on the performance bottleneck of BA. Unlike many image processing algorithms, VO is not a streaming processing application. For a full-pipelined VO, the additional data memory is required as ping-pong buffers between pipeline stages. It will cost much on-chip memory and off-chip bandwidth, which should be avoided for embedded applications. Therefore, in a non-pipelined VO system, the latency of BA still drags the processing speed so that the optimized parameters may not be updated in real-time.
The profiling result of a BA program is also given in Table 2. It indicates that matrix operations, including multiplication, addition, inverse and decomposition, occupy near 80% of the total run time. Previous matrix processors can be classified into two categories. One uses 2D register array to access matrix by row and by column [13]. This architecture indexes each entry and each element in an entry. It brings additional hardware and power cost due to the complex addressing logics. Another one operates like SIMD or vector machines [14]. When data is loaded to the register file or feed to the processing unit, it has an initial latency that rearranging the data to fit the SIMD lanes. BA has many operations that a matrix is transposed and then multiplied to another. Thus, previous solutions are less optimal in hardware cost or performance for BA due to its particular computing pattern.
Besides, previous hardware acceleration of BA [15]- [19] still have shortcomings for feature-based embedded VO. In [15], a part of BA is accelerated by FPGA, and the intermediate data are stored in off-chip memory. The data movement results in more latency and power cost. The pose estimation engines in [16], [17] are lack of optimization process. The works in [18], [19] optimize only camera poses, and the errors caused by points are not reduced. For all these motivations, we propose BAX, a hardware accelerator running local BA in real-time for embedded VO. It addresses the issue of matrix operations, works without external memory access, and refines both camera poses and map points. The high performance of BAX is gained as a result of fourfold contribution: • Decouple access/execute architecture. Matrix operations, which dominate the BA algorithm, can be performed by a vector processing unit (VPU). We adopt a decouple access/execute architecture [20] for the VPU pipeline. The initial latency of traditional single instruction multiple data (SIMD) execution is hidden by accessing memory and executing computations asynchronously.
• Transpose-free matrix multiplication. In the BA algorithm, a matrix is often transposed before multiplied with another one. We regulate three dataflows to perform transpose-free matrix multiplication on the VPU. Thus, the intial latency of matrix transpose is avoided. This computing paradigm can be applied to other matrix processors without much effort.
• Unified linear system solver. The BA algorithm constructs a linear system with a positive definite matrix. Exploiting the parallelism in the variant Cholesky matrix decomposition, we design a unified architecture for both the forward and backward substitutions. Compared with instruction-based processing, the dedicated datapath removes time-consuming data reorganization.
• Hierarchical graph memory. The BA algorithm uses a graph to represent the local map, including the camera poses, the 3D map points, and the 2D projections at different poses. We propose a hierarchical graph memory to maintain the local map. The memory access can be pipelined to save the timing cost when computing the projections and updating the parameters. BAX is implemented on FPGA and operates at 200MHz. The evaluation results show that BAX improves processing speed by 1.73× and 22.38× compared with that of desktop and embedded CPUs.
The rest of this article is organized as follows. Section II reviews the related works of BA and pose estimation acceleration. Section III briefly introduces the matrix manipulations in BA. In Section IV, the architecture of BAX and its novelty is described in detail. Section V shows the experiment result about implementation, performance and accuracy. Finally, the conclusion is drawn in Section VI. VOLUME 8, 2020

II. RELATED WORKS A. BA ON GENERAL PURPOSE PROCESSORS
To improve computational efficiency, several researchers have comprehensively studied the optimization and implementation of BA on general-purpose processors (GPPs). Due to the lack of interaction among some subgroups of parameters, the sparsity in corresponding Jacobian matrices can be used to achieve considerable computation and memory savings. Based on this observation, Lourakis and Argyros et al. [7] implemented a software package to realize LM-based sparse BA with high efficiency and flexibility regarding parameterization on CPUs. To solve large-scale BA problems, Agarwal et al. [8] designed a truncated Newton style LM algorithm coupled with a simple preconditioner using the conjugate gradients (PCG). This method delivers high performance on advanced multicore CPUs at a fraction of the time and memory cost of methods based on factoring the Schur complement.
Given the constraints of the GPU programming model, it is not trivial to get bundle adjustment algorithms to run on the GPUs. Choudhary et al. [9] took a hybrid approach to run overlapping computations on GPU and CPU, where the Hessian matrices and Schur complements are constructed on GPU. Wu et al. [10] improved the BA performance on both CPU and GPU by inexact LM without storing block matrices in memory. Besides, single-precision arithmetic with suitable data normalization further saves memory access and timing cost yet still maintains high accuracy. Zheng et al. [11] also combined the PCG algorithm, the Schur complement, and GPU parallel computing technology to develop a fast and effective BA method for large-scale datasets.

B. BA HARDWARE ACCELERATORS
Although GPU solutions achieve significant performance improvement when executing the BA algorithm, they are not suitable for embedded platforms like micro UAVs with limited power budgets. This motivates researchers to design customized architecture for BA. Qin et al. [15] presented a hardware-software co-designed BA engine on an embedded FPGA-SoC with a novel co-observation optimization technique. The engine offloads Schur elimination to the FPGA side while runs the other steps on the CPU.
Besides, there have been several works on building hardware accelerators with the function of BA or pose estimation for complete visual tracking applications. Yoon et al. [16] designed a graphics and vision unified processor for AR with a pose estimation engine (PEE). The PEE calculates the device pose using an orthogonal iteration algorithm, which can be regarded as the simplicity of BA. Instead of the previous marker-based approach, Hong et al. [17] proposed a marker-less PEE for AR. The techniques of speculative execution and reconfigurable data-arrangement layer are used to reduce computing time and logarithmic computing to save power consumption. Li et al. [18] applied a convolution neural network (CNN) to extract features in SLAM and designed a CNN-SLAM processor, which executes pose-only BA in ASIC. Similarly, Suleiman et al. [19] presented a visual-inertial odometry (VIO) accelerator, which performs a factor graph optimization instead of BA. To the best of our knowledge, BAX is the first full hardware acceleration of BA for embedded VO that refines both camera poses and 3D map points simultaneously.

III. MATRIX OPERATIONS IN BA
The basic idea of BA is shown by Fig.1. Assume the camera moves relative to the world reference by a rotation R and a translation t (i.e., camera extrinsic T ) estimated using EPnP [21], then the 3D coordinates of a point P in the world reference can be transformed to the 2D coordinates of a point p on the image plane. In addition, the same 3D point is directly captured and printed as a pixel p of the image. Due to the inaccurate measurement of T and P, p and p have some error e in their 2D coordinates, which should have been identical in theory. Essentially, BA is a non-linear least squares problem to minimize the sum of the reprojection errors concerning all of the 3D points P i and camera extrinsic T j . The error is defined as the squared L 2 norm of the difference between the observed feature location and the projection of the corresponding 3D map points. The most popular method to solve BA is the Levenberg-Marquart (LM) algorithm. It expresses the non-linear optimization problem as an approximate linear system of equations and finds the optimal solution as a combination of steepest descent and the Gauss-Newton method. Using the Schur complement trick, the scale of the linear system can be reduced greatly. Since the reduced coefficient matrix is positive-define, Cholesky decomposition is suitable for solving the equations. The more details of BA and related numerical optimization methods can be found in [1]. The computing patterns and matrix operations dominated in BA is introduced as follows.

A. NORMAL EQUATION
Approximated by the LM method, the BA problem is actually to solve the normalization equations as follow: where J is the Jacobian, D is a diagonal matrix extracted from J T J, λ is a non-negtavie coefficient that controls the damping strength, vector x is the corrections of optimized variables, and vector e is a collection of reprojection errors. Assume that the number of optimized camera poses and 3D points is a and b, respectively. Then the size of Hessian matrix , and the vectors x, e and g have the same length of (6a + 3b). It is also observed that all the components in Eq.1 can be partitioned into two parts: where the index c and p denote the camera pose part and the point part respectively. Thus, Eq.1 can be expressed as follow:

B. SCHUR COMPLEMENT
Directly solving Eq.1 has a complexity of O((6a + 3b) 3 ), which is computational intensive for large-scale BA problem.
With block-wise Gaussian elimination, Eq.2 can be reduced as follow: where B − EC −1 E T is a positive definite matrix, called the Schur complement. Therefore, solving Eq.1 is converted to solve the correction of camera pose part x c first: and then substitute x c to figure out the point part x p : Since the number of camera poses is much less than that of points, it is efficient to solve Eq.4 using Cholesky decomposition instead of Eq.1. Besides, it is simple to calculate C −1 because the diagonal of C is composed of 3 × 3 matrices.

C. CHOLESKY DECOMPOSITION
A variant Cholesky decomposition, called LDL decomposition, is given by A = LDL T , where A is a n × n positive-definite matrix, L is a unit lower triangular matrix with unit elements on the diagonal, and D is a diagonal matrix. Thus, solving Ax = LDL T x = b is equivalent to solving three equations Lz = b,Dr = z and L T x = r in sequence: where the first two steps are forward substitution, and the third is backward substitution. Different from the standard Cholesky decomposition, the above computation avoids square root operation and division dependency at the cost of more iterations, which can be compensated by parallel processing.

IV. HARDWARE ARCHITECTURE OF BAX
Application-specific hardware accelerators usually adopt parallel processing, stream processing, and pipeline technique to boost the run-time performance. From a macroscopic view, the tasks of BA (see Table 2) have to be executed in serial due to the significant data dependency. Besides, the complex dataflow of conditional execution prevents stream processing. Although pipeline can improve throughput, it brings additional data memory between stages for non-stream processing, which is less appropriate for BA, since it has massive intermediate data. Therefore, we consider executing BA by the algorithm flow under the control of finite state machines (FSM).
The overall architecture of BAX is partitioned into the frontend and the backend (see Fig.2). At the frontend, a set of FSMs are switched by the central controller following the BA algorithm. The FSMs include 1) Constructor that builds the normalization equation by computing the Jacobian and Hessian matrices; 2) Marginalizer that reduces the scale of normalization equation through the Schur elimination method; 3) Solver that figures out the increment of camera states and 3D points coordinates by variant Cholesky decomposition and 4) Updater that calculates the value of optimized camera states and 3D points coordinates. The above steps are repeated until the termination conditions are satisfied. Due to the serial nature, all the FSMs at the frontend can share the computational resources of the backend without conflict.
Unlike the work in [19], which calls the shared computational units by the FSMs directly, BAX bridges the frontend and backend through an instruction queue (IQ). Specifically, the FSMs at the frontend generate 32-bit RISC instructions and write them into the IQ. Then the backend act as a GPP to execute these instructions on a scalar processing unit (SPU) and a vector processing unit (VPU), with no idea of the current step at the frontend. This loosely coupled architecture has two advantages. First, more advanced operations can be integrated to BAX conveniently by the simple modification of the frontend if necessary. Second, the instruction-based calling of the backend avoids the duplicated control logics in each FSM at the frontend. The SPU and VPU can access the data memory (DM) through separate register files. The DM includes a graph memory for the local map, an equation memory for the Jacobian and Hessian matrices, and a shared memory for VOLUME 8, 2020 the other data. The DM capacity is compressed by making use of the sparsity and symmetry of the Hessian matrix. Besides, BAX implements a single-precision float-point data path with the data normalization method in [10], which saves much hardware and power cost at minimum accuracy loss compared to the double-precision.

A. DECOUPLED ACCESS/EXECUTE ARCHITECTURE
High-performance GPPs are usually enhanced by the SIMD technique to exploit the data-level parallelism (DLP). However, the SIMD execution units still suffer from the latency of data preparation (cache missing, permutation, packed and unpacked, etc.). To address this issue, modern GPUs further extend SIMD to single instruction multiple threads (SIMT). The SIMT technique can feed the pipeline with the other threads when one thread is data-hungry. Instead of the complex SIMT implementation, we expect a simple and efficient method to reduce the data preparation latency in SIMD. Decoupled access/execute (DAE) architecture was proposed in [20]. A typical decoupled processor has two independent processors [22]. They can execute memory reference and computation instructions asynchronously. Therefore, memory latency can be hidden by fetching data ahead of demand. The purpose of DAE architecture was to address the issue of memory access latency [23] at first. The DAE architecture fails in the market of GPP because it is inefficient for irregular data structures and control flows. However, computations on structured data like the massive matrix operations in BA is conformable to the concept of DAE. Although BAX has no external memory access, the data preparation in SIMD execution can be considered as internal memory latency. Therefore, BAX can benefit from in-order execution DAE architecture by prefetching vectors or matrices.
The instruction set architecture (ISA) of BAX derives from the RISC-V ISA [24], where only the load/store and arithmetic instructions (float-point scalar and vector) are retained for simplicity. Besides, a few customed instructions for matrix manipulations and data reorganizations are defined based on the features of BA (see Section III). As shown in Fig.2, the backend of BAX is decoupled into four components: the fetch and decode unit (FDU), the memory access unit (MAU), the scalar processing unit (SPU), and the vector processing unit (VPU). The FDU fetches and decodes instructions from the IQ. Then it issues them to the memory instruction queue (MIQ), the scalar instruction queue (SIQ), and the vector instruction queue (VIQ) according to the type of instructions. The instructions, classified into memory access (load, store, and permute), scalar arithmetics, and vector arithmetics, are handled by the MAU, SPU, and VPU, respectively. The MAU calculates the reference address through the address generation unit (AGU) with the address register file (ARF) and moves data between the data memory (DM), the scalar register file (SRF), and the vector register file (VRF). It can also execute data permutation instructions by the data reorganization unit (DRU). Both VPU and SPU have a 4-stage pipeline (no memory access) with various latencies in the execution stage for different float-point operations. Inside VRF and SRF, a state table for each entry is used to resolve data hazards between the MAU, SPU and VPU.
The pipeline timing of a code snippet executed on the DAE architecture is compared with traditional in-order RISC architecture in Fig.3. Assume that loading (storing) a vector from (to) memory costs three cycles, multiplying two vectors by elements requires four cycles, and both architecture support data forwarding before the write-back stage. In the beginning, the DAE architecture has more latency to complete the store instruction due to the additional FDU pipeline. However, the DAE architecture makes an effect as the loops proceed. The SPU can run the address increment instructions ahead of the completion of the previous store instruction conducted by the MAU since they are decoupled and have no data dependency. Therefore, the DAE architecture will improve the performance significantly, especially for programs with large loop counts and high latency of both memory access and computation. Different from outof-order superscalar processors, the proposed DAE architecture exploits prefetching rather than dynamic scheduling.  Thus, it still maintains in-order execution and a simple assembly programming model at a little hardware cost.

B. TRANSPOSE-FREE MATRIX MULTIPLICATION
To save the execution time of completing a matrix operation, we define a set of matrix arithmetic instructions and execute them on a canonical vector processing unit (VPU) with an adder tree. As shown in Fig.4, the VPU architecture is similar to previous matrix processors [18], [19]. The first stage can multiply or add the elements from two vectors as a naive vector processor. In each lane, the multiplier is chained to the adder to perform multiplication and accumulation. Together with the adder tree at the second stage, the VPU can also dot product two vectors and multiply two matrices. As described in Section III, when solving the linear system, the coefficients matrix (Hessian matrix) of the equation is manipulated by a series of block matrices. Since the largest size of a block is 6 × 6, the VPU of BAX is implemented with six lanes as a trade-off between performance and utilization. With the data normalization method in [10], a 32-bit single-precision float-point datapath guarantees the BA accuracy.
It frequently occurs that a matrix is transposed and then multiplied to another when constructing the Hessian matrix in BA. Although previous works like [13], [14] also perform matrix multiplication using multipliers and adder trees, they are less optimal for matrix transpose in hardware cost and performance. Except for fetching data, the memory access unit (MAU) at the backend of BAX also runs the data permutation instructions using the data reorganization unit (DRU). Therefore, the load and transpose operations can be packed into a single instruction. Using the DAE architecture, the latency of matrix transpose is likely to be hidden.
Another case is that the matrix to be transposed is already in the register file. To address this issue, we regulate three computing patterns for transpose-free matrix multiplication on the VPU, as shown in Fig.5. Assume two 3 × 3 matrices A and B are stored in vector registers with two read ports in rowwise. Therefore, a row vector or a single element from two matrices can be accessed at one cycle. To calculate C = AB, A 1,x (x = 1, 2, 3) is multiplied with three rows of B respectively to produce the partial sum of the first row of C. The partial sums are stored in the temp registers and accumulated to the first row of C (Fig.5a). Repeat this step for all the rows of A until the calculation of C completes. Similarly, C = A T B can be calculated by accessing the element of A in column-wise (Fig.5b). Since a row of A and a column of B T (i.e., a row of B) can be accessed at the same cycle, each element of C can be calculated directly by the VPU (Fig.5c). Swapping the order to access A and B in Fig.5b, C = A T B T can be calculated by column. The result of C must be transposed before being written to the memory to unify the storage format. The above computing patterns have a time complexity of O(n 2 ) with n being the matrix size. If the matrix size is larger than the number of VPU lanes, the VPU can compute the partial sums of several partitioned blocks, and then accumulate them to the final result.

C. UNIFIED ARCHITECTURE OF LINEAR SOLVER
Although there is a VPU in BAX, instruction-based vector (matrix) operations are less efficient for the above calculations because the size of the coefficient matrix in Eq.4 is far beyond the number of the SIMD lanes in the VPU and the VRF capacity. If the matrix decomposition is conducted by the 6-width VPU, massive partial sums have to be moved in and out of the registers repeatedly. Despite the DAE architecture, a large amount of pipeline stall is still inevitable. Therefore, the Solver FSM at the frontend is allowed to use the functional units of the VPU and SPU, and access the data memory directly without instructions.
To solve the reduced normalization equation (Eq.4), we learn from [25] and apply the LDL decomposition on the coefficient matrix iteratively. The LDL decomposition of a symmetric matrix A = LDL T can be divided into four blocks as follow: where g = a nn and d n are scalars, t and w are vectors. By matching the block matrices of the two sides in Eq.7, it is easy to find: which indicates that given the LDL decomposition of its top-left n − 1 order cofactor A n−1 as Eq.8a, A can be decomposed by solving w and d n according to Eq.8b and Eq.8c. Therefore, the full decomposition can be iterated from A 1 = D 1 = a 11 and L 1 = 1.
Fortunately, solving Eq.8b exactly follows the forward substitution of LDL composition as Eq.6a and Eq.6b. It indicates that the iterative LDL decomposition can be performed on a common datapath using similar dataflow, saving the hardware overhead. Therefore, we design a unified architecture for both LDL decomposition (forward substitution) and equation solving (backward substitution). Fig.6a and Fig.6b shows the dataflow of the forward and backward substitution to calculate Eq.6 on the unified linear solver, respectively. In the forward substitution, b j is first assigned to z j one-to-one (j = 1 ∼ n). Then the multiple product terms z j_p = L j,i z i (partial sums) associated with the same z i are calculated in parallel and negatively accumulated to z j until z j is figured out (j = 2 ∼ n, i = 2 ∼ n − 1). Once z j is ready, r j = z j /d j is calculated immediately on the divider. After that, the backward substitution starts in the reverse order to calculate the final results x j . Besides, the dataflow to solve Eq.8c is similar to the matrix multiplication on the VPU. All the intermediate data are stored in the on-chip data memory. One triangular iteration above has a timing complexity of about O(n 2 /(2p)), where p is the parallelism degree. Thus, the total time cost for a complete decomposition after n iterations is about O((n 3 )/(6p)). The linear solver borrows six adders and one divider from the VPU and SPU, respectively. With the elaborate pipeline timing, it can solve a 96 × 96 linear system of equation in about 80 thousand cycles. This unified architecture is flexible to support large scale matrix decomposition by more iterations or adding more computing resources.

D. LOCAL MAP AND GRAPH MEMORY HIERARCHY
The camera pose of each frame, the map points observed per frame, and the keypoints (features) extracted in each frame are associated with each other to form a local map as a graph. When a new frame (pose) with map points and keypoints is inserted to the map, the earliest frame has to be deleted if beyond the memory capacity. In [19], a two-stage graph memory is proposed without the storage of 2D keypoints coordinates, which is not suitable for BAX. In [18], a hierarchical graph memory with a free-list FIFO is used to store a similar map, but the method to update the map is not explained in detail. In this work, we implement a graph memory including three buffers for frames, keypoints and map points, respectively. Each entry of the three buffers is partitioned into multiple fields with different attributes. The frame buffer has three fields for the frame index (FID), the camera pose ([R|t]), and the number of keypoints in this frame (nKP). The LSB of FID indicates whether the entry is used or not. The keypoint buffer is divided into multiple groups, and a group stores the 2D coordinates (2D) of keypoints in a frame. Each keypoint has a pointer (PTR) to  its corresponding 3D map point. The map buffer stores the indexes of map points (PID), their 3D coordinates (3D), and the index of frames (FID i ) that observe the map point.
We will explain how the graph memory handles the local map with an example in Fig.7. Assume that the frame buffer has only two entries, the map buffer has six entries, and the keypoint buffer has eight entries partitioned into two groups. KF i denotes the i-th keyframe, and its camera pose is expressed by [R|t] i . The 2D coordinates of the j-th keypoint in KF i are denoted by [u,v] ij . The m-th landmark (3D point) is denoted by P m , where the index is numbered independently from any frame, and its 3D coordinates are given by [x,y,z] m . In the beginning, KF 0 with three keypoints arrives. The first entry of the frame buffer is filled, and the three keypoints are stored in the first group of the keypoint buffer in order. These keypoints correspond to three landmarks P 2 , P 0 , and P 1 , respectively. The landmarks are stored in the map buffer in sequence. The write address is given by m mod d, where d = 8 is the map buffer depth, and this address is also the pointer PTR of the associated keypoint. Then KF 1 is inserted after KF 0 , and the 2D keypoints are stored in the second group of the keypoint buffer. Since P 0 and P 2 are co-observed in both KF 0 and KF 1 , they share the same entries in the map buffer so long as the related FID fields are set. The next frame KF 2 will replace KF 0 , but P 2 has to be retained because it is still observed in KF 1 .
When running BA, the accelerator first reads a camera pose and its nKP from the frame buffer. Then the 2D coordinates of all the keypoints in this frame are fetched from the keypoint buffer according to nKP. Meanwhile, the corresponding 3D coordinates and the observed FIDs are indexed by PTR. After that, the normalization equation to solve BA can be constructed with the above information. In our implementation, the 128KB graph memory can maintain a local map with 16 frames, 256 keypoints per frame, and 4096 landmarks with eight co-observations at most. Any fine-coarse  operation to an entry of the buffers is completed in one or two cycles.

A. IMPLEMENTATION
The proposed accelerator is implemented on the FPGA side (XCZU9EG) of a Xilinx Zynq Ultrascale+ MPSoC (ZCU102) at 200MHz. Except for a 32-bit integer adder in the address generation unit (AGU) of the MAU, all the computational units in the SPU and VPU are 32-bit floatpoint. These float-point units work in non-block mode with the pipeline depth from four to twelve cycles. Table 3 reports the resource utilization of LUTs, REGs, DSPs, and BRAMs by different modules. Since the divider is generated without DSP by Vivado, it contributes to the most LUTs and REGs in the SPU. In addition, a BRAM in the FPGA is at least 18Kb, which exceeds the actual requirement of the small instruction queues. Nevertheless, the accelerator still consumes very few FPGA resources. To evaluate the effect of DAE architecture with single-precision datapath (DS) on hardware cost, we also implement two other versions of BAX, including non-decouple single-precision (NDS) and decouple double-precision (DD). As shown in Fig.8, the singleprecision datapath saves about half of hardware resources and power. Compared with the non-decouple architecture, the proposed DAE architecture has only a few more hardware cost. It is mainly used for the FDU and MAU, and the multi-ported VRF and SRF.
We also analyze the size of necessary on-chip memory by a quantitative approach. Different from the work in [10], BAX constructs the Hessian matrix using the 2 × 9 Jacobian associated with each error term without the storage of the global Jacobian in Eq.1. In our implementation, BAX supports a local map with at most 16 frames (camera poses) and 4096 landmarks (3D points) in total. According to Eq.2, we can infer that B and C consists of 16 6 × 6 and 4096 3 × 3 symmetric block matrices on the diagonal with the rest being all zeros, and g has a length of 12384. Since a landmark has max eight co-observations at different poses and a frame has most 256 keypoints, the sub-matrix E contains 2048 6 × 3 non-zero blocks, which can be encoded using the compressed sparse row (CSR) format by block. By exploiting the sparsity and symmetry above, the Hessian matrix H with the vector g can be stored in the 295KB equation memory of the data memory (DM), including a 4.75KB space to index the CSR formant. Besides, the DM also contains 128KB graph memory (see Section IV.D) for the local map and 16KB shared memory for intermediate data. In addition, BAX has four 128×32-bit instruction queues (IQ/VIQ/MIQ/SIQ), 16×384bit VRF, 16 × 64-bit SRF and 8 × 32-bit ARF. As a result, the total on-chip memory required is about 442KB, and the breakdown of on-chip memory usage is shown in Fig.9.

B. PERFORMANCE
To evaluate the performance gain from the four techniques described in Section IV, we run RTL simulation of BA on different architectures. The baseline architecture is a canonical 5-stage RISC pipeline. It performs all the computations on an SPU and VPU with the same specs as BAX, and the local map is stored in a single-bank memory. Then one or more techniques are added to the baseline architecture, and BAX is the one with all techniques. Before the simulation, the test datasets from the BAL project [8] are pre-processed. One group from each of the five datasets is selected and trimmed to 16 frames, less than 256 keypoints per frame, and less than eight co-observations per landmark. All the data are normalized to single-precision, and the termination conditions are set. The time breakdown of different architecture  is shown in Fig.10. It is observed that: 1) The performance of all steps in BA is improved by the DAE architecture, which hides the memory access latency; 2) The transpose-free matrix multiplication method further saves the processing time of matrix multiplication in all steps; 3) The unified linear solver accelerates the matrix decomposition when solving the equation; 4) The hierarchical graph memory reduces the processing time slightly due to the overlapped accessing of multiple parameters. With the combination of above features, the performance of BAX is improved by 53.9% than that of the baseline architecture.
The performance of BAX is also compared with that of different GPPs. The BA algorithm is programmed using the g2o [26] and OpenOF [27] library for CPU and GPU, respectively, with the same specifications as BAX. The BA program runs on an Intel i7-8700 CPU at 3.2GHz with 16GB DRAM, an ARM A53 CPU at 1.2GHz with 4GB DRAM, and an NVIDIA GTX960 GPU at 1.1GHz with 3GB graph DRAM. The experiment results are listed in Table 4. The execution time of the above platforms is 110.48ms, 1.42s and 57.13ms, respectively, and the power consumption is 65W, 1.67W and 130W. Although the GPU implementation achieves the best performance, it is not suitable for embedded VO due to its high power. On the contrary, BAX can complete a full BA in 63.44ms on average while consumes only 1.12W power, which is much less than that of the GPU and the Intel CPU. It achieves a 1.73× and 22.38× speedup compared with the Intel and ARM CPU, respectively. Besides, BAX has no external memory access during the entire processing. To evaluate the effect of DRAM traffic on performance, we add a DDR controller and a DMA (IP cores in Vivado) to BAX. The DRAM interface is set to 16bit and 1200MHz by default. Then we run simulations on BAX and the GPPs with larger datasets, respectively. As shown in Fig.11, the processing time of BAX and the GPPs grows nonlinearly as the number of poses or points per frame increases (Y-axis is in log). The curve slope of BAX is larger than that of GPPs for the limited on-chip memory and lower DDR bandwidth. Besides, the timing cost of GPU grows relatively slowly because of the massive streaming cores and high-parallel computing patterns. Therefore, one advantage of BAX is to solve a small BA problem for embedded VO efficiently.
Previous BA-related hardware accelerators [15], [18], [19] target to different applications and have various design spaces. The specifications and features of these works and BAX are listed in Table 5. The works [18] and [19] are visual (inertial) odometry accelerators implemented in ASIC while [15] and BAX are FPGA-based BA accelerators. The work [19] completes one iteration in 30.8ms using the factor graph method. It adopts double-precision float-point data to maintain high accuracy and achieves extremely low power through the adaptation technique. The work [18] adopts 32-bit fixed-point numerical precision with data normalization. It solves a pose-only BA with 20 frames and requires much fewer operations compared with the others. According to the architecture and the breakdown of chip die, we estimate that it competes one iteration in less than 17us, and the BA part costs less than 30mW. However, both works in [18], [19] optimize only the camera poses, and the errors caused by the map points are not reduced. In [15], only the Schur elimination step is implemented on FPGA, while the others still run on CPU. The data movement between the off-chip and on-chip memory results in additional latency and power cost. It completes one iteration on larger datasets in 110ms at the expense of more hardware resources. To the best of our knowledge, BAX is the first full-BA accelerator for embedded VO applications. It completes one iteration in about  10.57ms at 200MHz and overcomes the shortcoming, i.e., the lack of refinement on points in [18], [19] to achieve high accuracy. The support for full-BA and the FPGA-based implementation brings BAX more power cost than that of [18], [19]. However, at the configuration of pose-only mode, the power cost and processing speed of BAX can also be reduced. Compared to [15], BAX solves small BA problems more efficiently with full hardware implementation. As a result, it avoids the latency of external memory access and costs less hardware and power than that of [15].

C. ACCURACY
In this work, a 32-bit floating-point datapath with data normalization is used to save hardware resources. To evaluate its effect on the accuracy, we use the same datasets above to simulate on three configurations of BAX, including double-precision, single-precision with data normalization (DN), and single-precision without DN. The BA results calculated by the double-precision datapath are taken as the ground truth. Then BAX with different numerical precision performs BA using the same datasets applied by Gaussian noise. The results include camera poses and coordinates of 3D map points, among which the camera poses are converted to 6-element vectors using the Rodriguez formula [28]. After that, the accuracy is measured by the mean square errors (MSE) of all the three sets of results relative to the ground truth, as shown in Table 6. For single-precision without DN, the MSE of camera poses and map points are 10 10 × and 10 6 × higher than that of double-precision, respectively.
Although it seems that the MSE is quite small for one BA, this error is still unacceptable because it will accumulate as the movement proceeds. However, the MSE of single-precision with DN is only increased by 100× for both poses and points compared to the double-precision. Besides, the result of pose-only BA is a little worse than that of single-precision without DN, indicating the effectiveness of full BA. From the accuracy gain, it is observed that the DN method improves the accuracy of poses and points in single-precision BA by 10 7 × and 10 4 ×.

VI. CONCLUSION
We propose BAX, an FPGA accelerator for feature-based BA that refines both camera poses and map points. It consists of a frontend and a backend, which are loosely coupled by an instruction queue. Instructions are generated by the FSMs at the frontend. Then BA is performed at the backend under the control of the instructions. The backend exploits a DAE architecture to reduce the latency of data preparation in continuous matrix operations. A 6-width VPU with transpose-free matrix multiplication is used to reduce this latency further. In the linear solver, a unified architecture for both forward and backward substitution is proposed. BAX maintains a local map, including 16 camera poses, 256 keypoints per frame, and eight co-observations per landmark, in a hierarchical graph memory. The accelerator completes a full BA in about 63.44ms at 200MHz while consuming 1.12W power without external memory access. It achieves a 1.73× and 22.38× speedup compared with the desktop CPU and the embedded CPU, respectively, and 90% performance of the GPU. Since the backend works as a standalone processing unit, the function of BAX can be easily modified by configuring the FSMs at the frontend. Although this work targets embedded VO, the architecture of BAX also supports large scale BA or global BA if more computational resources and external memory are used. The concepts, such as the DAE architecture and transpose-free matrix multiplication, are suitable for other applications that contain massive vector and matrix operations. RENDONG YING (Member, IEEE) received the Ph.D. degree in circuits and systems from Shanghai Jiao Tong University, China, in 2007. He is currently an Associate Professor with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. He has an in-depth study of digital signal processing and its hardware implementation techniques. His research interests include navigation signal processing, SoC architecture for high-performance digital signal processing systems, 3D visual signal processing, and machine thinking. He has long-term close cooperation with domestic research institutions and well-known companies. Participating projects include: ''Modeling and Analysis of High-Reliability Software Systems'', ''Application-Oriented Instruction Configurable Processors'', ''Distributed 3D Video Acceleration Engine Technology'', and ''Compressed Sensing-Based Spatial Audio Object Coding''.