Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis

Visual Odometry (VO) systems are widely used to determine the position and orientation of a robot or camera in an unknown environment. They are deployed on resource-constrained platforms, such as drones, and virtual reality or augmented reality headsets. VO systems harnessing modern System-on-Chip (SoCs) with integrated Field Programmable Gate Array (FPGA) have the potential to improve overall performance. This paper explores the FPGA acceleration of sparse semi-direct VO kernels using High-level Synthesis (HLS). The selected sparse Semi-direct VO (SVO) system, since its conception, was developed to execute efficiently on low-power processors. We show that both computational and data transfer overheads between the processing cores and the accelerators on the reconfigurable fabric need to be optimized to obtain better end-to-end performance. The additional data movement incurred when using an FPGA accelerator is due to the sparse computational nature together with random memory access patterns of the kernels. This paper shows that state-of-the-art HLS tools are not yet able to perform the required optimizations automatically. These tools usually target successful application kernels with dense computational patterns and regular memory access. In this paper we propose three, potentially general, methods to reduce the data transfer between the processing cores and the customised hardware kernels on the FPGA; these methods are: (a) approximation based on domain-specific knowledge, (b) lossless image compression, and (c) the use of on-the-fly computation. We present a case study of the use of these methods on SVO, a state-of-the-art sparse VO system with a semi-direct front-end. We demonstrate that our proposed methods can reduce data transfer overhead to achieve better end-to-end performance and that they can be applied not only when using standard Xilinx tools, but also with other state-of-the-art HLS tools, such as HeteroFlow. Compared to the baseline performance of the original SVO software on Arm processors, our proposed methods enable the Xilinx SDSoC and HeteroFlow designs to achieve a speedup of <inline-formula> <tex-math notation="LaTeX">$2.4\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2.14\times $ </tex-math></inline-formula>, respectively, without noticeable accuracy loss. The Xilinx SDSoC and HeteroFlow designs also achieve a <inline-formula> <tex-math notation="LaTeX">$1.85\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.89\times $ </tex-math></inline-formula> improvement in energy efficiency, respectively, on a Xilinx Zynq Ultrascale+ SoC with Arm A53 cores and integrated FPGA. Compared to the SVO software baseline running on the Intel Xeon system, our proposed methods enable the Xilinx SDSoC and HeteroFlow designs to achieve <inline-formula> <tex-math notation="LaTeX">$8.2\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$8.3\times $ </tex-math></inline-formula> improvement in energy efficiency, respectively.

Simultaneous Localization And Mapping (SLAM) task. VO systems are deployed on robots, drones (unmanned aerial vehicles) and virtual reality or augmented reality platforms.
Recent developments in High-Level Synthesis (HLS) [6] and integrated design environments for software developers [7], offer easier and faster integration for hardwaresoftware co-designs on System-on-Chips with integrated FPGAs [8]. These FPGA SoCs (e.g., Xilinx Zynq family) can produce enhanced performance and energy savings.
FPGA acceleration of parts (kernels) of VO has been studied when the computationally intensive kernels exhibit dense computations and regular memory access patterns. As shown in Section III-E, when analyzing SVO [9] there are two main kernels that can benefit from acceleration. One of those kernels is sparse with predictable memory access patterns, whereas the second kernel is sparse with random memory access patterns. However, semi-direct frontend kernels with sparse computations and/or random memory access patterns have been largely overlooked, by previous work.
Considering the point of view of software developers, this paper proposes and explores three methods that target FPGA acceleration for semi-direct VO kernels with sparse computations and/or random memory access patterns using HLS. Specifically, these three proposed methods are: (a) approximation with domain knowledge, (b) lossless image compression, and (c) on-the-fly computation. We found that the performance improvement is limited when reducing only the computational overhead for the semidirect VO kernel accelerated on FPGA using standard HLS. The data transfer overhead between processing cores and FPGA accelerators can be optimized and contribute further to the overall performance improvement. This is due to the sparse computation, and further compounded by the random memory access pattern of these kernels. The state-of-theart HLS tools focus heavily on minimizing the computational overhead, while assuming kernels to be accelerated have dense computations and predictable memory access patterns. Therefore, we explore approximation, lossless image compression, and on-the-fly computation, for data transfer overhead reduction. We discover that lossless image compression benefits SVO kernels with sparse computations but predictable memory access patterns. On the other hand, for SVO kernels that have sparse computations and random memory access patterns, we discover that approximation can transform the random memory access patterns into predictable patterns, thus enabling lossless image compression. Also, on-the-fly computation, which trades off data transfer with computation, can further contribute to the reduction of the data transfer overhead.
SVO is our use case in this paper and is a state-of-theart and widely used sparse VO system; SVO stands for Semi-direct monocular Visual Odometry [9]. SVO is the first VO that uses a semi-direct front-end, and it has been demonstrated to be more accurate than any VO/SLAM with a feature-based or direct front-end on embedded platforms such as drones [10]. Furthermore, FPGA acceleration for sparse VO is not widely studied, and there is no previous work addressing the acceleration of a semi-direct front-end as present in SVO. This is mainly due to the main modules for the semi-direct front-end (i.e., sparse image alignment and sparse optical flow using the Lucas-Kanade method without image pyramid) being challenging for FPGA acceleration. Sparse image alignment has a sparse computational pattern and predictable memory access, while sparse optical flow using the Lucas-Kanade method without an image pyramid has a sparse computational pattern and random memory access, which makes it I/O intensive after offloading to FPGA.
We first perform an end-to-end profiling analysis to identify the bottlenecks of SVO. After that, we develop the FPGA accelerator to reduce the computational overhead by using state-of-the-art Xilinx tools and HLS tools such as Xilinx Software-Defined System-on-Chip (SDSoC) and HeteroFlow [11]. Then, we apply the proposed methods to the FPGA accelerator to reduce the data transfer overhead between processing cores and FPGA accelerators. We demonstrate that the proposed methods can effectively reduce data transfer overhead to obtain better end-to-end performance and they can be used not only on standard Xilinx tools such as SDSoC, but also on state-of-the-art HLS tools such as HeteroFlow.
The contributions of this paper can be summarised as follows: • We present a case study, based on SVO and FPGAs, of exploring the acceleration of semi-direct VO kernels which have sparse computations and/or random memory access patterns. We demonstrate that both computational and data transfer overhead needs to be optimized to obtain better end-to-end performance. State-of-the-art HLS tools are currently not able to conduct satisfying optimizations of such kernels automatically.
• We propose three (generic) methods to reduce data transfer overhead between processing cores and FPGA accelerators when using HLS and targeting the sparse semi-direct VO kernels.
• We demonstrate that the proposed methods can be used not only with standard Xilinx tools, such as SDSoC, but also with state-of-the-art HLS tools such as HeteroFlow.
• We provide an experimental evaluation that shows our SDSoC and HeteroFlow designs, accelerated using both computational and data transfer optimizations, achieved a performance improvement of 19.3× and 17.6×, respectively, compared to designs with only computational optimization. When compared to the software running on Arm A53 processors only, the optimized SDSoC and HeteroFlow designs achieve 2.4× and 2.14× speedup without noticeable accuracy loss, respectively. The SDSoC and HeteroFlow designs achieve 1.85× and 1.89× improvements in energy efficiency, respectively, when compared with the software baseline. When comparing with the software baseline running on an Intel Xeon W-2123 system, both designs achieve 8× improvement in energy efficiency. The rest of the paper is organized as follows: Section II presents the background on VO systems and related work on developing accelerators for VO/SLAM systems and standalone optical flow applications with FPGAs. Section III describes the specific VO use case (i.e., SVO) and its key modules. Section IV presents two candidate kernels for acceleration targeting the bottlenecks of SVO. For both accelerators, we present the computational optimization (Sections IV-A1 and Section IV-B1) and then the challenges in obtaining speedup are discussed (Sections IV-A2 and Section IV-B2). After that, we demonstrate the application of the proposed methods for reducing data transfer overhead between processing cores and FPGA accelerators in our use cases (Section IV-A3, Section IV-A4 and Section IV-B3). Section V evaluates the designs generated by SDSoC and HeteroFlow on a Xilinx Zynq Ultrascale+ platform, and presents a direct comparison with the SVO software baseline running on the Arm A53 processors and Intel Xeon W-2123 system. We conduct performance analysis for both accelerators, including the use of roofline models. To provide a wider context for our contribution in this paper, we also present (in Table 12) a number of state-of-the-art FPGA-Accelerated VO/SLAM examples, emphasizing the performance, power, estimated energy efficiency, area and accuracy characteristic obtained by previous works targeting VO/SLAM with different front-ends on different FPGA platforms. Finally, Section VI discusses the key lessons of this work.

II. BACKGROUND AND RELATED WORK
This section discusses the background of VO as well as the related work on implementing FPGA accelerators for VO and SLAM systems. Additionally, related works about the implementation of standalone optical flow accelerators for FPGAs are presented. To the best of our knowledge, there is no prior work exploring the reduction of data transfer overhead between the processing cores and the FPGA accelerator on a sparse VO/SLAM when using HLS, nor the acceleration of visual odometry with a semi-direct approach on FPGA.

A. VISUAL ODOMETRY
Visual odometry is a problem that is ubiquitous in areas such as the navigation of unmanned vehicles, virtual reality, augmented reality and extended reality. In general, VO can be thought of the task of processing raw data from visual sensors (cameras) and conducting probabilistic state estimation. Typically, a VO system consists of two threads (i.e., localization and mapping thread), which are executed in parallel with synchronizations.
In a VO system, the processing of raw data from cameras typically includes feature detection and feature descriptor generation, which are two of the main kernels in the VO frontend. The feature is a locally distinct location on a camera image, while the descriptor summarizes the local structure around a feature into a vector. Probabilistic state estimation can be divided into the following tasks: (a) pose estimation, which is part of the localization thread, and (b) map estimation, which is executed in the mapping thread. Pose estimation calculates the translation, and the rotation of the camera in an unknown environment, by using matching algorithms. These algorithms compare the current camera image with the previous camera image or the map (the result of the mapping thread) and attempt to reduce either the reprojection error or the photometric error. The above tasks will be explained in the next sections. There are three types of matching algorithms: • Feature/Descriptor-based matching (i.e., feature-based approach) compares the locations of the corresponding features, or feature descriptors between two images or one image and the map. This algorithm reduces the distance of any of the two corresponding feature locations that represent the re-projection error. Popular algorithms for such matching include sparse optical flow using the Lucas-Kanade method with an image pyramid [12], and dense optical flow using methods such as Farneback [13] and Horn-Schunck [14]. The above algorithms use differential methods for their calculations.
• Image Alignment (i.e., direct approach) compares the intensity of pixels between two images, and tries to reduce them. This task reflects the photometric error. Image alignment includes dense image alignment that conducts pixel-by-pixel comparison, and sparse image alignment that performs comparisons between pixel patches.
• Hybrid pose estimation (i.e, semi-direct approach) is used by SVO [9], and it is a combination of feature/descriptor-based matching, that uses sparse optical flow with the Lucas-Kanade method, as well as sparse image alignment. Hybrid pose estimation seeks to reduce both the photometric error and reprojection error. The purpose of this task is to further increase the estimation accuracy and robustness.
Map estimation is the main kernel of the VO back-end. Specifically, the 3D positions of the map points, or the depth of each pixel are jointly estimated and optimized with the poses of the camera, by solving a non-linear least square system using algorithms, such as Gauss-Newton [15] or Levenberg-Marquardt [16], to further reduce the reprojection or photometric error. A popular technique for map estimation is Bundle Adjustment [17]. Table 1 summarizes the related work and highlights the main differences compared with this paper. Our aim is not to develop a feature detector accelerator for SVO since it has been widely studied for FPGA acceleration, and some features detectors are also part of the Xilinx Vision Library. Furthermore, feature detection is VOLUME 11, 2023  conducted at the back-end of SVO, and it is not a bottleneck.

1) VO/SLAM SYSTEMS ACCELERATION ON FPGAs
According to Table 1, previous works on accelerating VO/SLAM systems on FPGAs mainly focus on the following modules.
Note that the tracking algorithms in [31] and [32] are based on the Hamming distance since they are using feature descriptors (Liu et al. [31] use BRIEF (Binary Robust Independent Elementary Features) [40], while Gu et al. [32] use SIFT (Scale-Invariant Feature Transform) [41]). Since SVO does not use feature descriptor, a direct comparison with our designs is not fair (see N/A under the Optical Flow based Feature Tracking column in Table 1). Papers in categories 1), 2) and 3) have covered feature-based and direct VO/SLAM and thus are the closest related work. Papers in 4) and 5) address different sensors as input and thus they are quite different from SVO. Using an image pyramid, as in [33], is a form of image compression albeit losing information. To sum up, we are the only ones who study a semi-direct front-end with sparse image alignment and sparse optical flow without an image pyramid. Furthermore, no other work has studied the approximation, lossless image compression and on-thefly computation as a means of reducing the data transfer overhead in/out of the accelerators placed on the reconfigurable fabric.
70744 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. A significant contribution from the vendors is the introduction of the Xilinx Vitis Vision Library [42], which provides a number of performance-optimized computer vision kernels. However, it does not provide kernels that are comparable to either sparse image alignment or sparse optical flow with Lucas-Kanade method.
None of these previous works have explored accelerating VO/SLAM kernel with sparse computations and/or random memory access patterns (sparse image alignment and sparse optical flow without image pyramid), on FPGA using HLS, where not only the computational overhead but also the data transfer overhead needs to be reduced. Zhang et al. [33] have accelerated the sparse optical flow module with an image pyramid on FPGA. However, using image pyramids instead of the whole image as input means loss of information as well as fewer data needs to be transferred to the accelerator from the processing cores, making it a less challenging target.
Most previous work (see summary in Table 1) utilizes either Direct Memory Access (DMA) or stream to transfer data between processing cores and FPGA accelerators, without any special considerations for data transfer optimizations. Tang et al. [25] implemented a direct I/O to allow the accelerator to access the image directly, bypassing the Arm cores. However, an off-chip camera kit is needed to provide a Digital Video Port (DVP) for such implementation. Our work, on the other hand, exploits approximation with domain-specific knowledge, lossless image compression and on-the-fly computation for data transfer optimizations.
The reason why most previous works do not consider data transfer optimizations is that the modules that are accelerated all have at least one of the following characteristics, • The module is computationally intensive after being offloaded to FPGA.
• The computation is dense and has a regular and predictable memory access pattern.
One such example is the feature detector, which has been widely studied by previous works when trying to accelerate feature-based VO/SLAM on FPGAs, as Table 1 shows. In general, a feature detector aims to iterate over every pixel of an image and determine which ones are feature points. Feature detector is a dense and computationally intensive process (iterating over all pixels), and its memory access pattern is regular and predictable (accessing pixels one after another). Modules with sparse computations and/or random memory access patterns are largely overlooked and have not been widely studied, since they are usually left to run on the processing cores.
SVO has been demonstrated to be more accurate than the state-of-the-art feature-based and direct VO/SLAM. However, it has not been widely studied for heterogeneous acceleration, especially on FPGAs. The bottleneck of SVO has sparse computations and/or random memory access patterns (sparse image alignment and sparse optical flow without image pyramid). As a result, the heterogeneous acceleration of SVO is a challenging and demanding process.
Note that the sparse optical flow accelerators developed in [22], [23], and [24] receive an image pyramid as input, which might lead to information loss due to compression.
The existing work for sparse optical flow has not explored FPGA acceleration for sparse optical flow with the Lucas-Kanade method without an image pyramid, which is used by SVO. In addition, most of the proposed solutions did not have special consideration for optimizing data transfer between processing cores and FPGA accelerators. Chai et al. [21] employed on-the-fly computation to reduce data transfer between the processing cores and FPGA accelerator albeit addressing dense optical flow. On the other hand, we exploit approximation and lossless image compression as well for reducing data transfer overhead.

III. SEMI-DIRECT MONOCULAR VISUAL ODOMETRY
This section introduces our use case, the Semi-direct Monocular Visual Odometry (SVO) algorithm [9]. Additionally, a detailed profiling analysis that identifies the most computationally intensive parts of the algorithm is presented. The profiling analysis is conducted on the Arm Cortex-A53 cores of the MPSoC (Multi-Processors System-on-Chip) Zynq Ultrascale+ [43] platform, and is also used for the main evaluation (Section V).

A. OVERVIEW
Forster et al. [9] proposed the Semi-direct Monocular Visual Odometry (SVO) as a means of targeting power-constrained platforms [10]. SVO has been the basis of several commercial applications, including Parrot-SenseFly Albris drone and autonomous car navigation by ZurichEye (now Facebook-Oculus VR Zurich) [44]. The vanilla implementation of the SVO algorithm is mainly optimized for Arm processors. SVO uses two threads (i.e., motion estimation and mapping thread), which are executed in parallel with synchronizations. The motion estimation thread is a combination of feature-based matching, that uses sparse optical flow and sparse image alignment. Specifically, it consists of three main kernels, which are: (a) sparse image alignment, (b) reprojection (based on sparse optical flow), and (c) pose and structure refinement. The three kernels are executed sequentially. In general, for the motion estimation thread of SVO, the current image is first compared with the previous image of the pipeline, by using the difference in pixel intensity to reduce the photometric error (i.e., sparse image alignment). Then, the current image is compared with the map, by aligning the correspondent feature point locations to reduce the reprojection error (i.e., reprojection and feature alignment).
The motion estimation thread concludes with a local bundle adjustment to jointly optimize the pose of the camera and the map, to further reduce re-projection errors (i.e., pose and structure refinement). The pseudo-code of SVO and its motion estimation and mapping thread can be found in Algorithms 1, 2, and 3 respectively. This section briefly explains the symbols and notations used in [9]. The symbols are summarized in Table 2. Any 3D point p ∈ S on the visible scene surface S could be projected to image using camera projection parameter π, and obtained a 2D image point u ∈ : where prescript k denotes that the point coordinates are expressed in the reference camera image k. Point u can be recovered by: where d u ∈ R is the depth, R ⊆ is the image region where d u is known. The camera position and orientation at timestamp k is called rigid body transformation T k,w ∈ SE (3). A 3D point p can be mapped to the reference camera image by: The relative rigid body transformation T k,k−1 between timestamp k and k − 1 can be computed by using the corresponding rigid body transformation T k,w and T k−1,w : A minimal representation of the rigid body transformation is needed during the calculation of a non-linear least square system using the Gauss-Newton or Levenberg-Marquardt method. Therefore, the Lie algebra se(3) corresponding to the tangent space SE(3) at the identity is used. The algebra element, twist coordinates ξ , is expressed as: where ω is the angular velocity and ν is the linear velocity. ξ can be mapped to SE(3) using exponential map: 70746 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

C. SPARSE IMAGE ALIGNMENT
Sparse image alignment calculates the relative rigid body transformation T k,k−1 between the current camera image and the previous image in the pipeline, by minimizing the negative log-likelihood of the intensity residual: The intensity residual δI is the photometric difference between pixels when observing the same 3D point p. R is the image region, where image depth d u is known at the timestamp k − 1, and where the back-projected feature points are visible in the current image domain k . δI can be calculated by back projecting a 2D point u from the previous image I k−1 and then project it to the current image I k : It is assumed that the intensity residuals δI are normally distributed with unit variance. Hence, the negative loglikelihood minimizer corresponds to the least square problem.
Sparse image alignment can be considered a variant of dense image alignment. Instead of dense pixel-to-pixel comparison, the task conducts sparse patch-to-patch comparison, using patches of 4 × 4 pixels I(u i ) around the detected feature points u i . Sparse image alignment seeks to find the pose that minimizes the photometric error of all patches: Since (10) is nonlinear in T k,k−1 , it needs to be solved using an iterative Gauss-Newton method. Given an estimation of the relative transformationT k,k−1 , an incremental update T(ξ ) of the estimate can be parameterized with a twist coordinate ξ ∈ se (3). The inverse compositional formulation of the intensity residual δI is used to compute the update step T(ξ ) for the reference image I k−1 at timestamp k − 1: In order to find the optimal update T(ξ ), the derivative of (10) is computed and set to 0: This system is linearized around the current state in order to be solved: The dimensions of the Jacobian J i = ∇δI(0, u i ) is 16 × 6, the patch size is 4 × 4, and can be calculated using the chain rule, By placing (13) into (12) and storing Jacobians J i in matrix J, the normal equation is obtained: which can be solved to obtain an updated twist ξ . The pseudo-code of sparse image alignment and intensity residual computation can be found in Algorithms 4 and 5 respectively.

Algorithm 4
Overview of Sparse Image Alignment 1: for each image pyramid level do 2: while not converged or not reach max iteration do 3: Compute δI 4: Solve the linearized system using δI and H to obtain an updated ξ 5: UpdateT k,k−1 using T(ξ ) 6: end while 7: end for

D. REPROJECTION
After the sparse image alignment, the current camera image is compared with the map in the reprojection kernel. The 3D map points p that can be observed from the estimated pose are projected onto the current image I k , to obtain an estimate of the corresponding 2D feature locations u i ′ . Then, the u i ′ are optimized by minimizing the reprojection error, with respect to the reference pixel patches (I r (u i )) which are around the feature points on the corresponding key image r: where A i is an affine warping matrix. The pseudo-code of reprojection and feature alignment can be found in Algorithms 6 and 7 respectively. Figure 1 presents a high-level overview of the SVO pipeline implemented in a software/hardware co-design environment. The modules in the red background are the most time-consuming parts of the SVO system, and they are implemented as hardware accelerators using HLS. The modules colored in green are executed on the Arm processors part of the Xilinx Zynq platform.

E. PROFILING RESULTS
The baseline software version of the SVO algorithm is profiled and evaluated on the Arm Cortex-A53 cores of the Xilinx ZCU102 Ultrascale+ platform [43]. The SVO software baseline is compiled using gcc-8.2 with -O3 compiler optimization level and Arm NEON extension and is executed on the four available Arm Cortex-A53 cores (Arm Cortex-A53 is configured to performance mode), with Xilinx Petalinux 2019.1. We select the ''fast'' algorithmic parameter setting [9] of SVO, with local bundle adjustment disabled at the back-end. Figure 2 illustrates the profiling results and shows a significant load imbalance overhead between the motion estimation VOLUME 11, 2023 70747 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Compute W I k−1 using p and u i 9:

Algorithm 5 Overview of Intensity Residual δI Computation
for each row of I k−1 (u i ) do 10: for each column of I k−1 (u i ) do 11: Compute I k−1 (u i ) using W I k−1 and I k−1

12:
Compute J i using p k , f i , W I k−1 , I k−1 and π and store J i into J 13: end for 14: end for 15: end for 16 Do nothing 22: else 23: Skip current iteration 24: end if 25: Project u i into I k 26: if the projected u i is within I k then 27: Do nothing 28: else 29: Skip current iteration 30: end if 31: Compute W I k using p k , f i ,T k,k−1 , π and p 32: for each row of I k−1 (u i ) do 33: for each column of I k−1 (u i ) do 34: Compute H and δI using W I k , I k , I k−1 (u i ) and the corresponding J i stored in J 35: end for 36: end for 37: end for and mapping thread, as the execution time for motion estimation is about 24.2× larger than the execution time of mapping. Therefore, we aim to reduce the run-time of the motion estimation thread as much as possible. The next step of the profiling analysis in the motion estimation thread shows that the sparse image alignment kernel is the most timeconsuming kernel, and consumes around 55% of the run-time of motion estimation. The reprojection is the second-most time-consuming kernel, which consumes around 39% of the run-time. Specifically, the Intensity Residual Computation (IRC) kernel takes around 99% of the run-time of sparse Algorithm 6 Overview of Reprojection 1: Reproject all 3D map points that are visible from the estimated camera pose into the current image I k . 2: Identify the key image r that can observe the map point with the closest angle. 3: while not finding enough matching u i ′ do 4: for each grid block in I k do 5: for each u i ′ in I k do 6: Compute A i using u i on I r , π and the estimated camera pose 7: Compute I r (u i ) using A i , I r and u i for each column of I(u i ) do 3: Compute H using I r (u i ) 4: end for 5: end for 6: while not converged or not reach max iteration do 7: Compute W I r using updated u i ′ 8: for each row of I(u i ) do 9: for each column of I(u i ) do 10: Compute Jacobian residual J res using I k and W I r 11: end for 12: end for 13: Update u i ′ using H and J res 14: end while image alignment, while Feature Alignment (FA) consumes approximately 81% of the run-time of reprojection. Hence, the IRC and FA kernels will be considered as our use case accelerators.

IV. DESIGNING ACCELERATORS FOR SVO
This section presents the use case designs for the IRC and FA modules in SVO. The main goal is to decrease data transfer overhead between the host processing cores and the programmable logic, using the proposed methods, which are approximation with domain-specific knowledge, lossless image compression and on-the-fly computation. We first optimize the computational overhead of both modules, followed by the application of the proposed methods that are used to reduce the data transfer overhead of J, I k−1 , and I k . Figure 3 shows the hardware-software co-design for SVO when it is mapped on the ZU9EG SoC platform. The integrated FPGA in this SoC includes 4 High Performance (HP) ports and 2 High-Performance Coherence (HPC) ports that are responsible to conduct high-performance data transfer between the Processing System (PS) and the Programmable  Logic (PL) of the target platform. In our case, HP ports, HPC ports along with AXI4 (Advanced eXtensible Interface)-Full interfaces are used to transfer most of the input and output of the IRC and FA accelerators. General Purpose (GP) port along with AXI4-Lite are used to stream I k,d to the FA accelerator, and control the IRC and FA accelerators. The following paragraphs present the design of the Intensity Residual Computation (IRC) and Feature Alignment (FA) hardware accelerator.
We use the HLS and data motion reports, generated by Xilinx SDSoC, to decide which optimizations are more beneficial for accelerator designs. The reason is that it would be quite time-consuming to synthesize the IRC and FA accelerator and obtain the run-time performance data after each optimization. The data from Tables 3, 6, and 7 are also obtained using these reports.

A. INTENSITY RESIDUAL COMPUTATION ACCELERATOR 1) REDUCING COMPUTATIONAL OVERHEAD OF THE IRC ACCELERATOR
We first try to reduce the computational overhead of the IRC module using HLS pragmas provided by Xilinx SDSoC and HeteroFlow. These HLS pragmas include loop pipelining, loop unrolling, caching frequently accessed data in the onchip memory (BRAM, Block RAM), and BRAM partition to allow parallel access. We also allocate contiguous memory space for both the input and output of the IRC accelerator, as recommended by Xilinx HLS.
However, as can be seen from the third and fourth row of Table 4, this leads to worse run-time performance. There is a 2× slowdown in the execution time of the IRC module, for both the SDSoC design and the HeteroFlow design. To understand the reason behind this, we break down the execution  time of the IRC accelerator into computation clock cycles and data transfer clock cycles, as shown in Table 3.
According to Table 3, the IRC accelerator indeed executes a computationally-intensive task after offloading, the computation consumes around 90% of the clock cycles. HLS pragma directives yield a 62× speedup in the computation clock cycle in the SDSoC design and an 89× speedup in the HeteroFlow design. However, we noticed that the bottleneck of the IRC accelerator has shifted from computation to data transfer, in particular input, after the computational optimization.

2) CHALLENGES IN OBTAINING SPEEDUP FOR IRC ACCELERATOR
The main challenge in obtaining speedup for the IRC module lies in two areas.
First of all, the additional input data transfer when using an FPGA accelerator. Even though the off-chip memory is shared between the Arm cores and the programmable logic in an FPGA SoC, the input and output of the accelerator still need to be copied to a specific DRAM location in order to be accessed by the accelerator. Such data movement will incur extra data transfer overhead when compared to the processing cores-only solution, where the input and output of the intensity residual computation software function do not need to be copied to a specific location in DRAM.
Secondly, the sparse computational pattern leads to a waste of memory bandwidth as data that will not be used in the computation is transferred as well. An example of this is I k−1 and I k , where up to 74% of their pixels are not read. Hence, the next step is to reduce the data transfer overhead of the IRC accelerator.

3) REDUCING DATA TRANSFER OVERHEAD OF J USING ON-THE-FLY COMPUTATION
We reduce the data transfer overhead of J by using on-the-fly computation, which trades data transfer with re-computation. To reduce the execution time, the authors in [9] propose the pre-computation of the Jacobian J i . J i is constant over all the iterations (Line 2 of Algorithm 4) on the same image pyramid level and a J i is needed each time the intensity residual computation software module is invoked. While this is valid from a computation and processing cores-only architecture point of view, for an FPGA implementation, where the data need to be transferred to the hardware, such approach incurs extra data transfer overhead.
When optimizing computational overhead, we declare the Jacobian matrix J as a global array that is stored in DRAM. The global array can be accessed by both the Arm cores and the IRC accelerator. This design decision allows for pre-computing J using the Arm cores, but J needs to be transferred from the PS to the PL every time the IRC accelerator is invoked. An extra data transfer overhead incurs here, since J is a matrix consisting of 11,610 64-bit floating point elements.
As mentioned before, J only needs to be computed once, per image pyramid level, since the J i stays constant across the iterations (Line 2 of Algorithm 4). Also, J is not needed in the computation on the software side.
The optimal solution would be to compute J using the IRC accelerator and store it in the on-chip BRAM as a local array instead of a global one, so there is no need to transfer J between processing cores and PL. We decided to re-compute J, because J is needed each time the IRC accelerator is invoked, and the value of a local array will be lost every time the accelerator function returns. As a result, we need to compute J each time the IRC accelerator is launched, rather than computing it just once per image pyramid level. One of the main advantages of this direction is that the data transfer overhead incurred by J is completely eliminated.
However, there are still two more issues regarding this approach. The first issue is focused on the usage of BRAM since J is an array that consists of 11,610 64-bit floating point elements. The second issue is the computation of δI (Line 19 to Line 37 of Algorithm 5). There is a dependency between the computation of δI and the computation of J (Line 2 to Line 15 of Algorithm 5).
Our proposed solution to the above issues is to compute J on the fly. This solution takes place by merging the computation loop of δI (Line 19 to Line 37 of Algorithm 5) and the computation loop of J (Line 2 to Line 15 of Algorithm 5). By observing Algorithm 5, we discovered that the computation of J, and reference patch I k−1 (u i ) (Line 2 to Line 15 of Algorithm 5) shares a similar iteration space with the computation of δI (Line 19 to Line 37 of Algorithm 5). Therefore, we merged these two loops, as can be seen in Algorithm 8 (Line 5 to Line 17).
This approach eliminates the need of storing J and I k−1 (u i ), as they are both produced and consumed on the fly. The computation loop of J is also further overlapped with the computation loop of δI .

4) REDUCING DATA TRANSFER OVERHEAD OF I k−1 AND I k USING LOSSLESS IMAGE COMPRESSION
The second data transfer overhead that needs to be reduced is the one incurred by I k−1 and I k . We reduce this overhead by compressing large sparse images I k−1 and I k into smaller 70750 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. dense images I k−1,d and I k,d , and store them in consecutive memories. Images are usually stored as 2D matrices in memory. We consider an image to be equivalent to a sparse matrix if most of its pixels are not needed in computation. In other words, the unneeded pixels are equivalent to zeros. Our compression method is based on CSR (Compressed Sparse Row), but different from it in way that I k−1,d and I k,d are vectors instead of 2D matrices. We are able to remove the rows that represent the row and column index of the stored pixels because we store them in the same order as they are consumed in computation. The pixels in I k−1,d and I k,d will be read sequentially during computation, so there is no need to store the row and column index representing the pixels. This compression method further reduces the data transfer overhead of I k−1 and I k .
The image pyramids of I k−1 and I k are stored in 2D OpenCV matrices, although IRC works on image pyramids, which means that we do not need to transfer the whole I k−1 and I k . However, further analysis of Algorithm 5 shows that even on image pyramid level 4 (the highest level used in IRC, the higher the image pyramid the smaller the image), only up to 13% of the pixels in I k and 26% of the pixels in I k−1 are being used in the computation, due to the sparse computational pattern of IRC. Image pyramids of I k−1 and I k are similar to sparse matrices, where the unused pixels are equivalent to zeros. Therefore, it would be optimal to transfer only the necessary pixels to the accelerator instead of transferring the image pyramid of I k−1 and I k .
We discovered that the index of the necessary pixels in I k−1 depends on u i , while the index of the necessary pixels in I k depends onT k,k−1 , π, and p. u i , π and p remain constant during the execution of the sparse image alignment module, whileT k,k−1 will be updated only once per iteration (Line 2 of Algorithm 4). This gives us the opportunity to build two new dense images I k−1,d and I k,d (they are considered dense because there are no unused pixels, i.e. no zeros), as can be seen in Algorithm 8. This can be done by compressing all necessary pixels in I k−1 and I k into I k−1,d and I k,d using the method mentioned earlier and storing them in continuous DRAM memory space. The size of the new dense images can be reduced up to 7.5× when compared with the original sparse ones.
The last step in that direction is to further reduce the data transfer overhead of the IRC accelerator by using reduced precision (16-bits floating point numbers) to represent u i , π, f i , p k ,T k,k−1 , and p. Algorithm  Compress I k into I k,d using u i 3: while not converged or not reach max iteration do 4: Compress I k−1 into I k−1,d usingT k,k−1 , π and p 5: for each feature i detected in I k−1 do 6: if I k−1 (u i ) is within I k−1 then Compute W I k−1 and W I k using p k , f i ,T k,k−1 , π, p and u i 12: for each row of I k−1 (u i ) do 13: for each column of I k−1 (u i ) do 14: Compute H and δI using p k , f i , W I k−1 , W I k , I k,d and I k−1,d 15: end for 16: end for 17: end for 18: Solve the linearized system using δI and H to obtain an updated ξ 19: UpdateT k,k−1 using T(ξ ) 20: end while 21: end for a 7× speedup in the computational clock cycles of SDSoC design, and a 6.2× in the HeteroFlow design (Table 3), there is still a 20.2× slowdown in the run-time compared to the software baseline (Table 4), due to sparse computations and random memory access patterns of FA, and the extra data transfer overhead between the Arm cores and the accelerator, as explained in Section IV-A2.

2) CHALLENGES IN OBTAINING SPEEDUP FOR FA ACCELERATOR
Apart from the additional input data transfer overhead when using an FPGA accelerator, the challenge of obtaining speedup for the FA module also lies in its sparse computation paired with a random memory access pattern. According to Table 3, after the offloading, the performance bottleneck of the FA accelerator is in I/O, in particular, input, rather than computation. The input data transfer clock cycles take up around 99% of the clock cycles of the FA accelerator. The reason why FA is I/O intensive after offloading is because of its sparse computation and random memory access pattern.
The FA module, which is based on sparse optical flow, does not operate on an image pyramid as the original sparse optical flow does. That means we cannot transfer a smaller I k . A closer look at Algorithm 7 shows that only up to 256 pixels VOLUME 11, 2023  are needed per iteration (Line 6 of Algorithm 7), due to the sparse computational pattern of FA. The default number of iterations is 10 which means only up to 2,560 pixels are needed during the computation of the FA accelerator, which is around 0.7% of the pixels in I k . We also found the memory access pattern on I k is random, as the index of the necessary pixels in I k depends on u i ′ , and u i ′ is being updated at every iteration (Line 6 of Algorithm 7). That means it will not be useful to transfer part of the I k that contains all the necessary pixels, or stream I k to the accelerator, and it is also difficult to compress all necessary pixels of I k into a smaller dense image I k,d , as those indexes depend on intermediate results.
Note that I k,d in FA is different from I k,d in IRC in both size and the pixels it stores, so it needs to be built again in FA.

3) REDUCING DATA TRANSFER OVERHEAD OF I k USING APPROXIMATION AND LOSSLESS IMAGE COMPRESSION
We reduce the data transfer overhead of I k by first using approximation with domain knowledge, which enables lossless image compression. An interesting observation is that the value of most u i ′ will remain the same during the execution of the FA module. In other words, most u i ′ are not being updated. Even for the u i ′ that are updated, their new locations are typically within the neighbour of their previous locations which means there is no significant change in their coordinates. The reason behind this finding is that the estimated pose from the sparse image alignment module is already very close to the final one so feature alignment does not reduce the error significantly. Hence, only a small number of u i ′ has been updated.
Based on this finding above, it is possible to approximate the calculation of the index of the necessary pixels in I k . Instead of using the updated u i ′ , we use its initial value, i.e. the input of the FA accelerator, for the calculation of the necessary pixel index in I k . By doing this, we can compress all necessary pixels of I k into a smaller dense image I k,d , as can be seen from Algorithm 9. When the number of iterations of the u i ′ update loop (Line 13 of Algorithm 9) is 10, only 81 pixels are needed over the 10 iterations. Therefore, instead of transferring the whole I k , where only 81 out of 360,960 pixels are needed, to the FA accelerator, we now compress those 81 pixels into I k,d and transfer it. Undoubtedly, this optimization reduces input data transfer clock cycles by about 3772× (see Table 3), without sacrificing the accuracy significantly (see Table 4).

4) EXPLORING PARALLELISM WITH MULTIPLE FA ACCELERATORS
We then try to deploy multiple FA accelerators at the same time. In the mapping thread, I k will be divided into a grid with a number of u i ′ in each grid block. Then a depth filter will be initialized on u i ′ with the highest FAST (Features from Accelerated Segment Test) score in each grid block, unless a 2D to 3D correspondence already exists. This is to ensure the features are distributed in I k evenly. The same grid will also be used in feature alignment. In Algorithm 6, the FA module will be applied to each u i ′ in each grid block to find the first u i ′ whose error is small enough to meet the criteria. In the end, it is guaranteed that only one u i ′ will be selected per grid block. Loop at Line 3 of Algorithm 6 will break after finding 120 (algorithmic parameter) feature points. By analyzing Algorithm 6, it appears that both loops at Lines 4 and 5 can be parallelized in a SIMD (Single Instruction Multiple Data) fashion, i.e. multiple grid blocks can be processed at the same time, and multiple u i ′ in each grid block can be processed at the same time as well.
However, we found out that parallelizing loop at Lines 4 and 5 of Algorithm 6 is not straightforward, due to inter-loop dependencies and irregularities. The inter-loop dependencies are two-fold, first the loop at Line 4 of Algorithm 6 will break as soon as 120 matched feature points are found, while the loop at Line 5 of Algorithm 6 will break as soon as the first feature point with an error that is small enough to meet the criteria is found.
The irregularities are also two-fold, first, some grid blocks are empty, meaning there is no need to compute them. We discovered that on average feature alignment will need to iterate over 216 grid blocks to find 120 feature points. Since one feature point is guaranteed to be found in each non-empty grid block, which means on average 44% of the grid blocks are empty. Secondly, the number of feature points in each grid block varies, from 0 to 13. The distribution of the empty grid blocks is unknown, and so does the number of feature points in each non-empty grid block. If the loop at Line 4 of Algorithm 6 is to be parallelized directly, then conditional branching might be needed to determine whether a grid block is empty or not. As for the loop at Line 5 of Algorithm 6, we also need conditional branching to determine the number of feature points in each grid cell, so we can determine the FA accelerators that need to be deployed at the same time.
The inter-loop dependency at Line 4 of Algorithm 6 can be easily solved since one feature point is guaranteed to be selected in each non-empty grid block, and a total of 120 feature points are needed. Therefore, only the first 120 non-empty grid block needs to be processed. So multiple grid blocks can be processed at the same time until 120 blocks are processed. This method also eliminates the irregularities at Line 4 of Algorithm 6. As for the dependency at Line 5 of Algorithm 6, multiple feature points can be processed in parallel and the one with the minimum error will be selected. However, the irregularity still exists which might cause extra overhead due to the use of conditional branching.
Our solution to remove the irregularity as well as the overhead brought by the conditional branching in the loop at Line 5 of Algorithm 6, is by approximation exploiting domain knowledge. Instead of having conditional branching to determine how many FA accelerators need to be deployed at the same time, we only deploy one FA accelerator to the first u i ′ in each non-empty grid block and select it. We found out this approximation does not pose a significant impact on the accuracy, because the sparse image alignment module ensures none of the u i ′ are outliers, which means in the same grid block, the error of each u i ′ is not significantly different from the others, meaning that even the u i ′ that gets selected is not the one with minimum error, it will not change the accuracy significantly. Algorithm 9 presents the optimized FA accelerator along with the reprojection module.

V. EVALUATION
This section presents the evaluation of the SVO use case, with IRC and FA accelerator designs. We start by introducing the experimental setup, followed by an analysis regarding performance, power, estimated energy efficiency, area, and accuracy.

A. EXPERIMENT SETUP
Our use case is implemented and evaluated on a Xilinx Zynq ZCU102 Ultrascale+ platform, which has the ZU9EG SoC. The SoC consists of quad Arm A53 cores and the FPGA fabric connected via Accelerator Coherency Port in the last level cache. The IRC and FA hardware accelerators as well as the host code are developed using Xilinx SDSoC 2019.1 and HLS C/C++. We also utilize the stateof-the-art HeteroFlow [11]  for each non-empty grid block in I k do in parallel 5: Compute A i using u i on I r , π and the estimated camera pose 6: Compute I r (u i ) using A i , I r and u i 7: Compress I k into I k,d using u i ′ 8: for each row of I(u i ) do 9: for each column of I(u i ) do 10: Compute H using I r (u i ) 11: end for 12: end for 13: while not converged or not reach max iteration do 14: Compute W I r using initial u i ′

15:
for each row of I(u i ) do 16: for each column of I(u i ) do 17 2 The image data-set we used for evaluation is provided with the open-source code, 3 whose resolution is 752×480. We report the achieved performance (in terms of execution time) and speedup, power, estimated energy efficiency (in terms of energy per image), area, and accuracy (in terms of average Absolute Trajectory Error (ATE)). We use software provided by Xilinx 4 to measure the power of the Xilinx ZCU102 board. The onboard Texas Instrument INA226 monitors read the voltage and current value of each power rail through an I2C bus.
In terms of frequency, the IRC accelerator can reach up to 150 MHz, whereas the FA accelerator can reach up to 100 MHz. We include two baseline SVO software. The first baseline software is compiled using gcc-8.2 with compiler optimization level -O3 and Arm NEON extension. We use the PS section of the ZCU102 board, which has four Arm A53 cores (1.2 GHz), all configured to performance mode, and 4GB of PS DRAM. The OS is Xilinx Petalinux 2019.1. The second baseline SVO software is compiled using gcc-9.4 with compiler optimization level -O3 and Intel AVX (Advanced Vector Extensions) and SSE (Streaming SIMD Extensions). We use a workstation with an Intel Xeon W-2123 chip (3.6 GHz, 8 cores), 64 GB of DRAM and Ubuntu 20.04. We select the ''fast'' algorithmic parameter setting [9] of SVO, which disables local bundle adjustment at the back-end. Table 4 shows that the SDSoC design optimized with our proposed methods, with an average run-time of 6.56 ms, can achieve a speedup of 2.4× for the end-to-end SVO pipeline, without noticeable accuracy loss, when both the IRC and FA accelerators are enabled, compared to the software baseline on Arm A53 processors (with an average run-time of 15.5 ms). According to Table 9, We also achieve an improvement in the energy efficiency of 1.85× and 8.2×, when comparing with the software baseline on Arm A53 and Intel Xeon respectively.

B. PERFORMANCE, POWER, ENERGY EFFICIENCY, AREA AND ACCURACY ANALYSIS
One of the key insights from Table 4 is that when accelerating SVO kernels using HLS on FPGA, both computation and data transfer needs to be optimized to obtain better end-toend performance, which state-of-the-art Xilinx tools and HLS tools typically cannot automatically achieve as they assume that the kernels to be accelerated have dense computations and regular memory access patterns. The reasons data transfer optimizations are required are two-fold; first of all, there will be extra data transfer when using an FPGA accelerator on an SoC, since the input of the accelerator needs to be copied to a specific DRAM location to be accessed by the accelerator. Secondly, VO kernels with sparse computations and/or random memory access patterns will lead to a significant waste of memory bandwidth, as most of the image pixels that are transferred, will not be used in the computation.
We use the roofline model to analyze the performance of the IRC accelerator and FA accelerator. We built two roofline models for the FA accelerator and the IRC accelerator generated by SDSoC, respectively. Our models are based on the application-centric roofline model with latency constraints proposed by [45]. For each accelerator, the first roofline model assumes the accelerator will occupy the entire PL, while the second roofline model introduces an area constraint. The introduction of the area constraint represents a more realistic scenario. For example, when there are multiple accelerators that need to run at the same time, it would be impossible to assume each accelerator can occupy the entire PL. The area constraint means that when calculating the peak performance, we assume the accelerator will occupy the same amount of area as our implementations, instead of the entire PL. In Figure 4 and Figure 5, the black solid lines represent roofline models without area constraint while the red dotted lines represent roofline models with area constraint. We also add a peak memory bandwidth ceiling for all the models. Manev et al [46] report that the peak achievable DRAM bandwidth for the ZCU102 board is 14.4 GB/s. However, according to [45], when the memory access is random, the peak achievable DRAM bandwidth for the ZCU102 board falls to 0.17 GB/s. We demonstrate three accelerator designs, the unoptimized design (yellow dot), the design with only computational optimizations (blue dot), and the design with both computational and data transfer optimizations (red dot).
We conducted a design space exploration on data transfer, targeting the following inputs, I k , I k,d , and u i . We considered the following design spaces: data mover, system port, data precision, and the use of our proposed methods. The exploration spaces and their parameters are summarized in Table 5. As recommended by the Xilinx HLS tool, we allocate contiguous DRAM memory space for I k , I k,d , and u i . When using HP ports, input is allocated in non-cacheable DRAM space. Input is allocated in cacheable DRAM space when using the HPC ports.

1) PERFORMANCE ANALYSIS OF IRC ACCELERATOR
According to Table 3, although HLS pragmas did yield a 62× speedup in the computational clock cycle of the SDSoC IRC accelerator design, and an 89× speedup in the HeteroFlow design, there is still a 2× slowdown in the average run-time for both design, as can be seen in the third and fourth rows of Table 4. This is due to the additional data transfer overhead when using an FPGA accelerator and the fact that IRC has a sparse computational pattern.
We did not observe a significant change in average runtime performance after applying data transfer optimization with DMA and HPC ports. The bottleneck of the IRC accelerator after computational optimization is input data transfer. However, according to Table 3, the speedup in the input clock cycle of the SDSoC design and the HeteroFlow design are only around 1.01×, while the speedup in the output clock cycle of both designs are around 1.75×, after using DMA and HPC ports. The speedup in average run-time of both designs is around 1.03×, (see Table 4). The reasons why using DMA and HPC ports do not benefit either design significantly are two-fold. First of all, Xilinx software-hardware co-design tools only allow DMA to be used on data structures that have less than 16,384 elements, meaning that J, I k , and I k−1 cannot be transferred using DMA. While J, I k , and I k−1 are the top 3 largest data structures in the input of the IRC accelerator. Secondly, HPC ports are only beneficial when the amount of data that needs to be transferred is small enough, such aŝ T k,k−1 (a 3×4 floating point matrix), p k (a vector of 3 floating point numbers), H (a 6 × 6 floating point matrix) and δI (a vector of 6 floating point numbers). We will explore more on HPC ports in Section V-B2. A large amount of data such as J (11,610 floating point numbers), I k (22,560 8-bit unsigned integers) and I k−1 (22,560 8-bit unsigned integers) will only yield worse performance when transferred using HPC ports.
After applying the proposed methods of lossless image compression and on-the-fly computation to the IRC  accelerator, we observed a significant decrease in the clock cycle of the input data transfer for both the SDSoC design and the HeteroFlow design. According to Table 3, the speedup in the input clock cycle of the SDSoC design is around 19.3×, while the speedup in the input of the HeteroFlow design is around 16.4×. We also observe a 2× speedup in the output clock cycle for both designs, a 2.1× speedup in the computational clock cycle for the SDSoC design, and a 1.3× speedup in the computational clock cycle for the HeteroFlow design. SDSoC design has a lower input clock cycle because we used 16-bit floating point numbers to represent u i , π, f i , p k ,T k,k−1 , and p, to further reduce data transfer overhead in the IRC accelerator, as mentioned in Section IV-A4. However, HeteroFlow does not support 16-bit floating point numbers in its data type, it only supports 32-bits floating point numbers and 64-bits floating point numbers as well as integer and fixed-point numbers. Since using fixedpoint numbers or integers to represent u i , π, f i , p k ,T k,k−1 , and p will have a negative impact on accuracy, we used 32bits floating point numbers to represent u i , π, f i , p k ,T k,k−1 , and p, leading to higher input data transfer clock cycle.
According to Table 4, compared with designs without data transfer optimization, the SDSoC design and the HeteroFlow design optimized using the proposed methods achieve a speedup of 4.76× and 4.14× respectively. Compared with the software baseline, the SDSoC design and the Het-eroFlow design achieve a speedup of 2.38× and 2.08×, respectively.
This demonstrates that the proposed methods can be used not only with standard Xilinx tools but also with other stateof-the-art HLS tools, such as HeteroFlow. We will now analyze the performance of the IRC accelerator using the roofline model. Figure 4 demonstrates the two roofline models of the IRC accelerator designs, under different peak memory bandwidth ceilings. Without considering the area constraint, the unoptimized design is computation-bound under peak memory bandwidth, but memory-bound under peak memory bandwidth ceiling for random memory access. Computational optimization is able to improve performance, but the IRC accelerator with only computational optimizations is still memory-bound under the peak memory bandwidth ceiling for random memory access. However, after applying our proposed methods for data transfer optimizations, a further increase in performance and arithmetic intensity can be observed from Figure 4, and the IRC accelerator design is now bounded by computation under either peak memory bandwidth or the peak memory bandwidth ceiling for random memory access.
Taking the area constraint into consideration, both the unoptimized design and the design with only computational optimizations are computation-bound under either peak memory bandwidth or the peak memory bandwidth ceiling for random memory access, while the design with both VOLUME 11, 2023 70755 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
computational and data transfer optimizations is much closer to the peak performance.
Both Table 3 and Figure 4 suggest that the IRC accelerator with both computational and data transfer optimization is computation-bounded. The nature of the IRC algorithm contributes to the gap between the peak performance and our achieved performance. In particular, the computation of H and δI . H and δI are updated iteratively (Line 5 of Algorithm 8), which makes it difficult to pipeline the loop and reduce the initial interval to 1. However, when building the roofline model the peak performance is obtained under the assumption that there are no loop-carried dependencies.
We will now analyze the design space exploration results on the input data transfer of the IRC accelerator. We conducted a design space exploration on the input data transfer of the IRC accelerator, using u i and I k as examples. Details of I k−1 are omitted because they show similar behaviour to the one of I k . Details of J, p and f i are omitted because they show similar behaviour to the one of u i . Details of π, p k and T k,k−1 are omitted since they show similar behaviour to I k,d in FA accelerator, which we will explore later. Furthermore, there is no need to transfer J after we compute it on the fly. For the data transfer design space exploration, we use the configuration in which u i is transferred using zero-copy and HP ports in FP64 (Floating Point 64-bits), and I k is transferred using zero-copy and HP ports as baselines. Zero-copy is the default data mover for Xilinx software-hardware co-design tools. u i is an array with 240 floating point numbers. According to Table 6, when u i is represented in FP64, using HPC ports instead of HP ports will result in worse performance regardless of the data mover. Compared with zero-copy and HP ports, zero-copy and HPC ports shows a 1.8× slowdown in overall clock cycles. In particular, a 19.6× slowdown in setting up data mover. Compared with DMA and HP ports, DMA and HPC ports shows a 1.9× slowdown in overall clock cycles. In particular, a 10.5× slowdown in setting up data mover. DMA with HP ports provides a 1.05× overall performance improvement compared with zero-copy with HP ports. We did not explore HP ports or HPC ports for stream because it only uses GP ports. When using stream, there is an 8.7× slowdown in the overall clock cycle, compared to the baseline. In particular, a 144.6× slowdown in the data mover setup. This is because elements of u i are not stored in the same order as they are accessed in computation. Therefore, extra cycles need to spend on re-organizing the elements in u i . Furthermore, the AXI4-Stream should not be used with data structures whose size is more than 300 bytes.
When u i is represented using FP32 (Floating Point 32bits), there is a 2× speedup in overall clock cycles spent on transferring u i when using HPC ports or stream, compared to the ones that are using FP64. We also observed a 1.44× speedup in the overall clock cycle when using zero-copy and HP ports, and a 1.5× speedup in the overall clock cycle when using DMA and HP ports, compared to the ones that are using FP64. However, using HPC ports or stream still yields worse performance compared to using zero-copy or DMA with HP ports.
When u i is represented using FP16 (Floating Point 16-bits), there is a further 2× speedup in the overall clock cycles spent on transferring u i when using HPC ports or stream, compared to the ones that are using FP32. We also observed a further 1.25× speedup in the overall clock cycle when using zero-copy and HP ports, and a 1.28× speedup in the overall clock cycle when using DMA and HP ports, compared to the ones that are using FP32. Note that only when using FP16 to represent u i , that the HPC ports will yield better performance than the HP ports. In this case, using zerocopy with HPC ports achieved an overall speedup of 1.23× compared to zero-copy with the HP ports. Using DMA with HPC ports achieved an overall speedup of 1.1× compared to DMA with HP ports. While using stream still yields worse performance than the baseline. I k is a 2D matrix with 22,560 8-bit unsigned integers. According to Table 7, I k is not able to utilize DMA initially as it has more than 16,384 elements. When using zero-copy with HPC ports, we have observed a 2.8× slowdown in overall clock cycles, especially in the data mover setup time (229.8×). Note that one of the proposed methods, lossless image compression, enables the use of DMA as the number of elements in I k,d has been reduced significantly. With lossless image compression, there is a speedup of 5.8× in transfer time, when using zero-copy and HP ports. The configuration of using DMA and HP ports shows a 1.04× overall performance improvement than the one using zero-copy and HP ports. Using the HPC ports again shows worse performance than using the HP ports, this is because even after compression, I k,d still has 3,000 8-bit unsigned integers. Compared with using zero-copy and HP ports, using zero-copy and HPC ports shows a 2.1× slowdown in overall clock cycles. Specifically, there is a 30.56× slowdown in data mover setup time. Compared with using DMA and HP ports, using DMA and HPC ports show a 2.18× slowdown in overall clock cycles. Specifically, there is a 16.4× slowdown in data mover setup time.
In conclusion, DMA with the HP ports yields slightly better performance than zero-copy with the HP ports in general. However, DMA cannot be used with a data structure that has more than 16,384 elements in Xilinx software-hardware codesign tools, and the performance improvement it provides is still limited. HPC ports can only yield better performance than HP ports when the amount of data that needs to be transferred is small enough, otherwise, it will introduce significant overhead in setting up the data mover. We will further analyze the usage of HPC ports in Section V-B2.
Lossless image compression and using reduced precision numbers can deliver a significant performance improvement, and enable the use of DMA and even HPC ports, by reducing the elements or the amount of data that needs to be transferred. When using stream, the elements within a data structure need to be stored in the same order as they are accessed in computation, and the amount of data that needs  to be transferred should be less than 300 bytes, to prevent an overhead in setting up the data mover.

2) PERFORMANCE ANALYSIS OF FA ACCELERATOR
According to Table 3, although HLS pragmas did yield a 7× speedup in the computational clock cycle of the SDSoC FA accelerator design, and a 6.2× speedup in the Het-eroFlow design, there is still a 20× slowdown in average run-time for both design, as can be seen in the third and fourth rows of Table 4. This is due to the additional data transfer overhead when using an FPGA accelerator and the fact that FA has sparse computation and random memory access pattern, which makes FA to be I/O intensive after offloading.
We did not observe a significant change in run-time performance after applying data transfer optimization with DMA and HPC ports. According to Table 3, the input clock cycle of the SDSoC design and the HeteroFlow design remains unchanged, while the speedup in the output clock cycle of both designs is 4.84×. The reasons why using DMA and HPC ports do not benefit either design are similar to the ones in Section V-B1. I k in FA is a full image with a resolution of 752 × 480, which is far too large to be transferred using DMA, while using HPC ports will only lead to worse performance. I r (u i ) and u i ′ can be transferred using DMA along with HPC ports. However, they are not the bottleneck.
After applying the proposed methods of approximation and lossless image compression to the FA accelerator, we observed a significant decrease in the clock cycle of the input data transfer for both the SDSoC design and the HeteroFlow design. According to Table 3, the speedup of the input clock cycle of both designs is 3772×. We also observe a further 12.5× speedup in the output clock cycle for both designs, a 1.32× speedup in the computational clock cycle for the SDSoC design, and a 1.24× speedup in the computational clock cycle for the HeteroFlow design.
According to Table 4, compared with designs without data transfer optimization, SDSoC design and HeteroFlow design optimized with proposed methods achieve a speedup of 68.83× and 66.34× respectively. Compared with the software baseline, the SDSoC design and the HeteroFlow design achieve a speedup of 3.42× and 3.28× respectively. This again demonstrates that the proposed methods can be used not only with standard Xilinx tools but also with other stateof-the-art HLS tools such as HeteroFlow.
We will now analyze the performance of the FA accelerator using the roofline model. Figure 5 demonstrates the two VOLUME 11, 2023 70757 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. roofline models of the FA accelerator designs, under different peak memory bandwidth ceilings. Both the unoptimized design and the design with only computational optimizations are memory-bound under both peak memory bandwidth ceilings, with or without considering area constraints. Computational optimizations have limited improvement in performance. However, after applying our proposed methods for data transfer optimizations, a significant increase in performance and arithmetic intensity can be observed from Figure 5 and the accelerator design is now bounded by computation, under both peak memory bandwidth ceilings, with or without considering the area constraint.
Both Table 3 and Figure 5 suggest that the FA accelerator is computation-bound. The reason behind the gap in peak performance and the performance we obtained is mainly because of the nature of the FA algorithm. When building the roofline model, it is assumed that there are no loop-carried dependencies or dependencies between loops. However, that is not the case for the computation of H , J res and u i ′ (Lines 10, 17 and 20 of Algorithm 9). H , J res and u i ′ are all being updated iteratively (Lines 8 and 13 of Algorithm 9), and there is conditional branching (loop at Line 13 of Algorithm 9 might break early if converged), which makes it difficult to pipeline the loops and reduce the initial interval to 1. Furthermore, there are also dependencies between the loop that computes H (Line 8 to Line 12 of Algorithm 9) and the loop that updates u i ′ (Line 13 to Line 21 of Algorithm 9) We will now analyze the design space exploration results on the input data transfer of the FA accelerator. We conducted a design space exploration on the input data transfer of the FA accelerator, using I k as an example. Details of I r (u i ) and u i ′ are omitted because they show similar behaviour to the one of I k,d . For the data transfer design space exploration, we use the configuration in which I k is transferred using zerocopy and HP ports as a baseline. Zero-copy is the default data mover for Xilinx software-hardware co-design tools. I k is a 2D matrix with 360,960 8-bit unsigned integers. According to Table 7, I k is not able to utilize DMA initially as it has more than 16,384 elements. When using zero-copy with HPC ports, we have observed a 2.4× slowdown in the overall clock cycles, especially in the data mover setup time (3555×). Note that our proposed methods, approximation along with lossless image compression, enable the use of DMA and HPC ports as the number of elements in I k,d has been reduced significantly. With approximation and lossless image compression, a 332× speedup in transfer time can be observed when using zerocopy and HP ports. The configuration of using DMA and HP ports shows a 1.13× performance improvement over the one using zero-copy and HP ports. Using the HPC ports with I k,d shows better performance than using HP ports, this is because after compression, I k,d only has 81 8-bit unsigned integers. Compared with using zero-copy and HP ports, using zero-copy and HPC ports achieved a 6.33× overall performance improvement. Compared with using DMA and HP ports, using DMA and HPC ports achieved a 5.59× overall performance improvement. Using stream yields the best performance, this is because the amount of transferred data is 81 bytes (less than 300 bytes), and the elements within I k,d are stored in the same order as they are accessed in computation.
In conclusion, approximation and lossless image compression can deliver a significant performance improvement, and enable the use of DMA and HPC ports, by reducing the elements or the amount of data that needs to be transferred. DMA again shows limited performance improvement over zero-copy. However, if the amount of data that needs to get transferred is small enough, using HPC ports yields much better performance than using HP ports, regardless of the data mover. Using stream will yield better performance than zero-copy and DMA when the amount of data transferred is less than 300 bytes and the elements in the data structure are stored in the same order as they are accessed in computation.
We conducted further experiments to explore the circumstance in which HPC ports should be used instead of HP ports, the results are reported in Figure 6. The PL is clocked at  100 MHz in this experiment. Figure 6 shows that when using zero-copy as data mover, it would not be beneficial to use HPC ports if the amount of data that needs to be transferred is more than 1200 bytes. When using DMA as data mover, it would not be beneficial to use HPC ports if the amount of data that needs to be transferred is more than 900 bytes. The reason why using HPC ports yields worse performance than using HP ports as the amount of transferred data increases, is due to the increased overhead in setting up data mover. Figure 7 shows that when transferring 1300 bytes with zero-copy, using HPC ports yields a 1.42× speedup in transfer time only, compared to using HP ports. Similar behaviour can be seen with DMA. When transferring 1000 bytes, using DMA and HPC ports yields a 1.17× speedup in transfer time only, compared to using DMA and HP ports. However, the benefits brought by HPC ports in transfer time are negated when taking the time on setting up data mover into consideration. Figure 8 shows that when transferring 1300 bytes with zero-copy, using HPC ports yields a 5.94× slowdown in setting up data mover, compared to using HP ports. When transferring 1000 bytes, using DMA and HPC ports yields a 4.43× slowdown in setting up data mover, compared to using DMA and HP ports. Table 8 shows the total power of the SoC is approximately 3.06W, when running the SVO software baseline on all four Arm Cortex A53 cores that are set to performance mode. When both the IRC and FA accelerators are enabled in SDSoC design, the total power of the SoC is 3.9W. The increase mainly comes from the dynamic power of PL, which is around 0.82W, when the PL is active. As for the HeteroFlow design with both accelerators enabled, the total power of SoC is 3.49W, which is slightly lower compared to the SDSoC design. The main difference in power between 5 MGT stands for Multi-Gigabyte Transceiver. 6 The TDP of the Intel Xeon W-2123 system.    the SDSoC design and the HeteroFlow design is also from the dynamic power of PL, the HeteroFlow design occupies a smaller area than the SDSoC design, which leads to lower PL dynamic power. Table 9 shows that the SVO software baseline running on Arm A53 processors has an estimated energy efficiency of 47.43 mJ/image, while the SDSoC design achieves 25.58 mJ/image, which is a 1.85× improvement in energy efficiency. The HeteroFlow design achieves 25.16 mJ/image, which is a 1.89× improvement in energy efficiency. When running the SVO software baseline on an Intel Xeon W-2123 system, which has a Thermal Design Power (TDP) of 120W and an estimated energy efficiency of 208.8 mJ/image, the SDSoC design can achieve an 8.2× improvement in energy efficiency, while the HeteroFlow design can achieve an 8.3× improvement in energy efficiency. Table 9 also shows the average ATE of the SVO software baseline as well as the average translation error of the SVO  Table 9).

4) PERFORMANCE ANALYSIS OF APPLYING PROPOSED METHODS TO SVO SOFTWARE BASELINE
On the software side, we applied optimization techniques to the SVO software, which is executed on the Arm Cortex A53 processors and Intel Xeon W-2123 system. Table 11 presents the performance of SVO on the Arm and Intel systems. The run-time is the average value over all dataset frames. We first try loop unrolling by utilizing compiler flag -funroll-loops. However, we did not observe significant performance improvement on either platform (see Table 11).
Then, we apply the proposed methods to the FA software code. We observed a slightly better performance on both platforms, when compared to the software, which compiled using compiler optimization level -O3 and flag -funroll-loops, (see Table 11). There is a 1.2× speedup for the SVO pipeline running on Arm A53 processors, and a 1.12× speedup for the SVO pipeline running on Intel Xeon W-2123 system. The modifications of processing only the first u i ′ in each grid block, and processing multiple grid blocks at the same time by manually unrolling the loop at Line 4 of Algorithm 6 are responsible for this improvement. Compressing necessary pixels of the large sparse I k into a smaller dense vector I k,d and transfer only I k,d does not make a significant difference in a processing cores-only architecture as there is no need to copy the pixels to a specific DRAM location to get accessed by the computational unit.
Lastly, we applied the proposed methods to the IRC software code. We observed a slightly worse performance on both platforms, when compared to the software compiled using compiler optimization level -O3 and flag -funroll-loops, and with the proposed methods applied to FA (see Table 11). There is a 1.32× slowdown for the SVO pipeline running on Arm A53 processors, and a 1.23× slowdown for the SVO pipeline running on the Intel Xeon W-2123 system. This is because computing the cache of J and I k−1 (u i ) on the fly, by merging the loop of precomputing the cache of J and reference patch I k−1 (u i ) (Line 2 to Line 15 of Algorithm 5) with the loop of computing δI (Line 19 to Line 37 of Algorithm 5), only increases the workload without reducing the data transfer overhead. This is because the cache of J and I k−1 (u i ) used to be computed only once per image pyramid level (Line 1 of Algorithm 4), and now it is computed once per iteration (Line 2 of Algorithm 4). It seems it is not worth trading-off data movement with computation in the Arm and Intel systems as there is no need to copy the input and output data to specific DRAM locations. Hence, it would be better to simply pre-compute the cache of J and I k−1 (u i ) once per image pyramid level, as reported in [9]. Similarly, compressing the necessary pixels of the large sparse I k and I k−1 into smaller dense vectors I k,d and I k−1,d and transfer only I k,d and I k−1,d does not improve performance in the Arm and Intel systems.

5) OTHER FPGA-ACCELERATED VO/SLAM SYSTEMS
Finally, in order to place our work in the wider context of related research, Table 12 presents a number of examples of accelerating VO/SLAM on FPGA, as well as ours. The purpose of this table is not to rank the works, but simply to provide an overview of the performance, power, estimated energy efficiency, area and accuracy characteristic of previous works targeting VO/SLAM with different front-ends on different FPGA platforms.

VI. CONCLUSIONS
This paper has studied the acceleration of SVO kernels on SoCs with integrated FPGAs using HLS. We have shown that when accelerating these kernels on FPGA not only the computational overhead but also the data transfer overhead between the processing cores and the accelerators need to be minimized to obtain better end-to-end performance. This is due to the sparse computations and/or random memory access patterns of these kernels. The extra data movement overhead, is incurred when using FPGA accelerators since the input needs to be copied to a specific DRAM location for access by the accelerators.
State-of-the-art HLS tools, such as Xilinx SDSoC and HeteroFlow, mainly accelerate successfully kernels that have dense computations and regular memory access patterns. Thus, these HLS tools have not provided acceleration for the SVO kernels. To that end, we have proposed and evaluated three methods to reduce the data transfer overheads: approximation with domain-specific knowledge, lossless image compression and on-the-fly computation.
We have studied SVO due to its low-power emphasis [10] and because it illustrates a challenging scenario for HLS due to the sparse computations and/or random memory access patterns present in the two main kernels. FPGA acceleration for sparse VO is not widely studied, and this is the first paper addressing the acceleration of a semi-direct front-end as captured by SVO. We have shown that lossless image compression can reduce data transfer overhead and better utilize the memory bandwidth for a kernel that has sparse computations and predictable memory access pattern. For SVO kernels that have sparse computation and random memory access patterns, we have found that approximation can transform the random memory access patterns into predictable patterns, which can then enable lossless image compression to be used. On-the-fly computation can further reduce data transfer overhead by trading data movement with computation. We have also observed that the proposed methods can enable further data transfer optimizations provided by HLS tools, such as the use of DMA and HPC ports.
We have demonstrated that the proposed methods can be used not only with standard Xilinx tools such as SDSoC, but also with other state-of-the-art HLS tools, such as Het-eroFlow. With the help of the proposed methods, the designs (accelerators) generated by SDSoC and HeteroFlow have achieved a performance improvement of 19.3× and 17.6× respectively, when compared with FPGA accelerators without data transfer optimization. When compared with the software baseline running on the Arm A53 processors, the accelerators generated by SDSoC and HeteroFlow and optimized with the proposed methods have achieved a performance improvement of 2.4× and 2.14× respectively, without noticeable accuracy loss. These accelerators have also achieved an improvement in energy efficiency of 1.85× and 1.89×, respectively. When compared with the software baseline running on an Intel Xeon W-2123 system, these accelerators have achieved an improvement in energy efficiency of 8.2× and 8.3× respectively, without noticeable accuracy loss.

DATA AND CODE AVAILABILITY
The data and code that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request. VOLUME 11, 2023