Real-Time Lens Distortion Algorithm on an Edge Device With GPU

The lens distortion process is essential for displaying VR contents on a head-mounted display (HMD) with a distorted display surface. This paper proposes a novel lens distortion algorithm to achieve real-time performance on edge devices with an embedded GPU. We employ unified memory space to reduce the data transfer overhead based on an architectural characteristic: an integrated CPU and GPU memory system. The lens distortion kernel is based on the lookup table-based mapping algorithm whose performance is bounded by memory operations rather than computations. To improve the kernel’s performance, we propose a compressed lookup table approach that reduces the memory transactions on the kernel while slightly increasing computation. We tested our method on three different edge devices and a desktop system while varying the image resolution from 720p (1, $280\times 720$ ) to 8K (7, $680\times 4.320$ ). Compared with prior GPU-based lookup table algorithms, our method achieved up to 1.72-times higher performance while consuming up to 28.93% less power. Also, our method demonstrates real-time performance for up to a 4K image with a low-end edge device (e.g., 56 FPS on Jetson Nano) and up to an 8K image with a mid-range device (e.g., 94 FPS on Jetson NX). These results demonstrate the benefits of our approach from the perspectives of both performance and energy.


I. INTRODUCTION
Virtual reality (VR) is a technique that provides a simulated experience to users, which is widely being used in various applications, including training, education, prototyping, and entertainment [1]. One of the simplest ways to make VR content is converting an existing 2D image (or video) to 3D. Stereoscopy is a technique for creating the illusion of depth in a flat (2D) image using binocular disparity [2]. A stereoscopic image consists of two images for the left and right eyes, and displaying them to each eye is the easiest way to enhance depth perception in the viewer's brain. A headmounted display (HMD) is the most widely employed device to achieve this effect.
Since the screens in HMDs are placed only a few inches from the user's eyes, an optical lens is placed between them. The optical lens system locates the images at a comfortable The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . distance for the user [3], [4]. It also magnifies the image to provide a reasonable filed-of-view (FOV) [5]. Although the optical lens system has such benefits, it also causes non-linear distortion (e.g., radial distortion) on the image. To correct such distortion, a pre-distortion process on the image is required based on an optical model for HMDs [6], [7]. Since stereoscopic rendering with the pre-distortion process requires a large amount of computation, a separation strategy is needed in most current VR systems; it performs most computations on a PC (or laptop), and the HDM just gets the resulting images while displaying them. This is one of the obstacles in the popularization of VR and HMDs [5].
Edge devices have recently improved their processing power significantly while embedding multi-core CPUs and a GPU (Graphics Processing Unit). Current commodity edge devices have up to six CPU cores and a powerful GPU (e.g., 384 CUDA cores) within a credit card size [8]. With the edge devices' computing power expected to grow continually, designing an algorithm that can efficiently use the heterogeneous architecture of edge devices will be critical to improving the convenience of using HMDs. It will also be a key to accelerating the popularization of VR systems.
In this work, we propose a novel lens distortion algorithm for edge devices with an embedded GPU. To reduce the data (i.e., input and output images) transfer overhead, we designed our system to employ unified memory space while taking advantage of the integrated CPU and GPU memory architecture of edge devices (Sec. IV-A). Our distortion kernel is based on a lookup table algorithm that significantly reduces the computation on a naive lens distortion method by using pre-computed (i.e., lookup table) values (Sec. IV-B). Since the main idea of the lookup table approach is using a pre-computed value instead of computing the value on the fly, the distortion kernel's performance is bounded by the memory operations while wasting the computational capability of the GPU. We propose a compressed lookup table method that decreases the memory transaction and improves the efficiency of the distortion kernel (Sec. IV-C).
To demonstrate the benefits of our method, we tested its performance on three different edge devices having an embedded GPU and a desktop system having an external GPU for various resolutions of images. We also implemented four alternative methods, including prior lookup table algorithms on the CPU and GPU, and compared their performance with our method (Sec. V). Overall, our method shows the best performance on all the devices independent of the image resolution. It achieved up to 1.72 times (1.35 times on average) higher performance than a prior lookup table algorithm on a GPU. Especially on edge devices, our method shows a meaningful performance improvement over prior lookup table methods on a GPU: about 43.24% on average. As a result, with our method, we can achieve real-time lens distortion performance on up to a 4K (3840 × 2160) image with a low-end edge device (e.g., 56 FPS with Jetson Nano) and up to an 8K (7680 × 4320) image with mid-range edge devices (e.g., 94 FPS with Jetson NX). We also compared the power consumption of our method with the prior GPU algorithm and found that ours consumes up to 24.86% less energy. These results demonstrate the efficiency and usefulness of our approach in both computational performance and energy cost.

II. RELATED WORK
Robinett and Rolland [7] pointed out the optical distortion problem and described an optical model for HMDs. Based on their model, they proposed a method of correcting the optical distortion in the HMD by applying an inverse function of the lens distortion to the image on the screen. The lens then restores the predistorted image, and the users see the correct image.
To accelerate the predistortion process, various parallel computing hardware has been employed. Some of the approaches utilize multi-core CPUs [9], but they do not meet real-time performance. Others employ specialized hardware (i.e., FPGA) and achieve real-time performance for VGA output resolution (640 × 480) [10], [11]. However, the most widely employed approach is using a GPU. Watson and Hodges [12] suggested using graphics hardware for predistorting images. They found that predistortion can be represented as a simple texture mapping onto a 3D polygon reflecting the distorted shape if the undistorted image is a texture. As a result, they achieved up to 10 frames/second (FPS) at a 640 × 480 resolution with high-end graphics hardware at that time: Silicon Graphics Onyx Reality Engine II. This texture mapping approach has been widely employed for correcting lens distortion in various fields using wideangle lenses, such as HMDs, the medical domain, and surveillance [13], [14]. Traditionally, texture mapping is realized by shader languages like GLSL (OpenGL shader language), and there are two methods: pixel-based and mesh-based [15]. Pixel-based implementation generates a high-quality result since it computes the distortion coordinates of every pixel in the image. On the other hand, the mesh-based method transforms vertices on a plane mesh according to the distortion equation and fills the other region by interpolating the values for the vertices. Therefore, it can accelerate the distortion process by using a low-resolution mesh; however, its quality also decreases. Shuhua et al. [14] employed the pixel-based method and achieved up to 190 times higher performance than the CPU algorithm. We compared the performance of both pixel-and mesh-based texture mapping algorithms with our method in the supplementary report.
Since a lens and the distortion model in a system (e.g., HMD) are fixed, the distortion parameters for each pixel can be reused. Therefore, the texture mapping concept can be extended to build a lookup table (or map) and be used to improve distortion performance [16], [17]. Shehrzad Qureshi [18] introduced lookup table-based lens correction implementation with OpenCL. The introduced method pre-computes two tables for each x-and y-coordinate and generates distortion results by looking at the table. Although the two-table approach is simple to implement, it requires memory transactions twice for processing each pixel. We found that frequent memory access causes performance degradation of GPU algorithms, especially for embedded GPUs on edge devices. To solve this issue, we propose a compressed lookup table approach (Sec. IV).
Lee et al. [13] implemented a distortion correction algorithm in a General Purpose GPU (GPGPU) platform with CUDA. They exploited the GPU's hardware-accelerated interpolation ability and achieved a near-real-time performance for HD resolution images (e.g., 48 FPS for 1920 × 1080 on Nvidia GT555M). However, this method requires additional RGB splitting and mering steps to utilize hardware-accelerated interpolation. Van der Jeught et al. [19] also implemented a distortion correction algorithm using a GPU, and their method shows a near-real-time performance (e.g., 30 FPS) for 1024 × 768 images. Our method also exploits the parallel computing power of a GPU. However, we propose a real-time lens distortion algorithm running on VOLUME 10, 2022 an edge device with a GPU, not on an external GPU equipped on a desktop or laptop.

III. PRELIMINARIES
This section provides the preliminaries for understanding the proposed approach, including the lookup table-based distortion method and the characteristics of edge devices with an embedded GPU.
Eq. 1 is the radial distortion model for pre-distorting the input image, where (x d , y d ) and (x u , y u ) are the pixel coordinates in the distorted and undistorted images, (x c , y c ) is the center coordinate of the image, and r is the distance between (x d , y d ) and (x c , y c ). k 1 and k 2 are the distortion coefficients, which depend on the target lens. This equation is used to find the matched pixel on the input image (x u , y u ) for the pixel on the distorted image (x d , y d ). Since pixels in an image have discrete coordinates, we can pre-compute the (x d , y d ) → (x u , y u ) pairs for every pixel of the distorted image. A lookup table is the set of all the pairs. Once we have the lookup table for the target lens, we can simplify the distortion process by reading the lookup table and copying the pixel value on the input image to the target position on the distorted image. Therefore, it becomes IO-bounded work, which means the memory operations bounded the performance of the work. Our distortion algorithm is designed based on this lookup table approach. However, the memory access pattern is optimized for the lookup table to improve the distortion performance on the embedded GPU.

B. UNIFIED MEMORY ACCESS IN EDGE DEVICE
In this work, we target edge systems with an embedded GPU. Compared with a general computing system (e.g., PC and workstation) with an external GPU connected over a system bus (e.g., PCIe), an edge system usually integrates a CPU and GPU into a chip while sharing the DRAM [20]. Although each processing unit (i.e., CPU and GPU) in an edge device has its own dedicated memory space on the shared DRAM, it can communicate more efficiently than using physically separated memories, such as an external GPU system. This is realized by specialized memory mapping methods like Nvidia's unified memory [21] and AMD's heterogeneous memory access (hUMA) [22]. The common ground of those methods is mapping the memory region of multiple processing units into unified memory space. Then, every processing unit can access the unified memory space without any explicit memory copy operation. The unified memory space is cached on each processing unit's memory region; therefore, we can avoid the expansive overhead for data copy between the host memory (i.e., system memory) and the device memory (i.e., GPU's DRAM). In our method, we employ unified memory to share input and output images between the CPU and GPU. Fig. 1 shows the overview of our lens distortion correction system. The system includes three components: CPU, GPU, and unified memory space. In the preprocessing step, the GPU gets the lens distortion parameters, including the image resolution and distortion coefficients (Eq. 1), from the CPU. Then, the GPU computes the lookup table and stores the result in its device memory. We represent the lookup table as a compressed form to optimize the memory access pattern while considering the characteristics of the distortion algorithm (Sec. IV-C). At runtime, the CPU acts as a manager that obtains input images from an external device (e.g., camera, disk) and orders distortion computation for the images to the GPU. The CPU puts an input image into the unified memory space and calls the GPU kernel (Sec. IV-B). Then, the GPU computes the distorted image of the input image based on the lookup table, and it returns the resulting image into the unified memory space. Finally, the CPU takes the distorted image and outputs it to the target display device (e.g., an HMD or a monitor). We repeat this runtime process until all the input images are processed.

B. DISTORTION KERNEL ON GPU
The work of the GPU kernel using the lookup table is simply reading the lookup table and copying a pixel from the input image to the output image. Since all pixels are independent of each other, the GPU kernel's simple but efficient thread layout is allocating a thread per pixel. The lookup table can be used for forward and backward mapping, which determines the thread layout of the GPU kernel. Forward mapping gets the target pixel index on a distorted image from the input (undistorted) pixel index (i.e., (x u , y u ) → (x d , y d )). In this case, each GPU thread handles a pixel on the input image. In the reverse direction, backward mapping obtains the pixel index on an input image from the pixel index of an output image (i.e., (x d , y d ) → (x u , y u )). For this backward direction, the GPU kernel allocates a thread per pixel on the output image.
We found that backward mapping achieves about 13.91% higher performance on average than forward mapping. The forward mapping kernel requires memory copies equivalent to the number of pixels on the input image. On the other hand, backward mapping requires fewer memory copies (e.g., 22.21% for 1920 × 1080 resolution) than forward mapping because some regions on the distorted image are blank. Also, forward mapping leads to more serialized memory access than backward mapping because the pixel index is discrete, and multiple pixels on the input image can be mapped to the same pixel on the output image. Therefore, we use the backward mapping approach on the GPU distortion kernel.

C. OPTIMIZING MEMORY ACCESS WITH A COMPRESSED LOOKUP TABLE
A straightforward form for the lookup table is a two-dimensional array of a structure consisting of two integer values for the x-and y-coordinates. However, such an array of structure (AoS) leads to uncoalesced memory access, which means threads in a warp read (or write to) a non-contiguous memory region [21]. Uncoalesced memory access requires more memory transactions than coalesced memory access. To improve lookup table access efficiency, Shehrzad Qureshi [18] employed an SoA (Structure of Array)-style lookup table that consists of two separate arrays for the x-and y-coordinates.
To improve memory access efficiency even further than Shehrzad Qureshi [18]'s approach, we propose a compressed lookup table that halves memory transactions for reading the lookup table. Instead of using two separate arrays, we encode x-and y-coordinates into an integer value. Eq. 2 is our encoding equation where (x, y) is the input coordinates, c x,y is the compressed value, and (width, height) is the image resolution.
The key to our compression method is the image resolution. Since the coordinate for the shorter axis cannot exceed the resolution of the longer axis, we can decode (or decompress) the c x,y into (x, y) coordinates with one division and one modular operation, as shown in Eq. 3.
Our compressed lookup table can obtain the target coordinate with one read operation and threads in a warp access continuous memory region. As a result, we can halve the memory transaction while accessing the memory efficiently by coalesced memory access. Even though it requires additional computation to decode the value (i.e., c x,y ) instead of memory transaction, it is beneficial to the performance of the distortion process. GPU cores can handle other threads during the memory transaction latency of a set of threads (e.g., warp) without penalty based on the GPU's zero context-switching overhead property [23]. Therefore, replacing some memory operations of the I/O-bounded kernel with computational tasks improves the utilization efficiency of resources in the GPU. With the two separated (uncompressed) lookup tables for the x-and y-coordinates, the main job of the distortion kernel is memory access. It is an I/O-bounded kernel, and the computing units (i.e., GPU cores) are idle most of the time; therefore, the compressed lookup table improves the balance between memory transactions and computation tasks, improving the distortion kernel's performance. By profiling the reasons for stalling threads during kernel processing with Nvidia Nsight Compute [24], we found that the ratio of stalls by memory latency is reduced up to 31.12% (29.57% on average) with the compressed lookup table. It also improves the kernel performance up to 1.72 times (1.35 times on average) more than using an uncompressed lookup table.

V. RESULTS AND ANALYSIS
We implemented our lens distortion algorithm on four systems, including three edge devices having different computing powers and a desktop system having an external GPU (Table 1). We used CUDA 11.0 for the desktop and 10.2 for edge devices. We implemented two versions of our method to discern the effects of the mapping direction (i.e., forward and backward).
• Ours B is an implementation of our algorithm with backward mapping that uses the compressed lookup table (Sec. IV-C). The distortion kernel allocates a thread per pixel on the output image.
• Ours F is the forward mapping version of our algorithm. In this algorithm, each GPU thread handles a pixel on the input image.
To analyze the benefits of our approach, we also implemented four alternative algorithms based on prior approaches. There are two categories of alternative methods. The first group includes two CPU-based algorithms.
• CPU naive is an implementation of the lens distortion process without pre-computation (i.e., lookup table). This method computes the target pixel according to the distortion equation (Eq. 1) for each pixel. We used the same number of threads as the number of CPU cores in the system (e.g., four-thread for Jetson Nano and six threads for Jetson NX).
• CPU LUT is a CPU implementation that employs the lookup table approach. This method uses a pre-computed lookup table for backward mapping. At runtime, it finds the target pixel by using the lookup table instead of calculating it on the fly. This algorithm also used the same number of CPU threads with the number of cores in the system.
The second group of alternative algorithms are prior GPGPU algorithms.
• GPU naive is a CUDA-based version of CPU naive . This method launches as many threads as the output image resolution. We tested various thread layouts and found that the (16 × 16) thread block generally performs better VOLUME 10, 2022  than other layouts. Therefore, we used this thread block size in our experiments.
• GPU LUT is an implementation of the lookup table distortion algorithm. Following Shehrzad Qureshi [18], we built two separate tables of x-and y-coordinates for backward mapping in the device memory. The thread layout and the block size are the same as GPU naive .
We employed the unified memory approach for implementing all the GPU algorithms, including GPU navie , GPU LUT , Ours F , and Ours B . However, for the external GPU system (i.e., RTX 2060s in Table 1), we used explicit memory copy between the host and device memories instead of unified memory. We found that the explicit version shows about 79.01 times (70.47 times on average) better performance for the RTX 2060s system.
We used the Lenna (Fig. 2) image and varied the image resolution from 1280 × 720 (i.e., 720p) to 7680 × 4320 (i.e., 8K) and checked the performance of the six algorithms. We performed the distortion process 100 times for each resolution and reported the average processing time. It should be noted that the lens distortion performance is affected by the image resolution, not by the contents (e.g., colors) on the image.
A. RESULTS Table 2 shows the processing time of the six different algorithms for lens distortion computation on different devices and various resolutions. Since GPU algorithms on edge devices use unified memory space, there is no explicit memory copy between the host and device memories. Therefore, the kernel processing time includes the memory transaction time. Unlike edge devices, an external GPU system requires explicit data copy between the host and device memories for an external GPU (i.e., RTX 2060s system). We measured all memory operation times and computation times. The parentheses for RTX 2060s in Table 2 show the time only for the kernel.
The lookup table algorithms on both CPU and GPU show better performance than the naive approach. CPU LUT and GPU LUT achieved up to 8.82 and 67.16 times higher performance than CPU navie and GPU navie , respectively. Also, GPU LUT shows up to 10.51 times (4.60 times on average) higher performance than CPU LUT . These results validate the benefit of the lookup table algorithm and that the lookup table approach is appropriate for GPU architecture.
Both our mapping algorithms with a compressed lookup table achieved the best performance in all cases. Fig. 3 compares the performance of three lookup  V-B). An interesting result is that our methods achieved better performance on edge devices  TABLE 3. This table shows the top seven reasons and related operations for stalling a warp. The last four columns show the number of the stalled cycles by each reason for GPU LUT and Ours B . We used an external GPU system (i.e., RTX 2060s) for this experiment. The reported data is the average value measured for every resolution. A detailed explanation of the reasons for each stall is available in Chapter 4 of the kernel profiling guide [25].
(except the low-end one, Jetson Nano) than on the external GPU (i.e., the RTX 2060s system) with a higher computational capability. We found that the kernel processing time is much faster on the external GPU than on edge devices. However, the data communication between host and device memories on the external GPU system takes more time than the kernel processing time on edge devices, which use unified memory (Sec. V-C). Overall, with our method, we could achieve real-time lens distortion performance on up to a 4K image with a low-end edge device (e.g., 56 FPS on Nano) and up to an 8K image (e.g., 94 FPS on NX) on mid-range devices.

B. BENEFIT OF COMPRESSED LOOKUP TABLE
To ascertain how our compressed lookup table approach (Ours B ) improves the performance over GPU LUT , we profiled the kernels of the two methods by using the Nvidia Nsight Compute [24]. We investigated the reasons for stalling a warp on each kernel, and Table 3 summarizes the results. The math pipe throttle occurs when all the math pipes are busy, which means the computational task causes this stall. On the other hand, the other six reasons are related to memory operation. As shown in Table 3, memory operations dominate the performance of both kernels. This result is consistent with the characteristics of the mapping kernel, on which the main work is reading the lookup table and copying pixels from an input image to an output image.
We found that our compressed lookup table approach (Ours B ) reduces the total stalls by about 30% compared with GPU LUT . This is because our method needs just one read operation to the compressed lookup table, unlike GPU LUT , which requires two memory reads for the x-and y-coordinates of tables. Although the stall by math pipe throttle is slightly increased (about 1.2 times) due to the decompression process, it still takes less than 1% of total stalls. Therefore, the benefit of using a compressed lookup table (i.e., reducing stalls by memory operations) is overwhelming in the increment of computational overhead. These results demonstrate the efficiency of our compressed lookup table approach.

C. EFFECT OF UNIFIED MEMORY
To check the effect of using unified memory on edge devices, we implemented an alternative version of our method (OursExp B ) that explicitly transfers the input and output images between the host and device memories with memory copy APIs (e.g., cudaMemcpyAsync()). We optimized the data transfer with pinned-memory (or page-locked memory) [21] for OursExp B . Table 4 shows the processing time of the two versions of our method. The kernel of OursExp B takes slightly less time than the kernel processing time for Ours B because all the required data is in the dedicated device memory for OursExp B , while Ours B requires access to the unified memory space. However, OursExp B requires separate data copy time between the host and device memories (e.g., Copy(H→D) and Copy(D→H) in Table 4) while kernel processing time of Ours B already includes all the  Unlike on edge devices having shared memory architecture, for a desktop environment with an external GPU (i.e., RTX 2060s system), using unified memory (Ours B ) leads to greatly reduced performance compared to performing explicit memory copy (OursExp B ). Since host memory and device memory are physically separated on an external GPU system, it should use a PCI bus for every access to the unified memory space. The bandwidth of PCI bus (e.g., 16GB/s for PCIe 3.0 x16) is much slower than the bandwidth of device memory (e.g., 448GB/s), and it is hard to exploit the peak bandwidth with frequent small-size data transfers.
We found that the RTX 2060s shows much higher performance (e.g., up to 54.22 times) than edge devices only for kernel processing time. However, the data communication time is much larger than the kernel processing time and the processing time of Ours B on edge devices except for the low-end Jetson Nano device. Therefore, employing unified memory is not suitable for external GPU systems.
As a result, we achieved better lens distortion performance with edge devices (e.g., Jetson NX and AGX Xavier) than using a powerful external GPU with the Ours B algorithm. These results demonstrate the efficiency and suitability of our method for edge devices.

D. POWER CONSUMPTION
For a wireless system like wireless HMD, power consumption is one of the critical factors since it determines the maximum usage time. To check the advantage of our approach in energy efficiency, we measured the power consumption of three GPU algorithms on each edge device. Jetson boards include the INA3221 power monitor module(s), and we can read the monitoring information via Linux sysfs [26]. We implemented a power-consumption-measuring software based on the NVIDIA Jetson board support package (BSP) and read the information on INA3221 every 15 milliseconds. We ran the power measuring thread concurrently with the thread handling the distortion process and synchronized at VOLUME 10, 2022 every execution. We ran each algorithm 5,000 times to have sufficient time for measuring power consumption accurately, and Table 5 reports the average power consumption for handling an image (or frame). Fig. 4 compares the power consumption of three GPU algorithms on edge devices. Ours F consumes up to 16.32% (10.37% on average) less power than GPU LUT . Our compressed lookup table approach halves the read operations for accessing the lookup table than GPU LUT (Sec. IV-C), and it reduces not only the processing time but also power consumption. Also, Ours B achieved up to 1.28 times better power efficiency than Ours F . As we inspired in Sec. IV-B, the backward mapping (Ours B ) approach requires fewer (e.g., 22.21%) memory copies than the forward mapping (Ours F ). Therefore, Ours B achieved much higher power efficiency than Ours F . Overall, Ours B reduces power usage up to 28.93% (24.86% on average) than GPU LUT . These results demonstrate the advantage of our method over the prior method on power efficiency.

VI. CONCLUSION AND FUTURE WORK
In this paper, we presented an efficient lens distortion algorithm for edge devices having an embedded GPU. We employed unified memory space to take advantage of the edge device's integrated memory architecture and optimized data transfer between CPU and GPU. We designed our distortion kernel based on backward mapping with a pre-computed lookup table. We then discovered that the performance of the lookup table method is bounded by memory operations while wasting the computational capability of the GPU. To improve the efficiency of the distortion kernel, we proposed the compressed lookup table approach. This approach balances the workload of memory operations and arithmetic computation, and it improves the kernel's performance. As a result, our method shows up to 1.72-times higher performance compared with a prior GPU-based lookup table algorithm (i.e., GPU LUT ). Also, our method demonstrates real-time performance for high-resolution images using low-end and midrange edge devices. Furthermore, we found that our approach consumes less power (e.g., up to 28.93%) than GPU LUT . These results demonstrate the benefits of our method from the perspective of both processing performance and energy efficiency, and it validates its suitability for edge devices.
Although our method achieved real-time performance for up to 4K resolution with low-end edge devices, future HMD will require a much higher resolution for a more immersive experience in the virtual world. As future work, we would like to make our algorithm meet real-time performance for ultrahigh-resolution contents (e.g., 8K) even with low-end edge devices. To achieve this, we plan to design a heterogeneous parallel lens distortion algorithm, which fully utilizes both multi-core CPUs and GPU.