Load Balancing Algorithm for Real-Time Ray Tracing of Dynamic Scenes

In this article, we propose a load balancing algorithm that accelerates ray tracing effectively and simply manner. The algorithm was developed in a hybrid system consisting of a CPU and hardware dedicated to ray tracing. Tree-building is processed on the CPU, and rendering is executed by the ray tracing-dedicated hardware. Because these components operate independently of each other, the final performance in terms of frames per second (FPS) is determined based on the time spent on tree-building or rendering, whichever process takes longer. This characteristic of a hybrid system is reflected in the developed algorithm, which dynamically adjusts tree-build parameters at every frame, thereby minimizing the interval between tree-building and rendering times. These effects ultimately increase FPS performance. Experiments at a resolution of $1920\times1080$ involving various dynamic scenes indicated that FPS performance improved by an average of 75.3% when the proposed algorithm was used.


I. INTRODUCTION
NVIDIA recently launched Turing GPUs, which support ray tracing-a rendering technique now also enabled in many games released for PCs and consoles [1]. These developments were followed by the release of other technologies that support ray tracing, such as AMD's RDNA2 GPUs [2], Microsoft's DirectX Raytracing API [3], and the Khronos Group's open ray-tracing API called Vulkan Ray Tracing. These releases are likewise supported on AMD, NVIDIA, and Intel platforms [4]. The aforementioned product roll-outs point to the extensive use of ray-tracing technology.
A diagram of a typical ray-tracing pipeline is shown in Figure 1. In the tree-building stage, a tree is constructed based on geometric data on a frame to be displayed. In the rendering stage, ray tracing is performed using the constructed tree. This chain of processes is repeated to render dynamic scenes.
Many studies have been conducted to improve ray tracing performance. Such an improvement can be achieved through The associate editor coordinating the review of this manuscript and approving it for publication was Seok-Bum Ko . the acceleration of stages of operation in a system in which each stage is performed sequentially (e.g., Figure 1). The situation differs with respect to a system wherein each stage is conducted independently. For example, if the processing of the tree-building stage is incomplete even when processing at the rendering stage is rapidly completed because of its acceleration, rendering is placed on standby (which translates to waiting time) until tree construction is finished. In such a system, an important requirement is to reduce waiting time by balancing performance at each stage. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Waiting can also occur if the tree-building and rendering stages are run asynchronously. Although waiting is not necessary under asynchronous operation, the slowing down of tree-building owing to a given scene potentially diminishes tree quality; if scene topology changes, a rebuild is needed, which may result in waiting time [5]. The cost of tree-building and the resulting waiting time can be reduced through the construction of a shallow tree using axis-aligned bounding boxes (AABBs) of each primitive [6]. However, this method involves the use of fixed tree-building parameters, which in the end can cause waiting time depending on the scene being rendered.
With consideration for the above-mentioned issue, we developed a load balancing algorithm that can solve the performance degradation stemming from waiting time. The algorithm can reduce waiting time by controlling the speed with which tree-building and rendering are implemented. Such control is achieved through the dynamic adjustment of tree quality at each frame.
To implement the proposed algorithm, we built a hybrid system that comprises a CPU and an FPGA that can process tree construction and rendering stages independently through the CPU and ray tracing-dedicated hardware, respectively. Experiments featuring four dynamic scenes were carried out to evaluate the algorithm. The results showed that performance in terms of frames per second (FPS) improved by 48% to 100%. The considerable intuitiveness and simplicity of the algorithm translate to an easy application for performance improvement in various platforms.
The rest of the paper is organized as follows. Section II presents ray-tracing acceleration studies and treeconstruction methods. Section III describes the proposed algorithm and system. Section IV discusses the experimental environment and the results of the algorithm evaluation. Section V concludes the paper.

A. RAY TRACING ACCELERATION
In some ray tracing studies, the performance was improved via the acceleration of the rendering stage. In [7], the researchers proposed a highly efficient, scalable, and modular hardware architecture that achieves real-time ray tracing. The architecture includes the triangle intersection unit, ray generation and shading unit, and Kd-tree traversal unit. The system created in [8] was derived from the hardware structure presented in [7] and features a fully programmable single-chip structure and implementation. Other researchers put forward a highly parallel multi-core processor architecture [9] and a tile-based parallel ray-tracing system that runs with 128 programmable multi-core [10]. In [11], the researchers designed a high-performance multi-chip architecture in which multiple chips share a single core memory. Finally, in [12], RayCore's RTU (ray-tracing unit) was developed in such a way that enables the implementation of the entire ray-tracing stage in hardware.

FIGURE 2. Example of SAH.
The surface area heuristic (SAH) algorithm was presented in [13] to determine the cost of generating a kd-tree. The SAH, which is a sorting-based kd-tree algorithm, is defined in (1) as follows: where K T represents the cost of tree traversal for a current node, and K I denotes the cost of an intersection test on rays and primitives. SAH(p) is determined based on the conditional probability that the current node can be divided into two sub-nodes and that the number of primitives can be divided into each sub-node. This conditional probability is calculated by dividing the area of a sub-node space by the current node space (V in the equation above). In (1), V L and V R respectively refer to the left and right sub-node spaces where the current node space is divided by p. C L and C R refer to the costs of the intersection test on primitives in the left and right sub-node spaces, respectively. The kd-tree constructed via the sorting method is optimized and of very high quality, but generating it entails a long time. Example of the hybrid tree-construction method in RayCore [12] with the binning-based tree-build pipeline for high-level nodes and the sorting-based pipelines for low-level nodes.
Many studies improved performance by accelerating the construction of kd-trees. In [14], the researchers proposed a scanning algorithm that can identify approximated SAH costs. In [15], the authors developed a GPU kd-tree construction algorithm that uses spatial median and SAH algorithms to process upper-and lower-level nodes, respectively. In [16], a highly parallel binning method for fast kd-tree construction was put forward.
RayCore's TBU (tree-build unit) [12] is underlain by a hybrid method that combines binning and sorting approaches in the operation of kd-tree construction hardware. In the current work, the hybrid method was implemented as software. Figure 3 shows an example of a kd-tree generated via hybrid kd-tree construction. Scene splitting is performed using a binning method until the number of primitives for high-level nodes goes below a certain threshold, and low-level nodes are generated via sorting for the primitives of split scenes. Fast tree construction is possible using the binning method, and an excellent quality of trees is created using the sorting method. Compared to the exact SAH kd-tree construction [13], the quality degradation of the trees constructed through the binning and sorting approach is insignificant, but kd-tree building proceeds rapidly.
An alternative to using a kd-tree is employing a BVH (bounding volume hierarchy). Construction time is minimized by repeatedly merging agglomerative clustering pairs to build BVHs [17]. Researchers proposed a BVH-building hardware architecture that enables the fast construction of high-quality BVHs using parallel locally ordered clustering [17], [18].

III. THE PROPOSED ALGORITHM AND SYSTEM
A. THE PROPOSED ALGORITHM FIGURE 4. Example of a system in which tree-build and rendering process sequentially. TB, RD, and RD' refer to the kd-tree, rendering, and accelerated rendering stages, respectively. L visually shows improvements in performance. Figure 4(a) shows a system in which the tree-building and rendering stages proceed sequentially. Figure 4(b) illustrates the improved performance of the system in (a) through the acceleration of the rendering stage. Accelerating the processing of each stage reduces the total amount of time it takes to generate frames, thereby enhancing performance. Figure 5 depicts a system wherein tree-building and rendering are operated independently. Figure 5(a) shows that performance improved when tree-building was carried out on the CPU presented in Figure 4. In such a system, a critical requirement is to balance the processing performance of each stage. Under imbalanced processing, one of the stages in the chain is placed on standby. To illustrate, if the rendering of a current frame has been finished, but the tree for the next frame has yet to be built, the rendering stage must be paused until the tree-building is completed. Figure 5(b) shows a case wherein waiting time is prevented through adjustments to the performance of each stage. The approach is expected to improve performance by as much as L compared to Figure 5(a). To solve the load imbalance problem described above, we developed a load balancing algorithm that can reduce waiting time via adaptive adjustments to tree quality in each frame. This adaptive adjustment is carried out in such a way as to raise or lower the appropriate value of tree parameters based on the measurements of tree-building and rendering times for a previous frame. Figure 6 shows the flow diagram of the load balancing algorithm.
Tree quality can be adjusted in many ways [19], including setting a tree's depth to a certain value, establishing the volume of each leaf node by using the max primitive number, and changing the K T , K I values in (1) [6]. The present research adopted the adjustment of the max primitive number, which is a universal method. If this number is 1, the depth of a tree VOLUME 8, 2020 is at its deepest, and build time is at its longest, but rendering can be completed within the shortest period because it can produce a high-quality tree (best tree quality). The higher the max primitive number, the shorter the tree-building time and the longer the rendering time, with tree quality at a diminished.  Figure 7 shows the overall organization of the hybrid ray-tracing system designed in this work. It can be largely divided into tree-building and rendering stages. The tree-build stage is performed on the CPU by operating a TBU [12] as software. The rendering stage is carried out via hardware through the use of an RTU [12] on the FPGA board. An RTU has excellent performance per chip area because it implements the entire stage of ray tracing as hardware. The entire stage includes the setup processing, ray generation, traversal & intersection, hit-point calculation, and shading units.

B. THE HYBRID RAY TRACING SYSTEM
In the system, trees are built in parallel on the CPU, and rendering is performed in parallel on the FPGA board. The CPU and FPGA board operate independently of each other. The specific operation is described as follows.
The CPU reads geometry data of the dynamic scene to be rendered from system memory and builds a kd-tree. Our kd-tree structure includes static and dynamic acceleration structures (ASs) for static and dynamic primitives, respectively. The static AS is built once at the beginning of the process and reused at every frame, whereas the dynamic ASs are rebuilt at every frame. The constructed kd-tree is transferred to the local memory of the FPGA, which performs rendering based on the transmitted kd-tree. During rendering, the CPU constructs a kd-tree for the succeeding frame. In other words, when the FPGA executes rendering for the n frame, the CPU builds a tree for the n + 1 frame. Figure 8 shows the system used in the experiments. This system is composed of a host computer and an FPGA board.

A. EXPERIMENTAL ENVIRONMENT
We used an Intel Core i9-9920X (3.50 GHz) CPU for the host computer and VCU118 [20] (FPGA: Xilinx Virtex UltraScale + 9P) as the FPGA board. The host computer and the FPGA board were integrated using the PCI Express interface.    Figure 9 shows the four dynamic benchmark scenes used in the experiments. There are four types of Sword (37k primitives), Wood (30k primitives), Flight-sim (29k primitives), and Water-wave (12k primitives). The experiments were conducted at a resolution of 1920 × 1080. Table 2 shows detailed information about each dynamic benchmark scene, which includes information on scene primitives (static, dynamic), the number of textures and lights, and the total number of frames. We measured rendering time, tree-building time, and FPS in units of the frame for each dynamic benchmark scene and obtained an average value for the total frames on the basis of the measured values. To ensure accuracy in experimentation, we replicated the procedures five times and obtained the average from these replications.  Table 3 shows the rendering time, tree-building time, the interval between them (RT-TT), and the FPS in accordance with changes in tree quality. Each value is the average for all the frames. Leaf-num refers to the number of primitives that a leaf node has. Tree quality was adjusted by dividing leaf-num into eight levels: 1, 2, 4, 8, 16, 32, 64, and 96. The best tree quality arises from a leaf-num value of 1; as this value grows to 96, tree quality decreases.

B. EXPERIMENTAL RESULTS
The optimal leaf-num values of the scenes are as follows: 32 for the Sword and Flight-sim scenes, 96 for the Wood scene, and 8 for the Water-wave scene (Table 3). In common, the closer the interval between tree-building and rendering times to 0 (i.e., the more balanced the processing performance of the two stages), the higher the FPS.  Table 4 shows the FPS results derived with and without the load balancing algorithm. When the algorithm was applied, the leaf-num setting value was controlled by applying an increase or decrease of 8 to reduce the interval between tree-building and rendering times. The maximum value of the leaf-num is 96, which is divided into 13 levels. When the algorithm was not applied, we set the leaf-num value to 1 to ensure the best quality as this configuration is a commonly used setting in building a high-quality kd-tree.
As shown in Table 3, the tree quality with the highest FPS differed in each scene in accordance with scene characteristics. Thus, when tree quality was set to a certain level, maintaining high FPS in all the benchmark scenes was impossible. Nevertheless, our load balancing algorithm produced high FPS performance under all the dynamic benchmark scenes because it adjusted tree quality adaptively by measuring the interval between rendering and tree-building times at the frame level.
The experiments indicated that FPS performance improved from 5.2 to 7.7 for the Sword scene, from 8.3 to 16.6 for the Wood scene, from 14.9 to 25.9 for the Flight-sim scene, and from 20.4 to 36.6 for the Water-wave scene. These amount to an overall performance improvement of up to 48% to 100%.

V. CONCLUSION, LIMITATIONS, AND FUTURE WORK A. CONCLUSION
In this article, we proposed a load balancing algorithm that can improve FPS performance by balancing tree-building and rendering performance. Our experiments showed an average performance improvement of 75.3%. Our approach is very effective for a system wherein tree-building and rendering are done independently, as is the case in a hybrid ray-tracing system. The findings showed that performance can be enhanced by identifying the appropriate trade-off point on the basis of the relationship between tree-building and rendering performance in generating a given tree quality.

B. LIMITATIONS AND FUTURE WORK
We focused on improving performance by controlling tree quality, which is one of the parameters of tree-building. VOLUME 8, 2020 Given that this method is highly scalable, it is not limited to environments where a CPU processes tree-building, as in our hybrid system. We believe our approach is applicable to tree construction on other platforms, such as GPUs or dedicated hardware [12].
The proposed method revolves around tree-building parameters, but no consideration was accorded to rendering parameters. In future studies, we intend to probe into parameters that affect speed and image quality in rendering, such as variable rate shading [21] and foveated rendering [22].