Visualization of volumetric data has seen much research over many years. The two prominent approaches typically employed are direct volume rendering[2], [4], [19], [23] and isosurface visualization. Isosur-face rendering is a simple and effective approach for visualizing the different intensities present in the data. Many methods for isosurface visualization have been presented ranging from direct volume rendering using isosurface transfer-functions [19], rasterization of isosurfaces extracted to polygonal data [17], or direct raytracing [13], [15], [24]. Raytracing is a process of projecting rays through a scene to find intersections with scene geometry. In order to facilitate finding these intersections in a timely manner, an acceleration structure is typically warranted. Acceleration structures serve two purposes; to ensure only valid data-elements intersecting the rays are examined, and to ensure empty space is ignored.

Interactive isosurface raytracing [14], [24] presents a unique challenge as the acceleration structure must facilitate the rendering of all possible isosurfaces held by the data. An implicit kd-tree [24] is a special variant of general kd-trees, which is designed to facilitate fast traversal to voxels with possible ray-isosurface intersections. In this paper, we utilize a variant of the implicit kd-tree approach and specifically target our contribution for highly-parallel computing devices, such as Graphical Processing Units (GPUs).

GPUs have developed very quickly in recent years so as to be several magnitudes more powerful than modern CPUs. GPUs are essentially massive parallel-processing devices with many small processing units and hence are typically suited to light-weight and highly-parallel computing problems. GPUs were initially made available to researchers with the introduction of shader languages, although this required complicated programming. With the advent of new technologies such as CUDA (Compute Unified Device Architecture), it has become more viable to perform sophisticated algorithms on GPU hardware with relatively simpler programming methodologies. For example, CUDA kernels (i.e., functions that operate in parallel across multiple threads) are programmed in the C language and called much like a normal C/C++ function.

However, raytracing on GPUs introduces new challenges for researchers, not only from an implementation perspective, but also in choosing the best algorithm to fully utilize the device hardware. Many previous raytracing approaches were designed for CPU-based raytracing and therefore may not have the same benefits when implemented on GPUs. Raytracing algorithms can be divided into two categories; those that reduce the overall workload (e.g., packet traversal) and those that optimize the traversal (e.g., stackless traversal). In packet traversal [24], a group of rays are represented and traversed as a packet, which reduces the amount of work for the group by performing the common traversal steps. Coherent traversal [14] attempts to exploit the fact that rays typically traverse the same nodes most of the time by forcing convergence after ray divergence. Optimization algorithms target the specific strengths and drawbacks of employed architecture, for example, stackless traversal [3], [5], [10], [21].

In general, raytracing requires a stack to record tree nodes which should, if required, be returned to. Due to shared-memory limitations of GPUs, such an approach typically requires the stack to be stored in the GPU's main memory (global memory). Since the slowest aspect of most GPUs is accessing the global memory, however, a stack-based approach may induce a memory bottleneck.

Foley [5] highlighted that a stack-based traversal approach, on older-generation GPUs, induced such a performance bottleneck. Their solution was to completely remove the need for a stack and resort to extra computation for correct traversal. The two approaches that they introduced for general kd-trees were kd-restart and kd-backtrack. Kd-restart, upon requiring to return, restarted traversal from the root node, while also moving the ray's segment-range ahead of previously visited areas. Kd-backtrack dealt with returning by backtracking up the tree one node at a time until a valid continuation point was found. Both approaches require a considerable amount of additional work to find valid nodes to continue traversal, as shown in Fig. 2.

### 1.1 Previous Work

Implicit kd-trees [24] have been presented for use with isosurface ray-tracing and have been shown to be useful for maximum-intensity projection [8].

Challenges using a stack have been addressed by employing semi-stackless or completely-stackless approaches. Foley [5] introduced kd-restart and kd-backtrack to find the correct point to continue traversal. This work was further optimized with the introduction of kd-shortstack [10]. With kd-shortstack, a small stack is used for the majority of cases and, if a stack-overflow occurs, the algorithm reverts to kd-restart. Another alternative in stackless traversal was *ropes* [21]. Ropes are additional memory pointers per node, which link a node to its neighbour nodes. Ropes allow traversal, when required to return, to simply travel along the ropes into the neighbour nodes. Unfortunately, ropes require additional computation to build and require a considerable amount of memory.

Volumetric rendering of scenes with billions of voxels is possible on modern GPUs by streaming data-bricks to the GPU while simultaneously rendering previously uploaded bricks [3], [6], [9]. This allows volumes far larger than the available memory of the device to be rendered in real time. The latest publication [3] employing the bricking method is also notable for partially using indices for traversing their acceleration structure, and using a variant of kd-restart.

### 1.2 Main Contribution

In this paper, we introduce a stackless approach, referred herein as Kd-Jump, for the traversal of implicit kd-trees. Kd-Jump achieves an immediate return, like a stack, without redundant node testing; illustrated in Fig. 2(a). Also introduced is Hybrid Kd-Jump, which utilizes a volume-stepper for leaf testing and a run-time depth threshold to define where kd-tree traversal stops and volume-stepping occurs. By using both methods, we gain the benefits of empty-space removal and fast hardware-accelerated texture-caching.

In Section 2, we give the background on building and traversing an implicit kd-tree. In Section 3, we present our stackless approach Kd-Jump. In Section 4, we introduce the Hybrid Kd-Jump traversal approach, which can dynamically alter the depth where volume stepping occurs. While in Section 5 and 6, we present our results, discuss the possible bottlenecks and analyze CUDA.

We employ and build upon Wald's [24] implicit kd-tree. For completeness in this paper, we outline the basic concepts for building and traversing an implicit kd-tree. An unfamiliar reader should refer to Wald's original paper for a more comprehensive overview. We also highlight the aspects that we have altered for our implementation.

### 2.1 Building Implicit Kd-Tree

A kd-tree is a binary tree where, starting with a root, each node is divided into two children by an axis-aligned splitting-plane. When a node cannot be split, it represents one voxel and is referred to as a leaf. The split axis is typically chosen based upon which dimension is currently largest for the node, as illustrated by Fig. 3.

Implicit kd-trees are required to be *balanced-trees*, such that all leaves of the tree are on the same depth. To achieve this balance, the voxel dimensions must be a power-of-two, although each dimension need not be identical. Implicit kd-trees define *actual*-dimensions and *virtual*-dimensions; where the actual-dimensions are for voxels that actually exist while the virtual-dimensions are used purely to ensure that a balanced kd-tree is built, as illustrated by Fig. 3. Even though the kd-tree is built upon a larger virtual-volume, the non-existent nodes and voxels are never visited; nor are they stored.

There are two stages to building the implicit kd-tree, both of which are iterative processes. The first stage involves determining the number of levels for the tree, computing the level information from top-to-bottom and allocating the node memory per level. The second stage involves calculating the node information from bottom-to-top. The initial building process is required to be run on the CPU in order to allocate GPU memory, while the more labour-intensive job of computing the node information can be performed in parallel on the GPU itself.

#### 2.1.1 Initial Building

Given a volume and its actual voxel dimensions *R* = [*R*_{x}, R_{y}, R_{z}], we first compute the virtual dimensions *V* = [2^{m}, 2^{n}, 2^{p}] where 2^{m–1} < *R*_{x} ≤ 2^{m} *etc*. The number of levels in the kd-tree is then defined as *k* = *m* + *n* + *p.*

Each level of the kd-tree has real dimensions *R*^{l} and virtual dimensions *V*^{l}. During the process of tree building, we also require maintaining the current range of nodes as . We use to determine the largest dimension and the split takes place along the axis *a*^{l} ∊ {*x*, *y*, *z*} of the largest dimension on level *l*. Thus, starting with and *V*^{0} = [1, 1, 1], the virtual-dimensions for each level are defined by *V*_{al}^{l + 1} = 2*V*_{al}^{l}, while . The actual-dimensions of each level can then be found by *R*^{l} = *ceil* (*V*^{l} (*R*/*V*)). Finally, the number of nodes which we require per level is *M*^{l} = *R*_{x}^{l} × *R*_{y}^{l} × *R*_{z}^{l}.

Unlike general kd-trees using memory pointers, nodes are addressed using indices and a map is used to convert the indices to a location in memory. For three-dimensional data, a node has three indices *U* = [*x,* *y, z*], which are non-negative integers. Converting these indices, for a node on level *l*, to a memory location is achieved with *off set* + (*U*_{x} + (*U*_{y} × *R*_{x}^{l}) + (*U*_{z} × *R*_{x}^{l} × *R*_{y}^{l})) × *size of* (*node*) where the *o f f set* is the start of data for the level.

Implicit kd-trees do not store a split-plane within each node. For each level of the kd-tree, the number of shared split-locations is *R*_{al}^{l} rather than *M*^{l}; see Fig. 3. In fact, the split plane for a node can be computed, for any node, during traversal so as to avoid using global memory.

#### 2.1.2 Computing Node Data

Once the memory for the kd-tree is allocated, the node data and dimension splits for each level can be computed. Wald [24] originally defined that each node contained the minimum and maximum (min/max) value for all data held within the node sub-tree. This min/max value is used to determine whether any child contains the current isovalue and whether traversal should continue.

In this paper, we define that each node contains the two sets of min/max values; one for each child; rather than just one min/max value encompassing both. By storing the min/max of both children, we can facilitate faster traversal. Specifically, we store both sets so that referencing of the data can be achieved by mapping the indices to memory once and loaded in one transaction. In addition, by checking whether both children are valid (contains the isosurface) before traversing into them, we can potentially eliminate two redundant iterations (down-traversal and return) if a child is invalid.

Starting from the last node level, the min/max values are computed by evaluating the children. In the case of the last level, this requires checking the eight corner values of a voxel. For the remaining node levels, the min/max sets are computed from the min/max sets held by the child nodes. As nodes are only dependent on their own children, all nodes on a tree-level can be computed in parallel. Finally, if memory overhead is a concern, the final level of nodes can be omitted and computed on the fly [24]. Special care must be taken if the volume is accessed via CUDA textures, as data is typically offset.

### 2.2 Traversing Implicit Kd-Trees

Determining whether a ray intersects the isosurface is achieved by traversing the nodes which both intersect the ray and contain the iso-surface. Starting from an origin, a ray is projected along a direction and a ray segment, defined by *t*_{near} and *t*_{f ar}, is used to mark the valid portions along the ray where raytracing can occur. Each node has two children, denoted *first-child* and *second- child*. During traversal, the children are also tagged as *near-node* and *far-node*, although the conventions are not synonymous. We also define a boolean *NearFirst* ≡ *r*_{al} > 0, where *r* is the ray direction vector. Traversing into a child node is performed by updating the indices; we update *U*_{d} = 2 *U*_{d} + (1 *− NearFirst*) for the near-node and *U*_{d} = 2 *U*_{d} + (*NearFirst*) for the far-node. By traversing the near-node initially, we ensure that the first intersection along the ray is found, at which point traversal can terminate.

Testing a node initially requires computing the intersection distance *t*_{d} from the ray origin to the split plane. Determining which children to traverse into is subject to where *t*_{d} is in relation to *t*_{near} and *t*_{f ar}, as shown in Fig. 4. If *t*_{near} > *t*_{d} then the near-node is traversed and the far-node is culled. If *t*_{f ar} < t_{d} then the far-node is traversed. A common case is when *t*_{near} < t_{d} < t_{far} and both child nodes are valid and (potentially) must be traversed. In this case, the near-node is traversed into initially and, if the ray does not find a valid intersection in that sub-tree, the far-node is returned to.

The typical solution for storing the *far-node* is to use a stack to record the indices and the *t*_{d} and *t*_{f ar} values. However, a stack is not ideal for use on a GPU and we explore a stackless approach.

SECTION 3

## Stackless Traversal with Kd-Jump

The basic traversal of a kd-tree requires storing a return point when both children of a node have to be tested. Utilising a stack is a simple method for storing the return information. However, a stack approach requires utilizing the (currently) slowest part of the GPU pipeline, the global-memory. To avoid using the global-memory, one must remove the stack. There are two main stackless methods currently available (without requiring additional node memory); kd-restart and kd-backtrack. Both are trivial to implement for implicit kd-trees, but lead to additional workload compared to a stack-based approach. The extra work comes in the form of redundant node testing to find a continuation point.

With kd-backtrack, the return mechanism is replaced by traversing back up the tree node-by-node until a valid node is found, at which point downward traversal continues. Once a valid parent node is found, the *far*-child is traversed. The approach was originally envisaged for use with arbitrary kd-trees and therefore required additional parent-pointers to work.

Traversal of implicit kd-trees does not involve memory pointers and all nodes for all levels are entirely referenced by indices. As a result, it is possible to forgo the need to backtrack one node at a time and simply jump immediately to the next valid node. However, what is a valid node and how much to jump by require additional information. We now explain a novel approach for this and we refer to it as Kd-Jump.

In order to understand how immediate returning to the next valid node is possible, we will re-examine how downward traversal is achieved with an implicit kd-tree.

### 3.1 Traversing to Child

Traversal of the kd-tree involves tracking and updating three indices, so as to allow addressing of nodes. The indices of a node, at level *l*, are defined as *U*^{l} = [*x*^{l}, y^{l}, z^{l}] and the indices of the next level are *U*^{l + 1}. We first initialize the child indices with those of the parent; *U*^{l}+ 1 = *U*^{l}. Then traversal, from parent to child, is achieved by altering the index-component corresponding to the axis *a*^{l} that splits the *l*'th level
TeX Source
$$U_{d^l }^{l + 1} = 2U_{d^l }^l + c,\quad c = \left\{ \matrix{ 0\quad {\rm{first}}\;{\rm{child}} \hfill \cr 1\quad {\rm{second}}\;{\rm{child}} \hfill \cr} \right.$$

### 3.2 Returning to Immediate Parent

Like a stack-based approach, the best scenario is to return to the next immediate node to test. The current node and the node to return to will always share a common parent. As such, the first step is to arrive at that parent (see Fig. 2(a) or Fig. 5). The trivial case, given *U*^{l} + 1, is to return to the immediate parent *U*^{l}. Again, for this simple case, we initialize the indices with *U*^{l} = *U*^{l}^{+ 1} and then apply an operation equivalent to the inverse of Eq.(1).
TeX Source
$$U_{a^l }^l = floor\left({{{U_{a^l }^{l + 1} } \over 2}} \right)$$

### 3.3 Returning to Arbitrary Parent

Returning from *U*^{l} to *U*^{j} (*j* < *l*) potentially requires different divisions of each element of the index set. To achieve an immediate jump, we must deduce the number of iterations of Eq.(1) that have been performed to each element of the index set.

We define a *k*-by-3 matrix **S**, which stores the accumulation of the number of axis-splits, for each level. The matrix is formed in a recursive manner. Each row is initialized with the previous row values; **S**_{l + 1} = **S**_{l}. This is then followed by altering the vector-component corresponding to the axis *a*^{l} that splits the *l*'th level
TeX Source
$${\bf{S}}_{l + 1,a^l } = {\bf{S}}_{l,a^l } + 1,$$
where **S**_{m,n} represents the matrix element at the *m*'th row and *n*'th column, and
TeX Source
$$s_0 = [0,0,0]$$
is the initialization vector for the root level. Note that **S** is formed only once during kd-tree construction. Thus, storing and accessing this matrix on GPUs can be made quite fast by using cached constant-memory.

Given the accumulation matrix **S**, the current depth *l* and the depth *j* of the common-parent node, we can find the numbers of iterations, denoted *N*, of Eq.(1) applied to each index element between levels *j* and *l* as
TeX Source
$$N = s_l - s_j $$
where *N* = [*N*_{x}, N_{y}, N_{z}].

Finally, we are able to restore the index-set for the parent node being returned to by altering Eq.(2) to acknowledge the number of power-of-2 multiplications that have been applied (Eq.(5)). Thus, returning to the index-set *U*^{j}, given *U*^{l}, is achieved using
TeX Source
$$U^j = floor\left({{{U^l } \over {2^N }}} \right),$$
where 2^{N} ≡ [2^{Nx},2^{Ny},2^{Nz}]. So long as *c* ≤ 1, Eq.(6) will correctly find the integer indices without having to redetermine *c* for each level. Also note, all divisions in Eq.(6) are a power-of-two and, therefore, they can be implemented using rightward bit-shifting.

### 3.4 Completing Jump

Once we have returned to the parent node, we can simply reapply Eq.(1) in order to traverse into the next child. However, the unknown element is the offset *c*, which we need to apply in order to arrive at the *far-child*. Assuming a *nearest-first* traversal ordering, this can be deduced by redetermining whether the *first-node* is the *near-facing node*, which is quickly performed by examining the ray direction; such that if *r*_{al} ≥ 0 then *c* = 1 else *c* = 0. The complete Kd-Jump method is illustrated by Fig. 5 for a simple two-dimensional case.

After returning into the next node, the final step is to re-clip the ray to the bounds of the node and recompute the *t-near* and *t-far* intervals. The bounds can be computed on-the-fly, which avoids global-memory usage as well. See [25] for efficient Ray-Bound intersection methods.

### 3.5 Making Return Flags

Our detailed Kd-Jump mechanism facilitates a return, but does not yet include how much to jump by. To return immediately during the course of ray traversal, we require a *flag* to specify whether a level along the traversal path has a node that requires testing. One should note that, for the current traversal path, only one possible node will be required to be returned to for each level. As such, we only require a single memory-bit per level to store a possible return. We define a 32-bit integer-register *DepthFlags* to store these flags. As the typical size of volumes used on GPUs today (without out-of-core methods) is less than 1024^{3}, a 32-bit integer can hold the depth-flags. However, 64-bit integers can be utilized to facilitate kd-trees of up-to 64 levels in the future; indeed CUDA devices already provide 64-bit hardware functionality.

Given the *DepthFlags* register, we set whether a level should be returned to using *bitwise operators*; *DepthFlags* | = ℓ (31 − *l*). Note that we store the bits in *most-significant* ordering, such that the bit-index of the *l*'th level is 31− *l*. We can determine whether there are return positions by checking if *DepthFlags* > 0. Assuming that *DepthFlags* is non-zero, finding the first-set depth flag is akin to counting the consecutive number of zero-bits, starting from the *least-significant* bit; we denote this operation *CountBits*. Hence, the actual depth *j* to return to is 31 *−CountBits*(*DepthFlags*). Upon a successful return to a level, it is important to clear the *j*'th level flag bit to zero; again using bit manipulation. In CUDA *CountBits* can be accomplished using the built-in function *f f s* (however it is offset by plus 1). For an alternative to CUDA's *f f s*, see [1].

SECTION 4

## Faster Traversal with Hybrid Kd-Jump

An acceleration method for ray-tracing serves one primary purpose of removing the extraneous memory access affiliated with empty or invalid space. However, it is entirely possible that an acceleration method might under-perform or even perform worse than a brute-force raytracer, for example, when the acceleration method is complex or is utilized for too long. Thus, it is very important to be able to determine when and where an acceleration method is useful. For this purpose, we present hybrid traversal and dynamic update.

### 4.1 Hybrid Traversal

Each node in a kd-tree represents a sub-region of the complete volume. With each level of the kd-tree, this region is made ever smaller until a node represents a single voxel on the final level. We employ a simple method, whereby we introduce a real-time depth threshold parameter to the traversal kernel. Once rays traverse past this threshold, we switch to the volume stepper and iteratively step along the ray from *t*_{near} to *t*_{f ar}. The volume stepping is performed until the isosurface is crossed, or until *t*_{f ar} is reached, after which a return is issued. See Fig. 6(a).

The purpose of this hybrid system is two-fold; firstly, to gain the benefit of the fast texture-cache and, secondly, to allow the adjustment of the threshold in order to maximize the usefulness of the kd-tree. Although combining an acceleration structure with volume stepping methods is not entirely new, we present it here in order to show that a kd-tree can perform well and can be adjusted easily for dynamic situations.

By building a complete kd-tree and then introducing a run-time depth threshold we can alter the threshold during raytracing in order to find the optimism setting. For instance, the optimal threshold is subject to several factors, primarily the volume size, the complexity of the data itself, the isosurface location and the view direction. Further, if the volume stepping distance is reduced, say to acquire better intersection results while zoomed in, then a larger threshold (traversing further down the kd-tree) would be more efficient.

Also, the threshold depends on the complexity of the data-interpolation being performed. If *tri-linear* interpolation is used, it would be more beneficial to switch to the volume stepper sooner. However, if *tri-cubic* interpolation is used, as is the case for discrete binary-volume rendering [12], then there is far more incentive for the kd-tree to traverse for as long as possible, because of the lack of hardware acceleration. The same argument applies for complex intersection methods such as the correct root finding method [18].

### 4.2 Dynamic Update

With Wald's original implicit kd-tree, each node contained a min/max pair; the minimum and maximum values within the region represented by the node. As we have described in the previous section, we load both child min/max pairs prior to traversal into children. Hence, during traversal, these two sets of values must be loaded from memory. For 8-bit data, this requires a 32-bit transaction while, for 16-bit data, the size of the node is 64-bits. The cost of loading this data, plus the cost of comparing the node value range with the target isosurface, may add additional complexity.

A better alternative is to move the node validity test out of the traversal stage and into the kd-tree update stage. Thus, instead of a node having two min/max pair's for each child, it simply has two boolean bits to specify if the children are currently valid for traversal. Upon a change in isovalue, this would require updating every node on every level starting from the original volume itself. This is already quite fast (less than 0.25 seconds) for 512^{3} volumes, even with a naive implementation. Much of this efficiency can be attributed to CUDA's cached texture-access, which not only applies to accessing three-dimensional volume, but also accessing the node data. See Fig. 6(b).

With hybrid traversal, several deep-levels of the kd-tree may be avoided all together. We show, in the results, that the deeper levels are not particulary useful in our implementation. Therefore, the dynamic update can be made more efficient by only updating levels of the tree which may be traversed. This can be accomplished by introducing a separate sub-volume of min/max pairs. This sub-volume would represent the node information for a kd-tree level and would be considered the absolute cutoff depth. During traversal, the *cutoff* depth would have to override the depth *threshold* if the later is greater.

Choosing whether to employ node-conditions or traversal-conditions depends on several factors. If memory size is an issue, or the isovalue is changed irregularly, then node-conditions would be more suited. In contrast, if the isovalue is altered every frame, then traversal-conditions would be better suited.

We performed several comparative tests and recorded the timing information for the kernels using CUDA's high-resolution timers. The results presented here where averaged over multiple passes Table 1 gives the results for the average frames-per-second (FPS) spanning a wide range of the isosurfaces and multiple view directions for the test data.

Memory usage is an important factor for GPUs, due to limited resources. We show the typical memory usage in Table 2 for a 1024^{2} screen, as would be the case with single ray-tracing kernel. The table shows a stack requires considerable amount of global memory to accommodate all rays, while Kd-Jump requires only a small matrix in fast constant-memory. Although kd-restart uses the least resources, the redundant node visitation severely reduces performance as shown in Table 1.

To further compare the performance of Kd-Jump, we evaluate the theoretical performance, as shown in Fig. 7. In this evaluation, we only test the relevant code to store and retrieve a return position. We set up a kernel with 1024^{2} threads organized into 128-thread blocks to achieve full occupancy. Both the stack and Kd-Jump kernels were tasked with storing and then retrieving *n* number of returns. The results clearly shows that Kd-Jump potentially has considerable speed gains. When cross-referenced with Table 1, however, it is evident the gains of Kd-Jump over stack, in a complete ray-tracer, are not as great. We believe that it is due to the fact the memory accesses in the stack kernel are better hidden by the other computation (general loop, ray splitting and leaf testing).

### 5.1 Limiting Factors

We can test both the Kd-Jump and stack-based kernels with different settings for the core and memory frequencies, as shown in Fig. 8. This allows us to examine which factor (computation or memory access) is limiting performance for each kernel. The results clearly show that Kd-Jump is computationally limited while the stack-based approach is memory limited. This quick test can also be quite useful during development and implementation of raytracing kernels, as it can indicate which factors should be optimized.

### 5.2 Hybrid Kd-Jump

In order to gain in performance as much as possible and thus give merit for using a kd-tree in the first place, we present a comparison of a hybrid kd-tree kernel (using our Kd-Jump method) versus a pure brute-force raytracing kernel. While these kernels are not optimized particulary well, both share the same code for stepping through the volume and detecting an isosurface crossing.

For the Hybrid Kd-Jump kernel, we have incorporated a number of optimizations, specifically the hybrid traversal and dynamic update described in Section 4.1 and 4.2. In addition, the Hybrid Kd-Jump kernel accesses node information from the texture cache rather than directly from global memory.

Table 3 shows the results for Hybrid Kd-Jump versus a brute-force volume-stepper; showing multiple isosurfaces, data-sizes and screen sizes. Of note are the cases where brute-force outperforms Hybrid Kd-Jump. In these cases, two conditions are (always) present. Firstly, the isosurface covers much of the screen and secondly, the isosurface is close to the bounds of the data. Hence, a simple volume stepper only operates for a short period of time before detecting an isosurface. With more complex isosurfaces, longer distances from the bounds to the isosurface and larger screen resolutions, however, brute force is slower than Hybrid Kd-Jump Fig. 9 shows the performance change for various threshold values and indicates the degree to which using a kd-tree is beneficial.

### 5.3 Multiple Rays Per Thread

A bottleneck affecting all methods can occur when only one thread of a warp is active, or only one warp of a block is so. If CUDA allocates a block worth of resources (shared memory, registers) and operates that block until completion, it is logical to assume that, if only one warp is actually active, then the three remaining inactive warps will actually limit computation throughput.

We can test this by altering the kernel to include an outer-loop, whereby new rays are loaded and initialized, once a warp has finished. It is doubtful that loading a new ray when a single-thread terminates will be effective. Indeed, during our initial development, loading a new ray when each thread terminated induced much slower performance. We believe that this is due to the result of more code-branching during traversal, as well as the removal of the initial ray coherence. On the other hand, a warp terminates when all threads terminate. Thus loading a new batch of rays across the 32 threads of the warp will maintain the initial coherency of a group of rays while ensuring that as many warps are active throughout the lifetime of the kernel. We implement the multiray kernel as an extension of the Kd-Jump kernel. As seen in Fig. 10, we show positive benefits.

### 5.4 Separating Kernels

The basic approach to parallel ray tracing is to dedicate one thread per ray and to develop a single-kernel containing the entire rendering pipeline; node traversal, leaf testing and pixel shading. However, as a single-kernel, the pipeline will not fully exploit the GPU and may well indeed create performance bottlenecks. For instance, shading is a branchless process and hence should perform very well in parallel. In a single-kernel ray tracer, however, some threads may begin shading prior to others,; this causes thread divergence and serialization events. Thus, separating out the shading portion (as well as other portions) of the pipeline and placing it in a different kernel should result in better performance, at least in theory [20], [22]. That being said, launching multiple kernels can carry an overhead Fig. 11 shows this theory has at least some merit. However, the performance improvement is small and only attained if a lot of shading occurs to begin with; the rendering of lower isovalues occupy a large portion of the screen.

CUDA devices contain multiple processing units. Each processing unit is capable of operating many threads in parallel, although only a small number (a warp) actually work at any given moment. Currently, CUDA devices operate warps of 32 threads in size. With branchless code, all threads in a warp operate the same instruction of code and fully utilize the SIMD (Single Instruction, Multiple Data) functionality. If conditional branching occurs, then the threads branching into the statement are evaluated first and any thread not following the branch is masked inactive and forced to wait. Once the branch is evaluated, a serialization occurs and threads in the warp are re-synchronized automatically. Apart from the fact that divergent branching and serialization incurs slowdown, the SIMD functionality might not be used to the fullest. Also, note that SIMD efficiency is dependent on limited code-branching and not necessarily ray-coherence.

Currently, a maximum number of 1024 threads can be active on each multi-processor. While only a single warp (group of 32 threads) ever works at any given moment for a multi-processor, CUDA is able to switch between warps waiting for instructions to complete and effectively ensure maximum throughput. For example, if one warp requests global memory, it will essentially have to wait for that request to complete and, during that time, other warps can operate. Thus, maximum occupancy ensures that costly instructions (such as memory access) are better hidden and do not pose a bottleneck; this observation can be made by comparing Fig. 7 and Table 1 where additional computation better masks the memory latency. In Fig. 10, we show that further performance improvements can be gained with load-balancing (multiple rays per thread).

Concern for maximum occupancy should be the first priority for researchers. Device occupancy is determined by two factors, the number of registers and the amount of shared memory used by the kernel. Unfortunately, it makes full occupancy improbable to include the entire traversal pipeline as a single kernel (on current architecture). To achieve full occupancy, without extensive and time-consuming optimization, the traversal mechanism must be separated into multiple kernels.

### 6.1 Multiple Kernels versus Single Kernel

Separating kernels into multiple stages presents a new challenge; how should we organize the work for them? For example, let us assume that we have an Intersect-Kernel which detects ray-geometry intersections and a Shader-Kernel. Not all rays will have intersections and, therefore, they will not require shading. Cropping out the rays which do not require shading so that we can pass only valid rays to the Shader-Kernel requires an intermediate step to organize the memory.

A simple solution is to have Intersect-Kernel to store a hit-flag specifying whether the ray has hit geometry. A separate kernel can then examine rays to find those with valid hits and create a work-list for the Shade-Kernel. This approach can also be performed in parallel (as a minimizing problem [7]), where each thread is responsible for checking the state of a set number of rays. Regardless of how it is implemented, however, the additional step to organize the work requires extra computation and memory access, and is therefore only useful if the benefit outweighs this cost.

Simply separating kernels *without* organization of the input workload, for the shading kernel (i.e., simply having 1 thread per pixel), is shown in Fig. 11. However, what remains to be seen is whether it is possible to reorganize workload between kernel calls, without it becoming a bottleneck in itself.

### 6.2 Alternative To Accumulation Matrix

The accumulation matrix approach, while simple, still requires constant-memory for storage. Different architectures may not provide fast caching features. An alternative to the accumulation matrix is to utilize more registers; one per-dimension. Each bit represents a level of the kd-tree. We store a TRUE, in the relevant register, to specify whether a dimension has been split on a particular level. The registers would be propagated with the correct split information during the kd-tree build stage, or actually during traversal.

Determining *N* of Eq.(5) using the accumulation registers involves counting the number of TRUE bits between the current depth and the return depth. Firstly, this requires masking the accumulation registers for only the levels in question and a bit-counter. In CUDA, this can be achieved using the *__popc* function (see [1] for an alternative).

### 6.3 Limitations and Scope of Kd-Jump

While Kd-Jump exploits the indexing method for implicit kd-trees, pointers are used for general kd-trees. As such, Kd-Jump in the current form cannot be readily used for a general kd-trees. In order to apply Kd-Jump to a general tree, one should be able to transpose a general kd-tree onto a "virtual" balanced kd-tree and build a suitable memory map to reference node data. In practice, however, any additional computation for the map could lower the performance.

When Kd-Jump is employed for isosurface ray-tracing or direct volume rendering, the traversal-orders are pre-defined and are quickly recomputed upon return. For MIP (Maximum Intensity Projection) rendering, however, redetermining the traversal-order would require additional memory look-ups and testing, which could lower the performance.

The Kd-Jump approach would be applicable for use with other binary trees, if nodes can be referenced with indices and index-updates can also be invertible. In theory, this approach could be used with BVH [16] if indices are employed and a sufficient method for mapping the indices to memory (without excess) is available. Since a BVH is not a spatial-splitting structure, tree-balancing is applicable.

We have presented Kd-Jump, a stackless traversal of implicit kd-trees for faster isosurface raytracing. We have shown that Kd-Jump can outperform both stackless and stack-based approaches, while only needing a fraction of memory compared to a stack-based approach. Further, Kd-Jump exploits the index-based referencing used for implicit kd-trees to achieve traversal-paths equivalent to a stack-based method, without incurring the extra node visitation of kd-restart.

To further strengthen kd-tree, we have introduced Hybrid Kd-Jump. Hybrid Kd-Jump utilises a volume stepper for leaf testing and a runtime depth threshold to define where kd-tree traversal stops and volume stepping occurs. By using both methods we gain the benefits of empty-space removal and hardware-based texture interpolation. We have shown that Hybrid Kd-Jump performs well at removing empty space and can outperform a brute-force ray-tracer.

Memory usage for an implicit kd-tree may be too large if min/max pairs are stored in each node. We have shown that, if the conditions for the current isosurface are moved out of traversal and into the tree nodes themselves, then significantly less memory is required. In addition, even with a naive implementation, updating the implicit kd-tree for a large volume was shown to be quite fast.

We have shown the usefulness of loading new rays once a warp of threads completes and report that such an approach yields promising results for faster raytracing. We also discussed and examined the separation of the ray-tracing pipeline into separate kernels, and showed that the methodology has some promise for better efficiency.

Further information, including source code, is available at [11].

### Acknowledgments

We thank the anonymous reviewers for the helpful comments and constructive criticism and Catrin Plumpton for helpful discussion. This work is in part supported by the Leverhulme Trust (grant no. F/00 174/L) and the HEFCW (Higher Education Funding Council for Wales).