By Topic

• Abstract

SECTION I

## INTRODUCTION

Due to improving sensor technologies, as well as increasing size and fidelity of numerical simulations, scientific datasets are growing dramatically in size. Often, the only viable way of interpreting such datasets is through visualization. Volumetric data based on unstructured grids and scattered point data are becoming increasingly important. Here, the locations of the available data points (hereafter: samples) from an underlying (spatial) distribution are irregularly distributed throughout the volume. This differs from the traditional voxel format, which requires samples on a rectilinear, uniform grid, as exemplified by a stack of MRI images.

Scattered point data arises in many areas, including sensor networks, which are used to measure physical and environmental conditions at various locations. Examples include sensors that measure precipitation at various geographical locations, or the temperature at various locations in a furnace. There is usually no predefined or implied connectivity between the sensor sites. The output of several numerical simulations, such as smoothed particle hydrodynamics (SPH) and $n$-body simulations can also be regarded as scattered point data. Finally, even if the data is originally grid based, the structure can be lost if it is post-processed, of if the data is streamed over a network.

The result of X-ray diffraction experiments, which are used to study the structure, chemical composition and physical properties of materials, is another example of scattered point data, which motivated this work. In a generic diffraction experiment, measurements of how a material scatters, or changes the direction of, X-rays are made. The outcome of these experiments is samples of the diffracted X-ray intensity—a three-dimensional scalar field—at locations determined by details of the experimental setup. To extract information about the material being studied, it is often helpful to create high quality 3D visualizations of this data. This is just one example of the increasing importance of volume visualization in science.

Well established methods exist for visualizing volumetric data on structured as well as unstructured grids, but scattered point data remains a challenge. One approach is to resample the data on a uniform grid, or tetrahedralize it to create an unstructured grid, and then visualize the resulting grid. This can, however, be problematic. If resampling is done, a grid with sufficiently high resolution will require vast amounts of memory. On the other hand, using a low resolution grid is equivalent to low-pass filtering, and will cause details to be lost, resulting in poor image quality. With tetrahedralization, one can typically only perform interpolation at the cell faces, which may cause poor results with cells of widely varying sizes.

In this paper, we present a method for visualizing scattered point data, primarily developed for X-ray diffraction data, but applicable also in other contexts. Our method is based on the well established technique of volume ray casting [1], [2]. However, our method differs by not resampling the data on a uniform grid, but instead operating directly on the scattered point data. Due to the highly parallel and computationally intensive nature of the method, we have implemented it on single- and multi-GPU systems for increased performance.

The remainder of this article is structured as follows: the Section II presents previous work on volume visualization of scattered data. Section III provides background information about GPU computing, scattered point data, and X-ray diffraction. Section IV describes our visualization method, as well as modifications and optimizations made for the GPU version. Results are presented in Section V, while Section VI concludes and describes possibilities for further research.

SECTION II

## RELATED WORK

An overview of direct volume visualization can be found in the book by Hansen and Johnson [1]. The volume ray casting algorithm was originally proposed by Levoy [2] for regular grids, and extended by Garrity [3] to irregular grids. In addition to these image order techniques, the object order technique of splatting was introduced by Westover [4]. With the advent of programmable GPUs, this platform was adopted for volume visualization. Early efforts emulated ray casting with texture mapping [5], [6]. As the flexibility of GPUs increased they were also used for full ray casting [7] [8] [9], as well as the related technique of ray tracing [10].

Most volume visualization techniques require the underlying function to be reconstructed based on the samples. This is fairly straightforward in the case of regular grids, where trilinear interpolation is used. In addition, several techniques for modeling and interpolating scattered point data also exist, for instance the works by Nielson [11] and Amidror [12].

Much work has also been done on visualizing scattered point data. Some approaches simply resample the data on a regular grid, and then render this grid using standard techniques [13] [14] [15] [16]. Other approaches operate directly on the scattered data. This includes techniques that are based on splatting, such as the one adopted by Hopf and Ertl [17].

Techniques based on ray casting have also been employed to directly render scattered point data; Chen [18] used an approach based on radial basis functions, with one radial basis function per point. At equidistant positions along the rays, neighbouring points were found using an octree, and their radial basis functions were evaluated to find the value of the underlying function. Jang et al. [19] also used radial basis functions, but with fewer radial basis functions than points. Their approach was based on texture mapping, but rather than using a 3D texture, the radial basis functions were evaluated when the volume slices were rasterized. Ledergerber et al. [20] proposed a unified framework for both structured and unstructured datasets, based on moving least squares. At equidistant positions along the rays, close samples were found and used to compute a weighted least squares approximation. The method also supports anisotropic weights. A GPU implementation was also provided.

While our approach resembles these three last mentioned, it differs in several ways: We use inverse distance weighing for interpolation; have developed a novel empty-space skipping technique to improve performance; and adapted a filtering approach which can dramatically reduce the size of the input dataset, without compromising image quality. We have implemented our method for both single- and multi-GPU systems, and have developed a load balancing scheme for the multi-GPU case. Finally, we provide detailed results, for experimental X-ray diffraction data, for both image quality and performance.

SECTION III

## BACKGROUND

In this section, we will provide the background information about GPU computing, scattered point data and X-ray diffraction necessary for the full appreciation of our work.

### A. GPU Computing

GPUs were originally developed as dedicated coprocessors for 3D graphics, but their increasing programmability, combined with high performance, low cost and low energy consumption, have made them highly popular for general purpose high performance computation [21], [22]. For instance, GPUs have been used to speed up SPH simulations [23], as well as applications in biology [24] and medicine [25] [26] [27]. In the following, we will give a brief introduction to the architecture of typical high-end GPUs.

A GPU consists of a number of multiprocessors, each of which consists of several streaming processors. Each of the streaming processors of a multiprocessor works in a SIMD fashion, executing the same instruction in lock-step. Hence, GPUs are well suited for data parallel problems, where each thread executes the same program, but with different input data. Furthermore, the high number of streaming processors, which range from several hundred to several thousand for current GPUs, requires a high degree of inherent parallelism in the problem for GPUs to achieve good efficiency.

However, even high-end GPUs typically lack advanced features such as branch prediction and out-of-order execution. In addition, more of the die is devoted to computational units and less to caches, compared to CPUs, and they thus typically run at lower clock frequencies.

### B. Scattered Point Data

A scattered point dataset can be defined as a set of samples $S=\{s_{1},s_{2}\ldots s_{N}\}=\{(p_{1},f_{1}),(p_{2},f_{2})\ldots (p_{N},f_{N})\}$ where $p_{i}\in{\mathbb R}^{m}$ are coordinate vectors in $m$-dimensional space with associated scalar values $f_{i}\in{\mathbb R}$. In the case of $m=3$, we deal with volumetric scattered point data.

In general, the sample locations are arbitrarily distributed, and there is no connectivity between samples. However, two special cases frequently arise in practice: the sample locations may be on a uniform, rectilinear grid, commonly referred to as voxel data, or the samples may be located on a $m-1$ dimensional manifold (e.g., on the surface of some object in 3D space), commonly referred to as point clouds.

The data can often be regarded as being samples of a continuous function $f$. Reconstructing this function from the samples is often necessary for visualization. Many methods exist for this problem, see e.g., [11]. Trilinear interpolation is frequently used with grid data, due to its conceptual and computational simplicity. In the case of scattered point datasets where the sample locations do not exhibit any particular structure, inverse distance weighing (IDW) [28] is an applicable method. In its simplest form, the estimated function value $f_{\mathrm{e}}(x)$ at position $x$ is TeX Source\begin{equation*}f_{{\mathrm{e}}}(x)=B^{-1}\sum_{i=1}^{N}d(p_{i},x)^{-u}f_{i},\tag{1}\end{equation*} where $u\!>\! 0$ is an adjustable parameter, $d(x,y)$ is the Euclidean distance between $x$ and $y$, and $B=\!\!\sum_{i=1}^{N}d(p_{i}, x)^{-u}$. It is clear that the samples located closest to $x$ has far stronger influence on the estimated function value $f_{\mathrm{e}}(x)$, thus the sums in (1) can be truncated to include only those samples located close to $x$.

If it is known that the underlying function varies more rapidly in one direction than others, it will generally make sense to weigh samples along the direction of slow change more heavily compared to samples along the direction of rapid change [29]. When IDW is used, this can be achieved by using an anisotropic distance measure, instead of the Euclidean distance [30]. The anisotropic distance $d_{{\mathrm{a}}}(x, y)$ between two points $x$ and $y$ can be defined as TeX Source\begin{equation*}d_{{\mathrm{a}}}(x,y)=\sqrt{(x-y)^{T}A\, (x-y)},\tag{2}\end{equation*} where the symmetric $m\times m$ matrix $A$ describes the equidistance ellipsoid.

### C. X-Ray Diffraction

To explain the nonuniform distribution of the X-ray data, used as our motivational test case, we will provide a brief introduction to X-ray diffraction [31].

The wavelength $\lambda$ of X-rays (0.01–10 nm) has the same order of magnitude as the distance between atoms and molecules in condensed matter. X-rays passing through a material will be scattered, that is, spread in new and different directions, and also interfere with each other. We emphasize that the dataset being scattered point data is not caused by the X-rays being scattered, but by the way the scattered X-rays are measured. Measurements of the resulting intensity distribution as function of direction can be used to extract detailed information about the structural arrangements inside the material specimen [31].

The setup of a generic X-ray scattering experiment is shown in Fig. 1. A 2D sensor array is used to measure the intensity distribution of scattered X-rays from the beam incident on the material specimen. For each pixel of the detector, the corresponding scattering vector ${\mathbf Q}\equiv 2\pi({\hat{\mathbf k}}_{{\mathrm{f}}}-{\hat{\mathbf k}}_{{\mathrm{i}}})/\lambda$ can be computed to obtain one sample $({\mathbf Q},I)$. The relative orientation of the detector, material specimen and incoming beam can be varied to cover the region of ${\mathbf Q}$-space of interest.

Fig. 1. Generic X-ray scattering experimental setup. X-rays from the source (direction ${\hat{\mathbf k}}_{{\mathrm{i}}}$) hit the sample, and are scattered into direction ${\hat{\mathbf k}}_{{\mathrm{f}}}$. The scattered X-rays are measured with a pixelated area detector. The measured intensity of each pixel $I$, combined with the corresponding scattering vector ${\mathbf Q}$ results in a sample $({\mathbf Q}, I)$.

If the components of the scattering vector are interpreted as 3D space coordinates in ${\mathbf Q}$-space, all samples from a single detector frame will be placed on the same curved 2D surface. Measuring multiple frames with different sample-detector configurations will result in multiple surfaces, each with different curvature, orientation and position. Such a set of frames can be used to effectively map the diffracted intensity in a volume of ${\mathbf Q}$-space, resulting in a reciprocal space map [32]. The distance between pairs of surfaces in ${\mathbf Q}$-space is typically different from the distance between samples located on the same surface, and surfaces may intersect. Although the structure of the sensor array and the curvature of the ${\mathbf Q}$-space surface corresponding to a particular frame may be used to acquire some connectivity between the samples, shadowing effects from the experimental setup combined with masked or insensitive detector areas complicate strategies utilizing connectivity information, suggesting that regarding it as an unstructured scattered point data set is a viable option.

SECTION IV

## RAY CASTING FOR SCATTERED POINT DATA

Fig. 2 illustrates how our ray casting method is used to generate a single image. A ray is created for each pixel of the output image. The direction of the ray is defined by the camera position and the center of its pixel. For each ray, starting at the camera, we move along the ray, and estimate the value of the underlying continuous distribution $f$ at positions separated by a user-defined distance $\delta$. How the estimation is performed, is covered later in this section. Next, the estimated value is mapped to a color and opacity value. This is done using a user-defined transfer function. Finally, the color and opacity values of all the positions along a ray are used to evaluate the volume rendering integral, as described in [33], in order to find the final color of the pixel. Starting at the camera, for each point along the ray, we update the color $C_{i}$ and opacity $A_{i}$ using the formulae:TeX Source\begin{align*}C_{i}=&\, C_{i-1}+(1-A_{i-1}) c_{i}\tag{3}\\ A_{i}=&\, A_{i-1}+(1-A_{i-1}) a_{i},\tag{4}\end{align*} where $c_{i}=c_{i}(f_{{\mathrm{e}}}(x_{i}))$ and $a_{i}=a_{i}(f_{{\mathrm{e}}}(x_{i}))$ are the color and opacity contribution of the $i$'th point, which are functions of the estimated value $f_{{\mathrm{e}}}(x_{i})$ of the underlying function at the position of the $i$'th point $x_{i}$, and $C_{i}$ and $A_{i}$ are the color and opacity after processing the $i$'th point. The initial values of the color and opacity are $C_{0}=A_{0}=0$. The final color of the pixel in the rendered image is $C_{N}\cdot A_{N}$, where $N$ is the number of positions along the ray at which the underlying function is estimated.

Fig. 2. Overview of implementation. The input is a set of samples, here shown as dots. Rays are cast from the eye/camera, through each pixel, and into the volume. The value of the underlying function is estimated at points along the ray, by interpolating among neighbouring samples. Here, we show one such ray, and one point, where three samples are contributing to the local function value.

#### 1. Filtering

Before the ray casting starts, the data is filtered using a simplified version of the filtering proposed by Ljung et al. [34]. For the X-ray data, low intensity values are often noise (or indistinguishable from noise), making them less interesting from a physics perspective. Therefore, a transfer function that suppresses the samples with low values will typically be used. We can achieve the same effect by simply discarding all samples whose value lies below a user defined threshold when the data is loaded, while at the same time treating regions void of samples as transparent. This will not affect image quality, but can dramatically reduce the size of the input data set, and thereby increase performance. This filtering does not require any changes to our rendering algorithm, since it makes no assumptions about structure or connectivity of the input data points.

#### 2. Interpolation

To estimate the value of the underlying function $f(x)$ at point $x$, we use IDW interpolation. Rather than using all the samples of the entire dataset for each point, we only use those samples closer to the interpolation point than a user specified search radius $r_{{\mathrm{s}}}$. How these samples are found is described later in this section. The user-specified matrix $A$ specifies the equidistance ellipsoid for anisotropic distance calculation. Regions with no samples are treated as transparent, i.e., as if the intensity is zero.

#### 3. Neighbor Search

An accelerating data structure is used to greatly speed up the search for neighboring samples during interpolation. Many data structures have been proposed for this problem [35], [36]. We have chosen to use an octree [37] due to its conceptual simplicity, ease of implementation, intuitive and predictable structure, and good performance. In an octree, each node in the tree corresponds to a cube, and the children of a node are the eight octants of the cube. The root node of our octree is the bounding box of all the samples. Leaf nodes contain those samples that lie in their corresponding cube, and may be empty if no such samples exist. In practice, we store the samples in a separate array, and the leaf nodes contains pointers to the samples.

During search, we want to find all samples within a sphere 2 of radius $r_{{\mathrm{s}}}$ centered at the search point. At each node, we find the child nodes intersecting the bounding box of the sphere, and search those nodes recursively. Thus the search returns all the leaf nodes of the tree where the intersection between the node and the bounding box of the search sphere is nonempty. Finally, all the samples of each of these leaf node must be checked to see if they fall within the search sphere.

The last step can be made faster by reducing the size of leaf nodes. There is, however a trade-off, as reducing the size of the leaf nodes will increase the depth of the tree, making it more costly to find the leaf nodes in the first place, and also increasing memory overhead. In our implementation, we set the maximum tree height to the empirically chosen TeX Source\begin{equation*}h_{\max}=\lfloor\log_{2}(R/r_{\mathrm{s}})\rfloor+2,\tag{5}\end{equation*} where $R$ is the smallest dimension of the bounding box of all the samples. For a cubic bounding box, the volume occupied by a leaf node will then be from $(r_{{\mathrm{s}}}/4)^{3}$ to $(r_{{\mathrm{s}}}/2)^{3}$. To avoid unnecessarily deep trees in sparse regions, we allow leaf nodes to contain up to 8 samples.

### A. Optimizations

To improve performance, we have adapted two common optimizations techniques for volume ray casting [2].

First, we use early ray termination. Rather than estimating the value at all $n$ points along the ray, we stop when $A_{i}$ becomes sufficiently close to 1, which indicates that the volume between the camera and the current position is completely opaque. Hence, the value at further points will not contribute significantly to the color of the pixel.

Second, the filtering described above leaves large regions of the volume empty. Ideally, no search should be performed in such regions, since they are treated as transparent. This could be done by taking advantage of the acceleration data structure. If an empty leaf node was encountered, one could simply jump to the end of it. However, due to the way we store samples in the tree, this might lead to incorrect results, as illustrated in Fig. 3. While this problem can be resolved by organizing the tree differently, we have instead developed a novel empty-space skipping algorithm. If the result of a search is empty, we increase the step size, the distance between positions at which the value of the volume is sampled, by a factor. After each subsequent empty search the step size is increased, until a threshold is reached. When a non-empty search is encountered, we move back to the previous point, reset the step size to its default value, and proceed.

Fig. 3. Illustration of why skipping a empty node might cause incorrect results. When A is reached and the empty node detected, we could skip it by jumping to C. This would lead to incorrect results, as B would be treated as transparent.

The optimal values for the threshold and factor depend, in the same way as the step size itself, upon the dataset. Care must be taken to avoid setting the threshold to high, as the resulting coarseness of the sampling might miss small structures in the dataset. Currently, these parameter values must be determined through trial and error.

### B. Parallelized CPU Implementation

Since all the rays can be processed independently, ray casting is regarded as an embarrassingly parallel problem. We have implemented a parallel version of our method. Rather than distributing the rays evenly between the threads, which might cause problems with load balancing as some rays are more work-intensive than others, we have adopted a work queue and thread pool approach, which is a technique we have used successfully in other contexts [38].

### C. GPU Implementation

Ray casting is a compute-intensive and highly parallel task, and is therefore ideally suited for GPUs. Furthermore, the task of moving the finalized image to the GPU for display is made superfluous when the entire computation is done on the GPU itself.

We have created an implementation of our visualization method where the ray casting is performed on the GPU, using Nvidia's CUDA platform [39]. We have implemented a kernel that processes a single ray. Multiple instances of this kernel are then run on the GPU in parallel, with one thread for each ray. The kernel implements the algorithm described above.

Since CUDA allows using a subset of the C programing language, we only needed to make a few modifications to our code in order to port it from the CPU to the GPU. Each of these modifications will be described in the following paragraphs.

#### 1. Removing Recursion

Older GPUs do not support recursion [39]. Since supporting a broad range of hardware was a priority for us, we replaced the recursive octree range search algorithm by iteration in combination with a manually managed stack. Finding all the samples within the search sphere is done by initially pushing the root node of the tree onto the stack. At each step of the iteration, one node is popped of the stack. If it is a leaf node, its samples are added to those returned. Otherwise, those child nodes of the popped node intersecting with the bounding box of the search sphere are pushed onto the stack. The loop runs until the stack is empty. The depth-first nature of the algorithm ensures that the stack size is bounded by $O(h_{\max})$ where $h_{\max}$ is the maximum depth of the tree.

In an effort to reduce the memory footprint of the stack, we used the following compaction technique: When nodes are pushed on the stack, we always push all the children of a node that intersects the search box at the same time. Therefore, rather than pushing a pointer to each of these child nodes, we can push a pointer to the parent node, along with a description of which of its child nodes that are pushed. This idea is illustrated in Fig. 4. The description can be encoded in one byte, where each bit indicates whether a child is pushed or not. By reducing the size of the node pointer to 24 bits, we can combine the parent node pointer, and child descriptor in a single 4 byte integer.

Fig. 4. Stack optimization. On the top, a portion of a octree is shown. The shaded child nodes are to be pushed onto the stack. In the bottom left is the original stack, and in the bottom right is the optimized stack, after the shaded nodes have been pushed. The stack grows downwards.

#### 2. Memory Optimizations

To achieve optimal performance, memory operations must take the GPU's explicit memory hierarchy into account [39], [40]. Texture memory is a logical memory space which physically resides in the GPUs device memory (RAM). This memory is designed to store textures for graphics applications, and is therefore optimized for 2D spatial locality and streaming fetches. Furthermore, it is typically cached. We stored the input samples in this memory, which can increase performance for a number of reasons.

Firstly, all the samples of a leaf node will be accessed sequentially, to check if they actually lie within the search sphere, and if that is the case, be used in the interpolation. As mentioned, texture memory is optimized for such streaming access patterns. We therefore sort the samples, so that all samples belonging to the same leaf node are placed together, before transferring the data to the GPU. Secondly, special access patterns are required to get good performance for global memory. Since these access patterns cannot always be achieved in our case, texture memory might give better performance. Finally, as opposed to global memory, texture memory has its own cache, also on older GPUs without L2 and L1 caches. Since a thread might access the same samples at consecutive points along the ray, or neighbouring threads might access the same samples, this can also increase performance.

#### 3. Multi-GPU and Load Balancing

Multiple GPUs can be used together to increase performance [41]. The embarrassingly parallel nature of ray casting makes this easy, and allows us to further speed up the rendering. Each GPU can simply be assigned a fraction of the rays/pixels, and process them independently. Achieving good load balancing, so that all the GPUs finish at the same time, is challenging, however, for two reasons. Firstly, the amount of work per ray/pixel is not constant. Hence, distributing the number of rays/pixels evenly will not cause work to be distributed evenly. Secondly, different GPUs with different performance might be used together.

To address these issues, we have developed a load balancing scheme that uses two techniques. The first is based on the realization that our application typically will be used to generate a large number of frames, where each frame is quite similar to the previous, because the camera will often be moved smoothly around the object. The relative performance of each GPU for one frame can therefore be used to decide how to distribute work for the next frame. The second technique uses the length of a ray, that is, the distance the ray intersects the bounding box of all the samples, as a proxy for the amount of work required to process it. While clearly inaccurate, this is a better assumption than that of uniform ray/pixel work amount. The ray lengths can be computed fairly cheaply on the CPU prior to work distribution.

SECTION V

## RESULTS AND DISCUSSION

Two carefully chosen and qualitatively different experimental X-ray datasets were used to test our method.

Dataset A was obtained for a ${\mathrm{PbTiO}}_{3}$ thin film on a ${\mathrm{SrTiO}}_{3}$ substrate [42], measured at the Swiss-Norwegian Beamline (BM01A) of the European Synchrotron Radiation Facility using a six-circle $\kappa$ diffractometer, a 1024 by 1024 pixel CCD detector, and an X-ray wavelength of 0.097 nm. It consists of 5.6 million datapoints and was zoomed in on a single feature known as a crystal truncation rod [43]. A 3D rendering is provided in Fig. 5(a). The central, green appearing, crystal truncation rod displays clear intensity variations consistent with the crystalline structure and thickness of the film. The surrounding magenta torus of lower intensity arises from, and thus contains information about, ferroelectric domain structures in the ${\mathrm{PbTiO}}_{3}$ film [44].

Fig. 5. Visualizations of the two datasets generated with our method. (a, b) Renderings of dataset A showing in normalized dimensionless units the 10l crystal truncation rod of oscillating intensity and the weaker torus caused by the ferroelectric domain structure of the ${\mathrm{PbTiO}}_{3}$ film. The crossed out patches in (a) indicate regions without measurements. (c, d) Renderings of dataset B, in units of Å−1, showing both sharp Bragg peaks and diffuse line features, oriented along high-symmetry directions in the material.

Dataset B is from a single crystal of diaquabis(salicylato) copper(II) [45], [46], consists of 243 million samples, and was measured over a much larger region of ${\mathbf Q}$-space than dataset A. The data was measured using a rotating anode X-ray source emitting Cu ${\mathrm{K}}\alpha$ radiation (wavelength 0.154 nm), a four-circle diffractometer and a Dectris Pilatus 1 M pixelated detector [47]. A 3D rendering can be found in Fig. 5(c), clearly demonstrating the presence of both regularly spaced sharp Bragg peaks consistent with a single-crystalline structure, and the presence of modulated lines of diffuse intensity in certain directions coinciding with high-symmetry directions in the sample. 3

For both datasets a filtering threshold was applied to remove samples from low-intensity regions where no measureable diffracted intensity was detected. This removed 36.45% and 99.98% of the samples for dataset A and B, respectively.

For the performance testing, we used a 4-core, hyperthreaded 3.5 GHz Intel i7 3770 K CPU, and three different Nvidia Tesla GPUs. Hardware details for the GPUs are shown in Table I. GCC version 4.6.3 and the Nvidia CUDA Toolkit version 5.0 with all optimizations enabled were used for compilation. On the CPU, we used twice as many software threads as physical cores, to take advantage of hyperthreading. Our code can be compiled to use either single or double precision floating point numbers; unless otherwise noted, single precision was used. This is sufficient in our case due to the limited dynamic range of the X-ray data, but other applications might require double precision. Performance was measured by rendering four representative 1024×1024 images from the two datasets, these are shown in Fig. 5. On the GPU, the rendering time reported here includes the time required to transfer the final image back to the host (since neither the K20 nor the C2070 has video output), but not the time required to transfer the samples or tree data structure to the GPU, since this only has be done once, and the cost typically will be amortized by rendering a high number of frames. For the same reason, the time required to load and filter the data and to build the tree is not included.

Table I Hardware details of nvidia GPUs used

### A. Benefits of Anisotropic Interpolation

To demonstrate how anisotropic interpolation can improve image quality, we generated several renderings of dataset B with various settings for interpolation. The settings are given in Table II and the resulting figures shown in Fig. 6.

Fig. 6. Examples of the effects of varying search radius $r_{{\mathrm{s}}}$ and anisotropy matrix, with settings given in Table II. (a) Small $r_{{\mathrm{s}}}$ and isotropic interpolation, resulting in the diffuse lines appearing to be broken where there is a lack of samples. (b) Increased $r_{{\mathrm{s}}}$ compared to (a), making the diffuse lines appear continuous while slightly decreasing the resolution. (c) $r_{{\mathrm{s}}}$ as (a), but with anisotropic interpolation favoring interpolation along the horizontal direction in the figure, effectively hiding the effects of insufficient sampling while retaining good resolution. (a) $r_{\mathrm{s}}=0.005$; isotropic. (b) $r_{\mathrm{s}}=0.01$; isotropic. (c) $r_{\mathrm{s}}=0.005$; anisotropic.
Table II Settings used in Fig. 6. $I_{3}$ is the 3×3 identity matrix

To render Fig. 6(a), a small search radius $r_{{\mathrm{s}}}$ and isotropic interpolation were used. The resulting image has quite poor quality. In particular, the rightmost part of the rods appear discontinuous, as the search radius is to small to span the gaps between adjacent samples. This problem can be mitigated by simply increasing the search radius, as was done to render Fig. 6(b). While the artifacts of Fig. 6(a) are removed, the increased radius increases blurring. In Fig. 6(c) a small radius was combined with anisotropic distance measurement. The anisotropy used compacts distances along the rods. The discontinuity artifacts are almost completely removed, without introducing the same amount of blurring, resulting in improved image quality.

In the case of dataset B, the underlying function $f(x)$ changes much more slowly along the rods than in directions orthogonal to them. It is therefore unsurprising that anisotropic distance can be used to improve the image quality, as explained in Section III-B. Anisotropic interpolation has, however, the disadvantage that it relies upon a priori knowledge about the underlying distribution. Furthermore, in regions where the underlying function changes uniformly in all directions, it will introduce artifacts. This can be seen in Fig. 6(c), where the single points in the top and bottom parts of the image appear stretched, relative to the other figures.

### B. Performance

Rendering times for the different images on the CPU and GPUs varied considerably. While interactive frame rates were achieved for Figs. 5(c) and (d) (on the fastest GPU), rendering the other images were orders of magnitude slower, as shown in Fig. 7. To explain this variance, recall that in Fig. 5(b), and to a lesser extent in Fig. 5(a), all rays pass through large regions of high sample density. This is significantly more computationally demanding than if they had been passing through empty space, as in Fig. 5(c) and (d). In empty space, the range search will return faster, no interpolation is required, and the empty-space skipping optimization can be employed. It should also be noted that the settings used to achieve satisfactory image quality depends heavily upon the properties of the dataset, as well as the viewing angle. These different parameter values have a great impact on the rendering time.

Fig. 7. Rendering time for the images in Fig. 5, for the parallel CPU version on a 4-core, hyperthreaded Intel i7 3770 K and the GPU versions on three different Nvidia Tesla GPUs. Note the logarithmic scale.

While real-time or interactive frame rates always are desired, the primary goal of our method was high quality images, not speed. It should be noted that adjusting the settings easily allows for a trade-off between quality and performance. This makes it possible to identify promising viewing angles at interactive or real-time rates, and then render high quality images offline.

The speedups on the GPUs compared to the CPU are shown in Fig. 8. For the K20 we saw speedups between 8 and 14, while the older GPUs achieved speedups between 2 and 9. The significant performance increase on the GPUs is expected, due to the highly parallel and computationally demanding nature of our ray casting algorithm. We saw different speedups for the different images depending upon how well suited their processing requirements were for the different GPU architectures.

Fig. 8. Speedups of the GPU version on three different Nvidia Tesla GPUs over the parallel CPU version on a 4-core, hyperthreaded Intel i3770 K for the different images in Fig. 5.

#### 1. Single and Double Precision

The results of comparing single and double precision performance for the different images and processing units can be found in Fig. 9. Here, the texture optimization described in Section IV-C-II was disabled, due to lack of support for texels consisting of four doubles. GPUs typically perform significantly fewer double precision operations per second compared to single precision. This difference was more pronounced on older GPUs, as evidenced by the poor results for the older C1060 on Fig. 5(a) and (b), where single precision was 8.3 and 4.6 times faster than double, respectively. However, as we can see, the situation has improved, for the newer C2070 and K20, single precision was between 1.7 and 3 times faster than double.

Fig. 9. Speedup of single over double precision for the images in Fig. 5, for the parallel CPU version on a 4-core, hyperthreaded Intel i7 3770 K and the GPU versions on three different Nvidia Tesla GPUs. Lower speedup values implies better double precision performance.

To explain differences in rendering performance between the images, we must again consider their different processing requirements. In Fig. 5(a) and (b), more rays pass through high density, medium intensity regions compared to Fig. 5(c) and (d). Hence, more interpolation must be done to render these images, while, for Fig. 5(c) and (d), most of the time is spent searching. Interpolation is more computationally intensive, involving expensive floating point operations. Searching is comparatively simple; the only floating point operations performed are comparisons. Switching from single to double precision will therefore lead to a larger performance hit for the images from dataset A.

#### 2. Texture Memory

Fig. 10 shows the speedup achieved when the texture optimization described in Section IV-C-II was applied. The effect varied, for Fig. 5(a) and (b), the K20 and C1060 got speedups of between 1.2 and 1.6, while the performance for the C2070 degraded slightly. For Fig. 5(c) and (d), we saw no significant speedups.

Fig. 10. Speedups of texture optimization for the images in Fig. 5 for the GPU version on three different Nvidia Tesla GPUs.

The difference between the images can again be explained by the fact that for Fig. 5(a) and (b), the rays pass through extended regions of high sample density and medium diffracted intensity. Hence, significantly more samples are read, compared to Fig. 5(c) and (d). Since the texture optimization improves the performance of reading samples, higher speedups were obtained for these images.

To explain the lack of speedup on the C2070, we must take a closer look at the architecture of the different GPUs. The C1060 does not have caches, so manual caching, either by using shared memory, or, as we do, texture memory, can therefore increase performance. The newer C2070 has both L2 and L1 caches. Using these yields slightly better performance than using the texture cache manually, and is the cause of the lack of speedup in this case. This can be verified by disabling the L1 cache at compile time. If this is done, using texture memory will improve performance. The newest of the GPUs, the K20, also has L1 and L2 caches, and hence it might seem surprising that we did not see the same results as for the C2070. However, on the K20, loads from global memory are not cached in L1 cache, which is used for local memory accesses only [48]. Hence, using texture memory pays off. The GK110 architecture, on which the K20 is based, also makes it possible to use the cache of the texture pipeline without having to bind the memory to a texture beforehand.

#### 3. Empty-Space Skipping

In Fig. 11, the effect of the empty-space skipping optimization is shown. We saw a significant speedup of 3.2–3.8 for Fig. 5(c) and 6.1–7.1 for Fig. 5(d), but little or no effect on the images from dataset A. This is as expected, since dataset B is more sparse, and has more empty space than dataset A. The difference between Fig. 5(c) and (d) is caused by the different settings that are used. In order to capture finer details, the original step size used to render Fig. 5(d) is smaller than that used to render Fig. 5(c), hence the potential for speedup is greater in this case.

Fig. 11. Speedups obtained by empty-space skipping for the images in Fig. 5, for the parallel CPU version on a 4-core, hyperthreaded Intel i7 3770 K and the GPU versions on three different Nvidia Tesla GPUs.

#### 4. Multi-GPU Systems and Load Balancing

Our multi-GPU load balancer assumes that multiple images will be rendered, since it uses the performance of one image to divide work for the next, as explained in Section IV-C-III. Therefore, to tests its performance, we rendered 4 pairs of images. Each pair consisted of one of the images used thus far (as shown in Fig. 5) as well as a slightly zoomed out version of the same image. For each set, we first rendered the zoomed out image, and then used the results to divide work for the next image. The reported results are for the last rendering only.

Fig. 12 shows the speedups obtained when a C2070 was used together with a K20, compared to just a single K20. The figure also shows a theoretical upper bound on the speedup. This upper bound was computed by assuming that if a fraction $\beta$ of the work is assigned to the K20, and $1-\beta$ to the C2070, the rendering time will be $\max(\beta T_{{\mathrm{K20}}}, (1-\beta)T_{{\mathrm{C2070}}})$, where $T_{{\mathrm{K20}}}$ and $T_{{\mathrm{C2070}}}$ are the rendering times measured individually on the K20 and C2070, respectively. The theoretical lower bound on joint rendering time can then be found solving the equation $\beta T_{K20}=(1-\beta)T_{C2070}$ for $\beta$. The theoretical upper bound on speedup can then trivially be computed.

Fig. 12. Speedup of the multi-GPU version on the combination of Nvidia Tesla C2070 and K20 versus the single-GPU version on the K20 alone. Theoretical upper bounds are also indicated.

The combination of C2070 and K20 achieved an average speedup of 1.4, while the average theoretical upper bound is 1.58. The cause of this discrepancy is that the amount of work for a thread is not known in advance. Hence, using thread count as a proxy for amount of work, even when ray lengths are adjusted for, leads to inaccurate estimates. The poor speedup results for Fig. 5(c) was caused by the fact that the rendering time was so short that the time to transfer the image back to the host became significant. The theoretical upper bound does not take this overhead into account, and was therefore too high in this case.

Fig. 13 shows how well the load balancing scheme was able to divide work evenly between the GPUs. The figure shows the fraction of compute time spent on each GPU. Ideally, the GPUs should spend the same amount of time, resulting in a 50/50 division. As we can see, for all cases, the division was fairly close to the ideal.

Fig. 13. Fraction of total compute time on each GPU for the multi-GPU version with two different Nvidia Tesla GPUs.
SECTION VI

## CONCLUSION AND FUTURE WORK

In this paper, we have developed a method based on volume ray casting for visualizing volumetric scattered point data. We have applied the method to two qualitatively different experimental X-ray diffraction datasets of highly crystalline materials containing disorder, showing that high-quality visualization of intensity distributions in ${\mathbf Q}$-space are highly useful for extracting information on complex nanostructures and disorder from X-ray diffraction experiments. This is one example of the growing importance of visualization.

The introduced method differs from traditional ray casting algorithms by not using voxels, but rather operating directly on the scattered point data. The method finds the value of the scalar field to be visualized at positions along the rays by interpolation using nearby samples. We use an octree to efficiently find the close-by samples. We implemented standard ray casting optimization techniques such as early ray termination and empty-space skipping. A novel, acceleration data structure agnostic algorithm for performing empty-space skipping, suitable for situations where voxels or similar representations are not used, has been developed. We have shown that in situations where the average absolute value of the directional derivatives depends strongly on the directions, image quality can be improved by using anisotropic interpolation to find values at points along the rays. Versions for multicore CPUs, GPUs and multi-GPU systems have been implemented.

Our implementations were tested using actual X-ray diffraction data, consisting of up to 120 M data points. Our method is capable of producing images of good quality. The rendering time varies significantly, between 0.2 s and 12.7 s, (Nvidia Tesla K20), depending upon dataset, and settings used. The GPU implementation (on Nvidia Tesla K20) achieves a speedup between 8 and 14 for different images, compared to the multithreaded CPU version (Intel i7–3770 K).

In future research, one may investigate how performance can be improved with further optimizations. Devising methods to automatically determine optimal parameter settings is also a possible direction of future research. It would be interesting to look at how our method compares to a different algorithm such as splatting. As experimental and simulated datasets become larger, one idea would be to also look at the tradeoffs of data compression Aqrawi et al., [49]. Without doubt, future work will aim at further developing real-time interactive multi-dimensional visualization tools for interactive analysis and visualization of vast data sets.

### Acknowledgment

The authors would like to thank Nvidia's CUDA Research Center Program and NTNU for hardware donations, Thomas Tybell for providing the ${\mathrm{PbTiO}}_{3}$ sample, Frode Mo for facilitating the experiments at the Swiss-Norwegian Beamlines, and Emil J. Samuelsen for providing the diaquabis(salicylato)copper(II) sample.

## Footnotes

Corresponding authors: T. L. Falch (thomafal@idi.ntnu.no), A. C. Elster (elster@ntnu.no)

1Our source code is available under a BSD license at https://github.com/acelster/scatter-pt-viz

2When anisotropic distances are used, coordinates can be transformed such that the region to search remains a sphere in the search space

3This paper has supplementary downloadable material available at http://ieeexploreieeeorg, provided by the authors This includes two mp4 format movie clips showing 3D renderings of the two datasets described here, generated with our method This material is 369 MB in size

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

Archive

### GPU-Accelerated Visualization of Scattered Point Data Multimedia

This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Comment Policy