Visual Hull Tree: A New Progressive Method to Represent Voxel Data

A visual hull is an approximation of a three-dimensional (3-D) object generated by the shape-from-silhouettes (SfS) technique. Because a visual hull is calculated from silhouettes, a visual hull can be represented by silhouette images, encoded by a small number of bits. However, it is challenging to represent the concave regions of a visual hull. In this paper, we model voxel data with a set of visual hulls, thereby handling the concave regions of voxel data. To accomplish this, silhouettes are generated from input voxel data using virtual cameras, and a visual hull is computed by SfS using the silhouettes. To handle the concave regions, we calculate the residuals of visual hulls, and the residuals are represented by visual hulls again. This process is repeated until all concave regions are processed, and a hierarchical data structure, i.e., a visual hull tree, is generated. Because the visual hull tree is constituted from a set of visual hulls, it can represent the details of the voxel data even in the root node. Also, because a set of visual hulls can be represented by silhouettes, a visual hull tree has a small number of bits. From the experiments, we compare our method to the octree-based representation, and our method demonstrates good encoding performance.


I. INTRODUCTION
To represent three-dimensional (3-D) geometry data efficiently, a variety of methods have been proposed. Among them, volumetric approaches use voxels as primitives and represent a 3-D geometry using transparent voxels and opaque voxels. Compared to other approaches such as polygonal meshes, volumetric approaches are intuitive and straightforward. Moreover, many signal processing algorithms can be applied to voxel data by considering a voxel as a 3-D extension of a pixel. For these reasons, recent research in computer graphics and artificial intelligence has widely utilized volumetric approaches.
The disadvantage of volumetric approaches is memory requirements, i.e., they generally require a large amount of memory to represent a 3-D geometry. Consequently, several encoding methods for voxel data have been studied, and an octree has become a popular means to represent voxel data [1]- [3]. An octree is the 3-D extension of a quad-tree, and it recursively subdivides a 3-D space into eight octants.
The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy .
When the geometry of a 3-D object is represented by an octree, the shape becomes more apparent as we traverse down the tree.
In this paper, we propose a visual hull tree to represent voxel data. A visual hull is a geometric entity created by the shape-from-silhouettes technique (SfS) [4]- [6]. SfS is a 3-D geometry reconstruction method using multiple-view silhouettes of an object, and computes a visual hull using a small number of computations compared with other 3-D reconstruction methods. From the above observation, we represent voxels as a set of silhouettes used for computing a visual hull using SfS. This implies that while the majority of previous research uses SfS as a tool for generating 3-D contents, we instead use it as a 3-D geometry decoder. However, a visual hull cannot present the concave regions of an object well. To address this issue, we find the concave regions based on the differences between the 3-D geometry and its visual hull. Then, the concave regions are again represented by visual hulls. This process is performed repeatedly, thereby representing voxel data as a set of visual hulls. From the above process, visual hulls are generated hierarchically, which we refer to as a visual hull tree, which is similar to a concavity tree [7]- [9]. Using this method, voxel data is represented by a set of silhouettes that are used for generating visual hulls, and the silhouettes can be encoded by various methods, i.e., 2-D run lengths, a chain code, or quad-tree. However, traditional methods only encode a single silhouette; thus, a new method for representing multiple silhouettes is necessary to reduce data. To address this, we represent a set of silhouettes using the bit plane approach. The bit plane method converts silhouettes to gray-scale images and it can reduce the data for representing silhouettes.
Compared to an octree, the proposed method has the following advantages. First, the visual hull tree can represent an object's details well, even if it does not descend to the leaf node level. Second, we can decode the visual hull tree using a similarity transformation without any post-processing.
The remainder of this paper is organized as follows. Section II reviews studies related to our method, and we describe our proposed visual hull tree in Section III. Experimental results are presented in Section IV. Finally, we conclude the paper in Section V.

II. RELATED WORKS A. OCTREE
An octree is considered to be a three-dimensional algorithm of a quad-tree representation and is introduced in [1]. To generate an octree, voxel data are divided into octants repeatedly, representing voxel data hierarchically. The octree reduces the cost of voxel data in terms of memory space, and there are many algorithms based on the octree representation. For example, a modeling scheme using the octree is proposed in [10]. Furthermore, many studies involving rendering, encoding, and deep learning have been developed [3], [11], [12]. To further reduce the memory space associated with an octree, a directed acyclic graph (DAG) is used, and is generated by merging the same pattern of nodes as those in an octree [2].

B. VISUAL HULL
The visual hull is a 3-D entity that is generated by the shape-from-silhouettes (SfS) technique. To generate a visual hull, silhouettes are back-projected into 3-D space, and the intersection of the back-projected volumes is calculated [6] (Figure 1a). It is a simple algorithm but requires many iterations to calculate the intersection of volumes. To improve the computation time, 3-D ray-based methods have been proposed [4], [5]. In [5], 3-D rays are projected into the silhouettes, and the intersections of rays and silhouettes are calculated. In this method, the intersection points in 2-D images are also back-projected into the 3-D space using 2-D homography. A second method [4] uses affine rectification so that the rays are projected in parallel onto 2-D images. Because the projected rays are parallel, it is easy to locate the intersections between rays and silhouettes.

C. CONCAVITY TREE
The concavity tree is a data structure that considers concave regions in a 2-D binary image [9] (Figure 1b). To generate a concavity tree, concave regions(C) of an object are estimated using a convex hull. Then, the concave regions are estimated again from C, and a hierarchical data structure is formed based on the concave regions. Because the convexity of an object is characterized by a concavity tree, the tree has many applications. For example, shape retrieval and classification can be accomplished by observing the convexity of objects [8], [13]. Encoding objects in 2-D binary images is another application of the concavity tree [14].

III. VISUAL HULL TREE
The system for generating the visual hull tree is shown in Figure 2. First, we project input voxels to the image planes and generate silhouettes. After acquiring the silhouettes, a visual hull is generated and residuals are found by calculating the differences between a visual hull and the input data. The residuals, then, are represented by silhouettes again. This process is repeated until all residuals are calculated, and then the visual hull tree is generated. Through the visual hull tree, input voxels are modeled by a set of silhouettes and runlengths. Algorithm details are provided in the subsections that follow.

A. VISUAL HULL COMPUTATION
In this section, we introduce how to generate silhouettes from input voxel data and predict input voxel data using a visual hull. A visual hull is a 3-D entity that is an approximation of VOLUME 8, 2020 a 3-D object, and the shape-from-silhouettes (SfS) technique is utilized to compute it. The SfS back-projects the regions on the silhouettes into 3-D space and calculates their intersections, generating a visual hull based on these intersections. To model voxels using visual hulls, we generate silhouettes from the voxel data. To generate silhouettes from the voxels, we define virtual cameras around the data and project the voxels onto 2-D image planes. A virtual camera is commonly modeled by projective cameras that are used for modeling real cameras. However, in terms of our method, projective cameras not only increase the complexity in calculating a visual hull, but also make it necessary to encode the camera parameters. To address this, we define virtual cameras as affine cameras. An affine camera is defined as follows.
Because the last row of an affine camera is (0, 0, 0, 1) T , an image is generated through the orthogonal projection of 3-D data. As shown in Figure 3, in the case of an affine camera, calculations for silhouettes and a visual hull are simpler than those when using a projective camera. In our work, we set three affine cameras with each camera's principal axis set along the X-axis, Y-axis, or Z-axis; hence a visual hull is easier to calculate. From the particular case of the affine camera, we define the affine cameras as Through the affine cameras (equations (2), (3), (4)), the silhouettes and a visual hull are generated by analyzing the coordinates of the voxels. In terms of generating silhouettes, voxels are projected into the cameras by multiplying the camera matrices and the coordinates of the voxels, coordinate of voxels and S X , S Y , and S Z are homogeneous coordinates of pixels on silhouettes that are generated by P AX , P AZ , and P AZ , respectively. The projection is repeated until all voxels are projected into the affine camera's image planes. However, because equations (2), (3), and (4) are special cases of affine cameras, we can compute the silhouettes by removing specific coordinates. For instance, in terms of equation (2), the Y and Z coordinates of a voxel become the x and y coordinates of an image plane.
To compute a visual hull using silhouettes, we may use a ray carving method [4], [5]. A ray carving method defines rays in 3-D space regularly and projects them into the image planes. Then, it calculates the intersection between the projected rays and the contours of silhouettes, and the intersection points are back-projected into 3-D space, from which a visual hull is calculated. This algorithm is a fast method and useful for our system. However, we can compute a visual hull in simpler fashion than ray carving methods by using affine cameras instead. Before computing a visual hull, regions in silhouettes are back-projected into 3-D space as follows: where P † is the pseudo inverse matrix of camera matrix P, and x is the homogeneous coordinate of a pixel. C is a null space of P, and λ is a real value. All pixels in silhouettes are back-projected by equation (5), and 3-D space is filled with back-projected voxels. Here, because we use simple affine cameras, back-projected voxels are generated by filling the space along the x-axis, y-axis, and z-axis. For instance, backprojection of a pixel (x, y) onto S Z is simply represented by (x, y, k), where 0 ≤ k ≤ N , and N is the voxel resolution. After back-projecting pixels, a visual hull is calculated by finding the intersections of the back-projected pixels in 3-D space as follows.
To calculate equation (6), a bit-wise AND operation is utilized in 3-D space [15].

B. RESIDUAL COMPUTATION AND MEASUREMENT
Because a visual hull is only an approximation of an object, a visual hull and an object have differences. Thus, to represent voxels using a visual hull, it is necessary to find the residual of a visual hull by calculating the differences between a visual hull and the voxels. To find a residual, we use the exclusive-OR operation applied to a visual hull and the input voxels, thus finding the differences between them [15]. The residuals are composed of a set of 3-D voxel blobs, and each residual is then represented by a visual hull again. If residuals have few numbers of voxels, it would be inefficient to represent them as a visual hull, because many bits can be used for representing the residuals as silhouettes. For this reason, we represent the residuals as 3-D run-lengths if the number of voxels in residuals is less than a specified threshold. The above process is repeated until all residuals disappear, and a visual hull tree that consists of silhouettes and 3-D run-lengths is  generated. However, the convergence of the visual hull tree is not guaranteed, because a visual hull usually encloses an object. Figure 4 shows the problem of convergence in 2-D space. In Figure 4-(c) and (e), we note that the residuals have similar shapes repeatedly, and therefore the visual hull tree may not be convergent. To allow convergence of the visual hull tree, the residual is partitioned by subdividing it into octants, which we refer to as sub-residuals. Subdividing takes place after finding the residuals, and each divided residual is represented by a visual hull ( Figure 5). Because the subdividing process reduces the size of the residuals, the visual hull tree becomes convergent. Algorithms 1 and 2 describe the overall generation process for visual hull trees.

C. BIT PLANE-BASED SILHOUETTES REPRESENTATION
The visual hull tree consists of silhouettes and 3-D runlengths. In terms of silhouettes, various methods can be utilized to represent silhouettes, such as 2-D run-lengths, a chain code, or a quad-tree. However, depending on the geometry of the 3-D object, many silhouettes may be required. Hence, a large amount of data may be required if traditional binary representations are utilized for encoding silhouettes. To handle this issue, we propose the bit plane-based representation, which merges silhouettes into a gray image. As we represent

Algorithm 1 Generation of a Visual Hull Tree
VHT ← append_to(vht) end for V P ← V N end while silhouettes using gray images, we can utilize lossless grayscale image encoders, which have been studied extensively.
A bit plane of a gray-scale image is a binary image that is composed of a set of bits corresponding to a given bit position based on intensity values. As shown in Figure 6a, bit plane 8 indicates the most significant bits based on their intensity values, and bit plane 1 contains the bits with least significant intensity values. In this manner, we consider silhouettes in a visual hull tree as bit planes. The simplest way of gathering silhouettes is to consider each silhouette as a bit plane. However, this approach generates many bit planes; thus, it requires much data. To reduce the number of bit planes, we gather silhouettes into a single bit plane without loss of the information describing the silhouettes. To achieve VOLUME 8, 2020 To decode the visual hull represented by bit plane-based silhouettes, we establish some rules for bit plane-based representation.
1) Each bit plane contains silhouettes present only at the same level in a visual hull tree. If silhouettes at the same level are represented by more than one bit plane, the number of bit plane is recorded in memory space. 2) Each set of silhouettes on the XY, YZ, and ZX planes is represented by the bit plane-based method. As a

Algorithm 3 Bit Plane-Based Silhouettes Representation
Input: Visual Hull Tree VHT Output: Image I for k = 1 to max_Level do num_sil ← VHT(k).num_sil for j = 1 to num_sil do tmp_I ← VHT(k).sil(j) for i = 1 to num_sil do if !is_overlapped(tmp_I , VHT(k).sil(i)) then tmp_I ← Union(tmp_I , VHT(k).sil(i)) end if end for end for Update_Bit_Plane(I, tmp_I ) end for consequence, there are three sets of images used to represent silhouettes. 3) If the number of bit plane is more than 8, an additional image is created. The detailed process is described in algorithm 3.

D. DECODING OF VISUAL HULL TREE
In Figure 7, we describe the example of decoding process in 2-D space. The decoding of a visual hull tree is similar to the decoding of a concavity tree. Figure 7a shows an example of the decoding process in a concavity tree. The decoding process involves repeated subtraction and addition of concave regions. In a similar manner to a concavity tree, the decoding process of a visual hull tree starts from the root node, and subtraction or addition is performed on the visual hulls ( Figure 7b).
if t is an even number (7) Here, D t is a decoded voxels in level t. VH t is a visual hull in level t. To compute VH t , silhouettes in level t undergo SfS that is introduced in subsection III-A. If level t is odd, a union (OR operation) for addition is used; otherwise, an exclusive OR (XOR) operation for subtraction is utilized. Specifically, in the first level, a visual hull is generated by SfS. In the second level, visual hulls are also generated by SfS, and an XOR operation is performed on the decoded voxels in the first level and the visual hull in the second level. Here, the residuals in the second level also undergo an XOR operation. In the third level, an OR operation is performed between the decoded voxels in the second level and the visual hull in the third level. Through the OR operation, the merged visual hulls become new visual hulls because the OR operation makes any output a 1, except for the case when the inputs are (0,0). From the above description, the visual hulls and 3-D run-lengths in the odd levels undergo the OR operation. In the even levels, the XOR operation is applied to the visual hulls and 3-D runlengths. After reaching the final nodes, the visual hull tree is wholly decoded, and the voxel data is generated. Algorithm 4 presents the decoding process for a visual hull tree.

E. ROTATION, TRANSLATION, AND SCALING OF VISUAL HULL TREE
In the decoding stage of a visual hull tree, the computation of a visual hull is essential. To compute a visual hull, we set simple camera matrices that are mentioned in the previous section. Specifically, we set a camera's extrinsic matrices as a 4 × 4 identity matrix that represents an orthogonal projection. The extrinsic matrix defines the relationship between a camera coordinate and world coordinate, which consists of Algorithm 4 Decoding of a Visual Hull Tree Input: Visual Hull Tree VHT, Max Level of visual hull tree L Output: rotation, translation, and scaling ( Figure 8). If the extrinsic matrix is an identity matrix, the camera and world coordinates are the same. Otherwise, the camera and world coordinates VOLUME 8, 2020 have a relationship under a similarity transformation. From the above analysis, we can directly decode a similaritytransformed object if we modify the extrinsic matrices with suitable parameters, such as rotation angle, translation, and scaling. Equation (8), as shown at the bottom of the page, indicates the relationship between a camera matrix and an extrinsic matrix R Z R Y R X T . In equation (8), c x and s x are respectively the cosine and sine of θ x , while S X , S Y , and S Z are scaling factors along the X -axis, Y -axis, and Z -axis, respectively. t X , t Y , and t Z are translation factors. Before computing a visual hull, camera matrices are modified using equation (8), and then visual hulls are computed. In terms of residuals, only the start points and end points of runlengths are transformed using equation (8). Through the above process, we can decode transformed voxels without post-processing.

IV. SIMULATION RESULTS
In simulations, we experimented with six different voxel data images: 'Bust' [16], 'Chair' [16], 'Duck' [17], 'Mario' [17], 'VCL Man1' [4], and 'VCL Man2' [4]. 'Bust,' 'Chair,' 'Duck,' and 'Mario' are mesh data and we voxelize them in 3-D space (Figure 9). 'VCL Man1' and 'VCL Man2' are multiple-view image data, thereby their geometry is reconstructed using the ray carving method [4], [5] (Figure 9). We set the resolution of all voxel data as 256 × 256 × 256. To model voxel data as a visual hull tree, a parameter (the threshold) should be set. The performance of our method therefore varies depending on the threshold. In the experiments, we set a threshold by multiplying the number of voxels and t. Here, t is a positive number less than 1.0. In the experiments, we varied t in the range of 0.001 to 0.01 and constructed visual hull trees. Additionally, we compared our method to octree. To evaluate the validity of the proposed method, we use a geometric measure that is used for MPEG Point Cloud Compression [18]. It proposes assessment criteria regarding the quality of the geometry and colors of a point cloud; we use only the geometric criterion for evaluations in the experiments. The geometric criterion is defined as follows: where, V o is the original voxel or point cloud data and V deg is the degraded voxel or point cloud data. d rms is defined as Here, K is the number of voxels in V o . v o is an element of V o , and v dnn is the nearest neighborhood of v o . In our experiments, to observe the various scales of errors, we use the logarithm of equation (9). To handle d(V o , V deg ) = 0, we add 1 to equation (9) and take its logarithm.
Equation (11) is calculated for each level of the visual hull tree and octree, observing the geometric error while traversing the leaf nodes of the trees. Figure 10 presents the geometric errors regarding bits. The silhouettes of a visual hull tree are merged into gray images by bit plane-based representation. In this experiment, we represent the gray images using the png file format, which is a lossless image encoder. To find curves in Figure 10, we decode the visual hull trees along with the level of detail, and then equation (11) is calculated. Furthermore, the numbers of bits are measured cumulatively by adding the bits for each level of the visual hull tree. To compare the proposed method with previous methods, we evaluate octree, which is a popular representation method for point clouds, and DAG, which is a modified version of octree. To evaluate the previous methods, equation (11) is also calculated using decoded voxel data along with the level of detail, and the numbers of bits for each level are measured in the same way as for the visual hull tree. In Figure 10, the level of the visual hull tree becomes shallow as the threshold approaches 0.01. Furthermore, from observing Figure 10, the visual hull tree converges at a lower level than the octree. Because the test voxel data have 256×256×256 resolution, the octree always has eight levels. However, the depth of the visual hull tree varies depending on the threshold and the shape of voxel data.
We can also observe the number of bits used to represent a visual hull tree, octree, and DAG in Figure 10. From the figure, the visual hull tree uses fewer bits to represent 3-D data compared to octree and DAG. We also present subjective quality images in Figures 11 and 12. From Figures 11  and 12, the visual hull tree not only converges quickly, but also achieves similar geometry at all levels.

V. CONCLUSION
In this paper, we propose a new progressive representation method, a visual hull tree, for 3-D voxel data. A visual hull is a 3-D entity that is generated by SfS technique. SfS computes a visual hull using silhouettes, which means that a visual hull is represented by silhouettes that have a small amount of data. To take advantage of it, we model voxel data as visual hulls. Because the visual hull does not describe concave regions, we estimate concave regions by calculating the differences between voxel data and visual hulls. The concave regions are also represented by visual hulls, and this process is repeated until all concave regions are processed. Finally, the input voxel data are represented by hierarchical visual hulls, i.e., a visual hull tree. Although the visual hull tree does not traverse leaf nodes, it produces voxel data similar to the input data. Hence, the visual hull tree is useful for progressive coding. Also, the visual hull tree is invariant for the similarity transform because we can modify the cameras, which are used for computing visual hulls. Moreover, to reduce the data of the visual hull tree, we propose the bit plane-based silhouettes representation. From the experimental results, the proposed method demonstrated superior performance to the octree algorithm.
As future work, we plan to develop lossy coding strategies for a visual hull tree. Furthermore, we expect to apply these strategies to visual hull trees, because the proposed method was motivated by concavity trees.