A Real-Time Super Multiview Rendering Pipeline for Wide Viewing-Angle and High-Resolution 3D Displays Based on a Hybrid Rendering Technique

,


I. INTRODUCTION
The simplest geometry-based rendering method is each camera viewpoint-independent rendering (ECVIR) [6], which sets up one camera for each viewpoint and renders all the viewpoint images in turn. Consequently, with increasing viewpoint number, the rendering efficiency of this method decreases quickly. The multiview rendering (MVR) [7] method first renders scenes to epipolar plane images and then transforms them into viewpoint images. MVR can promote rendering efficiency in theory, but it fails to be supported by standard computer graphics rendering engines. The backward ray tracing (BRT) method [8] for the SMV display is difficult to render in real time due to the many calculations for rayobject intersections. Although great progress has been made in the real-time ray tracing technique, it is impossible to achieve real-time ultrahigh resolution rendering for complex virtual scenes on a PC machine within the next decade according to an NVIDIA report [9].

B. IMAGE-BASED RENDERING METHODS
The input data of image-based rendering methods is an array of images. The common image-based rendering methods for SMV displays are volume rendering and light field rendering. Volume rendering based on the ray-casting technique [10] has the same problem as the BRT method. Light field rendering for SMV requires considerable graphics memory, so it is impractical for rendering a large virtual scene in real time. However, the depth image-based rendering (DIBR) technique [11]- [19] is a promising 3D rendering method only for traditional SMV 3D displays with viewing angles of less than 10 degrees. In addition, the DIBR method generates images with poor quality and has difficulties with filling the holes and dealing with light in complex scenes. Multiple view plus depth (MVD) 3D representation carries both multiple view color and partial geometry information of the scene (carried by the multiple view depth maps) that can be used in combination to render image data for an SMV 3D display.
Recently, many works on MVD have focused on removing holes and promoting image quality. The Gaussian mixture modeling (GMM) method to realize virtual view synthesis for MVD [20] is a valid approach for addressing the issue of image quality degradation. Emerging holes in a target virtual view can be greatly alleviated by making good use of other neighboring complementary views in addition to the two closest neighboring primary views [21]. Reference [22] presents the multiview video plus depth retargeting (MVDRT) technique for stereoscopic 3D displays, which takes shape preservation, line bending and visual comfort constraints into account and simultaneously optimizes the horizontal, vertical and depth coordinates in the display space. Reference [23] uses color correction of reference views and combines depth-based image fusion with direct color image fusion to decrease ghost effects. Additionally, the cracks are filled using depth filtering and inverse warping. To accelerate the generation of new viewpoint images, many references [24], [25] adopt a parallel computing scheme.
Although the MVD methods effectively remove holes and promote image quality, the lighting problem is not considered, so the newly generated viewpoint image has some color distortions and inaccurate lighting, which leads to degradation of the image quality. There are three different types of light sources in virtual scenes: directional lights, point lights and spotlights. A material in a virtual scene should contain four components: ambience, diffuseness, specularity and shininess. This does not strictly require that the scenes in the above MVD methods only consider the ambient light and diffuseness of material to illustrate higher PSNR than other algorithms. This may be suitable for the same special video sequences, but the image quality is unacceptable for virtual scene rendering.
In computer graphics, lighting is very important for rendering [26] and is based on a simple model of the interaction of materials and light sources. There are three familiar lighting models for real-time rendering -Phong [27], Blinn-Phong [28] and Cook-Torrance [29]. Forward shading is a straightforward approach where we render an object, light it according to all the light sources in a scene and then render the next object and repeat this sequence for each object in the scene. This process is quite computationally intensive as each rendered object has to iterate over each light source for every rendered fragment, which is a considerable number of steps. Deferred shading [30] overcomes this issue and is widely used in games and interactive 3D programs. This method consists of a geometry pass and lighting pass. The geometry pass is used to retrieve all kinds of geometric information from the objects that are stored in a collection of textures. The lighting pass calculates the lighting for each fragment using the geometric information stored in the textures. The lighting pass is an image-based rendering method. The great advantage of deferred shading in comparison with traditional rendering algorithms is that this method has the worst-case computational complexity O(N o + N l ), where N o and N l denote the number of objects and the number of light sources, respectively.
In computer graphics, hybrid rendering is a common and important technique to solve rendering problems, such as rendering quality and efficiency. For instance, a hybrid rendering method exists that combines a color-coded surface rendering method and a volume rendering method, exploits the advantages of both rendering methods, provides an excellent overview of the tracheobronchial system and allows clear depiction of the complex spatial relationships of anatomical and pathological features [31]. A multiview rendering hardware architecture consisting of hybrid parallel DBIR and pipeline interlacing is proposed to improve the performance in [32]. The proposed multiview rendering architecture can achieve 60 frames per second for processing full HD (1920 × 1080) video in a real-time processing system. A hybrid algorithm [33] is presented for accurately and efficiently rendering hard shadows that combines the strengths of shadow maps and shadow volumes. This approach simultaneously avoids the edge aliasing artifacts of standard shadow VOLUME 8, 2020 maps and avoids the high fillrate consumption of standard shadow volumes.
The hybrid rendering technique, which combines rasterization and real-time ray tracing techniques, has made great progress since 2018 since the revolutionary NVIDIA Turing TM architecture [34] was proposed. Barré-Brisebois et al. [35] proposed a hybrid rendering pipeline in which rasterization, computation, and ray tracing shaders work together to enable real-time visuals to approach the quality of offline path tracing in 2019.
Here, a whole new SMV rendering pipeline based on a hybrid rendering technique is presented to address the problems that exist in all the previous SMV rendering methods. The proposed method introduces additional normal information, diffuseness information and shininess information and exploits the advantages of ECVIR of superior quality 3D images without viewing angle limitations, the high rendering efficiency of the MVD technique and the perfect lighting effect of deferred shading and can be treated as a hybrid rendering technique.
The HRT rendering pipeline contains four steps. First, images of sparse reference viewpoints are generated. Then, we apply multiple view reprojection and hole-filling to generate images of new viewpoints. The target view of images (e.g., depth images, normal images, diffuseness images and specular images) can composite target color images with an accurate lighting effect using the deferred shading technique. Finally, the reconstructed 3D image is generated according to the viewpoint arrangement of the SMV 3D display.
The remainder of the paper is organized as follows. In section II, a new SMV rendering pipeline based on the HRT method is proposed. Then, the principles of generating SMV 3D images with large viewing angles are illustrated. In section III, we carry out experiments to demonstrate the validity of the HRT method. Finally, we conclude our work in Section IV.

II. THE HRT SMV RENDERING PIPELINE AND PRINCIPLES OF GENERATING SMV 3D IMAGES
The HRT SMV rendering pipeline is shown in Figure 1, which contains four stages to render a one frame SMV 3D image: sparse reference viewpoint image generation, dense viewpoint image generation, deferred shading and image synthesis. The former two stages increase the viewing angle and promote rendering efficiency. Deferred shading is used to generate an accurate lighting effect for every viewpoint image. The image synthesis stage generates the reconstructed 3D image according to the parameters of the SMV 3D display.

A. SPARSE REFERENCE VIEWPOINT IMAGE GENERATION
The first stage applies the render-to-texture technique [36] and programming shaders for generating sparse reference viewpoint images, including depth images, normal images, diffuseness images and specular images. The common render-to-texture techniques to create multiview images have two types: single pass stereo (SPS) and Turing multiview rendering (TMVR). TMVR is a new technique that can simultaneously generate four viewpoint images, and its rendering speed is 2-3 times that of the SPS technique [9]. Therefore, we choose the TMVR technique to generate reference viewpoint images and store them in GBuffers, as shown in Figure 1. The traditional usage of TMVR is only to generate two viewpoint color images for VR devices, but we used it to generate four kinds of images for sparse reference viewpoints.
There are two kinds of input data in this stage, including a 3D virtual scene and an N-view virtual camera array. As shown in Figure 2, each viewpoint image can be generated from a translational-offset virtual camera, which corresponds to different off-axis asymmetric sheared view frustums with parallel view directions. Every virtual camera property can be determined by its view matrix and perspective matrix, and the n-th virtual camera view matrix M vn and projection matrix M pn in the camera array can be determined by Equation (1) [37]. The view matrix is applied to set the position and direction of the virtual camera. The projection matrix is used to project 3D world objects in homogeneous coordinates into an image. M vc and M pc are the view matrix and projection matrix of the center virtual camera in the virtual camera array, respectively. Because the virtual cameras in different positions have the same direction, their view matrix can be obtained by multiplying the translation matrix M T and the view matrix M vc . The translation matrix is determined by the distance between adjacent virtual cameras and the index of the virtual camera. The projection matrix of each virtual camera can be obtained by multiplying M pc and the shear matrix M shear .
where d represents the distance between adjacent cameras and d h is the distance between the virtual camera array and the zero parallax plane.
Assuming that the viewing angle of the SMV 3D display is θ, d can be calculated by the following equation: Four images as one group are generated by a virtual camera. By using these images as intermediate results, the viewpoint color image created with the deferred shading technique is illustrated in the fourth stage.

B. DENSE VIEWPOINT GENERATION
In this stage, dense target viewpoint images can be generated by sparse reference viewpoint images. We improved the MVD technique. The difference between the traditional MVD technique and the stage is that the latter has to process depth, normal, diffuseness and shininess information, whereas the former only has to process depth and color information. The precision of depth in the traditional MVD technique is 8 bits, while the precision of depth in the stage is 32 bits. The basic approach of generating new viewpoint images includes view reprojection and hole filling. In the first step, the points in the reference view are projected into a 3D space and then projected to a new view (reprojection view). Holes are introduced, which decreases the reprojection view image quality. However, a texture map from other reference views is developed to inpaint most of the holes and avoid degrading the 3D image quality. In the hole-filling step, the remaining holes are filled by linear interpolation.
Each pixel in the depth image contains one virtual point, and its position p can be determined by the following equation: where (u, v, d, 1) represents the homogeneous coordinates under the normalized device coordinate system, (u, v) represents the texture coordinates, and d is the depth value.
The reprojection view can be obtained by shifting the image value in the horizontal direction according to the depth value. As illustrated in Figure 3(c), there are two virtual points (p 1 ,p 2 ) with the same projection point p j . However, they have different depth values (d 1 ,d 2 ) in the reprojection view. The shift values of s 1 and s 2 can be calculated with the principles of similar triangles.
where x is the distance from the reference camera to the target camera. Because the depth of p 1 is larger than p 2 , and the shift value s 1 is larger than s 2 , the final value of p 1 in the other images is saved at p 2 . Therefore, occlusions can be solved effectively. There are N reference views in our display system, so each viewpoint image in the dense target views has N reprojection views. With increasing x, the holes caused by view reprojection become obvious. However, the target view can be synthesized by the N reprojection views, and most of the VOLUME 8, 2020 holes generated from one view reprojection image can be inpainted in by other view reprojection images.
Linear interpolation is the simplest approach for hole filling. Assuming that pixel (x,y) in target view m is a hole pixel, the nearest left valid pixel p l and the nearest right valid pixel p r are shown in Figure 4. Therefore, the filling value for pixel (x, y, m) can be calculated from the following equation: where d r is the distance between p r and pixel (x, y, m), d l is the distance between p l and pixel (x, y, m), and V l and V r represent the values of pixels p l and p r , respectively. The values of the pixel can include depth, normal, diffuseness and specular values. Parallel computing is implemented in this stage to provide the real-time performance of view reprojection and holefilling. Given that the resolution of a target view is W×H and the number of reference views is M, M×W×H threads are set up in a computer shader in Figure 5. Thread (x, y, m) first reads the image value of the index (x, y) pixel from the n-th reference view. Then, the thread writes to the corresponding pixel in the m target view according to Equation (4). If there are simultaneously several threads for writing the same pixel   in the m target view, the value with the minimum of d n is ultimately saved. The value of the hole pixel can be calculated from Equation (5).

C. DEFERRED SHADING STAGE
In this stage, the color image of a viewpoint and the correct lighting effects can be generated from its four images, as shown in Figure 6. The lighting model used in this stage is the Blinn-Phong model, which is the most popular model used in interactive 3D programs. The following text is from [38], illustrating the detailed process of deferred shading.
For convenience, we only consider point lights, as shown in Figure 7. The total lighting equation of the Blinn-Phong model is shown in Equation (6). I tot = I amb + I diff + I spec (6) The diffuseness term I diff , for the diffuse components of both light sources and materials, can be computed as follows: where the operator ⊗ is used to perform component-wise multiplication, l represents the unit vector from light source position S pos to the virtual point p, n is the surface normal on p, M diff denotes the diffuseness of the material color, and S diff is the light source color. The values of p, M diff and n can be obtained from depth images, diffuseness images, and normal images, respectively. The specular term is a key component in determining the brightness of specular highlights, along with shininess to determine the size of the highlights. The specular term I spec , which represents the specular parts of both light sources and materials can be computed as follows: where h is the unit half vector between l and v and v is the view vector from the point p to the viewer.
The ambient term I amb is represented by the following equation: (10) where Factor amb modulates the value of M diff so that all the objects in the scene reflect a small part of the diffuseness contribution. From Equations (6)-(10), the color images can be easily calculated in the compute shader.

D. IMAGE SYNTHESIS STAGE
The viewpoint mask as input data is a 2D buffer that records the viewpoint arrangement of the SMV 3D display. The relationship between the viewpoint number v kl and the subpixel index (k, l) can be represented by the following equation [39], [40]: v kl = (k − 3l tan α)mod((m+1)p u /(mp h cos α)) ((m+1)p u /(mp h cos α)) N tot (11) where the microlens magnification m can be expressed in terms of the optimal viewing distance D and the lens focal length f as m+1 = fD. p u is the lens pitch, p h is the horizontal subpixel pitch, and N tot is the total number of viewpoints. In graphic memory, there are dense viewpoint color images in the deferred shading stage that can be used to obtain all the subpixel values of the final 3D images. This stage works through parallel computing and only has assignment operations. Therefore, the time consumption is small enough to be ignored.

III. EXPERIMENTS AND RESULTS
The HRT method is implemented in the computer shader and fragment shader. The PC hardware includes an Intel R I9 9980 XE (4.26 GHz) CPU with 16 GB of RAM and an NVIDIA GeForce 1080 GPU with 8 GB of RAM. The GPU is the main factor that affects the rendering frame rate. Six 3D models, including monkey, car, heart, buildings, Manhattan and furniture, are used to test the performance of the HRT method. Monkey, car, heart, and buildings are simple 3D models, while Manhattan and furniture are complex 3d models. The numbers of vertices, faces and triangles in the models are listed in Table 1. An 8k SMV 3D display with 80 degree viewing angle and 100 viewpoints is applied in the experiment, and other parameters of the display are shown in Table. 2. In our experiment, PSNR is applied to measure the squared intensity differences of the synthesized and ideal view image pixels. The ideal view image can be obtained with the ECVIR or BRT methods. Then, based on the average PSNR performance, we compare the outcome of the HRT method with   those of state-of-the-art methods, namely, GMMDIBR [20] and MVDRT [22]. We apply different hole-filling methods in GMMDIBR, MVDRT and HRT to refine the blended image. The input data of GMMDIBR and MVDRT are color images and depth images, while the input data of HRT are depth images, normal images, diffuseness images and specular images. Figure 10 shows that the proposed HRT method provides better performance than the existing state-of-theart MVD methods. The number of reference views is four, the number of frames is 200, and the resolution of the reference views is 1024 × 768. The PSNR improvement range varies from 6.735 dB to 13.798 dB with an average improvement of 9.54 dB for GMMDIBR and 6.34 dB to 11.312 dB with an average improvement of 7.15 dB for MVDRT.
Because new viewpoint images should have different colors for the same virtual point in different reference views, the new viewpoint images generated by GMMDIBR and MVDRT have many error pixels in Figure 11. Color images and depth images cannot provide enough material VOLUME 8, 2020  information to generate new viewpoint images with accurate lighting, so the PSNR values of these methods are lower than that of the HRT method. Figure 12 illustrates that with increasing reference views, our proposed HRT pipeline can provide better performance. The improvement ranges vary from 2.169 dB to 3.303 dB with an average improvement of 2.432 dB using three reference views and 3.585 dB to 4.630 dB with an average improvement of 3.726 dB using four reference views.
Observers standing at different positions should obtain different colors for one virtual point because of the different unit half vectors in Equation (8). The specular term I spec can directly affect the image quality of the newly generated viewpoint, as shown in Figure 13. The values of M spec and S spec are (1.0, 1.0, 0.0) and (0.7, 0.7, 0.7), respectively. The monkey material is an ideal diffuse material when M shi is zero. The PSNRs of GMMDIBR and MVDRT are higher than the PSNR of the HRT method. The hole-filling method of the former methods is better than that of the HRT method under these conditions. With increasing M shi , the PSNR values of GMMDIBR and MVDRT decrease quickly, while the PSNR value of HRT increases slowly. Therefore, HRT is more  suitable for virtual rendering than GMMDIBR and MVDRT for most virtual scenes.
The rendering times of the different methods and every stage of HRT are shown in Table 3 and Table 4. As shown   in Table 3, the rendering time is related to the complexity of virtual scenes. The results also illustrate that our HRT method has an obvious advantage in rendering efficiency compared to those of the ECVIR and BRT methods. Because the HRT should process more input data and has more stages than GMMDIBR and MVDRT, it requires more time to render one frame of a 3D image, while the frame rate is still more than 35 fps on average. As depicted in Table 4, the time consumption of the last three stages has no relationship with the 3D models. Figure 14 illustrates the main factors affecting rendering efficiency. The rendering frame rates of the six models decrease rapidly with increasing view numbers in Figure 14(a). However, the frame rate also remains above 20 fps when the viewpoint number reaches 200. As shown in Figure 14(b), with increasing target view resolution, the frame rate decreases rapidly because the number of computing units in the GPU is limited. In addition, as shown in Figure 14(c), increasing reference viewpoint number results in a decrease in frame rate because the first stage consumes more time. The final 3D image that is displayed on the LCD panel of the SMV 3D display can be generated by the image synthesis stage, as shown in Figure 15. The proposed HRT algorithm is implemented on an 8k SMV 3D display. The real-time reconstructed 3D images from different perspectives are shown in Figure 16.

IV. CONCLUSION
In summary, a whole new SMV rendering pipeline based on the hybrid rendering technique (HRT) is constructed that can generate accurate lighting effects in real-time when the number of viewpoints of the SMV 3D display is greater than 50, the viewing angle is greater than 100 degrees, the resolution of a single viewpoint image is more than 512 * 512 and the resolution of the LCD panel is 7680 * 4320; in particular, complex scenes can be generated in real time. Real-time 3D optical reconstruction with accurate lighting effects is realized on an 8k SMV 3D display with an 80-degree viewing angle and 100 viewpoints. The main factors affecting the rendering efficiency are the number of target views, the number of reference views and the resolution of the target views. Experiments demonstrate that when the number of reference views is four and the resolution of the target view is 1024 × 768, the frame rate is more than 35 fps, the PSNR value of HRT is greater than 36 dB, and the rendering result has a good lighting effect. The HRT method has an obvious advantage in image quality and lighting effects for most virtual scenes in comparison to GMMDIBR and MVDRT. His research interests include 3D display technologies, computer graphics, and integral photography and its applications in the 3D display. VOLUME 8, 2020