Group-Based Adaptive Rendering System for 6DoF Immersive Video Streaming

The Moving Picture Experts Group (MPEG) has started an immersive media standard project to enable multi-view video and depth representation in three-dimensional (3D) scenes. The MPEG Immersive Video (MIV) standard technology is intended to provide a limited 6 degrees of freedom (DoF) based on depth map-based image rendering (DIBR). The 6DoF immersive video system is still challenging because multiple high-quality video streams require high bandwidth and computing resources. This paper proposes a group-based adaptive rendering method for 6DoF immersive video streaming. With group-based MIV, each group can be transmitted independently, which enables adaptive transmission depending on the user’s viewport. The proposed method derives weights from groups for view synthesis and allocates high-quality bitstreams according to a given viewport. This paper also discussed the results of the group-based approach in the MIV, and the advantages and drawbacks of this approach are detailed. In addition, pixel rate constraint analysis has been introduced to facilitate deployment with existing video codecs. On end-to-end evaluation metrics with TMIV anchor, the proposed method saves average 37.26% Bjontegaard-delta rate (BD-rate) on the peak signal-to-noise ratio (PSNR).

The rest of the paper is organized as follows: This 81 paper first introduce the background and related work 82 in Section 2 regarding MIV standardization and group-83 based MIV. This paper then introduce the proposed adap-84 tive rendering system and the view weighting calculation 85 method in Section 3. In Section 4, this paper present the 86 experimental setup and the metrics used in our evaluation. 87 This paper also discuss the results of group-based MIV 88 and a proposed adaptive rendering system. Our compre-89 hensive set of evaluations included empirical evaluations 90 based on common test conditions (CTCs) from the MPEG-I 91 community.

93
This section describes the MIV standard's immersive video 94 technology in general. In addition, this paper briefly discusses 95 group-based TMIV and multiview streaming research. The MPEG-I project (ISO/IEC 23090) is launched the 98 first phase standardization in 2018 for 3DoF technology, 99 which provides three dimensional degrees of freedom, 100 including part-2 omnidirectional media format (OMAF) [7]. 101 Subsequently, discussions were held on 6DoF technol-102 ogy, which provides full movement of the user in three-103 dimensional space, followed by a call for proposal (CfP) 104 at the 125th meeting to define and standardize part-12 105 MPEG immersive video [2], [8], [9], [10], [11]. The con-106 cept of 6DoF system architecture contains pre-processing 107 and post-processing modules for removing inter-view redun-108 dancy. Several studies implemented 6DoF systems by 109 down-sampling multiple videos and eliminating the corre-110 lation among [4], [12]. Based on the proposed responses, 111 MPEG-I proposed TMIV as a reference software for 6DoF 112 video compression. TMIV supports pre-processing and post-113 processing for transmitting multiview videos to compress 114 6DoF videos more efficiently.

173
The TMIV supports group-based encoding to produce better 174 rendering results with local coherent projections [19]. Group-175 based encoding has shown improvement in the subjective 176 and objective results of the TMIV, particularly at high bitrate 177 levels. The main feature of group-based encoding is dividing 178 all input source views into groups to be processed sepa-179 rately, and this feature enables the TMIV to preserve impor-180 tant regions (for example, foreground objects and occluded 181 scenes) in each group. Moreover, this feature can also lead to 182 sub-bitstream accessibility across groups. This paper focuses 183 on the fact that this method facilitates sub-bitstream sep-184 aration, which enables adaptive rendering in MIV. Such a 185 region of interest (ROI)-based approach can lead to the rep-186 resentation of viewport-dependent streaming in multiview, 187 depending on the field-of-view (FoV) of the viewer.

188
This method simply separates all input source views into 189 groups using camera parameters, and each group encodes 190 the views independently. First, the view pool includes all 191 the input source views, and then the camera parameters are 192 listed. Then, the dominant axis is assigned as the axis with 193 the largest valid range in the X, Y, and Z coordinates. The 194 dominant axis is used to set the key direction, and the closest 195 camera is selected in ascending order by distance from the 196 key direction. The selected camera is labeled as the current 197 group and removed from the pool of views. The second key 198 position is assigned, and the process is repeated, covering 199 all source views across the chosen number of groups. Using 200 group-based coding, MIV produces better rendering results, 201 especially for natural content sequences.
[20]. Figure 2 illustrates an example of a group clus-203 ter using group-based encoding. The example sequence 204 VOLUME 10, 2022  Figure 3 shows the proposed sys-239 tem for adaptive rendering. The system we propose includes 240 the following considerations. The MIV technology enables services to be provided on 244 current-generation or near-next-generation hardware plat-245 forms. The pixel rate and simultaneous decoder instantia-246 tion restrictions become key considerations when considering 247 actual implementation on real hardware. As the number of 248 video decoder instantiations increases, it is clear that parallel 249 processing performance decreases significantly. Therefore, 250 minimizing the number of instantiations should be consid-251 ered in multiview streaming. From these results, a practical 252 maximum of two video decoders is assumed. If the lower 253 limits of frame rates are stretched down to 30 fps, then up to 254 four decoders can be instantiated. Based on this constraint, 255 the TMIV tried to minimize the spatial resolution of the 256 input videos to below 4096 × 2048, even though the process 257 of inter-view redundancy removal causes information loss. 258 Furthermore, beyond the HEVC decoder, the view must be 259 rendered by a view synthesizer using the decoded streams. 260 The larger the number of streams, the larger the memory and 261 buffer capacity required. Table 1 presents the pixel rate constraints for immersive 263 videos. The maximum luma sample rate is the luma sample 264 value per second across all decoders. The maximum luma 265 picture size is the picture size value of each decoder instantia-266 tion. 'MP' means megapixel, which is a pixel rate per second 267   In the MIV standard, view weighting synthesizer (VWS) is 298 adopted as a renderer software for synthesizing and rendering 299 virtual viewpoints in TMIV [30]. The main feature of VWS 300 is that when trying to synthesize a virtual viewpoint, the 301 blending weight is calculated and reflected in the render-302 ing process [31]. VWS generates a warped geometry map 303 for each input view by unprojecting or reprojecting pixels 304 from this view towards the target view. When several warped 305 images are blended, a color value for each pixel inside the 306 virtual view is computed using the weighted average of the 307 warped pixels' color information. This calculation can be 308 expressed as:

262
c v (p) is the color value of a pixel p in n a viewport 311 image and w i (p) is a blending weight of a pixel p in the 312 i-th warped image. Specifically, this paper use the blending 313 weight factor w i (p) of a warped image, namely the visibility 314 map. The rendering process with VWS is represented in 315 Figure 4. Using this visibility map, the contribution of input 316 views is calculated, and this information is used in the ren-317 dering process to synthesize more accurate and high-quality 318 virtual viewpoints for view blending. To achieve efficient 319 6DoF streaming, this paper propose an adaptive rendering 320 method for immersive video streaming. This paper implement 321 the group weighting module with the visibility map. When 322 transmitting 6DoF video to render a scene, we have to choose 323 the quality of representation of bitstreams. Considering the 324 rendering method of VWS, this paper proposes a method of 325 calculating the contribution of each group and reflecting the 326 contribution information when transmitting the bitstream of 327 the priority group. A group containing many videos with a 328 high contribution is defined as a priority viewpoint group in 329 the adaptive rendering framework. 330 VOLUME 10, 2022  9,16,28,50} [Mbps] L : quality level QP={22, 27, 32, 37, 42, 47} representations g L i : each i-th group atlas bitstream with a bitrate corresponding to L g L bit i : bitrate of g L i of L representation G : list of groups sorted by view weight in each group, viewpoint is given, a priority viewpoint group is determined 355 by calculating the viewpoint contribution of each view. Using 356 the weight information, a bitstream of high quality is selected 357 first for the viewpoint group to allocate a bitrate, and a group 358 other than the viewpoint group selects and transmits a quality 359 bitstream corresponding to the remaining bitrates. Atlases are 360 pre-encoded with different quality representations that can 361 be transmitted independently using group-based encoding in 362 the proposed system. When the user's gaze information is 363 given, a renderer calculates the contribution of each view to 364 the virtual viewpoint. Then, the view weight module derives 365 the bitrate allocation using these contributions. Once the 366 target bitrates have been determined, a basic bitrate allocation 367 strategy is implemented. The target bitrates are considered to 368 be provided at high quality to the priority group first in the bit 369 allocation stage. All groups are assigned to the lowest quality 370 bitstream, and then a high quality bitstream is selected so that 371 the priority group occupies as much view weight as possible. 372 As a result, the bitstream of a group with a high contribution 373 is assigned to the high bitrate to enable adaptive rendering at 374 a given target bitrate. The total target bitrate is specified as R t , 375 the bitrate of the i-th group bitstream is specified as g bit i . 376 Algorithm 2 details the operation of the proposed method 377 of allocating bitrates to each group based on computed view 378 weighting information.

380
In this section, the efficiency of the proposed method is 381 verified through video quality measurement for the virtual 382 Street as 6 MIV test sequences in the MPEG-I CTCs [33]. is presented [33]. According to the CTCs, the test sequences  9, 16, 28, 50} 414 Mbps. In the MIV anchor, the texture QPs are sequence-     Through the proposed adaptive rendering framework, priority 427 groups with a high contribution to rendering virtual view-428 points are transmitted with high quality, enabling more effi-429 cient adaptive streaming than when transmitting each group's 430 quality uniformly. Although the group-based encoding is an 431 advantage to preserving a more complete view, the pixel 432 rate of each atlas increases, consuming more resources to 433 process them. The group-based encoding has the disadvan-434 tage of increasing redundancy between views compared to 435 TMIV, The results of pixel rate are summarized in Table 5. 436 VOLUME 10, 2022 The group-based encoding has a negative impact on the 437 pixel rate owing to the constraint of inter-view redundancy.

438
To ensure compatibility with existing video codecs, TMIV 439 considers the pixel rate as well as the coding efficiency, and 440 it is important to understand these aspects.

441
With the proposed bit allocation algorithm, the group with 442 a high contribution to view rendering is transmitted with high 443 quality representation. This approach is related to prior tile-444 based 360 video streaming research, but the proposed method 445 is more easily adaptable to group-based multiview streaming.

446
The performance of the proposed method is evaluated by   for all sequences to evaluate the performance of the pro-458 posed method [36]. The results are measured by PSNR and 459 IV-PSNR, respectively, according to the proposed method 460 compared to the TMIV. For the low pixel rate condition 461 (N = 2, 3), the proposed method shows an average BD-rate 462 reduction of 36.24% in PSNR compared to TMIV anchor for 463 six sequences, and 30.85% in IV-PSNR. Also the proposed 464 method shows an average BD-rate reduction of 11.56% in 465 PSNR and 10.63% in IV-PSNR compared to group-based 466 TMIV. Table 8 shows the experimental conditions for the 467 decoding time measurements. Figure 7 presents a TMIV decoding time for the number of 469 groups. As N increases, the overall decoding time also tends 470 to increase. Even if parallel processing is currently possible 471 in TMIV, the TMIV decoding time increases significantly 472 according to the number of groups. The proposed method is 473 capable of efficient transmission even though N is low. There-474 fore, the proposed method has the advantage of enabling 475 more realistic deployment. The subjective quality comparison 476 results for the proposed method with p03 view are shown 477 in Figure 8. For the all sequence, significant rendering arti-478 facts are observed in TMIV anchor, but the proposed method 479 achieves lower distortion in synthesized results with a total 480 bitrate. A comparison result is shown when sequence is ren-481 dered at a target bitrate. The group-based encoding results 482 show that viewpoint synthesis is improved during rendering, 483 as reported in the proposal document [20]. Furthermore, the 484 red region represents the part where quality improvement 485 is identified in the result of applying the proposed adaptive 486 quality allocation method. The proposed method allocates 487 high quality to the priority group, so it can be observed that 488 patches for texture parts are mainly replaced with high quality 489 within the same bitrate, and subjective quality comparison is 490 also improved. Electron. Lett., vol. 57, no. 25, pp. 973-976, 2021.