Viewport History as a Heuristic for Quality Enhancement and Quality Variation Control in Viewport-Aware Tile-Based 360-Degree Video Streaming

Despite the growing popularity of Virtual Reality (VR), 360-degree videos are often regarded as challenging to stream due to their large bandwidth requirement. As a solution, the 360-degree video content is spatially divided into tiles, and the quality level for each tile is selected based on the user’s network environment and viewport information. To determine the high quality tiles, viewport prediction and viewport history methods are used to estimate the user’s viewport. However, due to the unpredictability of user head movements, generating accurate viewport estimates are difficult, which can severely degrade the Quality of Experience (QoE) for the user. In this paper, to sustain high user QoE, we detail a novel tile quality selection algorithm that employs viewport prediction, viewport history, viewport extensions, and a viewport tile count limit. In addition, we also include comparison analysis on six 360-degree videos that vary in content pace. Based from simulations, using viewport history as a heuristic for tile quality selection demonstrated a significant increase in the perceived quality while suppressing quality variation inside the viewport and across segments compared to eight reference methods; and secondly, 360-degree videos slow in content pace tended to result in lower viewport prediction accuracy, QoE performance, and weaker viewport history trends compared to 360-degree videos fast in content pace.

virtual environments and can freely adjust their Field of 23 View (FoV)/viewport of the virtual scene by adjusting their 24 head orientation. However, there are challenges in delivering 25 360-degree videos to users. In particular, these challenges 26 include the bandwidth requirement, the motion-to-photon 27 The associate editor coordinating the review of this manuscript and approving it for publication was Tai-Hoon Kim .
(MTP) latency, and the user Quality of Experience (QoE) [3]. 28 Moreover, in consideration of the growing popularity of VR 29 and related technology, such challenges are imperative for 30 research communities and industries to resolve in order to 31 support 360-degree video streaming at large scales [3]. The 32 three challenges are further described in the following. 33 The first challenge is satisfying the bandwidth require-34 ment to stream 360-degree video content. Unlike 2D video 35 content, the bandwidth requirement is magnitudes larger for 36 360-degree video content. For example, to stream 360-degree 37 and 2D video content in 4K, the data rate is 400Mbps and 38 25Mbps, respectively [3]. However, the user's viewport cov-39 ers less than 20% of the entire 360-degree virtual scene [4], 40 meaning that over 320Mbps from the above 400Mbps of 41 data is needlessly expended. This data wastage is particularly 42 misprediction. This is because the tiles comprising the user's 99 viewport will be displayed entirely or partially in low quality. 100 Based on this tradeoff between viewport prediction accuracy 101 and user QoE, tiled segments are typically prefetched in 102 advance by 1∼2 seconds [10]. Larger segments are gener-103 ally preferred over shorter segments by incurring less net-104 work overhead, but larger segments provide less adaptability 105 to changes in the user's network environment and viewing 106 experience [15]. Therefore, reducing the severity of viewport 107 mispredictions is imperative for maintaining high user QoE 108 with respect to the network environment. 109 To improve user QoE due to viewport mispredic-110 tions, we propose a viewport prediction-based and view-111 port history-based tile quality selection algorithm for on 112 demand tile-based 360-degree video streaming. The proposed 113 algorithm uses the viewport information of past users as 114 a heuristic for quality control inside the viewport and for 115 performing viewport extensions. Moreover, unlike our con-116 ventional algorithm in [18], we expand this work by improv-117 ing the algorithmic logic during the viewport enhancement 118 and viewport extension phase to further increase the per-119 ceived quality and reduce quality switching across segments. 120 In addition, we also include four additional 360-degree video 121 contents to the evaluation environment. This is for highlight-122 ing tendencies concerning the viewport prediction accuracy, 123 viewport history, and user QoE in relation to the video content 124 pace. 125 The main contributions of this paper are as follows. 126 • We propose, to our knowledge, the first formally com-127 bined viewport prediction-based and viewport history-128 based tile quality selection algorithm.

129
• We propose a viewport tile count limit as a control means 130 to increase the perceived quality and suppress quality 131 variation across segments.

132
• We demonstrate the difference in viewport prediction 133 accuracy, viewport history, and user QoE performance 134 in relation to the content pace of the 360-degree video. 135 • We demonstrate the difference in user QoE performance 136 in relation to viewport prediction-based and viewport 137 history-based tile quality selection methods.
138 VOLUME 10, 2022 The rest of this paper is organized as follows. predict the user's viewport for the next segment as in [11].

162
The linear regression method uses the roll, pitch, and yaw 163 head orientation information contained in the time interval of 164 t 0 − 1 to t 0 to predict the user's future head orientation for 165 a time t 0 + , where denotes the prediction window size. 166 We adopt a simple viewport prediction method because it is 167 not within the scope of this paper to design such a method.

168
In addition, the prediction method is not limited to linear 169 regression for the proposed method.
170 Figure 1  However, there is a limit to how much the viewport can 199 be extended and to what quality. This means that it is just 200 as important as viewport predictions to select effective view-201 port extensions. Accordingly, to realize effective viewport 202 extensions, we showed in our previous work of [18] that 203 users' viewport history can be used, even if the viewport was 204 mispredicted. In addition, as a viewport extension technique, 205 we do not suggest using strict tile group classifications like 206 in [23] because severe quality level gaps between tiles can 207 occur when there is a misprediction. In detail, this hap-208 pens because in [23] all tiles located in the north and south 209 poles of the equirectangular projection, the most common 210 two-dimensional plane projection method for 360-degree 211 content [10], are classified as outside tiles when not a view-212 port tile, and the quality level difference between the two tile 213 groups of viewport and outside, especially if large, can further 214 decrease user QoE by inducing higher quality variation inside 215 the viewport when mispredicted [8], [9]. Therefore, to avoid 216 inducing such quality variation, all tiles should be initially 217 considered as a candidate for quality enhancement instead of 218 predetermining which tiles are outside/background tiles.

220
Viewport history is simply the aggregated viewport informa-221 tion of multiple users for a 360-degree video. To illustrate, 222 Figure 2 presents the viewport history of 50 users for 60s of 223 video content in a 5 × 5 equirectangular tile map, in which 224 the popularity of a tile is defined as the affinity score from 225 Equation 1. As illustrated, viewport history can characterize 226 popular tiles within the 360-degree space and can also assist 227 with viewport prediction, viewport extensions, bandwidth 228 allocation, and caching proposes an adap-233 tion method for tile-based streaming that transitions from 234 viewport prediction-based to tile popularity-based according 235 to the network environment; and reference [24] prefetches 236 the enhancement layers of select tiles in advance from a 237 descending order tile popularity list for scalable video coding. 238 However, when only non-user dependent features such as 239 tile popularity are used as the basis for tile quality selec-240 tion, the user is discouraged from looking around the virtual 241 space. This is because only such tiles will be predominantly 242 encoded in higher quality, irrespective of the user's viewport 243 information. Therefore, to promote the user's individualized 244 viewing experience, non-user dependent features should be 245 used as a supplementary tool for viewport-aware systems 246  file. In the proposed system, however, the server also holds 272 information on the popularity of each tile, and this informa-273 tion is also stored inside the MPD file.

274
To start the streaming session, the client downloads the 275 server's MPD file, and the client's MPD Parser confirms the 276 quality and number of tiles that are available. From reading 277 the MPD file, the user's Adaptive Bit Rate controller selects 278 the appropriate quality for each tile according to the client's 279 network environment and viewport information. Afterwards, 280 the client sends a request for the tiled segment to the server 281 using a HTTP/GET request, and when the requested segment 282 arrives, the client places the segment inside a buffer and plays 283 the video content sequentially from the buffer.

284
To determine the popularity of a tile, the tile's affinity 285 score is calculated. The affinity score of a tile for a given 286 segment is derived from the viewport history of all past users. 287 In particular, to calculate the affinity score a i,j for tile j in 288 segment i, the mean number of occurrences tile j overlapped 289 with all users' viewports for all frames contained in segment i 290 is taken, where a i,j in Equation 1 represents the normalized 291 tile affinity score. The affinity score of a tile ranges from 292 0∼1. Accordingly, an affinity score closer to 1 indicates 293 a tile greater in popularity. This affinity score calculation 294 is performed in advance by the server, and the server can 295 perform this calculation as needed. To address viewport mispredictions and quality variation, 299 we adopt tile affinity scores as a heuristic for both tile 300 quality level selection and control. Algorithm 1 presents 301 the pseudocode of the proposed tile quality selection 302

Algorithm 1 Tile Quality and Viewport Extension Selection
←− number of tiles in tile group cutoff ←− viewport tile count limit N cutoff ←− number of demoted viewport tiles Output: Q(j) ←− assigned quality level l to tile j Exceptional Initial: 1

if
Seg M = Seg 1 then 2 Viewport ← History-based 3 else 4 Viewport ← Prediction-based Quality Initialization: Third, in the Viewport Enhancement phase at L8-L26, the 342 viewport is enhanced. This phase can be divided into two 343 parts, with the first part being L8-L11 and the other L12-L26. 344 In the first part, all viewport tiles are enhanced to the 345 highest possible quality level. At L8, the bandwidth used 346 in the Quality Initialization phase is rewarded back to the 347 bandwidth budget for the viewport tiles. At L9-L10, all view-348 port tiles are uniformly assigned the highest possible quality 349 level, and these tiles are called viewport tiles in Figure 4. 350 ranging from 7.24 to 8.11 tiles, for all six videos and 50 users.

362
As a result, quality switching between segments can occur if 363 a limit on the number of viewport tiles to enhance is absent. 364 Therefore, in the second part, a viewport tile count limit 365 called cutoff is used to rearrange the viewport quality level 366 based on affinity scores. At L12, if the current quality level of 367 the viewport is less than the highest quality level available and 368 the number of viewport tiles exceeds cutoff, then the number 369 of viewport tiles exceeding cutoff will be demoted in quality 370 and the remaining viewport tiles will be promoted in quality.

371
To determine which viewport tiles will be either demoted or  The value for cutoff is derived from empirical analysis 399 concerning various factors. These factors include the tiling 400 scheme, the average number of tiles comprising the view-401 port of users, the bitrates of the available quality levels, the 402 segment size, and the perceived bandwidth available. Among 403 these, the tiling scheme and the bitrates of the available 404 quality levels hold the greatest influence. This is because 405 these two factors are, in general, fixed and influence the 406 outcome of the other factors. For example, a larger tiling 407 scheme (i.e., more tiles) will result in a larger value in the 408 average number of tiles comprising users' viewport, and 409 larger bitrates for the available quality levels will result 410 in less high quality tiles for a given network environment. 411 A ''good'' cutoff value will cover over the average number 412 of tiles perceived by a user at a given frame, which is 7.24 to 413 8.11 tiles here; however, the bitrates of the available quality 414 levels and the network environment can limit the value for 415 cutoff. In this case, the cutoff value should be lowered from 416 its ''good'' value until the best quality level can be achieved 417 but still be within the perceived tile count range, which is 418 6 to 11 tiles here. Based on the above, in this paper for a 419 5 × 5 equirectangular tile map and the network environment 420 described in section IV, we select a cutoff value of 8 viewport 421 tiles.

423
Fourth and last, in the Extension Enhancement phase at 424 L27-L37, the viewport is extended. Viewport extensions are 425 performed by enhancing the tiles surrounding the viewport 426 by creating direction groups. We consider the four direc-427 tion groups of left, right, top, and bottom, and these tiles 428 are called sub-adjacent tiles in Figure 4. We ignore corner 429 tiles because of their peripheral location and to conserve 430 network resources. We also do not set a fix ratio of bandwidth 431 between the Viewport Enhancement and Extension Enhance-432 ment phase. This is because viewport prediction takes prece-433 dence over viewport history. After each direction group is 434 defined, the rank of each group is determined by taking the 435 average of the tile affinity scores comprising each group. 436 For example, in Figure 4b, the group rank order is red, 437 green, pink, and blue, respectively. Like this, at L27-L37,   Intuitively, fast-paced content refers to highly dynamic con-470 tent, and slow-paced content refers to less dynamic content. Q3:12000Kbps. For our assumptions, we first assume that a 482 segment is not re-requested. Secondly, we assume that all tiles 483 in a segment are at least encoded in the lowest quality level. 484 Thirdly, we assume that all prediction-based methods use the 485 prediction method described in section II. Lastly, we assume 486 that the data size and distortion of all tiles in a segment 487 are fixed. The simplicity of the above network environment 488 is to provide ease in performing direct comparison on the 489 performance of all videos in each user QoE metric and tile 490 quality selection method.

492
For comparison against the proposed tile quality selec-493 tion algorithm, we adopt the following eight reference tile 494 quality selection methods. Among these eight, reference 495 methods 2), 3), 4), and 5) are in [25]. 496 1) The Conventional method, from [18], selects the qual-497 ity of tiles by groups and is the former algorithm to 498 Algorithm 1. Bandwidth is allocated to the groups 499 of viewport, direction, and background, respectively. 500 The conventional method follows Algorithm 1, except 501 L12-L26 and L32-L36. This means that cutoff and 502 individual tile quality enhancement are disregarded. 503 2) The Baseline method, from [32], presents the user's 504 current viewport in high quality, with all other tiles in 505 low quality. This method is also called Full Delivery 506 Basic in [32]. Viewport extensions are not used.

507
3) The State Art 1 method, from [23], selects the quality 508 of tiles by groups. Bandwidth is allocated to the three 509 groups of viewport, adjacent, and outside, respectively. 510 In particular, bandwidth is allocated for the highest pos-511 sible quality level to the viewport tiles, the remaining 512 bandwidth after enhancing the viewport is allocated for 513 the highest possible quality level to the adjacent tiles, 514 and bandwidth is allocated for the lowest quality level 515 to the outside tiles. 516 4) The State Art 2 method, from [13], selects the quality 517 of tiles individually based on weights. Bandwidth is 518 allocated to tiles inside the viewport based on their 519 distortion and to tiles outside the viewport based on 520 their distance to the viewport center.  First, the segment perceived quality bitrate R i is defined 552 as the sum of the quality level bitrate for all tiles, including 553 partial tiles, that overlapped with the user's viewport V i dur-554 ing segment i. Accordingly, the perceived quality bitrate R 555 is defined as the above segment perceived quality bitrate R i , 556 averaged across all segments.
Second, internal quality switching I 1 is defined as the 560 standard deviation of all tile quality levels comprising V i , 561 averaged across all segments.

562
Third and last, external quality switching I 2 is defined as 564 the absolute difference between the average tile quality level 565 VOLUME 10, 2022 FIGURE 6. User QoE performance in internal quality switching, averaged across all 50 users.
of V i and V i−1 , averaged across all segment transitions.

568
In this section, we introduce the evaluation results on each  all 50 users is shown in Figure 5. As shown, the highest 577 perceived quality bitrate possible is unique for each video. 578 The reason is because the number of tiles overlapping a user's 579 viewport throughout each video is also unique. Regarding 580 the performance, the proposed method generally performed 581 the best compared to the other reference methods in all net-582 work environments, except under the network environment 583 of 8Mbps on 2s segments. This is because the bandwidth 584 requirement for requesting the highest quality level could not 585 be achieved for all viewport tiles, even by introducing cutoff. 586 Although lowering the value for cutoff enough here to support 587 the highest quality level will result in a higher perceived 588 quality bitrate, the performance in internal quality switch-589 ing and external quality switching will both be negatively 590 affected. The Popular 1 method, on the other hand, performed 591 the best under this network environment by enhancing tiles 592 irrespective of quality uniformity.
593 environments of 10Mbps on 1s segments and 8Mbps on 2s 609 segments and Chariot Race under the network environment 610 of 8Mbps on 2s segments. This suggests that individual tile 611 quality enhancement based on tile popularity is better for 612 videos centralized in popular content.

613
Regarding the videos, the performance behavior is consis-614 tent. However, a notable difference in performance behavior 615 is observed between fast-and slow-paced content under the 616 network environment of 10Mbps on 2s segments. Compared 617 to fast-paced content, the State Art 3 method and Popular 1 618 method both performed worse than the FoV method, except 619 in Kangaroo Island, and this is because Kangaroo Island is 620 less dispersed in tile popularity compared to Perils Panel and 621 Shark Shipwreck. From this performance behavior, it can 622 be seen that tile quality selection methods using primarily 623 VOLUME 10, 2022 only viewport history are not suitable for all videos, particularly slow-paced content videos. Therefore, based on the 625 above, the perceived quality performance for history-based 626 methods will be affected according to the tile popularity 627 dispersion of a video. Figures 1 and 2    extensions by using strict tile groups. This verifies our asser-679 tion in section II that using strict tile groups as a view-680 port extension technique is not advisable. Lastly, the State 681 Art 3 method and Popular 1 method both yielded the worst 682 performance compared to the prediction-based methods by 683 not considering the actual user's viewport information.

684
Regarding the videos, a notable difference in performance 685 behavior is observed between fast-and slow-paced content, 686 and it is observed under all network environments. Com-687 pared to fast-paced content, the degree of internal quality 688 switching tended to be higher with slow-paced content. This 689 suggests that it is more uncertain where the user will look 690 in slow-paced content and that the used viewport estimation 691 methods could not capture where as accurately. As reflected 692 in Figures 1 and 2, the lower viewport prediction accuracy 693 further degraded the performance of all prediction-based 694 methods, and the higher tile popularity dispersion further 695      the best only in Roller Coaster and Pac-Man. The above 712 suggests that viewport history can provide consistency in 713 quality across segments, especially for videos condensed in 714 tile popularity, and that there were more severe instances 715 of viewport mispredictions in slow-paced content. Viewport 716 mispredictions can result in large quality level changes to 717 occur between segments, but strictly history-based methods 718 can provide stability in performance.

719
Overall, the proposed method tended to perform the best on 720 1s segments. In addition, the proposed method notably out-721 performed the conventional method. The most notable differ-722 ence between the proposed method and conventional method 723 is under the network environment of 10Mbps on 2s segments. 724 This significant difference in performance is primarily by 725 introducing cutoff to address viewport tile count fluctuations. 726 Under the other network environments, the proposed method 727      decrease the performance of prediction-based methods and 744 lower the tile-likelihood for history-based methods. Con-745 versely, when the popular content is not as dispersed, such as 746 in Roller Coaster, the performance is generally better for both 747 prediction-based and history-based methods. The above rela-748 tion between content pace and user head movements should 749 be further investigated.

750
Second and last, the proposed method is in a tradeoff 751 with the conventional method for both internal and external 752 quality switching. As exhibited on 2s segments, both meth-753 ods yield a different performance in internal and external 754 quality switching because of the introduction of cutoff in 755 the proposed method. Though introducing cutoff can sup-756 press external quality switching and increase the perceived 757 quality bitrate, internal quality switching is higher because 758 the quality level difference between tiles that were not as 759 enhanced is also higher. Depending on the user's preferred 760  viewing experience, the user may or may not want to use 761 cutoff as a control means for quality control, but we would 762 like to reserve this for our future work.

764
In this paper, to resolve the challenges concerning the band-   external quality switching on 1s segments. In addition, 777 by the newly added algorithmic logic, the proposed method 778 performed better than the conventional algorithm in all QoE 779 metrics on 1s segments.

780
For our future work, we have a list of items that we will like 781 to tackle. First, we will like to investigate in detailing a tile 782 quality selection algorithm that can realize different viewing 783 experiences, such as prioritizing either internal or external 784 quality switching. Second, we will like to investigate on how 785 to remove the empirical analysis for determining cutoff by 786 formalizing an equation. Third, we will like to investigate 787 deeper on the relation between content pace and user head 788 movements. Fourth, we will like to perform analysis on using 789 advanced viewport estimation methods. Here, this is because 790 we used a simple linear regression model as our viewport 791 estimation method, but advanced viewport estimation meth-792 ods, such as the neural network-based method in [19], could 793  have different impacts on the performance of the tile quality 794 selection methods shown in section IV. Fifth and last, we will 795 like to perform analysis on using different equirectangular 796 tiling schemes. This is because the proposed algorithm is 797 specifically tailored for a 5 × 5 equirectangular tile map, 798 and therefore modifications to the present algorithm will 799 likely be needed for optimal performance according to the  Tables 2-19.   810   TABLE 18. Perils Panel QoE performance in external quality switching, averaged across all 50 users.