I2MB: Intelligent Immersive Multimedia Broadcast in Next-Generation Cellular Networks

The popularity of immersive multimedia content is prevalent and the consumption of 360° videos is increasing rapidly in varied domains. The broadcast of such content in cellular networks will be challenging in terms of dynamic content adaptation and efficient resource allocation to serve heterogeneous consumers. In this work, we propose an intelligent immersive new radio multimedia broadcast multicast system (NR-MBMS), I2MB, for next-generation cellular networks. I2MB intelligently forecasts the users’ viewing angle and the 360° video tiles to be broadcast beforehand using long short-term memory network. We define broadcast areas by using modified K-means clustering. The complex multivariable optimization problem that integrates efficient adaptive 360-degree video encoding and tiled broadcast using optimized transmission parameters is defined as as a Markov decision process (MDP). In a dense urban scenario with a large MBSFN (multimedia broadcast multicast service single frequency network) synchronization area, the state and action space dimensionality is very high, in which the solution is obtained by using deep deterministic policy gradient (DDPG) algorithm. I2MB incorporates deep reinforcement learning based radio resource allocation (modulation-coding scheme and frequency-time resource blocks) and tiled video encoding to maximize the viewport video quality experienced by the broadcast mobile users. I2MB provides improved immersive video broadcast streaming quality while serving a higher number of mobile users. Adaptive encoding of 360° video tiles and radio resource allocation are performed based on users’ forecasted viewing angle, spatial distribution, channel conditions, and service request. The performance evaluation of our proposed scheme, I2MB, shows considerable gains in viewport quality (46.83%) and number of users served (30.52%), over a recent state-of-the-art method VRCAST.


I. INTRODUCTION
Immersive 360 • video streaming is increasingly used in 23 diverse applications such as virtual reality, gaming, and enter- 24 tainment [1]. In the immersive environment, when a viewer 25 changes their viewing direction, the content is accordingly 26 rendered. However, streaming such content requires very 27 high bandwidth and is challenging [1]. A 360 • immersive 28 video can be divided into small portions spatially known 29 as 'tiles' that can be encoded at different quality levels. 30 The associate editor coordinating the review of this manuscript and approving it for publication was Jon Montalban . This has enabled tiling-based viewport-adaptive 360 • video 31 streaming, where tiles are delivered to clients based on their 32 viewing direction and network conditions. Concretely, the 33 tiles within the user's viewport can be transmitted at a higher 34 quality, while the rest of the tiles can be delivered at a lower 35 quality [2]. 36 Digital television (TV) broadcast is a popular service in 37 wireless networks comprising on-demand content streaming 38 and multimedia broadcast to heterogeneous customers on 39 their smart devices like TVs, phones, and car-infotainment 40 systems [3]. Streaming on-demand multimedia data to mobile 41 users using unicast transmission requires considerably higher 42 MBSFN comprises of several gNBs that broadcast same set 83 of programs in a synchronized manner using the same set 84 of radio resources. A gNB can be a part of more than one 85 MBSFN area. In the given example, a cell is shown to be a 86 member of two MBSFN areas: {1, 2}. The users u 1 − u 6 are 87 accessing the 360 o multimedia program at different viewing 88 angles and are receiving the corresponding tiles in their view-89 port. 90 The trend in viewing and popularity of TV content (i.e. 91 programs and TV channels) is found to be dependent on 92 demography, social, economical, age, and region specific fac-93 tors of the viewers [6], [7], [8]. Ratings of TV channels, pro-94 grams, and audience can help in deciding content production 95 and schedules [9]. Multiple channel TV service has another 96 alternative over-the-top (OTT) streaming that has behavioral 97 advertising based monetization [10]. Overall, it motivates us 98 to group users using multi-criteria clustering to form MBSFN 99 areas. We perform NR-MBMS resource allocation based on 100 multiple parameters: user content interest (request), viewing 101 direction angles, gNB association, and program popularity, 102 in a given MBSFN synchronization area. In our proposed 103 scheme, I2MB, we form MBSFN area by grouping cells 104 while considering user content request, location, experienced 105 channel conditions, and user head navigation direction. 106 Given a set of MBSFNs, we aim to maximize the immer-107 sive quality delivered to the users by adaptively encoding 108 360 • video tile and efficiently allocating radio resource. 109 We have formulated an algorithm based on deep reinforce-110 ment learning (DRL) that executes at the broadcast transmit-111 ter (BTx) to efficiently allocate radio resources and adaptively 112 encode the 360 • video tiles that have to be broadcast. The 113 aim is to minimize the sum-distortion and churn rate expe-114 rienced by the users in the system. The users' program 115 requests, viewing directions, and channel conditions are con-116 sidered to be unknown to the users and the BTx beforehand. 117 We demonstrate using performance evaluation results consid-118 erable gains in viewport peak signal to noise ratio (PSNR) and 119 number of served users, over a recent state-of-the-art method 120 VRCAST. 121 The rest of the paper is organized as follows. Section II 122 discusses related works. Section III presents the I2MB system 123 architecture and components. Section IV describes the I2MB 124 framework consisting of User head navigation direction fore-125 casting, MBSFN formation using multi-criteria clustering, 126 and deep reinforcement learning based tile quality adaptation 127 and resource allocation. Section V provides details on the 128 simulation scenario and presents the key performance results. 129 Finally, Section VI draws our conclusions. 131 Adaptive 360 • video streaming based on users' viewport 132 has been studied in [1] and [2] via the design of efficient 133 360 • video representations and resource allocation methods. 134 Live scalable 360 • video network multicast has been investi-135 gated in [11] via rate-distortion optimization and user view-136 port prediction. The reference method we consider in our 137 VOLUME 10,2022 experiments is known as VRCAST and has been studied 138 in [12] for streaming of live 360 • videos to mobile users. 139 It considers grouping of users, adaptive resource allocation, 140 and tile-quality selection. However, it focuses on live multi-  [21]. It can also be used for optimal path planning 186 of mobile robots [20]. Live streaming services for vehicu-187 lar infotainment systems in the Internet of Vehicles (IoV) 188 requires high quality, low latency, and low bitrate variance. actor-critic DRL DDPG algorithm [23].

193
The following are a few key contributions of this work:

194
• User head navigation direction prediction using deep 195 learning LSTM model.

196
• Efficient Multi-criteria clustering based MBSFN area 197 formation with optimal number of clusters.

198
• The DDPG algorithm based optimal radio resource allo-199 cation and adaptive 360 • tiles encoding that minimizes 200 the users' sum-distortion and system churn rate.

201
• Extensive simulation based evaluation shows the effec-202 tiveness of the proposed I2MB technique that outper-203 forms state-of-the-art VRCAST algorithm in terms of 204 churn rate and video quality.

206
The architecture of our proposed I2MB system is illustrated 207 in Figure 2. Heterogeneous user equipments (UEs) send the 208 head navigation information to their serving gNB (RAN 209 broadcast Transmitter). This is then used to refine the LSTM 210 users' viewport forecast. The gNB forward this information to 211 the broadcast transmitter core element consisting of multicast 212 (multi-cell) coordination entity (MCE) and NR-MBMS gate-213 way. These elements define the MBSFN area (based on multi-214 criteria clustering) in order to efficiently broadcast 360 • 215 immersive digital TV content to heterogeneous UEs. There-216 after, these also adaptively allocate radio resources (resource 217 blocks, modulation and coding scheme). The content server 218 adaptively encodes 360 • video tiles using quantization level 219 selection based on user requests, viewport (based on head 220 movement navigation data), rate-distortion (R-D) character-221 istics of the immersive media content, and radio resource 222 constraints. The user head-movement data corresponding to user naviga-226 tion of a 360 • video over time is monitored by the UE. At the 227 UE, the immersive extended/ virtual reality (XR/VR) device 228 records the viewpoint direction, V i , of the user i, on the 360 • 229 viewing sphere. The user is considered to be positioned at 230 the center of this sphere. This is shown in Fig. 3(b). In par-231 ticular, the spherical coordinates azimuth and polar angles, 232  an LSTM network to forecast the head navigation direction 295 (rotation angle) of each user given the previous time-step 296 direction information. The predictor is trained by ADAM 297 optimizer [24], [25] for the non-stationary head navigation 298 data of immersive multimedia users.

299
Since, the monitored head navigation information is sent 300 to the broadcast transmitter (gNB), the actual values of time 301 steps are accessible between predictions. Hence, the observed 302 values are used to update the network state instead of the 303 predicted values. We begin by initializing the network state 304 and proceed thereafter by resetting it to prevent previous 305 predictions from affecting the predictions in subsequent time 306 steps. For each prediction in each time step, prediction in sub-307 sequent time step uses the observed value (at the users' head 308 navigation monitoring module) of the previous time step. The 309 prediction accuracy is enhanced when the network state is 310 updated with the observed values instead of the predicted 311 values [16]. The prediction model performance is evaluated 312 using root mean squared error (RMSE), defined as: where, A i is the actual value and F i is the predicted value 315 during forecasting and ν is the total number of time steps over 316 which the prediction has been performed.
(2) 349 We evaluate the efficacy of the multi-criteria clustering meth-350 ods in our cellular 360 • multimedia broadcast framework 351 using the following metrics: User i belongs to cluster k. There are total of N k users in 356 cluster k. We can find the nearest neighbors using D e .

357
2) Mahalanobis distance (D m ): (cohesion) compared to other clusters (separation). It is 369 used to evaluate the distance of separation between the 370 clusters resulting from the used method [28]. The silhou-371 ette plot visually shows closeness of points in a cluster 372 than to those in neighboring clusters.
where, s i (−1 ≤ s i ≤ 1) is the Silhoutte width, a con-376 fidence indicator on the membership of i in cluster k. 377 When s i is close to 1, it indicates that i is well clustered 378 (i.e., assigned to appropriate cluster). When s i is close to 379 zero, indicates that i can also be assigned to the closest 380 neighboring cluster [27]. The average distance between 381 i and all other users included in k is denoted as a i and 382 minimum of the average distance between i and all of 383 the samples clustered in The partition that results in the maximum value of S 386 is the optimal corresponding to the most appropriate 387 number of Clusters, i.e. optimal K [29].
(c k ) represents the complete diameter intracluster dis-394 tance of cluster k. This measure maximizes the inter-395 cluster while minimizing intracluster distances. A large 396 value of DI corresponds to good clusters [27]. The num-397 ber of clusters that maximizes DI could be taken as the 398 optimal number of clusters and a higher value represents 399 a better cluster quality.

400
The Dunn [30] and Silhouette [28] coefficient result from 401 nonlinear combination of compactness and separation. 402 We begin by setting K , i.e. number of clusters, as 1. There-403 after in each iteration, we increase K by 1. When the perfor-404 mance metric obtained is better that the previous iteration, the 405 algorithm continues till a drop in performance is observed. 406 This gives us the optimal size of cluster, K * . According to 407 the 3GPP standard, a gNB can belong to at most 8 MBSFN 408 areas, we ensure that this limitation is enforced i.e. a gNB 409 can be associated with maximum 8 clusters at a time. These 410 clusters would have the highest proportion of UEs asso-411 ciated with the gNB. The proposed I2MB MBSFN area 412 formation multi-criteria clustering framework is given in 413 Algorithm 1, Function I. This function in the algorithm gives 414 the K MBSFN areas, the set of 360 • programs to be broadcast 415 in K (i.e. P k ), and the video tiles to be broadcast.
The broadcasting decisions include the following: (i) pro-427 gram and tile set to be broadcast given the radio resource con-  In the following discussions, we consider that each UE at 451 any given point of time is requesting at most 1 360 • program. 452 We denotes the maximum number of tiles of a 360 • program 453 p as T p . We define the 360 • immersive multimedia broadcast 454 service distortion for heterogeneous UEs as follows: 455 Definition 1: Immersive video distortion is governed by 456 the rate-distortion (R-D) characteristics of the video. The 457 video bitrate varies with the variation of quantization param-458 eter (QP) at the encoder. In particular, QP is a video encod-459 ing parameter that regulates the extent of spatial detail in 460 encoded video. As QP is increased, the bit rate drops in 461 exchange for increased distortion. It is related to quantization 462 step size q as: q = 2 (QP−4)/6 . Effectively, the user i viewport 463 distortion is given as: Correspondingly, the video quality, Y-PSNR, is given as: 469 Q i = 10·log 10 (I max /D i ), I max is the peak luminance intensity, 470 given that D i is the luminance mean square error (Y-MSE). 471 We define system (network) churn rate for I2MB as follows. 472 Definition 1: System (network) churn rate is the ratio of 473 the unserved and the total users in the system. A user i is 474 served if it successfully receives the tiles in its viewport V i 475 and is given as:  486 We assume that channel characteristics remain stationary, 487 i.e. SINR experienced by users remains constant, during the 488 broadcast of all tiles (T p ) for a group of picture (GOP) of the 489 requested 360 o program. We also assume that the number 490 of users in the system remains constant for the time dura-491 tion of GOP broadcast of the requested program. Therefore, 492 a UE can successfully receive immersive video tile τ i (in its 493 viewport) if its experienced SINR is greater than the thresh-494 old corresponding to the MCS m k p,τ i that is being used to 495 broadcast tile τ i of p in MBSFN k. Only when the 360 • 496 multimedia service quality experienced by user i is above 497 the MCS allocated for its transmission, i.e. m k p,τ i , we consider 499 the corresponding quality. It is assumed to be zero other wise.

500
The SINR pertaining to the worst channel condition user Lindley recursion, where, a t p (R t , t ) denotes the amount of data of program p 518 for transmission based on resource allocation action. Given 519 the arrival distribution P l p and the resource allocation action 520 a p , the probability that program p stream transitions from t + 1 is defined as:

543
where the term in square brackets is equal to R t+1 p from (9).

544
Minimizing the long-term average of (11) minimizes the sys- 548 where s t = {s t i . . . s P k } and a t = {a t i . . . a P k } are joint state 549 and actions, respectively.

550
The value of each state when following the policy is 551 defined using Value function, V (s), given as: where ω ∈ [0, 1]; (ω) t is the t-th power of the discount factor. 554 We take the expectation over a sequence of states that is gov-555 erned by the controlled Markov chain with transition prob-556 abilities P(s |s, a) = We can represent expected future cost using the recursive 558 expression of the value function based on the transition prob-559 ability as: V (s) = c(s, (s)) + ω s ∈S P(s |s, (s))V (s ).

560
Then, the objective of the resource allocation strategy is 561 to determine the resource allocation and tile encoding policy 562 that solves the following optimization: where V * (s) is the optimal value function. ψ * (s, a) is the opti-569 mal action-value function that evaluates the value of taking an 570 action a in state s and thereafter following the optimal policy. 571 The optimal policy * (s) can be determined by taking the 572 action that minimizes the right-hand side of (16) and thereby 573 gives us the optimal action to take in each state.

574
Since the possibilities for quantization parameter selec-575 tion and resource allocation are nearly infinite, there is a 576 large number of discrete states and actions. Furthermore, the 577 dynamics of the underlying system (user channel quality, 578 user requests, video data rate adaptation, gNB radio resource 579 allocation) is predominant and the complexity would be very 580 high if the broadcast resource and encoding parameter allo-581 cation problem has to be entirely solved for each video GOP 582 from scratch. For a scenario with P k programs and M RBs, 583 and each 360 • program stream has T p tiles and there are 584 |q| = q max −q min possible program stream data values and M 585 possible MCS levels that can be allocated to the tiles of each 586 program, then there are a total of P k × M × T p × |q| × M 587 possible states and M × M × |q| possible resource allocation 588 actions. Hence, we use DRL to solve this problem. 589 We use a deep neural network deterministic policy gradient 590 method, DDPG algorithm. It is suitable for high dimensional, 591 continuous or discrete, large action state space problems. 592 The underlying principle is Actor-Critic framework consist-593 ing of an actor and a critic function. The former chooses 594 the actions and latter evaluates the corresponding selection. 595 We employ DRL-based DDPG method that reduces the time 596 complexity by maintaining a cache (i.e., replay buffer) with 597 state-action transitions and by performing an iterative update 598 of the networks (critic, actor, and target) on-the-go instead 599 of exploring the state-action mapping each time from the 600 beginning. The current policy is specified by mapping states 601 to an action in the DNN (parameters η µ ) by means of an actor 602 function, µ(s|η µ ). The critic function, ψ(s, a|η ψ ), is imple-603 mented using DNN (parameter: η ψ ) that learns using Bellman 604 equation and provides feedback based on selected action. 605 We update the actor DNN using gradient of the expectation 606 of return J in terms of η µ , similar to (13).
We update the critic network by minimizing the MSE:  can belong to the set of allowed MCS levels in accordance 664 with the 3GPP standard. 665 We formulate the respective optimization problems below, 666 where the constraints (20a)-(20d) capture the above 667 conditions.
If gNB j is a part of MBSFN k and p ∈ P k then it 684 broadcasts p, provided atleast one or more users experience 685 acceptable program quality Q i > 0. Thus, constraints (20c) 686 and (20d) is subject to broadcast of program p by gNB j. 687 Additionally, since gNB can belong to more than one MBSFN 688 area, (20b)-(20c) applies to each gNB instead of an MBSFN 689 area. Given program p and its tile τ , q p τ , σ for iteration= 1, I do Initialize random process N for action exploration for t = 1, T do Select action a t = µ(s t |η µ ) + N t Execute a t , observe c t and s t+1 Store transition (s t , a t , c t , s t+1 ) in replay buffer Sample transitions (replay buffer): (s i , a i , c i , s i+1 ) Set y i using c i , s i+1 , ψ , µ in (19) Update critic by minimizing the loss in (18) Update actor policy: sample-policy-gradient in (17) Update target network:  Proof: It is evident from Fig. 5 that both quality and rate 701 are strictly decreasing functions of quantization parameter q. 702 The distortion is a strictly increasing function of q. Analyt-703 ically, we model R(q) = a · e b·q and D(q) = c · e d·q , and 704 Q(q) = 10 · log 10 ( I max D(q) . It is shown in Fig. 5(a) that this video 705 rate model for Tile 40 of a 360-degree video corresponds to 706 a = 57500 and b = −0.2 with RMSE = 0.00097. The 707 video distortion (Y-MSE) model is shown in Fig. 5(b) for 708 Tile 40 and it corresponds to c = 11.2 and b = −0.12 with 709 RMSE = 0.00852. The video quality (V-PSNR) model for 710 Tile 40 is shown in Fig. 5(c) and it has RMSE=0.00361.

711
The non-negative weighted linear sum of strictly increas-712 ing functions is increasing [32], [33]. Hence, we prove that 713 our objective function is strictly increasing with the quan-714 tization level value by proving it for generic D i . The first 715 derivative of D i with respect to q k p (i.e., the quantization level 716 of the program requested by u i in the MBSFN(s) to which the 717 UE belongs) is of the form c · q (c > 0 and a constant). This 718 is positive thus proving the assertion.

719
The objective function (20) with constraints (20a)-(20d) 720 selects the highest possible quantization parameter level for 721 the group of users requesting tiles of a program such that the 722 resource constraint in the network are met.

723
The optimization problem for I2MB(Cr min ) is solved by 724 selecting the lowest possible m p τ and the highest possible 725 q p,τ ∀ τ, p such that Q(q p,τ ) ≥ Q min . This ensures that the 726 maximum number of users in the system have φ φ= 1 which is 727 in accordance with objective (21). To assess the performance of our scheme, we have used 730 360 • videos with diverse content types (for example: Office, 731 City, Sports, Jungle, and Sunrise). Sample snapshots of three 732 videos from the set that has bee used is shown in Fig. 6    We have assessed K-means, K-medoids, and fuzzy c-means 778 multi-criteria clustering algorithms to chose the most effec-779 tive method to efficiently form the MBSFN areas. These 780 clustering methods have been evaluated in terms of met-781 rics listed in Section IV-B, i.e. Euclidean distance, Maha-782 lanobis distance, Silhouette coefficient, and Dunn's index. 783 The clustering performance of these methods in terms of 784 the mentioned metrics is shown in Fig. 9(a)-(d), respec-785 tively. It is seen that K-means and K-medoid multi-criteria 786 clustering methods have comparable and significantly better 787 performance than fuzzy c-means in terms of the Euclidean 788 and Mahalanobis distance. Furthermore, K-means efficacy 789 as compared to K-medoid and Fuzzy c-means method is 790 evident from its higher Dunn's index value, shown in 791 Fig. f:cluster1(d).

792
The Silhouette value for the eight clusters formed in a 793 scenario consisting of 70 users per cell and 10 programs using 794 the three methods is shown in Fig. 10. Even though Fuzzy 795 VOLUME 10, 2022    Fig. 11(a) and 11(b) that K-means clustering can effectively 810 use these methods to find the optimal number of clusters in a 811 given scenario.

812
K-means multicriteria clustering implementation is a sim-813 ple, easy, and effective method to classify data [43]. Addition-814 ally, it is fast with few computations and has linear complexity 815 O(n). We therefore apply the Lloyds K-means heuristic [44] 816 for our multi-criteria clustering of heterogeneous users into  K-means++ approach [45]. The optimal number of clusters 819 is system scenario dependent and can be assessed through 820 Fig. 11(c). The optimum number of clusters depends on the 821 number of users and number of programs. A few program 822 options results in a fewer number of clusters (i.e. fewer 823 MBSFN areas). As can be seen from Fig. 11(c), sometimes 824 a higher number of users provides more competent centroid 825 options resulting in lesser number of optimum clusters.  Table 1. For each user in an MBSFN area with a given number 833 of interfering cells, the SINR is computed according to [5]. 834 The performance of our system is obtained by averaging the 835 results over several iterations (>150 iterations with 95% con-836 fidence interval) with uniformly random distribution of users. 837 We also examine the impact of the number of users.

838
Given the above scenario, our approach leads to the for-839 mation of an optimum number of MBSFN areas using the 840 approach discussed in Section V-B. Eight clusters (MBSFN 841 areas) are formed in the scenario shown in Fig. 12, indi-842 cated by different color markers. The efficacy of our pro-843 posed LSTM based viewport angle prediction scheme in 844 I2MB is evident from the Fig. 13(a) that shows the viewport 845  The significance of tile based immersive video broadcast 874 in I2MB is evident from Fig. 14. Fig. 14(a) shows the total  Fig. 14(b). Fig. 14(c) 883 shows the corresponding efficient QP level, respectively. The 884 corresponding quality in terms of viewport luminance PSNR 885 (Y-PSNR) is shown in Fig. 14(d). The tile specific rates of   the tiles of this program is shown in Fig. 14(e). I2MB (both 887 D min and Cr min schemes) selects efficient QP and MCS level 888 as compared to existing scheme (VRCAST [12]) in dense 889 network scenario while ensuring higher tile quality delivered 890 to users. 891 We also examine the performance of I2MB (both D min and 892 Cr min schemes) in terms of the churn rate (i.e., proportion of 893 unserved users), immersive video quality (in terms of view-894 port PSNR) and distortion (MSE). It is evident from Fig. 15(a) 895 that the churn rate increases as the number of users per cell 896 increases. The churn rate of I2MB(D min ) and I2MB(Cr min ) 897 is 65.63% and 71.88% (on average) lower than VRCAST.

898
The Viewport PSNR (V-PSNR) reduces with an increase in 899 number of users per cell but is maintained above 27 dB for the 900 two I2MB methods unlike VRCAST that has 23 dB V-PSNR, 901 as shown in Fig. 15(b). The MSE increases with increase in 902 number of users and the performance of I2MB(D min ) and 903 I2MB(Cr min ) is 76.14% and 42.28% better (i.e. lower dis-904 tortion, MSE), respectively, than the existing scheme [12], 905 as shown in Fig. 15(c).

906
The quality per user for a scenario with fifty users in each 907 cell is shown in Fig. 16(a) and the corresponding cumulative 908 distribution function (CDF) is shown in Fig. 16(b). It is evi-909 dent from Fig. 16 that I2MB(D min ) and I2MB(Cr min ) provides 910 higher quality (greater than 30dB) for all users as compared 911 to VRCAST. Specifically, as can be seen from Fig. 16(a)