Deep Learning and Bidirectional Optical Flow Based Viewport Predictions for 360° Video Coding

The rapid development of virtual reality applications continues to urge better compression of 360° videos owing to the large volume of content. These videos are typically converted to 2-D formats using various projection techniques in order to benefit from ad-hoc coding tools designed to support conventional 2-D video compression. Although recently emerged video coding standard, Versatile Video Coding (VVC) introduces 360° video specific coding tools, it fails to prioritize the user observed regions in 360° videos, represented by the rectilinear images called the viewports. This leads to the encoding of redundant regions in the video frames, escalating the bit rate cost of the videos. In response to this issue, this paper proposes a novel 360° video coding framework for VVC which exploits user observed viewport information to alleviate pixel redundancy in 360° videos. In this regard, bidirectional optical flow, Gaussian filter and Spherical Convolutional Neural Networks (Spherical CNN) are deployed to extract perceptual features and predict user observed viewports. By appropriately fusing the predicted viewports on the 2-D projected 360° video frames, a novel Regions of Interest (ROI) aware weightmap is developed which can be used to mask the source video and introduce adaptive changes to the Lagrange and quantization parameters in VVC. Comprehensive experiments conducted in the context of VVC Test Model (VTM) 7.0 show that the proposed framework can improve bitrate reduction, achieving an average bitrate saving of 5.85% and up to 17.15% at the same perceptual quality which is measured using Viewport Peak Signal-To-Noise Ratio (VPSNR).


I. INTRODUCTION
In recent years, virtual reality (VR) technology has rapidly grown in public markets, providing solutions to immersive in-home experiences and elevating the standards of media consumption [1], [2], [3]. In addition to the entertainment sector, VR technology also supports other business endeavours such as travel, education, and real estate. Consequently, the proliferation of high-resolution 360 • videos required to boost VR-based multimedia applications demands higher The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu . bandwidth requirements. Therefore, there is a great need for efficient compression of such video content.
A 360 • video can be regarded as a sequence of surface information that encloses a point source. The viewpoints on the virtual 360 • surface are subjected to change as a user varies the head position. In the event that these viewpoints are uniformly distributed and placed at a constant distance from the point source, then the 360 • surface becomes isotropic, hence can be defined as a spherical surface. A 360 • surface can be represented by the spherical coordinate system and its parameters latitude θ [−π/2, π/2], longitude φ [−π, π] and unit radius r. Moreover, 360 • videos are converted to 2-D representations, mostly the EquiRectangular Projection (ERP) format, in order to use 2-D video encoders to compress the video contents [4]. However, in 360 • videos, the Field of View (FOV) of the user encloses only a portion of the spherical information, also known as the viewport which is a rectilinear image generated from the user's head position. Because the remaining surface information is redundant to the user, it is expected that video compression technologies will effectively remove such information and save bitrates.
Video-compression technologies continue to evolve and facilitate new video-communication trends. The Versatile Video Coding (VVC) standard [5] is the latest initiative of the Joint Video Experts Team (JVET) which introduces numerous coding tools to improve the video compression efficiency [5], [6], [7], [8], [9], [10], [11]. VVC not only target high coding gains, but also targets versatility, and as such it supports encoding of various types of videos such as natural videos, screen content, Standard Dynamic Range (SDR), High Dynamic Range (HDR) and 360 • videos and video formats such as 4:2:0, 4:4:4, lossless video formats, etc. In order to support 360 • videos, VVC introduces specific coding tools including several re-projection, packaging and padding tools [12]. Particularly for ERP videos, VVC introduces Motion Vector (MV) wrap-around and vertical edge padding features to provide continuity in the 2-D plain as similar to that in a spherical surface [12]. The VVC Test Model (VTM) 7.0 is capable of improving the compression gains of 360 • videos by 28.91% compared to its predecessor, High Efficiency Video Coding (HEVC). However, VVC does not account for the user perception when encoding an ERP video which could otherwise further improve the perceptual compression gains.
Perceptual video coding has been a vastly researched area in the video coding domain which focuses on enhancing the user perceived visual quality by improving the fidelity of human interested regions, also called the Region of Interests (ROI) [13], [14], [15]. In the context of 2-D videos, perceptual characteristics can be directly exploited using video compression techniques. However, ROI based coding can be a challenging problem for 360 • videos because the FOV of the user is limited to the viewports as opposed to the conventional ERP representation of 360 • videos as illustrated in FIGURE 1. Therefore, failure to utilize viewport information during the encoding of an ERP video can result in an abundance of non-observed video information being coded, incurring an additional transmission cost. Also, individual encoding of several viewports of a given ERP frame cannot address this problem because different users can opt to view different viewports. In response, this paper proposes a novel viewport dependent ERP coding framework that exploits viewport information in VVC encoding processes. To this end, in light of the success of image processing tools and data-driven technologies, this research employs a hybrid technique incorporating Gaussian filtering technique and bidirectional optical flow estimation [16], and a deep learning network constructed from the components of Spherical Convolutional Neural Network (Spherical CNN) [17] and Salient Resnet [18] in order to predict the user observed viewports. Furthermore, this paper also illustrates the generation a VVC compliant, ROI aware weightmap by non-linearly fusing the predicted viewports. Subsequently, the developed weightmap is used in the removal of spatial redundancy in the ERP videos and the optimization of the video coding parameters.
The novel contributions provided in the proposed ERP coding framework are summarized as follows.
• Hybrid viewport prediction technology for video coding that fuses bidirectional optical flow estimation and Gaussian filtering techniques.
• Deep learning based viewport prediction technique for video coding that incorporates Spherical CNN and Salient Resnet components.
• VVC compliant weightmap derivation from non-linear fusion of predicted viewports to mask the source video.
• Application of the weightmap in Lagrange optimization and adaptive Quantization Parameter (QP) derivation for VVC. The remainder of this paper is organized as follows. Section II describes the related work in the area, Section III describes the proposed encoding framework, and Section IV illustrates the experiments and results, followed by concluding remarks in Section V.

II. RELATED WORK
360 • video coding is a growing area of research and numerous solutions have been proposed in the literature to boost the compression efficiency of the 360 • videos. In general, these VOLUME 10, 2022 research works on 360 • video coding can be classified into three categories namely, pre and post coding, context adaptive coding and perceptual coding. Pre and post coding primarily discusses the re-projection techniques of existing 2-D projected 360 • video contents as well as the packaging mechanisms of the video frames. In contrast, in context adaptive coding, the spherical properties of the 360 videos are exploited and used by video compression tools such as quantization and motion compensation in order to improve the coding efficiency. Finally, perceptual coding addresses the deployment of user perceptual models, specifically viewport dependent encoding approaches during the video coding procedures.
The pre and post coding tools for 360 • videos play a major role in the 360 • video coding domain. The main functionalities of the pre-coding tools are to convert the 360 • videos into 2-D space, and rearrange them in rectilinear formats which are suitable for subsequent encoding. Their respective inverse operations at the decoder are performed by the post-coding tools. In this regard, the projection and packaging techniques such as rhombus dodecahedron projection [19], CubeMap (CMP) [20], octahedron projection [21], Truncated Square Pyramid (TSP) [22], icosahedron projection [23], and rotated sphere projection [24] have been studied. In contrast to the pre-coding techniques used in the literature, the proposed framework applies a viewport based weightmap on the input ERP video frame, to provide user perception to the codec.
Context based 360 • video coding incorporates spherical characteristics during the video coding processes. Moreover, it is also vital to understand that the locations and magnitudes of the reference pixels pointed in the spatial and temporal domains by the 2-D codecs may not be accurate in a spherical projected 360 • video as opposed to a conventional 2-D video. In general, the spherical projected 360 • videos introduce artifacts such as discontinuity in the boundaries, shape distortion and redundancy in pixel samples. These can result in the encoding of invalid pixels, spatial prediction issues, and inefficient motion estimation and motion compensation, thereby not being able to achieve the potential maximum compression efficiency. In this regard, the application of spherical objective quality metrics and related algorithms in adaptive quantization techniques, Lagrange optimization, quantization parameter optimization, residual weighting, adaptive resolution techniques and Rate-Distortion Optimization (RDO) have been studied in numerous research works [4], [25], [26], [27], [28], [29], [30]. Furthermore, few studies report that the use of spherical characteristics in motion vector candidate selection, motion compensation and pixel padding can boost coding performance in the temporal domain [12], [31], [32]. As opposed to the aforementioned studies which incorporate spherical properties in video compression processes, the proposed research incorporates perceptual characteristics to the video coding tools.
Perceptual coding in 360 • videos includes ROI detection which can be very challenging because the viewports are instantly constructed based on the user's head movements. Therefore, in 360 • videos, ROI can be approximated by predicting viewport information. In the context of leveraging user observed viewports, many state-of-the-art studies primarily focus on video streaming applications and place less emphasis on the encoding of video content. The literature categorizes, the viewport centric 360 • video streaming/coding techniques as tiled and non-tiled approaches. Benefiting from parallel processing features, tiled approaches are mainly applicable in streaming of the 360 • video contents. As such, tiled approaches either follow a scalable coding solution [33], [34], [35], [36] or assign high bitrates to the tiles that represent primary viewports [37]. Moreover, there are research works that combines both solutions by encoding the viewport dependent tiles at a higher bitrate and proving scalable support to further enhance the Quality of Experience (QoE) of users [38], [39]. Furthermore, multi-layer streaming system with base and enhancement layers have also been studied [5], [35], [40]. Although tiled based coding systems are useful for streaming purposes, reducing transmission delays, providing higher flexibility and improving the QoE, the associated coding losses remain an issue. Furthermore, viewport driven RDO strategies for 360 • video streaming have also been studied [35], [36] Coding losses can be improved using non-tiled approaches. However, viewport dependent non-scalable non-tiled driven coding approaches have not been a popular research topic owing to problems related to generalization of viewport prediction, viewport mapping with ERP video frames and associated coding delays. Among the JVET approved projection schemes for VVC, TSP [22] is the only viewport dependent projection technique that specifically prioritizes the front viewport during the packaging of 360 • video in 2-D platform. Furthermore, Sreedhar et al. [41] proposed a multiple viewport resolution centric, rectilinear packaging technique for ERP in which the front viewport has been biased with higher resolution as opposed to the other viewports. These two studies make an assumption that users tend to view the front viewport more often than the other viewports. In contrast, Facebook developed Barrel layout based on AI-driven saliency maps to identify user interested regions prior to encoding [42]. Furthermore, Hu et al. [43] reports learned weights driven viewport dependent Lagrange optimization and adaptive quantization techniques at Coding Tree Unit (CTU) level which improves the perceptual compression gains of HEVC by 26%.
Building on the non-tiled approach for 360 • video coding, this paper proposes several novel approaches that differ from the state-of-the-art techniques [22], [41], [42], [43], in number of ways. Firstly, the proposed research applies two different techniques for viewport prediction; a hybrid approach that combines bi-directional optical flow estimation and a Gaussian filtering technique to extract saliency features in both spatial and temporal domains and an Spherical CNN incorporated deep learning approach to obtain spherical features only in the spatial domain. Secondly, the proposed research employs a Gaussian based viewport non-linear fusion technique in order to detect the ROI on the ERP video frames and generate a weightmap. Thirdly, the proposed research also adopts the generated ROI based weightmap in, 1) masking the input ERP frames, 2) adaptive quantization and 3) Lagrange optimization.

A. OVERVIEW OF THE PROPOSED 360 • VIDEO CODING FRAMEWORK
This paper proposes a novel ERP coding framework that exploits viewport information in 360 • videos to perceptually improve the compression efficiency. The sequential flow of the tasks in the proposed framework is shown in FIGURE 2. The proposed framework first predicts a set of user observed viewports. Once a set of viewports is obtained, they are mapped onto an ERP frame to generate Quality Emphasis Regions (QERs) on the ERP frame. Subsequently, QERs are fused using Gaussian operation in order to identify the ROI on the ERP frame. Thereafter, an ROI aware weightmap is generated for each frame and it is used to mask the corresponding video frame prior to encoding using VVC. Furthermore, the proposed framework also employs an adaptive Lagrange and quantization parameter optimization techniques based on the generated weightmap in order to further improve the compression efficiency of the encoded bitstream.

B. OVERVIEW OF VIEWPORT PREDICTION TECHNIQUES
The proposed framework experiments two different approaches for predicting the viewports which would be used to detect the ROI on an ERP frame. The first approach, Viewport Prediction Hybrid (VPredHyb) applies a hybrid implementation between bidirectional optical flow estimation and Gaussian filtering technique in obtaining a set of viewports from a predefined viewport set. The magnitudes of the optical flow vectors and the filtered pixels obtained respectively are weighted and the resulting magnitude is used in the selection of the required number of viewports. The second approach called the Viewport Prediction Spherical CNN (VPredSCNN) deploys a deep learning technique incorporating Spherical CNN and Salient Resnet components to directly predict a set of viewports as opposed to the viewport prediction procedure used in VPredHyb where viewports are selected from a predefined viewport set. In VPredSCNN, a k-means clustering technique is used after the viewport prediction process in order to obtain the required number of viewports.
Viewport Generation: A user observed viewport in a 360 • video frame is defined by the viewport centre (φ o , θ o ), its width W vp , its height H vp , horizontal FOV angle F h and vertical FOV angle F v [44]. Here, the viewport centre (φ o , θ o ) is derived from a given pair of spherical coordinates (φ, θ) of the 360 • video frame. During viewport generation, the pixel values at location (x, y) in the ERP frame that corresponds to the sampling location (m, n) on the viewport are determined using interpolation techniques. In this context, the mapping relationship provided by [44] is used in this research for the generation of viewports.

C. VIEWPORT PREDICTION HYBRID
VPredHYB is one of the proposed viewport prediction techniques that fuses viewports predicted using bidirectional optical flow vectors and Gaussian filtered pixels to extract both spatial and temporal features. Optical flow provides a better estimate of the motion trajectory leading to a better approximation of human interest regions. As opposed to many available optical flow estimation methods, bidirectional optical flow utilizes the information from past frames as well as future frames during flow vector estimation, hence used in the proposed VPredHYB. Furthermore, a Gaussian filter is included to capture the low frequencies in the signal which are more sensitive to the human visual system. FIGURE 3. illustrates the overview of the proposed VPred-HYB technique. Although an arbitrary number of viewports can be constructed, doing so is impractical due to their high computational complexities and can also result in the generation of redundant viewports. Therefore, in VPredHYB, VOLUME 10, 2022  a set of viewports V T with length N T are constructed at first. Here, the value of N T is chosen such that the majority of the regions on the ERP can be enclosed while reducing the impact of the encoding complexity. Table 1. illustrates the percentage of the area that the number of viewports can enclose when projected onto an ERP at standardized coding resolution of 4432×2216. In the proposed VPredHYB N T is set to 20, as it can provide a good trade off between the area enclosed and the complexity of search. Furthermore, the initial viewport centres are defined such that θ o and φ o take values in the ranges of [−π/2, π/2] and [−π, π) at π/4 and π/2 intervals respectively.
Each of the prediction techniques in VPredHYB is sequentially applied on all of the viewports in V T . In the case of the Gaussian filtering technique, only a particular viewport from a given video frame is generated. In contrast, the viewports corresponding to the previous and subsequent frames are constructed during the estimation of optical flow vectors, hence this technique cannot be applied to the first and the last frames. Moreover, a weighting factor ξ is applied to the magnitudes of the optical flow vectors and the filtered pixels during the fusion of the two techniques. The resulting fused magnitude E v T for a given viewport v (∈ V T ) can be obtained as shown in Eq. (1). where f and N f are the total magnitudes of the Gaussian filtered pixels, optical flow vectors, the frame number and the number of frames respectively. Furthermore, a viewport set V P (⊂ V T ) with length N p for the frame f is constructed using N p viewports that exhibit greater E v T values in order to be used in the generation of ROI aware weightmap for the ERP frame.

D. VIEWPORT PREDICTION BIDIRECTIONAL OPTICAL FLOW ESTIMATION
Viewport Prediction Bidirectional Optical Flow Estimation (VPredBOFE) is one of the techniques used in VPredHYB that applies flow vector estimation to predict the user observed viewports. An example of bidirectional optical flow estimation for a given viewport constructed at Let pixel intensity of a viewport at time t, t0 and t1 ( t0 < t < t1) be I t , I t0 and I t1 respectively. Then, using the bidirectional optical flow concept [16], I t can be written in terms of I t0 and I t1 , where Gx tm , Gy tm , Vx tm , and Vy tm are the horizontal image gradient, vertical image gradient, and horizontal and vertical optical flow vector components of a given viewport at time tm (m ∈ {0, 1}) respectively. Furthermore, because only the previous and successive viewports are used in estimation in the proposed approach, the temporal difference becomes one, thus t −t0 = t1−t = 1. Furthermore, assuming that the motion is along the trajectory and there would not be greater variation of flow information between two successive viewports, the optical flow vector components can be written as Vx t1 = Vx t0 = Vx and Vy t1 = Vy t0 = Vy. Then, from Eq. (2). and Eq. (3)., the error i,j at the pixel location (i, j) in the viewport can be derived as, In order to approximate Vx i,j and Vy i,j at pixel location (i, j), the spatial gradients Gx i,j , Gy i,j and the temporal gradient Gt i,j need to be obtained. The temporal gradient is estimated by the pixel intensity difference between the viewport at t0 and t1. Moreover, by convolving horizontal and vertical Sobel filters of size 3×3 over the entire viewport in the respective direction using a local window of size 9 × 9, the individual spatial gradients of the viewports at t0 and t1 can be obtained. These can be used to extract the combined spatial gradients Gx i,j and Gy i,j . Furthermore, in order to obtain a closer approximation of the optical flow vectors, the error i,j can be minimized inside the window using least square estimation such that However, given that there are many viewports to be processed, the least square estimation process can be complex. Therefore, the optical flow vectors can be estimated by setting the partial derivatives of 2 (i,j) with respect to Vx i,j and Vy i,j to zero. Then the optical flow vectors at pixel location (i, j) (denoted as a in Eq. (5)), Vx i,j and Vy i,j can be estimated from the spatial and temporal gradients as given in Eq. (5).
where δ is the denominator correction factor with a value of 0.0001. The total magnitude of the optical flow vectors E v o for each viewport v is then determined using Eq. (6).

E. VIEWPORT PREDICTION GAUSSIAN FILTERING
Viewport Prediction Gaussian Filtering (VPredGF) is the second technique used in VPredHYB, that uses the Gaussian filtered pixels to predict the viewports. A visual example of the obtained Gaussian filtered pixels using this technique is shown in FIGURE 5. The Gaussian filter in spatial domain h x,y can be defined as, where (x, y) are the spatial coordinates and σ is the standard deviation. In predicting the viewports, Gaussian filter h x,y of size N × N , (N = 7) and σ = 1 is used to convolve over the entire viewport image. Furthermore, a local window with central pixel coordinates (i, j) and an equal size to the Gaussian filter, spanning N 2 pixels in the viewport image is used in the identification of the mean intensity. The mean intensity µ i,j of a particular window is computed as shown in Eq. (8).
where Ig i+x,j+y is the Gaussian filtered pixel intensity at location (x, y) in the window with central pixel coordinates (i, j). However, the mean intensity must be removed prior to the computation of the magnitude of the filtered pixels as it can be a bias to the low frequency signal components within the particular window [45]. Subsequently, the magnitude of the filtered pixels E v f of viewport v can be determined using Eq. (9). where Viewport Prediction Spherical CNN (VPredSCNN) is an alternative approach to the VPredHYB, that can precisely predict a set of viewports. Although an arbitrary number of viewports can be generated from a single 360 • video frame, user interest can be limited to only fewer viewports with their centres located close to one another. In such cases, the use of VPredHYB becomes a disadvantage as viewports selected would at least have a predefined distance between them which may hinder in enclosing the user interested regions. Moreover, the dual process of VPredBOFE and VPredGF in VPredHYB consumes a large amount of computational time in evaluating each viewport separately. In order to address these issues, as opposed to selecting a viewport subset V P from the viewport set V T , in VPredSCNN the viewport set V P is directly predicted using trained deep learning models.
In this regard, a deep learning architecture composed of Spherical CNN [17] and Salient-Resnet [18] components is employed.

1) SPHERICAL CNN
Spherical CNN [17] is a special type of convolutional neural network that defines rotation-equivariant spherical cross-correlation for spherical signals. As opposed to the planer CNN which applies cross-correlation on a 2D image, Spherical CNN is developed to support the signal on the sphere S 2 . Here, S 2 can be defined as a set of points in 3D space R 3 and can be parameterized by spherical coordinates α ∈ [−π, π] and β ∈ [−π/2, π/2]. Moreover, a set of rotations called the SO(3) is parameterized using α ∈ [−π, π], β ∈ [−π/2, π/2] and γ ∈ [0, 2π] where γ performs rotation around the axis for a particular (α, β). In order to represent S 2 and SO(3), Spherical CNN defines two types of cross-correlation called the S2Conv and SO3Conv respectively. Both S2Conv and SO3Conv output feature maps on SO(3) space. However, S2Conv receives its input as spherical signals with k channels as opposed to SO3Conv which takes in feature maps with k rotations from SO(3) space itself.
2) SALIENT-RESNET Salient-Resnet [18] used in the proposed VPredSCNN is a CNN based architecture with Resnet [46] as its backbone which uses skip connection to address vanishing gradient problems. In the context of Salient-Resnet, a Salient block featuring Global pooling, batch normalization and relu unit has been used as the residual block of the Resnet.

3) PROPOSED VPredSCNN
The key idea of VPredSCNN is to extract spherical and salient features from the luma component of 360 • video. In doing so, the proposed VPredSCNN is made up of two branch deep learning pipelines consisting of Spherical CNN based cross-correlation blocks, Salient-Resnet and Fully Connected Layers (FCN) as shown in FIGURE 6(a). Firstly, in the first branch, a single layer of S2Conv and three layers of SO3Conv have been used with equatorial grids as visual information is densely concentrated along the equator. In this regard, for S2Conv α is set from −π to π at 1024 intervals while β = 0. In the case of SO3Conv α is set from −π to π at 32 intervals, while β = 0 and γ takes values −π/8 and π/8. Moreover, the input to the network is a single channel luma component of an ERP frame. In each layer, features are increased to 32, 32, 64 and 128 while the bandwidths are reduced from 512 to 64, 32, 16 and 10. The bandwidth is initially set at 512 to maintain a higher resolution of the spatial grid as ERP frames are high resolution images. Furthermore, after the final layer, the signal is integrated over SO3 to obtain a tensor of 1 × 128 which is concatenated with the outputs from the Salient-Resnet. Here, Salient-Resnet is defined as a 34 layer network in order to reduce the computational time.
where s, n s , V s and V pred,v are the subject, number of subjects, ground truth viewport coordinate obtained from subject s and predicted viewport coordinate of viewport v(∈ V T ) respectively. Furthermore, during the training of the network, a validation set is used to tune the hyper-parameters of the model. In this regard, training followed gradient descent algorithm with Adam optimizer with following hyperparameters: initial learning rate 1 × 10 −3 , weight decay 1 × 10 −5 , coefficient of greater circle distance λ g = 100, coefficient of Euclidean distance λ e = 1, batch size 1 and epochs 30. Batch size and epochs were limited by the computational complexity of the training process and availability of the resources.
Furthermore, the predicted viewports are clustered using k-means clustering to obtain a viewport set V P . Here k is the same as the required number viewports N P . Furthermore, centre coordinates of each cluster are obtained as pairs of (φ o , θ o ) which are then used for QER generation on the ERP.

G. VIEWPORT FUSION
The viewports predicted using VPredHYB and VPredSCNN need to be mapped onto an ERP frame for encoding. The mapped viewports can then be fused to generate a pooled region of QERs that represent generalized ROIs of users. The generalized ROI represent the user interest on the ERP and are used to emphasize user interested regions during the encoding.
QER is a rectilinear region on the ERP with the centre known as the Quality Emphasis Centre (QEC) that represents an area corresponding to a viewport. In order to identify a QER on the ERP frame, the centre coordinates of the viewport (W vp /2, H vp /2) and the four vertices (0, 0), (W vp , 0), (0, H vp ) and (W vp , H vp ) of the viewport are projected onto the ERP frame using the mapping relationship provided in [44]. Furthermore, the identified QERs cannot be directly used to construct the mask as it would cause undesirable edges in the ERP frame, disrupting the prediction schemes in VVC. Therefore, a Gaussian operation is performed to fuse all of the QERs as it does not cause a sudden decrease in pixel intensities along the spatial direction. By fusing all of the QERs, an ROI aware weightmap w x,y can be generated using Eq. (11).
where dx q,x,y , dy q,x,y and σ 2 q are the shortest distance between the centre of QER q, QEC q and the pixel coordinates (x, y) in ERP, and variance respectively. Unlike the Gaussian filter operation where a filter is convolved over the entire viewport image, here entire ERP image is considered as a single unit where performing the Gaussian operation such that σ 2 q is computed as shown in Eq. (12).
Furthermore, when computing dx q,x,y , dy q,x,y and σ 2 q , the continuity of a 360 • image along the vertical edges of ERP frame is also considered. For example, consider a QEC q positioned near the left vertical edge of an ERP frame; then the shortest distance between that and a pixel located near the right vertical edge, would be measured through the vertical edges considering the fact that an ERP represents a spherical image and it is continuous along the normal to the longitude. Furthermore, in fusing the QERs, a max operation is performed in order to give priority to the nearest QER in enclosing all possible pixels in the ROI. An example of fusion of three QERs, corresponding ROI aware weightmap and a mesh diagram of the weightmap are shown in FIGURE 7(a)., FIGURE 7(b). and FIGURE 7(c). respectively.
Since VVC encoding includes spatial prediction, the error can be propagated from the quality degraded pixels to the neighbouring pixels including those enclosed in ROI. Therefore, a quality factor ρ is introduced to the weightmap in order to compensate for the quality degradation that may occur in the neighbouring pixels to the ROI during the encoding processes. The modified weightmapŵ x,y , hence can  be defined as Eq. (13).
The min operation used here duplicates a clipping mechanism that restricts the weights to one. Furthermore, the visualization of the weightmap after introducing the quality factor is shown in FIGURE 8.

H. LAGRANGE OPTIMIZATION AND ADAPTIVE QUANTIZATION
The proposed ERP coding framework further incorporates a Lagrange Optimization and Adaptive Quantization (LOAQ) techniques to enhance the RDO process in VVC. Denote D and R are the distortion and the bitrate required for a given coding process, then RDO in VVC is performed as, where J is the cost of the coding process and λ is the Lagrange multiplier. The optimization problem can be solved for different Coding Units (CU) selection in VVC. In this context, the proposed weightmap is introduced to the cost function in order to account for the perceptual characteristics. In this regard, the cost function for a given CU selection can be modified as, where, D cu , R cu , λ base and w cu are the distortion, bitrate, original Lagrange parameter and the proposed weights at CU level respectively. This can also be represented by Eq (16) [29].
where λ cu is the new Lagrange parameter of the CU which is given by λ cu = λ base w cu Furthermore, the quantizer design in VTM is based on scalar quantization [10]. Here, quantization step Q is defined as, Furthermore, since Q 2 is proportional to λ base [47], λ base and λ cu can also be written in terms of original QP QP base and the adaptive QP QP cu as given in Eq. (18) and Eq. (19).
where c is a constant. Using equations Eq. (16), Eq. (18) and Eq. (19), the QP cu can be derived as, Furthermore, λ cu and QP cu can be deployed at CU level for the optimization of coding process. Although the weight w cu adopts the weightmap derived for masking, it cannot be used in its normative form as greater pixel intensity variation may be seen inside a CU. w cu is therefore defined as, where,w if ∀ q (x, y) ∈ QER q ψŵ x,y , otherwise and W cu , H cu , ψ are the width, height of CU and a constant CU weighting factor respectively. Furthermore, an offset value of 10 pixels to QERs have been used in this process in order to compensate for any projection error that occurred during the generation of the QERs.
Signalling:Signalling is important for the reconstruction of the encoded video sequence at the decoder. Because the proposed framework primarily involves preprocessing, there is no requirement to send any information to the decoder. However, the weights contributing to the adaptive quantization must be known at the decoder to predict the correct QP value. Since information including N P , W vp , H vp , F h and F v can be present at the decoder, only the viewport centre (φ o , θ o ) need to be signalled to the decoder for each frame encoded. In the case of VPredHYB, the index of the viewport set V T is signalled to the decoder. However, viewport coordinates predicted using VPredSCNN cannot be signalled it their normative forms as the decimal point values can increase the cost of transmission. Therefore, (φ o , θ o ) are rounded to the nearest integer and QERs are constructed from the resulting values to be used in the derivations of w cu .

IV. EXPERIMENTS AND RESULTS
360 • video sequences from the VQA-ODV dataset [48] are used during the experiments. It has 60 reference sequences with HM and Eye Movement (EM) data extracted from more than 200 subjects. Since VPredSCNN requires prior training, the dataset is split into train and test sets. Initially, the test set is determined using Spatial Information (SI ) and Temporal Information (TI ) such that each video sequence falls in each quadrant of the SI vs TI graph. As shown in the SI vs TI graph in FIGURE 9. three sequences from each quadrant, (high SI , low TI ), (low SI , high TI ), (low SI , low TI ) and (high SI , high TI ) are selected as the test sequences in order to represent all four quadrants in the experiments. Subsequently, the remaining 48 sequences are used for the training of VPredSCNN. Moreover, the selected test sequences vary in resolution between 4K and 8K resolutions.

A. TESTING
The proposed framework is developed using the 360 library version 10.0 [49] and incorporated in VTM 7.0. Both VPredSCNN and VPredHYB have been developed for VVC independent of each other and tested accordingly, using the same test sequences. The test sequences are coded in 4432 × 2216 resolution as per the Common Testing Conditions (CTC) recommended by JVET. Furthermore, the viewports are also constructed as per the following instruction from CTC: W vp = 1920, H vp = 1080, F h = 78.1 • and F v = 49.1 • . The experiments are conducted under the All-Intra (AI) configuration since the ROI is determined only from the spatial distance between the QEC and the pixel coordinates. Moreover, based upon the several proposed viewport prediction techniques and Lagrange optimization techniques, eight different variants from the proposed framework are derived and experimented. The derived variants and corresponding coding parameters are listed in Table 2.

B. CODING PERFORMANCE
The coding performance is evaluated using Bjontegaard Bit Rates (BDR-Y) [50] for the luma component, Encoder Time (ET) and the Decoder Time (DT) with respect to the VTM 7.0 reference software. Here, in BDR-Y calculations, Viewport Peak Signal-To-Noise Ratio (VPSNR) [35], [36], [51] is used to measure the video quality as opposed to the conventional spherical objective quality metrics which do not have the ability to assess objective video quality of the 360 • videos at viewport level. VPSNR constructs viewports from the HM data of the subjects and applies PSNR calculations between the reference viewport and the tested viewport, both constructed from the same HM coordinates. In this research, the HM data presented in the VQA-ODV dataset have used in the video quality assessment. Formally, let Ori and Imp denote the pixel values of original and impaired sequences at (i, j) coordinates of viewport v of frame f , and s represents a subject. Then the Viewport Mean Square Error (VMSE f ) and VPSNR (VPSNR) are given by, where n v and N f are the number of viewports sampled per frame and number of frames respectively. Furthermore, change in quality between the anchor and the proposed algorithms ( VPSNR) has also been used in the assessments of the coding performance. Here, positive value for VPSNR indicates quality loss with respect to the anchor while a negative value for BDR-Y indicates an overall compression gain.
The coding performances of the proposed variants are shown in Table 3. and Table 4. Table 3. illustrates the performance of the proposed methods without the LOAQ component whereas Table 4. shows results with the inclusion of the proposed LOAQ. It is evident that all our proposed variants outperform the anchor implementation of the reference software. VPredSCNN has outperformed the other variants achieving an overall bitrate savings of 4.99% (and up to 15.96% for the CougourTreats sequence) with no adverse effect in the decoding times. This is further improved to 5.85% by the addition of LOAQ. However, this comes at a cost of increased computational complexity both at the encoder and the decoder. Especially, the decoding times has risen up by a factor of 2.65 to perform the necessary calculation in the reconstruction of the QPs. Moreover, in the case of VPredSCNN, the sequences with higher SI produce higher bitrate savings whereas those with lower SI such as ConcertLive, AuroraQuaatzLake and BuddhaCave produce low to no gain. This is due to the fact that the VPredSCNN model is only trained to extract features from the spatial domain.
Furthermore, VPredBOFE and VPredGF produce consistent gains across all the sequences and achieve up to 3.97% and 3.84% (for PandaBaseChengdu sequence) gains respectively. When they are combined to form VPredHYB, average compression gain increase to 2.10% and up to 9.43% (for CougourTreats sequence). Moreover, LOAQHYB could not improve the coding performance both in isolation and in combination with VPredHYB. This is mainly because the sparsely predicted coordinates can disrupt the intra prediction process owing to the biased pixel intensities of the non-neighbouring CUs. Furthermore, it is observed that VPredBOFE, VPredGF and VPredHYB suffer heavy encoding complexity which is in excess of 200% resulting from the use of an exhaustive search on all the viewports in V T . Table 5. illustrates the performance comparison between the state-of-the-art [22], [41], [43] 1 works and the VPred-SCNN + LQAOSCNN. It is evident that the proposed variant outperforms the state-of-the-art methods in terms of perceptual compression efficiency, however it remains computationally complex. Aforementioned state-of-the-art researches, perceptually suffer losses resulting from the inaccurate prediction of QERs in the ERP. 2 Furthermore, other than Hu et al. [43], 3 both Sreedhar et al. [41] and TSP [22] show inconsistent gains and heavy losses (up to 10.15 dB and 12.52 dB respectively) in viewport quality due to the application of re-sampling and re-packaging of   poorly predicted QERs in their methodologies. Moreover, the 35% compression performance achieved for CourgarsTreats sequences by TSP [22] comes at perceptual quality loss of 5.47dB. Furthermore, the HM data for the test sequences  [43] and VPredSCNN + LOAQSCNN only exhibit the compression artefacts. Additionally, Sreedhar et al. [41] and TSP [22] also demonstrate seam artefacts as enclosed by yellow rectangles.
comprise of both fine and coarse distributions of user observed viewport coordinates. The re-sampling strategies in Sreedhar et al. [41] and TSP [22] are cable of exploiting the fine distribution of viewport coordinates as opposed to the coarse distributions, which explains the variations in their respective perceptual compression performances and quality losses. Also, the idea of viewport dependent coding parameter adaptation at CTU level followed by Hu et al. [43] is not an efficient strategy for VVC due its finer QTBT+multi tree type partitioning structure. Moreover, for a better understanding, a visual comparison of a given viewport generated from the reference, anchor, state-of-theart works and VPredSCNN + LOAQSCNN with respect to ManHattan sequence is shown in FIGURE 10. In the figure, it is noticeable that the anchor, Hu et al. [43] and the proposed VPredSCNN + LQAOSCNN display the common compression artefacts. As presented in Table 5., they do not exhibit greater variations in the viewport quality assessments. Moreover, the seam artefacts that appear for Sreedhar et al. [41] and TSP [22] are also illustrated in the figure.
FIGURE 11. illustrates Rate-Distortion curves for a selected sequence. It is observed that the proposed variants show improved performance compared with the anchor. An important observation from these plots are the behaviour of Sreedhar et al. [41] and TSP [22]. Both approaches tend to saturate in quality when bitrates are increased. They attain a crossover with the anchor at low bitrates, suggesting the  the re-sampling and the re-packaging strategies can bring benefits for low bitrate sequences. However, the proposed variants can produce gains at high bitrate ranges which can be a real benefit for 360 • videos which need to be encoded at higher resolutions.
Furthermore, the variation of bitrate savings of VPred-SCNN for various values of the quality factor ρ and N p is presented in Table 6. When considering the BDR-Y values for the V P with N p ≥ 6, a parabolic pattern with the minimum value at (N p = 8, ρ = 3.0) can be observed in both horizontal and vertical directions. This indicates that an increase in number of viewports and ρ can result in increased cost of bitrates. Conversely, a smaller number of viewports and smaller ρ values can also have a negative impact on the objective viewport quality, resulting in minor coding gains. Furthermore, Table 6. also report significant coding gains with the use of fewer viewports (i.e 6 < N p ). However, these results are inconsistent across all the sequences as similar to the results obtained for Sreedhar et al. [41] and TSP [22]. Here, for certain sequences, the fusion of the QERs has accounted all the user observed viewports which resulted in massive gains. However, the smaller the number of viewports, the wider the predicted viewport centres would  be. This would hinder the intra prediction as some regions in between two QERs can be adversely affected from the proposed masking process. Table 7(a). and Table 7(b). present the variation of bitrate saving for VPredHYB for various values of the number of viewports N p and the quality factor ξ at ρ = 3.5 respectively. Unlike in the case of VPredSCNN, the increase of viewports does not result in an increased gain. Since the initial coordinates are predefined at constant intervals, it is likely that additional redundant information is being coded when using a greater number of viewports. Conversely, visually degraded results are obtained when a lower number of viewports are deployed. Furthermore, the impact of ξ can be great when both VPredGF and VPredBOFE are combined, but the gain is likely to be saturated between ξ = 0.4 and ξ = 0.75.

C. VIEWPORT PREDICTION ACCURACY
The accuracy of viewport prediction techniques can be measured using metrics such as Intersection over Union (IoU) and Intersection over Ground truth (IoG). IoU measures the ratio between the intersection area and union area. Here, the intersection and the union areas are measured with respect to the QERs generated from the predicted viewport coordinates and the ground truth data. In this context, for a given set of predicted viewports V P , IoU is measured by constructing N p QERs from the predicted coordinates and n s QERs from the ground truth data. Hence, the number of elements in V P is increased to obtain a better intersection. In doing so, IoU score of a given sequence is affected by the additional number of viewports used as the distribution of the viewport centres may vary from one sequence to another. Hence, for fixed number of viewports, IoU would not be able to accurately estimate the solitary measure of intersection across different sequences which accounts for the information loss in video coding approaches. As opposed to IoU, IoG measures the ratio between the intersection area and the QERs generated from the ground truth data, negating the effect of the union area. This provides a better estimate of whether user observed viewports are actually covered by the predicted VOLUME 10, 2022 viewports when the number of viewport is fixed for all sequences.
The scores of the viewport prediction accuracy measures, IoU and IoG for VPredHYb and VPredSCNN are illustrated in FIGURE 12. In this figure, the variations of IoU and IoG with respect to the number of viewports, N p are also shown in FIGURE 12(a)., FIGURE 12(b)., FIGURE 12(d). and FIGURE 12(e)., for both VPredSCNN and VPredHYB respectively. Furthermore, FIGURE 12(c). and FIGURE 12(f). show the variation of these measures for various values of ξ in VPredHYB for N p = 6. Since prediction techniques produce fewer number of viewports than the actual subjects, it is apparent that IoU scores in all scenarios exhibit lower values. However, in the context of video coding, it is sufficient if the predicted QERs can account for the subject's viewports and avoid disruption to the intra prediction of the video codec. In support of this norm, the IoG scores exhibit higher values for both VPredSCNN and VPredHYB as the number of viewports increases. This is because an increased number of viewports can improve the overlapping of generated QERs with the subject's HM coordinates. Moreover, for VPredHYB, there is no greater difference in the IoU and IoG scores obtained for the several variations in ξ .
The distribution between the subject's HM coordinates (Subject data), the 20 viewports predicted using VPredSCNN and the predefined 20 viewports used in VPredHYB are illustrated in FIGURE 13 for 1 st , 9 th and 17 th frames of CougourTreats and AuroraQuartzLake sequences. The two sequences are chosen as one produces very high coding gains for the proposed variants, while the other does not. It is evident that distribution of the viewport coordinates are much closer to ground-truth HM data for CougourTreats sequence compared to AuroraQuartzLake sequence. Hence, it can be concluded that the subset of viewports selected from the 20 viewports in both VPredSCNN and VPredHYB have demonstrated better performance for CougourTreats sequence compared to AuroraQuartzLake sequence. Furthermore, the distribution of the predicted viewport coordinates from the proposed variants without LOAQ are shown in FIGURE 14. In this figure, it can be observed that the coordinates predicted using VPredSCNN (N p = 8) are found to be closer to the subject data than the other proposed variants which substantiate the bitrate improvements with VPredSCNN in comparison with the other variants.

V. CONCLUSION
Existing perceptual video coding algorithms cannot be applied to 360 • videos which are not represented in their visually observed format when encoding. In response, a novel 360 • video coding framework has been developed to leverage the user observed viewport information in the VVC coding pipeline in order to reduce the bitrates at the same perceptual quality. To this end, the proposed framework first applies a deep learning architecture incorporating Spherical CNN components and a fusion between bi-directional optical flow estimation and a Gaussian filtering technique in order to develop two viewport prediction techniques namely, VPred-SCNN and VPredHYB respectively. Furthermore, based on the predicted viewports, the proposed framework also generates QERs, to identify the ROI on the ERP. Subsequently, by fusing QERs, an ROI aware weightmap is developed and applied as a mask to the source video. Furthermore, the proposed framework also employs the weightmap to support the Lagrange optimization and adaptive quantization procedures in VVC.
In the context of 360 Lib 10.0 integrated VTM 7.0, the experiments conducted for different variants of the proposed framework outperform the state-of-the-art techniques and report significant coding gains. VpreddSCNN when combined with LOAQ yield the highest compression gains with average bitrate savings of 5.85% (and up to 17.15%) with an increase of 249% and 265% encoder and decoder complexities respectively. Additionally, it has been reported that the removal of LOAQ from the coding framework can nullify the decoder complexity and reduce the encoding time with a slight drop in coding gain. Moreover, temporal domain support to VPredSCNN, integration of the proposed VPredSCNN and VPredHyb techniques, several other viewport prediction techniques and inter coding compatibility for the proposed 360 • video framework can be explored in the future.