Multi-Model Motion Prediction for 360-Degree Video Compression

Efficient video compression is fundamental for enabling today’s highly interactive multimedia landscape. With the trend towards virtual reality, efficient storage and transmission of 360-degree video content becomes increasingly important. Recent works in this area addressed the design and investigation of 360-degree-specific motion models for improved compression efficiency. We investigate the strengths and limitations of these motion models and show that a single model cannot sufficiently cover all motion scenarios. A video codec that combines the strengths of multiple models in the motion compensation procedure is expected to achieve notable gains in compression efficiency. However, the additional motion model signaling and switching costs quickly outweigh the gains achieved by improved motion compensation. In this paper, we address this challenge by proposing advanced motion prediction and coding schemes that significantly reduce the resulting side information overhead. A novel multi-model motion vector prediction technique ensures seamless cooperation between different motion models by generalizing the motion modeling concept through forward and backward passes, and yields significant improvements in the overall compression efficiency. Our proposed hierarchical coding scheme is broadly applicable and is shown to be the most effective among comparable coding schemes. Experimental results demonstrate the performance of the proposed multi-model coding framework with average Bjøntegaard Delta rate savings of 3.20% with a peak of 4.49% based on PSNR and 2.76% with a peak of 4.04% based on WS-PSNR compared to the state-of-the-art H.266/VVC video coding standard.


I. INTRODUCTION
Many of today's most popular multimedia applications for entertaining, educating and connecting people build on the exchange of video.These applications are only possible thanks to highly effective video compression techniques that are able to considerably reduce the amount of data associated with video.Effective video compression becomes even more important for 360-degree video as it inhibits even larger amounts of data owing to its all-around field of view and the high requirements on resolution and frame rate in order to deliver immersive virtual reality experiences.
The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu.
As a 360-degree video provides data for all viewing directions from a given viewpoint, it is typically interpreted to lie on the surface of a sphere in 3D space, as shown in Fig. 1(a).To store this data in a conventional 2D planar representation, different projection formats can be applied to map the data from the spherical domain to the 2D image plane.One of the most common projection formats is the equirectangular projection (ERP) shown in Fig. 1(b).It is similar to unfolding the surface of the globe onto a world map [1], and is also the format in which most 360-degree videos are provided.With the obtained 2D image plane representation, existing video coding standards such as H.264/AVC [2], [3], H.265/HEVC [4], [5] and H.266/VVC [6], [7] can be used to compress the 360-degree video.However, while the application of these coding standards to 360-degree video is possible, the incorporation of knowledge regarding the special properties of 360-degree video can further improve the overall compression efficiency [8].Different approaches for incorporating this knowledge have been followed, ranging from the design of improved projection formats [9], over the introduction of novel 360-degree compression tools [10], [11], [12], [13], [14], [15], to the implementation of adaptive streaming techniques [16], [17], [18], [19], [20].
One of the key technologies enabling the high compression efficiency of modern video coding standards is inter prediction.For perspective videos, inter prediction with a translational motion model proved to be effective.H.266/VVC has extended inter prediction by an additional affine motion modeling procedure yielding further rate savings [21], [22].However, the inevitable distortions originating from the mapping of spherical 360-degree videos to the 2D image plane [1] impair the performance of these motion models for 360-degree video [7].
Consequently, the design of improved motion models for 360-degree video has attracted broad interest.Numerous models have been developed and have shown notable improvements in compression efficiency [23], [24], [25], [26], [27], [28], [29], [30], [31], [32].In this paper, we investigate the strengths and limitations of these motion models and show that a single model cannot sufficiently cover the large range of possible motion scenarios.
In a realistic scenario, different regions of a video are dominated by different kinds of 3D object and camera motion.By selecting the motion model that is able to most closely replicate the specific kind of motion in each region, notable gains in compression efficiency can be expected.In [33], it has already been proposed to allow the block-wise selection of one of multiple motion models during inter prediction in 360-degree video compression.However, the study was limited to a custom and lightweight video codec implementation that cannot reflect the capabilities of modern video compression techniques.For state-of-the-art video coding standards, multi-model motion approaches have not been able to demonstrate gains.The main reason for this is the considerable side information overhead that stems from the additional motion model signaling costs and the costs associated with motion model switching between different blocks.In particular, the highly effective motion vector prediction [34], [35], [36] suffers from the varying motion characteristics among different motion models, yielding increased motion vector differences and thus larger signaling costs.Thereby, the term motion characteristic refers to how a motion vector is interpreted by a specific motion model.If the described challenges are not considered, the resulting signaling overhead quickly outweighs the gains achieved by improved motion modeling.
This paper aims to address these shortcomings by designing efficient prediction and coding schemes that significantly reduce the overhead signaling costs.To allow efficient motion vector prediction between blocks using different motion models, we propose a novel and broadly applicable multi-model motion vector prediction technique (MM-MVP).MM-MVP is realized by generalizing the motion modeling concept by a backward pass, which allows to infer the motion vector required to yield a desired pixel shift.This is used to map motion vectors between different motion models and thus improve the accuracy of motion vector prediction.Furthermore, to realize efficient motion model signaling, we propose a hierarchical coding scheme that is applicable to an arbitrary number of motion models.As prior work used only a single motion model during inter prediction, motion model signaling was not required before.By integrating our proposed multi-model coding framework into the H.266/VVC video coding standard, we demonstrate considerable Bjøntegaard Delta rate savings of up to 4.04% especially in complex motion scenarios.To summarize, the primary contributions of this paper are 1) an investigation of the strengths and limitations of the individual motion models showing that a single model cannot sufficiently cover all motion scenarios, 2) a broadly applicable multi-model coding framework based on a novel motion prediction technique and a novel hierarchical motion model coding scheme, 3) a performance evaluation demonstrating the substantial gains of our multi-model coding framework at the example of the H.266/VVC video coding standard.The remainder of this paper is organized as follows.Section II investigates the strengths and limitations of the existing 360-degree motion models.Section III describes our proposed tools for efficient multi-model 360-degree video compression.Section IV presents our experimental results and evaluates the performance of our approach.Finally, Section V concludes the paper and discusses future research directions.

II. MOTION MODELS FOR 360-DEGREE VIDEO
This section investigates the strengths and limitations of the individual motion models that have been developed for 360-degree video compression in recent years, motivating why a single model cannot sufficiently cover all motion scenarios.To provide context, the basic process of inter prediction is recapitulated, and a general understanding for 360-degree projection functions is introduced, first.The individual 360-degree motion models are investigated in Sections II-A to II-E.
Mathematically, the construction of a predicted frame I pred ∈ R U ×V of width U and height V from a reference frame I ref ∈ R U ×V can be represented as for all blocks i in the image.Thereby, B i denotes the set of pixel coordinates p = (u, v) T ∈ R 2 within block i and ∈ R 2 denotes the 2D motion vector for block i that is transmitted to the decoder as side information.The motion model m yields the resulting moved pixel position p m = m(p, t) from the original pixel position p and a motion vector t.I(p) yields the pixel value of image I at pixel coordinate p, possibly incorporating interpolation.While all pixels in a block share the same motion vector, the shifts resulting after the application of the specific motion model can vary for each pixel.
The classical translational motion model does not consider the spherical characteristics of the 360-degree video and relies only on the pixel coordinate p and the motion vector t in the planar domain m trans (p, t) = p + t. ( Consequently, all pixels in a block share the same shift.This rigid shift of the block in the projection domain leads to a warping of the block in the spherical domain.Although block warping in the spherical domain is inevitable for the representation of most 3D object and camera motion, the specific warps caused by the translational motion model do not represent any real-world motion.
To improve upon this, 360-degree motion models designed to represent specific kinds of 3D object and camera motion have been investigated.All these models require knowledge of the applied projection format to understand the spherical geometry of the video.As a prerequisite, we thus briefly introduce the concept of projection functions.
Any valid projection function ξ : S → R 2 is invertible and describes the relation between a 3D space coordinate s = (x, y, z) T  ∈ S on the unit sphere and the corresponding pixel coordinate p = (u, v) T  ∈ R 2 on the 2D image plane, where S = {s ∈ R 3  | ∥s∥ 2 = 1} describes the set of all coordinates on the unit sphere.The inverse projection function ξ −1 : R 2 → S maps the 2D image plane coordinate back to the unit sphere in 3D space.Fig. 2 shows the relation between the spherical domain and the 2D image plane for the equirectangular projection (ERP).
In case of ERP, the mapping between a coordinate s on the unit sphere and the corresponding pixel coordinate p on the image plane is defined as where the spherical coordinates ϕ and θ denote the longitude and latitude of s, respectively.The pixel coordinate p on the image plane is directly proportional to the longitude (u-direction) and latitude (v-direction) of the pixel coordinate s on the unit sphere.It is scaled according to the width U and height V of the ERP image plane.Usually, U = 2V as the range of longitudes ϕ (0 ≤ ϕ ≤ 2π) is double the range of latitudes θ (0 ≤ θ ≤ π).The inverse ERP function is defined as Similarly, the forward and inverse projection functions for other 360-degree projection formats, such as cubemap projections, segmented and rotated sphere projections, and octa-and icosahedron projections [9] can be formulated.Please note that the motion models described in the following, as well as the proposed prediction and coding tools, work independently of the specific projection format of the video.We thus use ξ o as a placeholder for the specific projection function, where the suffix ''o'' is shorthand for omnidirectional.

A. 3D TRANSLATIONAL MOTION MODEL
The 3D translational motion model (3DT) proposed by Li et al. [23], [24] performs block motion in 3D space by deriving a 3D motion vector from the original 2D motion 117006 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.vector.It is defined as where the motion model additionally depends on the center p c of the regarded block for the derivation of the 3D motion vector t 3D = (t x , t y , t z ) T ∈ R 3 .Following the categorization from [33], the regarded block is represented using the radial object shape model in 3D space, that is, it lies on the surface of the unit sphere.Perfect reconstruction can be expected when the true 3D object motion matches the motion scenario visualized in Fig. 3.A portion of the surface of the unit sphere -corresponding to the area of the regarded block -is shifted in 3D space according to t 3D , which is constrained to retain the original depth d of the block center.Due to interpreting the motion vector t as a shift of the block center in the nonlinear projection domain when deriving the 3D motion vector t 3D , the motion magnitude of the 3D motion resulting from t differs for different blocks, such that the motion characteristics of this motion model differ between different blocks.

B. TANGENTIAL MOTION MODEL
The tangential motion model (TAN) proposed by De Simone et al. [27] performs block motion on a plane tangential to the regarded block's center in the spherical domain representation using a gnomonic projection [1].It is defined as where ξ gnm,p c denotes the gnomonic projection with tangent point p c .For details on the gnomonic projection, please consult the original work by De Simone et al. [27].Following the categorization from [33], the block is represented using the planar object shape model in 3D space, that is, it is projected onto a plane in 3D space.Perfect reconstruction can be expected in scenarios where the underlying 3D object motion is aligned with the plane tangential to the center of the regarded block, as visualized in Fig. 4. As the plane on which motion is performed is tangential to the center of each regarded block, the motion characteristics of this motion model differ between different blocks.

C. ROTATIONAL MOTION MODEL
The rotational motion model (ROT) proposed by Vishwanath et al. [28], [29] performs block motion in the spherical domain by rotating all pixels in a regarded block along the sphere surface, according to rotation angles derived from the original 2D motion vector.It is defined as where the rotation matrix R(p c , t) describes the rotation of the regarded block based on its center p c and the 2D motion vector t.It is derived as With ϕ c and θ c denoting the longitude and latitude of s ensures equal rotation magnitude irrespective of the regarded block's position on the sphere [29].The motion vector rotation matrix describes the actual block rotation derived from the 2D motion vector t.This can be interpreted as changing the block's pitch and yaw with the central viewing direction p c , where roll is explicitly not modeled to remain compatible with the existing 2-parameter motion vector signaling syntax.
The angular stepsize describes an angular shift per unit of motion vector.For ERP, it is typically chosen as = π/V .Perfect reconstruction can be expected in scenarios where the underlying 3D object motion is a rotation around the origin of the 360-degree camera's viewpoint, as visualized in Fig. 5.While the equatorial alignment rotation ensures equal rotation magnitude for different blocks, it also leads to varying yaw and pitch directions for different blocks, such that the motion characteristics vary between different blocks.FIGURE 6. Visualization of the geodesic motion model.The block is represented using the radial object shape model.The camera motion vector (epipole) q defines the orientation of the geodesics.A subset of the resulting geodesics are shown in red.Motion is modeled by shifting all block pixels along their individual geodesics.Additional motion perpendicular to geodesics is allowed to account for potentially unrelated object motion.

D. GEODESIC MOTION MODEL
The geodesic motion model (GED) proposed by Vishwanath et al. [30], [31] performs block motion by moving all pixels in a regarded block along their individual geodesics that are oriented along a known global camera motion vector.In general, geodesics refer to the shortest path between two points on a surface, and in the context of 360-degree camera motion, represent the paths on the sphere along which points in the scene appear to move as the camera moves.The red circles in Fig. 6 show a subset of the resulting geodesics for translational camera motion along an exemplary camera motion vector q shown in blue.The geodesic motion model is defined as where the epipole q = (x, y, z) T ∈ R 3 denotes the camera motion vector in 3D space and the shifts θ and ϕ are calculated based on the given motion vector t.During coding, the epipole q is specified as prior information and is transmitted to the decoder separately.
The so-called epipole alignment rotation matrix R(q) performs a rotation of the coordinate system to realize the desired geodesic motion.It aligns the default north facing epipole q north = (0, 0, 1) T to the desired epipole q and is calculated using Rodrigues' rotation formula [37] as Thereby, I denotes the identity matrix, [v] × denotes the skewsymmetric cross-product matrix of v = q north × q, and c = q north , q denotes the inner product of q north and q.
The longitudes of the epipole-oriented coordinate system represent the desired geodesics.T sph : S → R 2 transforms the epipole-oriented pixel coordinate s = R(q) • ξ −1 o (p) into its spherical domain representation (θ, ϕ).As s ∈ S, the spherical radius r = 1 is omitted.The motion vector t is interpreted as shifts along the longitudes and latitudes of the epipole-oriented coordinate system as where θ is the epipole-oriented polar angle of the current pixel coordinate and The calculation of k must be performed only once per block.θ c describes the epipole-oriented polar angle of the center p c of the regarded block, which can be obtained as Similar to the rotational motion model, the angular stepsize describes an angular shift per unit of motion vector and is typically chosen as = π/V in case of ERP.
The nonlinear modeling of the geodesic shift θ along the longitudes of the epipole-oriented coordinate system is necessary as pixels closer to the epipolar equator move relatively faster compared to pixels further away from the equator.For details on the derivation of θ , please refer to the original work by Vishwanath et al. [31].
Fig. 6 visualizes the geodesic motion model.Perfect reconstruction can be expected when the true 3D camera motion matches the assumed camera motion vector q, such that the individual block pixels shift along their respective geodesics, as visualized in Fig. 6.As the magnitude of the geodesic shift θ depends on the block center p c for the derivation of the parameter k, the motion characteristics vary between different blocks.

E. MOTION PLANE ADAPTIVE MOTION MODEL
In previous work [32], we proposed a motion plane adaptive motion model (MPA) that performs block motion on different where R mp describes the orientation of the desired motion plane, and the perspective projection ξ p is used to map the pixels from the unit sphere to the motion plane.Unlike the tangential motion model, where the regarded plane is always tangential to the center of the current block, the motion plane adaptive motion model includes the selection of the best-matching motion plane in the process of rate distortion optimization [38].We proposed three rotation matrices yielding the motion planes front/back (parallel to y-z-plane, and top/bottom (parallel to x-y-plane, R mp = R y (π/2)) that can be implemented efficiently through fast transpositions of the 3D space coordinates.The corresponding motion plane index i mpa is transmitted alongside the motion vector t in the motion information.Perfect reconstruction is possible in scenarios where the underlying 3D object motion is aligned to the selected motion plane, as visualized in Fig. 7.While motion characteristics vary between blocks using different motion planes, motion characteristics remain constant between blocks using the same motion planes.
A summary of the described motion models and a list of the side information that must be signaled for each model are presented in Table 1.As each model is tailored to represent a specific kind of 3D object or camera motion, its motion modeling capabilities are limited in deviating motion scenarios.To resolve this shortcoming, multiple motion models can be combined to represent a broader set of motion scenarios.
This section presents our contributions to realize efficient multi-model 360-degree video coding.The main challenges FIGURE 8. Spatial motion vector predictor candidates in the H.266/VVC video coding standard [6] with a schematic illustration of MM-MVP for candidate position B1 at pixel coordinate p s .
are the varying motion characteristics between blocks using different motion models impairing motion prediction performance, and the additional side information stemming from the necessity to encode the selected motion model for each block.To address the problem of impaired motion prediction performance, we propose a novel multi-model motion vector prediction (MM-MVP) technique that restores the statistical correlation of motion vectors between neighboring blocks despite varying motion characteristics.For the additional motion model side information, we propose an efficient hierarchical coding scheme capable of coding an arbitrary number of motion models.Finally, we introduce a framework for integrating the described concepts into hybrid video codecs at the example of the H.266/VVC video coding standard.

A. MULTI-MODEL MOTION VECTOR PREDICTION
In the context of video coding, motion information from spatially or temporally neighboring blocks is commonly reused in subsequently coded blocks to improve the overall compression efficiency.As an example, video codecs often predict the motion vector for the current block from already available motion information of previously coded blocks by forming motion vector predictors (MVPs) at distinct candidate positions in spatially or temporally neighboring blocks [2], [3], [4], [5], [6], [7].The selected MVP is further refined at the encoder, and only the index of the selected MVP as well as the motion vector difference (MVD) between the final estimated motion vector and the MVP needs to be signaled.This significantly improves the compression efficiency of modern video codecs [34], [35], [36].
Fig. 8 shows the spatial MVP candidates defined for the H.266/VVC standard [6].The general concept of MVP is similar for most hybrid video codecs and is extensible to H.265/HEVC [4], VP9 [39], and AV1 [40].The following derivations are hence broadly applicable.
Naturally, the motion characteristics of different motion models differ fundamentally, which is a desired effect as they model different 3D motion scenarios.In addition, as described in Sections II-A to II-E many motion models exhibit varying motion characteristics even between different blocks of the same model.This is caused by interpreting the given motion vector t differently for different positions TABLE 2. Forward and backward passes for motion modeling and MM-MVP of all introduced 360-degree motion models.For details on the definition of the motion models or the derivation of the backward pass, please see Section II and III-A, respectively.* Geodesic motion modeling is abbreviated to the forward and backward passes up to the angular shifts θ and ϕ for readability.For further relation to the motion vector t, please see text.
of the regarded block in the overall image.Differences in motion characteristics between neighboring blocks mean that a given motion vector t yields fundamentally different warps for one block than for a neighboring block.The caused discontinuities impair the traditional motion vector prediction procedure, which uses the motion vector of a neighboring block candidate position as a direct predictor for the motion vector of the currently regarded block.The proposed MM-MVP solves this issue by translating motion information between blocks with different motion characteristics to restore the spatial and temporal correlation of motion information between neighboring blocks.
In the first step, the motion information at a given neighboring candidate position p s is obtained, which serves as an anchor for the motion vector translation in MM-MVP.The obtained motion information includes the candidate block's motion model m s , motion vector t s , and block center p c,s as visualized in Fig. 8.The motion model m s is then used to apply the given motion vector t s to the anchor position p s yielding the shifted position In the second step, the goal is to find an equivalent motion vector t t that shifts the anchor position p s to the shifted position p m using the current block's motion model m t and block center p c,t By solving (18) for t t for each of the described motion models, the final MVP for the current block is obtained.
Ultimately, solving a motion model for t t can be viewed as a backward pass through the motion model, whereas the conventional application of the motion model can be viewed as a forward pass.We derived the backward passes for all described motion models and summarized them in Table 2.
When solving for t, the rotational motion model is a special case because the mapping between a rotation matrix R and the corresponding motion vector t is a nonlinear operation.Solving for t would hence require a nonlinear optimization.However, in the context of video coding, this additional complexity is impractical.We opt for an alternative solution where t is estimated based on the spherical coordinate shifts in the equatorial aligned coordinate system as This avoids complex nonlinear optimizations during the encoding and decoding procedure while delivering sufficiently accurate estimates.
For the geodesic motion model, the angular shifts θ and ϕ obtained in the calculations listed in Table 2 must be further processed to arrive at the final motion vector where θ and θ c are the epipole-oriented polar angles of the anchor position p and the regarded block's center p c , respectively, which can be calculated similarly to (15).Without the described MM-MVP procedure, the conventional motion vector prediction scheme suffers from the incompatible meaning of motion vectors among different motion models.By enabling MM-MVP, the more meaningful adoption of motion information from spatially and temporally neighboring blocks yields a significantly improved compression efficiency.

B. MOTION MODEL CODING
In the targeted multi-model coding scenario, each block can apply a different motion model, which is selected during rate distortion optimization at the encoder.Hence, the motion model must be transmitted as side information from the encoder to the decoder.We propose an efficient hierachical coding syntax to ensure efficient communication.
The general coding syntax for the decoder is outlined in Fig. 9.The encoding process works analogously.The syntax is based on a list of motion model candidates (motionModelCandidates) that contains all motion models used for compressing a given video.The order of the motion models is common to both encoder and decoder and is explained later.The set of activated motion models is signaled in the sequence parameter set (SPS).The decoder first initializes the candidate index i=0 to describe the currently regarded motion model in the candidate list before stepping into an iterative procedure.
(a) At the beginning of each iteration, it is checked, whether the regarded candidate model refers to the last model in the candidate list.If this is true, then the motion model decoding process is completed by assigning the motion model at index i in the candidate list for the current block.This also considers the special case of only one motion model being enabled, where motion model signaling is effectively skipped.
(b) If the regarded candidate is not the last in the list, the decoder checks whether it is coded hierarchically using the CABAC entropy coder [41] or using an equal probability model.This decision is based on a pre-configurable integer variable codingDepth that specifies how many motion models are coded hierarchically before the applicable model from the set of remaining motion model candidates is coded using the equal probability model.This allows to evaluate the efficiency of different coding depth scenarios.
(c) In case the currently regarded candidate index is above the given threshold, equal probability coding means that the encoder signals an index into the set of remaining motion model candidates using an entropy model where all remaining motion models share equal probability.
(d) On the other hand, if the currently regarded candidate index is below that threshold, a flag is coded that defines whether the motion model candidate at index i is the selected motion model for the current block or not.As described before, this flag is coded using the CABAC entropy coder, where a dedicated context model is assigned to each motion model to track the motion model statistics for effective entropy coding.If the flag is 0, the next motion model is checked for coding by incrementing the candidate index i and rejoining the iterative procedure at step (a).
This coding syntax is applicable to an arbitrary number of active motion models.Furthermore, with the configurable hierarchical coding depth, the syntax can easily be adapted to coding all motion models using the equal probability model (codingDepth = 0) and coding all motion models hierarchically using CABAC (codingDepth → ∞).The efficacy of different coding depth scenarios is evaluated in Section IV.

C. CODEC INTEGRATION
To evaluate the proposed motion prediction and coding schemes in an actual video codec, we integrated them into the state-of-the-art H.266/VVC video coding standard [6], [7].Fig. 10 outlines the updated decoder-side inter prediction pipeline.The encoder operates in an analogous manner.
The signaling step is extended by our proposed motion model prediction and coding scheme, which is only applied if the current coding unit (CU) does not specify the usage of merge mode.In case of bi-prediction, both directions share the same motion model, such that it needs to be signaled only once per CU.If a merge mode is applied, the signaling step remains unchanged compared to H.266/VVC.
In the motion derivation step, the incomplete motion information (MI) obtained in the signaling step is completed, which is necessary for one of two reasons.First, if the current CU is coded in merge mode, the MI, including the motion model, is taken over from the selected merge candidate.Second, if the current CU is coded explicitly, only a motion vector difference (MVD) between the actual motion vector and the selected MVP candidate is signaled.In this case, the proposed MM-MVP is applied to derive the MVP for the current CU before summing it with the signaled MVD to obtain the final motion vector for the current CU.
In the motion compensation step, the prediction signal for the current CU is generated based on the completed MI and the known reference frame.Thereby, the inherent motion modeling step is extended by the described 360-degree motion models.After motion modeling, the reference frame sampling step is responsible for extracting the motion modeled pixel positions from the reference frame, possibly performing interpolation using the existing interpolation filters defined in the H.266/VVC video coding standard [6].
The corresponding encoder performs an isolated motion estimation for each motion model and finally decides for the motion vector and motion model combination that yields the minimum rate distortion cost.This rate distortion cost is then further used to decide between explicit motion modeling or one of the other tools available in H.266/VVC.Please note FIGURE 10.Schematic representation of the decoder-side inter prediction pipeline in the H.266/VVC video coding standard including our extensions and additions for the support of the proposed multi-model coding concept.Components that need to be extended or added with respect to the original H.266/VVC inter prediction pipeline are labeled explicitly.For details on the performed extensions and additions, please refer to the text.
that the described 360-degree motion modeling procedure is applied to all stages of motion modeling including usage in tools such as geometric partitioning mode (GPM) or decoder-side motion vector refinement (DMVR).

IV. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
The performance of the proposed multi-model coding tools is evaluated based on the introduced integration into the state-of-the-art H.266/VVC video coding standard.Our implementation is based on the VVC reference software version 17.2 [42], [43], which also serves as a baseline for our evaluations.the following, the extended software including the proposed tools as described in Section III is termed MM-VTM, 1 the baseline is termed VTM-17.2.
To evaluate the influence of the individual coding tools we proposed, they can be configured independently.As such, MM-MVP can be switched on and off, and the hierarchical coding depth can be set using any positive integer i ≥ 0. Furthermore, all 360-degree motion models described in Section II can be switched on and off independently.The standard translational and affine motion models of VVC are always activated.As an example, with two hypothetical motion models A and B being activated, the encoder can decide between motion model A, B, translational motion modeling, and affine motion modeling for each CU.
The employed 360-degree test sequences [44] are in ERP format.In accordance with the common test conditions for 360-degree video (360-CTC) [44], all tests are performed on 32 frames of each sequence using four quantization 1 The source code of our multi-model VVC implementation is publicly available at https://github.com/fau-lms/vvc-extension-mmparameters QP ∈ {22, 27, 32, 37} with the random access (RA) configuration [46].Rate savings are calculated according to the Bjøntegaard Delta (BD) model with piecewise cubic interpolation [47], [48].Complexity is reported as the encoding and decoding runtimes relative to the baseline, where all tests are performed on one core of an AMD EPYC ™ 7543 with a base frequency of 2.80 GHz [49].
The 360-CTC specify to encode the sequences at a lower resolution (potentially performing a projection format conversion while down-sampling) and to upsample the resulting sequences to the original resolution (and projection format) after decoding.As all test sequences are given in ERP format, this approach allows a less biased comparison of different projection formats.Quality metrics can then be calculated either at the original resolution (end-to-end) or at the coding resolution (codec).Although we limit our investigations to the ERP format, we report the results for both configurations (end-to-end and codec), where the coding resolution is set to 2048 × 1024 pixels.Table 3 gives an overview over the employed quality metrics for 360-degree video that we evaluate using the 360Lib software version 13.1 [9], [50].

B. MM-MVP
In the first evaluation, we investigate the efficacy of the proposed MM-MVP by applying it to a diverse set of 360-degree motion models.Table 4 shows the rate savings that are achieved for each motion model described in Section II with MM-MVP being enabled (ON) and disabled (OFF).If MM-MVP is disabled, the conventional motion vector prediction scheme is applied.Thereby, WS-PSNR serves as quality metric, and the reported rate savings are evaluated with respect to the baseline VTM-17.2.It is clearly visible that MM-MVP leads to considerable additional rate savings for each motion model.Without the application of MM-MVP, average rate savings of 0.19% for 3DT, 0.28% for TAN, 0.56% for ROT, 1.20% for GED, and 1.53% for MPA are achieved.By enabling MM-MVP, the average rate savings increase by 0.22% for 3DT, 0.2% for TAN, 0.24% for ROT, 0.75% for GED, and 0.87% for MPA.Thereby, MM-MVP shows to be especially beneficial for sequences with more complex motion characteristics such as BranCastle2, where rate savings roughly double for MM-MVP=ON compared to MM-MVP=OFF.In addition, MM-MVP eliminates some of the rate losses that occur for lower-performing motion models.
Because large parts of the bitstream consist of motion information in case of inter predicted frames, the efficient coding of motion information is crucial.Due to the applied MVD coding scheme in H.266/VVC being a combination of CABAC coded flags (signaling whether the MVD is larger than 0 or 1) and rice coded difference values, the lower the MVD, the lower the signaling cost.With MM-MVP, more accurate MVPs are produced that lead to significantly reduced MVDs and ultimately yield robust overall rate savings.

C. MULTI-MODEL
Using the proposed motion prediction and coding concepts, we evaluate the compression efficiency in a multi-model coding scenario.As the extended set of available motion models allows the coder to cover a broader set of motion characteristics, the multi-model scheme is expected to outperform the achieved compression efficiency of all single-model approaches shown in Table 4.
In our experiments, we activate ROT, GED and MPA as they showed the highest rate savings, and 3DT and TAN represent motion characteristics similar to those of ROT.For each CU, the encoder can thus select between ROT, GED, MPA, translational, and affine motion modeling.The hierarchical coding depth for motion model coding is set to 6 such that all motion models and the motion planes of MPA are signaled hierarchically using CABAC.The order of the motion models in the candidate list is defined as affine, translational, MPA front/back, MPA left/right, MPA top/bottom, GED, and ROT.Thereby, affine and translational are coded first to remain close to the signaling order in H.266/VVC, while MPA, GED, and ROT are sorted according to their achieved average rate savings.
Table 5 displays the achieved rate savings of the described multi-model configuration ROT+GED+MPA.For each quality metric, the left column shows the results if MM-MVP is disabled (OFF), while the right column shows the results if MM-MVP is enabled (ON).
With an increased number of possible motion models, the positive effect of MM-MVP becomes even more evident.Based on WS-PSNR, MM-MVP improves the average rate savings from 1.49% to 2.76% -an increase by 1.27%.Without MM-MVP, the multi-model approach drops below the performance of the single-model approach from Table 4 on several occasions (SkateboardInLot, Balboa, Broadway, Landing2).With MM-MVP, however, the multi-model approach consistently outperforms the single-model approach.
Overall, the multi-model approach with MM-MVP=ON achieves average rate savings of 3.20% based on PSNR, and 2.76% based on WS-PSNR and S-PSNR-NN.With respect to end-to-end quality metrics, average rate savings of 3.67% based on E2E-WS-PSNR and 3.62% based on E2E-S-PSNR-NN are obtained.Viewport-based average rate savings amount to 2.87% for PSNR-DYN-VP0 and 3.54% for PSNR-DYN-VP1.The multi-model approach shows to be most effective for sequences with complex motion characteristics such as Landing2, where rate savings of 4.49% based on PSNR, 4.04% based on WS-PSNR, and 4.03% based on S-PSNR-NN are obtained.With respect to end-toend quality metrics, rate savings of more than 5% are achieved.
These results demonstrate the broad applicability of the proposed multi-model coding framework.It succeeds in combining the strengths of multiple 360-degree motion models and achieves considerable rate savings, particularly for sequences with complex motion.The application of MM-MVP shows to be crucial to be able to achieve consistent rate savings over the single-model approach.

D. CODING DEPTH
To validate our hierarchical motion model coding scheme, we evaluate the multi-model approach with ROT, GED and MPA using different hierarchical coding depths 2, 4, and 6.Thereby, a coding depth of 2 means that only the two flags are CABAC coded that signal whether the existing affine or translational motion models are used.If both flags are false, the selected 360-degree motion model is equal probability coded.A coding depth of 4 means that 2 of the activated 360-degree motion models are additionally CABAC coded hierarchically before the remaining ones are equal probability coded.A coding depth of 6 then means that all motion models are CABAC coded hierarchically.
Table 6 shows the rate savings obtained by ROT+GED+ MPA for the different coding depths.Bold entries mark the best performing coding depth for each sequence.The entirely hierarchical motion model coding scheme is clearly favorable.Any reduction in coding depth results in a notable decrease in rate savings.The lower row shows the predictions for the low-bitrate QPs 32 and 37.In Fig. (g)-(i), MM-VTM shows a higher level of detail by retaining the brick textures on the midlane in considerably higher quality than VTM-17.2 is able to.In Fig. (j)-(l), MM-VTM is able to better replicate the high-frequency tree structures than VTM-17.2.

E. VISUAL COMPARISON
These observations explain the origin of the obtained rate savings, where it becomes clear that MM-VTM is able to retain a higher level of detail without having to spend more bits on residual coding, but instead by adaptively selecting well-suited motion models for the respective areas in the video frames.

F. COMPLEXITY
The complexity analysis is split in two parts.First, the complexity of the proposed MM-MVP is investigated, before the overall complexity of MM-VTM is analysed.
To evaluate the complexity of the proposed MM-MVP, the time elapsed during encoding and decoding is measured for each motion model.The complexity is then calculated as a percentage of the MM-VTM configuration with MM-MVP=ON compared to the configuration with MM-MVP=OFF.Thus, a complexity of 100% means that both configurations process equally long, a complexity of 200% would translate to MM-MVP=ON requiring twice the processing time compared to MM-MVP=OFF, and a complexity of 50% would translate to MM-MVP=ON requiring half the processing time compared to MM-MVP=OFF.
Table 7 shows the encoding and decoding complexities of MM-MVP=ON compared to MM-MVP=OFF for each motion model and the multi-model approach.It can be observed that the application of MM-MVP only mildly affects the overall processing time.On the decoder-side, MM-MVP shows an average complexity of 101% compared to MM-VTM without the improved motion vector prediction scheme.On the encoder-side, the effects are slighly higher, especially for MPA and the multi-model approach, which can be explained by the higher number of motion models (or motion planes) that need to be checked.This increases the cumulative frequency at which MM-MVP is executed.
Table 8 shows the overall complexity of MM-VTM.In contrast to the complexity analysis of MM-MVP, VTM-17.2 now serves as a baseline instead of MM-MVP=OFF.For all motion models and the multi-model approach, MM-MVP is enabled and the coding depth is 6.The complexity of MM-VTM is highly asymmetric between the encoder and the decoder.While the decoder-side complexity ranges from 121% for TAN to 315% for ROT+GED+MPA, the encoder complexity ranges from 535% for 3DT to 5361% for ROT+GED+MPA.
The high complexity observed using currently available 360-degree motion models can be attributed mainly to the considerable amount of trigonometric function evaluations when projecting from the 360-degree image plane to the sphere and the desired motion surfaces, and vice versa.The development of more efficient motion modeling schemes is thus of high interest.Our proposed techniques provide the basis for future developments in this area and offer a ready-to-use framework for the extension of state-of-theart video codecs such as H.266/VVC with more efficient motion models.Initial tests incorporating lookup tables into the motion modeling procedure of MPA show a decrease in decoder-side complexity from 298% (cf.Table 8) to 143% while retaining rate savings of 1.63% on average.The encoder-side complexity reduced from 3581% to 569%.

V. CONCLUSION
In this paper, we present a solution to realize efficient multimodel 360-degree video compression, employing multiple motion models to cover a variety of underlying 3D object and camera motion scenarios.First, we investigate the strengths and limitations of the individual motion models and show that each model can only represent a limited range of motion scenarios.We solve this shortcoming by combining the strengths of multiple 360-degree motion models in a multi-model coding framework.To avoid the additional signaling overhead outweighing the gains achieved by improved motion modeling, we propose a novel and broadly applicable motion prediction technique and an efficient hierarchical coding scheme.
Our proposed motion vector prediction technique ensures a seamless cooperation between the different motion models by generalizing the motion modeling concept to include forward and backward passes.It is applicable to arbitrary motion models that allow the definition of a backward pass.Our proposed hierarchical coding scheme showed to be most efficient among comparative methods.It is applicable to arbitrary sets of motion models.Extensive experiments demonstrate the superior performance and broad applicability of our multi-model coding framework and highlight the importance of efficient motion vector prediction.
Further possible research directions include the investigation of possibilities for improved multi-model motion estimation -for example, a neural network based one-shot motion model selection -that could significantly reduce the encoder-side complexity of our approach.Furthermore, the application of our method to other domains, such as fisheye or classical perspective video, poses an interesting direction to evaluate its generalization capabilities outside the domain of 360-degree video.

FIGURE 1 .
FIGURE 1. 360-degree image in spherical domain representation (a) and its projection to the 2D image plane applying the equirectangular projection format (b).

FIGURE 2 .
FIGURE 2. Relation between the spherical domain representation and the projected 2D image plane domain representation of a 360-degree image at the example of the equirectangular projection ξ erp .

FIGURE 3 .
FIGURE 3. Visualization of the 3D translational motion model.The block is represented using the radial object shape model.Motion is modeled by shifting the block in 3D space.The 3D motion is constrained to retain the original depth d of the block center.

FIGURE 4 .
FIGURE 4. Visualization of the tangential motion model.The block is represented using the planar object shape model.The plane to which the block is projected lies tangential to the block center.Motion is modeled as horizontal and vertical shifts of the block on the tangential plane.

FIGURE 5 .
FIGURE 5. Visualization of the rotational motion model.The block is represented using the radial object shape model.Motion is modeled by rotating the block around the origin in 3D space.

FIGURE 7 .
FIGURE 7. Visualization of the motion plane adaptive motion model.The block is represented using the planar object shape model.The plane to which the block is projected can be oriented freely in 3D space.The blue line indicates the normal orientation of the shown plane.Motion is modeled as horizontal and vertical shifts of the block on the plane.

FIGURE 9 .
FIGURE 9. Syntax structure for motion model coding.

FIGURE 11 .TABLE 6 .
FIGURE 11.Crops from original and predicted frames for VTM-17.2 and MM-VTM (ROT+GED+MPA) with MM-MVP=ON.Best to be viewed enlarged on a monitor.

Fig. 11
Fig.11presents crops of the inter predicted frames for VTM-17.2 and the proposed MM-VTM with ROT+GED+ MPA, coding depth 6, and MM-MVP=ON.For reference, the according crops from the original frames are shown as well.The upper row shows the predictions for the high-bitrate QPs 22 and 27.In Fig. (a)-(c), MM-VTM retains more details on the street surface such as the dark spot in the lower center than VTM-17.2.In Fig. (d)-(f), the road marking shows severe distortions and the blue object at the bottom right is missing for VTM-17.2, while MM-VTM retains the correct structures.The lower row shows the predictions for the low-bitrate QPs 32 and 37.In Fig.(g)-(i), MM-VTM shows a higher level of detail by retaining the brick textures on the midlane in considerably higher quality than VTM-17.2 is able to.In Fig.(j)-(l), MM-VTM is able to better replicate the high-frequency tree structures than VTM-17.2.These observations explain the origin of the obtained rate savings, where it becomes clear that MM-VTM is able to retain a higher level of detail without having to spend more bits on residual coding, but instead by adaptively selecting well-suited motion models for the respective areas in the video frames.

TABLE 1 .
Summary of side information that needs to be transmitted for each motion model.

TABLE 4 .
Bjøntegaard Delta rate savings in % based on WS-PSNR for MM-VTM with MM-MVP=OFF and MM-MVP=ON with respect to the baseline VTM-17.2 for different 360-degree motion models.Negative values (black) represent actual rate savings, positive values (red) represent increases in rate.

TABLE 5 .
Bjøntegaard Delta rate savings in % based on different quality metrics for MM-VTM with MM-MVP=OFF and MM-MVP=ON with respect to the baseline VTM-17.2 for the multi-model approach (ROT+GED+MPA).Negative values (black) represent actual rate savings, positive values (red) represent increases in rate.

TABLE 7 .
Complexity of MM-MVP=ON in % with respect to MM-MVP=OFF for each motion model and the multi-model approach.

TABLE 8 .
Complexity of MM-VTM in % with respect to VTM-17.2 for each motion model and the multi-model approach.