User Gaze-Driven Adaptation of Omnidirectional Video Delivery Using Spatial Tiling and Scalable Video Encoding

Omnidirectional video is becoming increasingly popular among viewers, but its delivery requires considerable amount of network bandwidth. Today’s streaming services are transmitting the full spatial angle of omnidirectional videos, although most of the transmitted content is not utilized. Due to both limited bandwidth availability and its dynamic fluctuations, adaptive delivery solutions play a key role in supporting high user quality streaming of omnidirectional videos. This paper describes research which extends the MPEG-DASH Spatial Relationship Description by adding scalable video encoding to spatial tiling. It proposes a novel tile-layering based gaze adaptation algorithm for omnidirectional video delivery and employs it in conjunction with multiple tiling schemes. The benefits of the proposed algorithm with diverse tiling schemes are evaluated objectively in terms of bandwidth savings and adaptation latency. The results show a reduction of network bandwidth requirements to about 30% of the original bandwidth value with a low processing latency of 70.87ms.


I. INTRODUCTION
T HE POPULARITY of 360-degree videos is growing fueled especially by the immersive experience offered by virtual reality (VR) display systems such as head-mounted displays (HMD). One of the commonly available HMDs is HTC Vive with a resolution of 1080 × 1200 pixels for one eye, covering nominal field of view of about 110 • and performing updates at 90 Hz [1]. Another commonly used HMD is Oculus Rift with the resolution of 1080 × 1200 pixels for one eye, covers horizontal field of view of about 94 • with the same refresh rate as HTC Vive [3]. To prevent simulator sickness and support good user Quality of Experience (QoE), the overall display system should react in around 10 ms [2]. Stereo omnidirectional videos are usually encoded with resolutions from 3840 × 3840 pixels to 7680 × 7680 pixels and frame rates of 30 -60 fps. These video characteristics are associated with large amounts of video data in the order of 150 -360 Mbit/s [4]. Current VR video delivery systems are transmitting full 360-degree videos, but only a portion of each frame is utilized within a user viewport. With the increasing demand for 360-degree stereoscopic videos, the bandwidth requirements for delivery of such content providers increases, putting pressure on existing limited capacity infrastructure and potentially affecting user QoE. There are several strategies which try to address this problem and mitigate the negative impact on QoE.
Employing scalable video encoding (SVC) is a compressiondriven approach and is based on the SVC principle introduced in various forms by international video standards, e.g., MPEG-4 Visual [5], H.264 SVC [6], H.265 HEVC [7]. A conventional video bitstream is divided into a base layer and multiple enhancement layers which improve the base layer. In general sender-based adaptive algorithms use SVC to adjust some video encoding characteristics during video delivery to match the available network bandwidth and reduce loss, increasing user QoE [8], [9]. Such approaches allow for effective video transmission via networks with limited and/or highly variable bandwidth, but rely on client feedback which sometimes introduces delays in the adaptation process.
Client-based approaches rely on the MPEG Dynamic Adaptive Streaming over HTTP (DASH) standardized by ISO/EIC [10]. DASH divides the bitstream into temporal segments containing encoded media data and metadata using the Media Presentation Description (MPD) format. The video has multiple representations, encoded at different bit rates (e.g., using different resolution, encoding quality and/or framerate). Representations are sliced into temporal segments with given duration. Each segment has unique URL accessible via HTTP GET requests [11]. MPD contains resources identifiers (URLs) to access the video segments. DASH-based adaptive schemes enable clients to select most appropriate segments during video playout for delivery based on existing network conditions in order to improve video quality [12]. Adaptive video streaming can be improved in order to achieve some goals, such as for instance by exploiting the statistical equivalence of QoE for saving bandwidth [13]. DASH Spatial Relationship Description (SRD) [14] was introduced to enable streaming of sub-parts of a video with a combination of multiple bitrates supported by DASH. This extension enables the DASH client This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to download and display only relevant parts of the video with appropriate representations and bitrates to achieve high QoE. SRD enables a video to be spatially partitioned into tiles, which are independently decodable videos. Tile-based adaptive solutions such as zoomable and navigable video [15] are effective, but challenging. SRD-based approaches have some drawbacks. Temporal synchronization of multiple tiles must be ensured as more than one tile might be needed for reconstructing video user's viewport and encoding efficiency deteriorates as similarities between tiles cannot be utilized during compression [16].
This paper presents an innovative method for omnidirectional video delivery adaptation using spatial tiling and scalable video encoding. This method extends the idea of MPEG DASH SRD and improves adaptability of omnidirectional video delivery by layering spatial video tiles. The paper proposes a novel Tile-layering based gaze adaptation algorithm (TLGA) which adjusts buffering of video tiles at appropriate scalable video encoding layers based on user's gaze position. It prioritizes the base layer of the currently playback segments, giving more importance to the tiles close to user's gaze. The paper presents an evaluation of the proposed algorithm for multiple spatial tiling schemes for ondemand client-server transmission of omnidirectional video. Testing results demonstrate high network bandwidth savings and low adaptation latency.
The paper makes the following contributions: • An innovative method for dividing an omnidirectional video into tiles, i.e., square (based on longitude and latitude), or progressive (with comparable covered area) and organizing these tiles into layers of quality. • The Tile Layering-based Gaze Adaptation algorithm (TLGA) which prioritizes the delivered tile quality based on their position from user's gaze, • A comparative evaluation of tiling schemes in terms of adaptation latency and bandwidth saving when subjected to changes in user gaze. An overview of the proposed algorithm is provided in Section III and is detailed in Section IV. Experimental results are presented and discussed in Section V. In Section VI conclusions are drawn.

II. RELATED WORKS
Research community has put a lot of effort over the years trying to enable high quality multimedia delivery given various delivery network constraints, mostly in terms of bandwidth and latency. Proposing adaptive solutions which adjust the multimedia content to accommodate current network delivery conditions was a successful approach. Adaptive multimedia transmissions help support the increasing user expectation for improved quality of experience by dealing with content, device and network-related aspects. Diverse adaptive solutions were proposed to consider visual quality [8], navigation [17], Region-of-Interest (RoI) [18], energy consumption [19], load balancing [20], etc. on mobile and fixed networks. These solutions use a wide range of approaches from classic optimization techniques [21] to innovative methods such as machine learning [22]. More recently many efforts were put to design adaptive streaming solutions which target non-traditional content, including 4K/8K, 360 • [23] and multisensorial [9] and innovative settings, including multi-device [12] and multistream [24]. More adaptation research proposals, including some based on the MPEG-DASH standard, were discussed in terms of their benefits and limitations in [25] and [26].
The authors of [27] have proposed modifications of MPEG-DASH to create multiple representations of the same omnidirectional video. In addition to creating representations with multiple bitrates, representations will offer enhanced quality in some quality emphasis regions (QER). The quality deteriorates the further the area is from the center of this region and similarly to MPEG-DASH, videos are split into temporal segments. When a user moves suddenly, the video playback is not stopped, but the video is presented with a lower quality in the user's viewport.
One of the implementations of MPEG-DASH SRD used tiling scheme dividing omnidirectional video encoded with equirectangular projection into 6 tiles (top and bottom cap of a sphere and four sides) [28]. Authors of [29] extended the idea of MPEG-DASH SRD and proposed using a divide and conquer algorithm to prioritize download of tiles in user's field of view. Authors of [30] have also used a cubic projection, however they have divided each of the 6 tiles into 4 tiles, creating 24 HEVC encoded tiles with original and low resolution. Another modification of traditional DASH was presented in [31], this contribution makes use of a software-defined networking (SDN) architecture for streaming VR multimedia and optimizing the bitrate for viewer's region of interest.
Another approach is represented by foveated streaming solutions, which make use of the uneven distribution of photoreceptor cells in human eyes. Higher quality of video is projected onto the center of retina called fovea centralis and lower quality is presented in the eccentric parts of the retina. In foveated rendering it was shown that there is no observable difference in full rendering and foveated rendering if the adaptation to user's gaze is prompt [32], [33]. Authors of [34] have proposed a system for foveated video streaming. The system generates regions with different sizes of enhanced area and merges them into one frame. The frames are then encoded with H.264 and sent to the user. A blurring mask is applied on transitions between merged layers to fade the edges. The authors have shown 5-8 times decrease in bandwidth consumption and a double performance increase. A different approach to foveated streaming applied in cloud gaming was proposed with setting different quantization parameters to macroblocks depending on the real time gaze fixation. The authors have shown total of 110 ms end-to-end latency of their system [35]. On the same principle as MPEG-DASH, the authors of [36] have proposed a method, where client sends requests to the server with spherical coordinates and size of region of interest. According to this request, the server crops high-resolution segment with delay of 700 ms. The client also downloads low-resolution segments and merges high-and low-resolution segments together, where part of low-resolution segment is omitted in favor of high-resolution segment.
Techniques described above have limited possibilities of adaptation to changes in user's gaze, and can only work well with little to no movements of the user's gaze, which is often not the case in real-world conditions. When the gaze suddenly changes, the limitations of the above mentioned adaptation methods are demonstrated. They either do not fully utilize the already downloaded content and have to re-download new one, or even have to re-encode the content on server side, which can take considerable amount of computing power in case of multiple users using one server.
The proposed TLGA algorithm for omnidirectional video delivery is fully utilizing all the transmitted data, and also offers fast adaptation to user's gaze downloading only differences between already buffered content and required quality. TLGA also allows for scalability in number of streaming users, because the content is available encoded on the server and does not need to be re-encoded for each user.
Authors of [37] proposed a method of streaming layered tiles with a shared coded picture, which is in fact a base layer covering the full video. This approach helps exploit spatial similarities between tiles and saves 11% to 14% in comparison to regular tile-based streaming. Similarly, authors of [38] considered streaming a full 360-degree base layer and downloading enhancement layers and researchers in [39] proposed an approach based on layered video coding for 360-degree video which considers a spatial dimension in addition to the classic temporal one in the adaptation process. Unfortunately none of these approaches is compatible with MPEG-DASH, which makes is difficult in terms of adoption. Additionally, with fast enough adaptation to changes in user gaze, it might not be needed to transmit the full spatial angle even at base layer. Unlike these solutions, the proposed TLGA algorithm prioritizes transmission of tiles and their respective layers according to the user gaze, which provides the user with most suitable content.
Other works proposed solutions which have diverse optimization targets, but considered specific wireless delivery settings, with many simultaneous users. For instance the authors of [40] focused on achieving optimal 360-degree video transmissions to multiple users and employ a multicast-based approach, whereas the same researchers in [41] targeted power optimization in a MIMO setup. The authors of [42] concentrated on a multi-antenna setup and achieved positive results in both single and multi-user scenarios. Unlike these solutions TLGA is a generic solution and can be deployed in all scenarios (including in conjunction with other solutions), as it adjusts the amount of data to be delivered and thus reduces the pressure on the network.

III. TILE-LAYERING BASED GAZE ADAPTATION (TLGA) OMNIDIRECTIONAL VIDEO DELIVERY SOLUTION -PRINCIPLE AND ARCHITECTURE
The main goal of the proposed TLGA is to deliver best possible quality of experience to the user, while minimizing the bandwidth consumption. The idea is to deliver only parts of omnidirectional video that are in the gaze of an observer or are likely to be within user gaze in a short time. The proposed method employs a client-server architecture which employs MPEG-DASH and enables the client to acquire content by HTTP-GET requests. The functional scheme of this method is illustrated in Fig. 1, where the video encoded with conventional equirectangular projection is split into tiles, which are then split into scalable layers that can be transmitted to the user in varying quantity, depending on the position of user's gaze.
TLGA relies on a client-server architecture. The server side deploys a HTTP server which stores omnidirectional videos. Each video stored on the server is divided into tiles with fixed duration (usually 1 to 10 seconds). Tiles are also divided into SVC layers, which refers to a base layer with low quality and bitrate and multiple enhancement layers with increasing quality and bitrates. The server also provides metadata with a structured description of video representation in XML format, containing information about tile duration, area covered by a given tile and information about layering of the tiles.
The client side consists of a playback device (i.e., virtual reality headset, desktop player or smartphone) deploying the proposed TLGA algorithm. Client first accesses and reads the XML metadata file from server with the information about tiling. User's gaze coordinates are continuously monitored by built-in eye-tracking in the VR headset, head tracking in headset, mouse in case of a desktop playback or accelerometer and gyroscope in case of a playback on a smartphone. TLGA algorithm is responsible for downloading the base layer and enhancement layers according to their distance from gaze. In practice, the number of enhancement layers downloaded by the TLGA algorithm decreases with the increasing distance  of a given tile from the user's gaze. When the user suddenly changes their gaze, a tile with an earlier starting time is downloaded and its playback is synchronized onto the same frame with currently play-back content. This also applies for enhancement layers, which are used as an overlay over the base layer. The described process allows for selective video delivery only in the regions of interest and thus helps reduce bandwidth utilization. More technical details are provided in Section IV-D and in a patent application [43].

IV. TLGA PROOF OF CONCEPT
In order to test the proposed method, a reference implementation was targeted, which involved effort to create the TLGA-enhanced adaptable content at the server and a client application to play adaptable omnidirectional videos in a VR headset.

A. Creating TLGA-Adaptable Content
FFmpeg was employed to generate the TLGA adaptable video content. It involves rearranging the keyframes in the video so that each segment starts with a keyframe. Segmenting the video into temporal segments of fixed duration to be able to deliver only parts of video with a short duration. For each segment, multiple spatial tiles are created to be able to deliver only a portion of the video which is in user's viewport. We have used multiple tiling schemes applicable for omnidirectional videos with equirectangular projection, as illustrated in Figs. 2, 3, 4 and 5: Cubic projection, Squares and poles, Progressive tiling and Vertical tiling.  Spatial tiles are then layered using the SVC principle and a simple implementation of signal-to-noise ratio (SNR) scalability based on a Constant Rate Factor (CRF). The Constant Rate Factor is a rate control mode that allows the encoder to attempt to achieve a certain output quality for the whole video file. The CRF scale is ranged between 0-51, where 0 is objectively lossless, 18 is considered to be visually lossless and 51 is the worst possible quality [44]. The base layer is encoded with a CRF of 30 and enhancement layers are encoded with lower CRFs (e.g., highest level enhancement layer is encoded with CRF of 18). The enhancement layer is created by calculating a difference video between the original video tile and it's base layer, where the output video file is shifted to 50% gray level, limited at black and white levels [45]. The described SVC layering is pictured in Fig. 6, along with the result of overlaying one enhancement layer over the base layer and subtracting away 50% gray color. We have used a publicly available omnidirectional video source for demonstration [46].
Along with the video files, we have also generated the XML media description file. This file includes video extensions, names, information about full video duration, temporal segment duration, spatial information about tiles and information about layering. An example of such a XML file is illustrated in Fig. 7. At the beginning, the naming convention of video files is defined, for example the enhancement layer of the fourth tile with sixth second will have a name of: "17_UnderwaterPark_segment6_tile4_layer2.mp4". The video consists of multiple tiles (sectors), where each sector has a defined covered scope (values [x, y] from 0 to 1) and offset (values [x, y] from 0 to 1). Each sector consists of one or more layers. First layer is the base layer and other layers are enhancement layers which should be added to the base layer. Each layer has a defined quality (CRF) and a video resolution (width and height).

B. HTTP Server Providing Adaptable Omnidirectional Video Content
The created video files with multiple layers are available at the HTTP server along with their associated media description files. For the testing described in this paper, we have used Microsoft Internet information services (IIS) on a server on the same local area network (LAN) with the client. The server provided content via HTTP GET requests.

C. Playback of Adaptable Omnidirectional Video Content
The client VideoPlayer deploys the proposed adaptable omnidirectional video playout solution, which as proof of concept was implemented in Unity. VideoPlayer parses the XML  media description file and according to the tiling scheme, it creates diverse components which render each video tile to a separate texture. Each texture has assigned tiling (scope of omnidirectional scene) and offset. Textures are rendered on a skybox (celestial sphere -an imaginary sphere with infinite radius). An example of client playback and comparison between the playback of base layer and base layer with enhancement layer is shown in Fig. 8.
Tiles are buffered into a video sequence buffer from where the proposed TLGA algorithm, described in Section IV-D, enables adaptable playout. The video is navigable by mouse  drag movements or headset movements in case of playback with a VR headset. There are two different ways to playback a video. Playback on-time means that the video tile of a given segment was already loaded when the next temporal segment playback is due to start. In this case all loaded video tiles start their playback at the same time. Late playback refers to the case when the video tile was loaded during the time when other tiles of the same segment are already being played. In this case, the tile's video sequence playback starts on the same frame number as the video tiles which are already being played. This means the playback of this tile is temporally synchronized with other video tiles.

D. Tile-Layering-Based Gaze Adaptation Algorithm (TLGA)
The proposed TLGA algorithm, listed in Algorithm 1, considers that the adaptable omnidirectional video content is already generated and therefore multiple omnidirectional video tiles are available. TLGA is responsible for filling the video sequence buffer with appropriate data and delivering this data to be played out. A concentric circular approach is considered for TLGA selection of the quality of the omnidirectional video tiles based on the distance from current user's gaze and possible future user's gaze.
An important parameter for TLGA adaptation is the distance from the center of each tile to user gaze. The shortest distance between two points on the surface of a sphere, the orthodromic distance d, is calculated using the following formula: where r is the radius of the sphere and [ϕ 1 , λ 1 ], [ϕ 2 , λ 2 ] are the spherical coordinates of user gaze and tile center, respectively. A tile is considered for buffering by TLGA when it is in the region of user gaze and can have different quality levels. This region is defined based on the distance from user gaze: less than D base (i.e., 1.8 · r in our instance) and tiles are associated with basic quality content (base layer) and less than D enhancement (i.e., 0.9 · r in our case) and tiles have improved quality (base and enhancement layers). The algorithm can be extended easily to include a more complex user gaze distancebased adaptation involving multiple enhancement layers. The values of D base and D enhancement were such chosen so that TLGA adaptation will cover the user's field of view with enhanced quality when the user is not moving or moves slowly, and with the base layer (until enhancement layer is loaded) in case the user moves fast or much. 1 The TLGA algorithm uses a priority queue of candidate video sequences, which belong to certain video tiles, are split into temporal video segments and have certain quality levels. These segments need to be buffered and eventually delivered for playout. A video sequence V associated with a given segment number s, tile i, layer l at time t is denoted as V i,l s (t).

1) Enqueueing Video Sequence Candidates for Buffering:
The TLGA algorithm maintains a buffer with length BL. In order for a video sequence to be enqueued, the segment number s must be between current playback segment number s current and the s current plus buffer length, inclusive.
In addition to this condition, if the distance d between a given video sequence of base layer and user gaze is lower than D base , the video sequence is enqueued into the priority queue, as given by equation (3): Similarly for enhancement layer we have: The priority P of a video sequence is given by: Fig. 9. TLGA was deployed in an adaptive client-server video delivery system for evaluation. P max is the maximum priority (we have chosen P max = 1000) , and α, β, γ are the prioritization constants. Constant α is the weight of temporal distance between the current segment and segment of video sequence in question. Constant β is the weight of the spatial distance d between the video sequence tile center and user gaze and γ is the weight of the quality index given by layer l. In the tests reported in this paper, we have used α = 100, β = 10 and γ = 1. The units of the constants are such chosen so that priority P has no units. The values of the constants were chosen so that the tiles for the already playback segment are considered first, then tiles closest to the user's gaze and lastly their enhancement layers. This prioritises base layer data before enhancement layer for a respective tile.
2) Dequeueing Video Sequence Candidates: In parallel to the enqueuing process of TLGA algorithm, the algorithm is also continuously dequeuing videos from queue for remote delivery or local playout. In order to preserve the performance of a CPU and network with short buffering times, a maximum of number buffering videos BV max can be buffered at once (in our tests BV max = 2). As the user's gaze can change at any given time, the dequeue priority of a video sequence is decided at the same time the TLGA algorithm requests a dequeue from the priority queue. As the queue is prioritized, the data is sent for delivery or local playout in order of video sequence importance.
3) Removing Obsolete Video Sequences From the Queue: With the enqueueing process, there are several candidates of video sequences added to the priority queue. Some video sequences may not be buffered due to having low priority and become obsolete after some time. Any video sequence that has lower segment number s than the current play-back segment s current is removed from the queue: Also, video sequences where s = s current are considered obsolete when approaching the current segment playback time, as there is no time for delivery and/or decoding. This is done by the condition in equation (7).
T s is the duration of one segment, t is the current time of playback segment and μ N is the average prepare time of a video sequence V i,l s (t) and is counted as running average: where N is number of downloaded tiles and t prepare is preparation time for the latest tile.

A. Testing Setup and Scenarios
To evaluate the proposed method in real world conditions we have used a dataset of head and eye movements in Virtual Reality [47]. We have used one omnidirectional video with the highest quality from this dataset with underwater content (17_UnderwaterPark.mp4). The video has a duration of 20 seconds and a framerate of 30 fps covering full sphere with an equirectangular projection and resolution of 3840 × 1920. This video comes with head movement trajectories of 57 observers (mean age: 25.7 years, 25 women and 32 men) sampled at 200 ms intervals (labeled fixation).
A client-server system which deploys TLGA as described in the previous section was employed during testing. The system is illustrated in Fig. 9, which indicates all TLGA phases from video capture to its display, i.e., video segmentation, tile generation, multiple layer creation, server adaptive transmission and client content reception and display. The four tiling schemes discussed were used in turn: cubic projection, squares  and poles, progressive and vertical. The omnidirectional video was prepared for adaptation, as described in Section IV-A with segment duration of 1000 ms. The playback client was modified to consume headset fixation data and to simulate head movements with a simple linear interpolation between two fixations. That means each next rendered frame (note: not video frame, but frame rendered into the display from GPU) would have new gaze position counted and view of the camera would be adapted (rotated) accordingly. The serverclient-based experimental setup employed during testing has its technical details listed in Table I. The server and client were both connected to an isolated local area network (LAN) through Ethernet cable. The maximum theoretical throughput of the network was 1000 Mbps with a latency < 1 ms, so no negative network delivery influence on the omnidirectional video distribution was recorded. The evaluation is performed in terms of network bandwidth saving and adaptation latency.

B. Network Bandwidth Savings
Network bandwidth utilisation was monitored for each user and the results for a user are presented as an example in Fig. 10 and Fig. 11. These charts compare user head movement speed against current network bandwidth utilisation and demonstrate higher network bandwidth utilisation when the user gaze is changing very fast. This is given by the fact that TLGA algorithm needs to cover the changing user gaze with additional video tiles. Head movement speed ω t was determined by: σ t,t− t is the angle between user's gaze at time t and time 200 ms earlier and t is 200 ms. Using equation (1) for orthodromic distance between user's gaze at these two times we get user's head movement speed ω t at time t: Average network bandwidth utilisation for all tiling schemes and traditional streaming with full viewport are compared with their distribution into on-time playback and late playback in Table II. Playback of tiles from buffer (on-time playback) represents 84% of average network bandwidth utilisation and tiles that were loaded at the same time their segment was already being played (late playback) represents 16% of the bandwidth utilisation. Late playback is caused exclusively by the user changing gaze and TLGA algorithm adapting to the new gaze by downloading tiles that should have been already playing. This could be partially mitigated by employing a user gaze prediction algorithm such as [48]. The proportion of late playback could also be mitigated by using a lower segment duration, however this would negatively impact the encoding efficiency, as each segment needs to have a key frame encoded. Fig. 12 compares the average network bandwidth utilisation when the four proposed tiling schemes are used delimited by upper and lower quartile of average network bandwidth consumption for each user. Network bandwidth savings of the proposed tiling schemes in comparison to the traditional streaming of full spatial angle are listed in Table III. Results seen in Fig. 12 show similar bandwidth utilisation between all proposed tiling schemes. These results are determined by the selected values of D base and D enhancement and the overlap between the concentric regions and tiles.   It can be noted how by employing TLGA the network bandwidth consumption was decreased to 30-35% of the original bandwidth consumption without adaptation, for all tiling schemes.

C. Adaptation Latency
The adaptive approach employed by TLGA streams only a portion of the scene, and one of the most important parameters to evaluate is the time it takes to adapt to changes in user's viewport. Fig. 13 depicts time to download and decode one tile for each tiling scheme delimited by upper and lower quartile of tile preparation time. Tile preparation time of Cubic projection was 125.94 ± 5.66 ms, highest of all tiling schemes. This is because the cubic projection is splitting the omnidirectional video into only 6 tiles, unlike progressive tiles (18 tiles), vertical tiles (18 tiles) and squares and poles (20 tiles). The tiles of other three tiling schemes cover a smaller region and are also smaller in file size and faster to decode. The lowest tile preparation time of 70.87 ± 2.37 ms was achieved when using the squares and poles tiling scheme, which has the smallest tiles among the tested tiling options.
According to [49] latency of 50-70 ms in foveated rendering can be tolerated, however in this research, lower values of peripheral eccentricity were used (up to 20 • , which is 0.349066 rad, in comparison, the proposed method had full quality within fovea size of 0.9 rad). Big tiles as used in cubic projection tiling scheme could prove to be effective in case of rapid changes in user gaze, where smaller tiles would need to be downloaded in high numbers. Tiling schemes with smaller tiles seem to be more suitable for ordinary movements, where the user gaze is changing slowly enough so that only one new tile is needed. Also, to keep the adaptation latency comparable between all tiles, they should be of comparable area, in order to enable both download and decoding with similar complexity. The original video could be split into more tiles than proposed in this paper to achieve better latency. However, the latency is also limited to other factors of end-to-end latency and also the encoding efficiency will be negatively affected, since the similarities between the tiles cannot be utilized when encoding.

VI. CONCLUSION
This paper has introduced the novel Tile-layering based gaze adaptation algorithm (TLGA) which adjusts the omnidirectional video delivery with user head movement such as the highest quality content is in the area where the user is looking at and the content quality decreases further from that. TLGA involves omnidirectional video split into multiple temporal segments, which are then further spatially split into tiles that can be encoded at multiple quality layers. A proof of concept implementation of the proposed method was realized in a client-server video delivery system and testing was performed with head movement and gaze-based data from real users. Efficiency of four different tiling schemes were compared. The results show that when employing the proposed TLGA method the network bandwidth consumption decreased to 30% of the original bandwidth consumption with adaptation latency of 70.87 ms, while showing the best quality of content around the viewer area of interest. The proposed method is not limited to head tracking and can be extended by considering eye-tracking to further decrease the network bandwidth consumption and adaptation latency and improve overall perceived QoE.