Advanced Scalability for Light Field Image Coding

Light field imaging, which captures both spatial and angular information, improves user immersion by enabling post-capture actions, such as refocusing and changing view perspective. However, light fields represent very large volumes of data with a lot of redundancy that coding methods try to remove. State-of-the-art coding methods indeed usually focus on improving compression efficiency and overlook other important features in light field compression such as scalability. In this paper, we propose a novel light field image compression method that enables (i) viewport scalability, (ii) quality scalability, (iii) spatial scalability, (iv) random access, and (v) uniform quality distribution among viewports, while keeping compression efficiency high. To this end, light fields in each spatial resolution are divided into sequential viewport layers, and viewports in each layer are encoded using the previously encoded viewports. In each viewport layer, the available viewports are used to synthesize intermediate viewports using a video interpolation deep learning network. The synthesized views are used as virtual reference images to enhance the quality of intermediate views. An image super-resolution method is applied to improve the quality of the lower spatial resolution layer. The super-resolved images are also used as virtual reference images to improve the quality of the higher spatial resolution layer. The proposed structure also improves the flexibility of light field streaming, provides random access to the viewports, and increases error resiliency. The experimental results demonstrate that the proposed method achieves a high compression efficiency and it can adapt to the display type, transmission channel, network condition, processing power, and user needs.


I. INTRODUCTION
L IGHT field imaging is a promising technology for providing an immersive experience to the users [1]. Unlike traditional photography that integrates angular information into a 2D image, light field imaging collects both spatial and angular information, resulting in a grid of 2D views, enabling functionalities such as changing viewport, synthesizing new views, and immersive navigation within the captured scene. However, light fields come with a huge amount of data for transmission and/or storage, making their compression and transmission a challenging task. Therefore, a highly efficient light field compression method is required to deal with these images for transmission/storage. Light field compression methods are mainly categorized into two groups [2]: (i) transformbased coding and (ii) predictive-based coding methods. The Discrete Cosine Transform (DCT) [3], Discrete Wavelet Transform (DWT) [4], Karhunen Loeve Transform (KLT) [5], and Graph Fourier Transform (GFT) [6] are among the transformations that have been applied to light fields to reduce their redundancy in the transform domain. Such a transformbased solution has been adopted in the 4D transform mode, also known as the Multidimensional Light field Encoder (MuLE) [3] of JPEG Pleno. The 4D redundancy of light fields is exploited by applying a 4D-DCT transform to 4D spatioangular blocks. Rizkallah et al. [7] propose a graph-transform based light field compression method using a rate-distortion optimized graph coarsening and partitioning algorithm.
Predictive-based coding approaches are typically based on (i) non-local spatial prediction, (ii) inter-view prediction, and (iii) view synthesis methods. Non-local spatial prediction approaches have been used to reduce the redundancy within a lenslet image [8], [9]. High Efficiency Video Coding (HEVC) [10] or Versatile Video Coding (VVC) [11] coding standards have also been used to reduce the redundancy between light field views thanks to inter-view prediction methods. Light field views are reordered as a pseudo video sequence (PVS) and the generated PVS is fed into the video codec. A predefined scan order such as raster and spiral [12], [13] is typically used to generate a PVS. Fig. 2 depicts the conversion of the multiview images to a PVS using the serpentine scan order. Wang et al. [14] analyze the relationship of the inter-view prediction structure with the coding performance and propose an efficient prediction structure for light field coding.
In synthesized-based approaches, a sparse set of light field views is first encoded and used to synthesize (predict) the remaining views using view synthesis methods, including (i) Depth Image Based Rendering (DIBR), as in the Warping, merging and Sparse Prediction encoder (WaSP) [15], which has been adopted in the JPEG Pleno coding standard, or in [16], (ii) transform-assisted [17], and (iii) learning-based view synthesis [18], [19] approaches.
Dib et al. [20] use a transform-assisted view synthesis method to compress light fields. A subset of views is first inter-coded and then used to synthesize the next subset of views using the Fourier Disparity Layer (FDL) representation. The prediction residuals are then inter-coded and used to enhance the quality of synthesized views and refine the FDL representation. Ahmad et al. [21] divide light field views into two groups, namely, key views and decimated views. Key views are encoded using MV-HEVC. They are then used to synthesize the decimated views using the shearlet transform. The residuals of synthesized views are then encoded as a single PVS.
Hou et al. [18] propose a bi-level compensation approach which uses the learning-based view synthesis Deep Neural Network (DNN) proposed in [22] for light field compression. The four corner views are inter-coded first and after decoding, they are fed to the DNN to synthesize the remaining views. The residuals between the synthesized views and their corresponding target views are reordered as a PVS and intercoded. Jia et al. [23] propose a light field compression method based on a Generative Adversarial Network (GAN). They first generate a PVS by sparsely sampling light field views following a chessboard pattern. The intermediate views are then synthesized from the decoded PVS views using the GAN. The residuals between synthesized views and their corresponding target views are then inter-coded to enhance the quality of the synthesized views. Hu et al. [19] propose an adaptive two-layer light field compression method based on Graph Neural Network (GNN) reconstruction. Low-and high-frequency components are encoded using different approaches. The highfrequency view components are converted into a PVS and encoded using HEVC. The low-frequency components of the views are resampled in the angular dimension and the selected views are inter-coded. The discarded views are synthesized using the GNN. Bakir et al. [24] use VVC's temporal scalability structure to encode key views which are then fed to a GAN to synthesize the remaining views.
Some approaches provide a form of scalability when coding light fields. Conti et al. [25] propose a viewport scalable coding solution for 3D light fields based on an inter-layer prediction scheme that exploits the redundancy between multiview and lenslet representations. Li et al. [26] propose a three layers disparity-compensated scheme for scalable coding of lenslet images. Garrote et al. [27] propose a scalable scheme based on the wavelet transform for lenslet image coding. Conti et al. [28], [29] propose a light field coding solution with field of view scalability, which supports region of interest enhancement. Komatsu et al. [30] propose a light field coding using weighted binary images with the support of quality scalability. Rüefenacht et al. [31] propose a scalable light field coding approach based on the base-anchored representation, including scalable compression of the disparity information itself.
In this paper, we propose a flexible light field compression method that can be adapted to the user's needs by supporting the following functionalities: (a) viewport scalability, (b) spatial scalability, (c) quality scalability, (d) random access, and (e) uniform quality distribution. The proposed framework extends the method described in [32] in several ways. It first adds spatial scalability based on a single image super-resolution approach which is shown to give a very high rate-distortion performance for each target spatial resolution. The flexibility of the encoding structure has been increased by adding spatial scalability in addition to the viewport and quality scalabilities. This increased flexibility allows us to better address the various trade-offs between encoding efficiency, random access, and the different forms of scalability. A comprehensive analysis is carried out using a light field dataset with a large parallax which is more challenging in terms of encoding efficiency as well as low parallax light fields.
In a nutshell, we first downscale light field views to a lower resolution to make two spatial layers: (i) Spatial Layer 1 (S L 1 ) and (ii) Spatial Layer 2 (S L 2 ). Views in each spatial layer are divided into Viewport Layers (V Ls). Fig. 3 depicts the structuring of 5 × 5 light field views into spatial and viewport layers. In each V L, the available views are used to synthesize intermediate views and the synthesized views are used as virtual reference images to predict their corresponding views. To encode views in S L 2 , super-resolution is applied to their corresponding encoded viewports in S L 1 and they are also added to the reference image list. The remainder of the paper is organized as follows. The theoretical background for light field imaging is introduced in Section II. The functionalities supported by our proposed method are introduced in Section III. Section IV presents the proposed light field encoding method. Experimental results are provided in Section V and Section VI presents the concluding remarks.

II. LIGHT FIELDS
A light field is a quantized representation of the 7D plenoptic function [33], i.e., where all light rays at every possible location (x, y, z), at every possible direction (θ, φ), at any time (t), over any range of wavelengths (λ) are recorded. The light field representation can be simplified based on some assumptions. First, light rays are considered time-invariant, and monochromatic, resulting in removing time (t) and wavelength (λ) dimensions. Second, the light rays are assumed to travel in a free space, which leads to removing another dimension. Therefore, a light field is represented by a 4D function as follows: where (u,v) represents the view location, and (x,y) denotes the pixel location in each view. A two-plane parameterization can be used to model light fields, and they are represented as multiview images as shown in Fig. 1. To acquire light fields, multi-array or lenslet cameras are used. For lenslet cameras, the spatial and angular domains are multiplexed into a single 2D image, known as a lenslet image. The lenslet image can be converted into a multiview representation [34].

III. FUNCTIONALITIES IN LIGHT FIELD COMPRESSION
In this section, we highlight the functionalities supported by our proposed light field coding method.

A. Viewport Scalability
Viewport scalability for light fields is provided by grouping light field views into different layers. In this way the adaptation to (i) capturing device, (ii) display, (iii) network condition, (iv) processing power, and (v) storage capacity is enhanced. For example, 2D displays might require the central view, while 3D/stereo displays need only the central view and two of its side views. For light field displays, layers can be transmitted, decoded, and displayed one after another. PVS-based methods make all the views dependent on each other to highly utilize redundancy among the views and increase the compression efficiency. However, to access an arbitrary view, e.g., the central view on a 2D display, all light field views should be encoded, transmitted, and decoded. This will lead to both bandwidth and processing power wastage as well as decoding delay [35]. Monteiro et al. [36] divide the light field views into multiple viewport layers and encode the views in each layer by using the previously encoded/decoded views in the same layer or in prior layers as references.

B. Quality Scalability
Through quality scalability, the adaptation to the network condition is provided. In this way, light fields are encoded in two (or more) quality layers and the quality of light fields can be improved by transmitting enhancement layers when enough bandwidth or processing power is available. In synthesizing views, some approaches introduced in the previous section, e.g., [18], [21] encode their residuals as a quality enhancement layer to improve the quality of the synthesized image.

C. Spatial Scalability
To address various devices and display resolutions it is important to provide spatial scalability. In this regard, images are encoded at two (or more) spatial resolutions. The lower resolution is encoded as the base layer and it is used as a reference to encode the higher resolution(s), i.e., enhancement layer(s).

D. Viewport Random Access
Navigation between various viewports is another important factor to be considered in light field encoding solutions. Since light field views in an inter-view prediction are highly dependent on each other, navigation between different views may require a huge amount of views to be decoded which can have a high cost on decoding delay, bandwidth requirement, and processing power. To avoid these problems, random access to the image views should be considered in light field coding [35], [37]. Therefore, JPEG Pleno defines various metrics within its light field coding common test conditions [38]. The random access metric (R A) is defined as: R A = T otal amount o f encoded bits r equir ed to access a view (3) The random access penalty metric (R A p ) is considered as the maximum R A among all views as: The relative random access penalty metric (R R A p ) is defined as: T otal amount o f encode bits to decode the f ull light f ield (5) In PVS-based light field coding solutions, R R A p is equal to 1, which means to access a view, the whole encoded light field should be transmitted and the whole bitstream should be decoded (to access, e.g., the last view). In encoding light fields, some compression methods focus on improving random access to arbitrary views [36], [37], [39], [40], [41], [42], [43].

E. Uniform Quality Distribution
Light field views in a given number of encoded bits should have similar quality at any view. It is undesirable to provide light field views in a way that users face different quality levels when navigating between viewports. Fig. 4 illustrates the quality variation when a user navigates between the top-left and top-right views in case there is a significant difference between the quality of those image views.

IV. SCALABLE LIGHT FIELD CODING
To address the above-mentioned functionalities, a flexible light field compression method is proposed in this paper. To provide spatial scalability, a light field L F is spatially downscaled to a lower resolution (× 1 2 in each direction). Therefore, the light field views are provided in two spatial layers; (i) S L 1 (low resolution), and (ii) S L 2 (original resolution). represents the view of V L 1 , represents views of V L 2 , and represents views of V L 3 .
To support viewport scalability, both spatial layers are divided into multiple viewport layers, each containing a subset of views. The maximum number of viewport layers (n) is determined by the angular resolution of light fields. For example, a light field of 5 × 5 views will be decomposed into four viewport layers (n = 4), a light field of 9×9 views will be decomposed into five viewport layers (n = 5), and a light field with 17 × 17 angular resolution will be decomposed into six viewport layers (n = 6) for each spatial resolution. Fig. 3 shows the way a light field with an angular resolution of 5 × 5 views is structured into two spatial resolution layers, S L 1 and S L 2 , and four viewport layers per each spatial layer.

A. Compression of S L 1
We encode views at different viewport layers in different way. (i) S L 1 V L 1 : the central view is intra-coded, hence it can be accessed independently. (ii) S L 1 V L 2 : views in the second layer are encoded independently of each other, however, using inter-coding taking the central view as a reference image.
: the remaining views are encoded using a predictor based on a view interpolation method as described in the following.
In video frame interpolation methods, the optical flow between two input frames, i.e., a per pixel translational displacement, is estimated and subsequently, the intermediate frame guided by motion is synthesized. DNNs are promising techniques to generate intermediate frames or -in our case -views. Many video frame interpolation methods using DNNs have been introduced [44], [45]. In this paper, we use RIFE [46] for view interpolation as it allows real-time flow estimation without any limit on the maximum number of interpolated views, which makes it flexible to support a varying number of viewport layers (cf. Section V-F). Fig. 6.a illustrates the use of RIFE to synthesize the top view in the 3 r d viewport layer (S L 1 V L 3 ) from two input images, i.e., the top-left and top-right views of the second viewport layer (S L 1 V L 2 ). The residual images between the ground truth top view in S L 1 V L 3 and these three images are also shown in Fig 6.b. It is seen that the synthesized view has more correlation with the target view and, thus, it can serve as a better reference for predicting the top view in the 3 r d viewport layer (S L 1 V L 3 ). We therefore use these three views, i.e., topleft, top-right, and synthesized views, as reference images in the reference lists of the standard video codec VVC [11] to inter-code the top view in the 3 r d viewport layer (S L 1 V L 3 ). The Rate-Distortion (RD) performance (see Fig. 6.c) shows a significant improvement when the synthesized view is used as the reference.
When a synthesized view is used for prediction, it is added as a virtual reference frame to the Decoded Picture Buffer (DPB), which stores pictures for future use as reference, and into the two Reference Picture Lists (RPLs), i.e., RPL0 and RPL1 [11]. To encode such "intermediate" view, four references are thus needed for inter-coding: (i) the central view, (ii, iii) two views that are used for interpolation, and (iv) the synthesized view. It should be noted that the synthesized view corresponds to a first level of quality in all the viewport layers, a second level of quality being obtained by transmitting a prediction residue.

B. Compression of S L 2
An upscaled view of S L 1 can be used as an additional reference to inter-code its corresponding view in S L 2 . The views of the second spatial layer are encoded in a different way depending on the viewport layer to which they belong to. (i) S L 2 V L 1 : the central view of the second spatial layer is inter-coded using the upscaled central view in S L 1 V L 1 as the reference image. (ii) S L 2 V L 2 : the views of the second viewport layer of the second spatial layer are encoded independently of each other but using inter-coding, taking (a) the central view in S L 2 and (b) the upscaled version of the co-located view in S L 1 V L 2 as reference images. The views in S L 2 V L m (3 ≤ m ≤ n) are synthesized similar to the views in S L 1 V L m (3 ≤ m ≤ n). That is, views in S L 2 V L 1 to S L 2 V L m−1 are used to synthesize those views which are equidistant from them in S L 2 V L m using RIFE.
To upscale images, DNN based super-resolution methods have shown a significant gain over the traditional methods. Some methods have been proposed specifically for light field super-resolution [47], [48], [49]. However, they typically use all or a set of low resolution light field views for the superresolution task, which impairs the random access functionality (cf. Section V-F). To avoid this problem, we use a conventional single image super-resolution method in this paper, i.e., DASR [50]. It should be noted that in S L 2 , for the first quality level, intermediate views can be either (i) synthesized using a view interpolation method or (ii) reconstructed by applying a super-resolution approach to the co-located view in S L 1 . To produce the second quality level, they are enhanced by adding the prediction residue to the above-mentioned reference images. Fig. 7 shows the encoding workflow for the top view in S L 2 V L 3 . The co-located view in S L 1 , i.e., the top view located in S L 1 V L 3 , is upscaled using DASR and it is added to the reference list. The central view in S L 2 , i.e., the view located in S L 2 V L 1 is also added to the reference list. Finally, two views that the top view is equidistant from them, i.e., the top-left and top-right views of the second viewport layer (S L 2 V L 2 ), are used as inputs of RIFE, and the output of RIFE, i.e., the synthesized view, is also added to the reference list. The top view is inter-coded and the prediction residue is added to the bitstream as the quality enhancement layer.

C. Bit Allocation and Quality Distribution
The bit allocation to different layers and views is flexible, allowing users to allocate bits in a way that meets their needs. In this paper, we allocate bits to provide uniform quality distribution among the views. To this end, we encode S L 1 V L 1 with a base QP, and consider its quality as the reference quality (q c1 ). We then empirically determine QPs for the views in S L 1 V L 2 in a way that similar quality to the reference quality is achieved for views in S L 1 V L 2 , i.e., |q view −q c1 | ≤ ϵ, where ϵ is a threshold. When views in S L 1 V L m (3 ≤ m ≤ n) are synthesized, the prediction residue is encoded if the quality of the synthesized view (i.e., interpolated view) does not meet the uniform quality distribution criterion, i.e., |q view − q c1 | ≰ ϵ. QP is empirically determined for the prediction residue to achieve |q view − q c1 | ≤ ϵ. For S L 2 V L 1 , we consider the super-resolved image of S L 1 V L 1 as the first quality level and we encode the prediction residue with the base QP to  provide quality scalability for S L 2 V L 1 and its final quality is referred to as q c2 . For the views in S L 2 V L 2 , we consider the super-resolved image of co-located views in S L 2 V L 1 as the first quality level and we encode the prediction residue if |q view − q c2 | ≰ ϵ by determining empirically QP to meet |q view − q c2 | ≤ ϵ.
For views in S L 2 V L m (3 ≤ m ≤ n), the reconstruction quality of the interpolated (synthesized) view and upscaled image by super-resolution is measured and their maximum value is calculated (q view ) for each view. q view is then compared with the reconstructed quality of the central view (q c2 ). If the difference between q view and q c2 is not less than or equal to the threshold (ϵ), i.e., |q view − q c2 | ≰ ϵ, the prediction residue is added to ensure |q view − q c2 | ≤ ϵ and consequently uniform quality distribution is guaranteed. Adding an enhancement layer is equivalent to providing quality scalability. Note that in this paper, the quality enhancement layer is not provided for views in S L 1 V L 1 and S L 1 V L 2 , which  [39] and the JPEG Pleno dataset [38]. S L 1 represents the first spatial resolution after applying bicubic upsampling, S L1 + S R represents the first spatial resolution after applying super-resolution, and S L 2 represents the compression efficiency of the overall proposed method.
can be provided depending on the user's need. Additionally, in this paper, for the views that the uniform quality distribution is satisfied with the interpolated or super-resolved images, the quality enhancement layer is not provided. However, the flexibility of the proposed method allows for a quality enhancement layer for all views according to the user's needs.

V. EXPERIMENTAL RESULTS
In this Section, we first introduce the test condition that we used in this paper. We then provide experimental results for compression efficiency and other functionalities that have been discussed in the previous sections.

A. Test Condition
To evaluate the performance of the proposed method, we have selected six light fields from the Stanford 1 dataset [39] and three light fields from the JPEG Pleno 2 dataset [38] to cover light fields from large to narrow parallaxes. The characteristics of these images are summarized in Table. I. The Stanford light field views were converted to 8-bits YUV420 format and the JPEG Pleno light field views were converted to 10-bits YUV444 format to match the coding conditions of the baseline codecs selected for comparison. VTM Encoder Version 10.2, 3 was used as the standard encoding software for VVC. We encode light fields at four quality levels. The base QPs used to encode each light field test image at four quality levels are also summarized in Table. I. QPoffsets for each viewport layer are selected in a way that the quality of encoded views remains similar to each other. In this paper, ϵ, was set to 1dB, which means that the quality difference of all views and the central view at each quality level is less than 1dB. For video interpolation, RIFE, 4 and for video superresolution, DASR 5 were used without fine tuning.

B. Compression Efficiency and Quality Distribution
To evaluate the compression efficiency of the proposed method, we consider three points in its workflow: (i) S L 1 : the compression efficiency of the first spatial resolution after applying the bicubic upsampling, (ii) S L 1 + S R: the compression efficiency of the first spatial resolution after applying super-resolution, and (iii) S L 2 : the compression efficiency of the overall proposed method. We compare the encoding efficiency of these three points with the JPEG Pleno anchor (x265) [38], MV-HEVC [51], and Shearlet Transform Based Prediction (STBP) approach [21] for Stanford light fields, and with the JPEG Pleno Verification model 2.1 (4D Prediction) (VM2.1) [38] for JPEG Pleno light fields. Note that different baseline codecs have been selected for each dataset since they perform differently on each of them. VM2.1 performs well on the JPEG Pleno dataset, which mainly includes light fields with a narrow disparity. However, it does not perform well for large disparity light fields such as those of the Stanford dataset. On the other hand, STBP, which is based on MV-HEVC, provides limited compression efficiency for narrow disparity 3  light fields [21]. Fig. 8 shows the RD curves using the mean PSNR of the Y component of all the views as the objective metric.
For the Eucalyptus Flower light field, which has lots of fine geometry, the proposed method fails to outperform the state-of-the-art scheme. This might happen because of the inefficiency of video frame interpolation or super-resolution DNNs for these images or the lack of this type of image in their training dataset. For other light fields the proposed method (S L 2 ) shows superior performance compared to its competitors, particularly at lower bitrates. This is more significant for a light field with simple geometry such as J elly Beans. The superiority of S L 1 + S R to S L 1 shows the importance of super-resolution in improving the compression efficiency.
Note that the compression efficiency of S L 1 and S L 1 + S R is low for some light fields such as Sideboar d and T ar ot, while it is high for some light fields such as J elly Beans. We have calculated the spatial complexity (E) for each light field view using Video Complexity Analyzer (VCA 6 ) [52] and computed their average value (E mean ). The E mean values for all test light fields are shown in Fig. 9. It is observed that, with increasing the spatial complexity, the compression efficiency is reduced.

C. Scalability
In this paper, to support spatial scalability, the light fields are compressed at two spatial resolutions. Therefore, the final bitstream consists of two parts: (i) b S L 1 : the bits allocated to compress the lowest resolution, and (i) b S L 2 the bits allocated to compress the highest resolution. The allocated bits to each spatial layer are also divided into multiple viewport layers (i.e., {b V L 1 , . . . , b V L n }) to support viewport scalability and uniform quality distribution. Finally, the allocated bits to each viewport layer are used to improve the quality of viewports in that layer, in other words, to support quality scalability. Fig. 10 shows the bits allocated to spatial and viewport layers the encoded Bunny light field. It is observed that with increasing the number of encoding bits, the larger portion of the whole bitstream is allocated to S L 2 . It is also observed that at the higher number of encoding bits, the smaller portion of each spatial resolution is allocated to the first viewport layer of each spatial layer, i.e., S L 1 V L 1 and S L 2 V L 1 , which have been differentiated from the other viewport layers in Fig. 10. To subjectively analyze the scalability of the proposed method, Fig. 11 shows the Eucalyptus Flower light field when the whole light field is encoded at 0.04 bits per pixel (bpp). The central view of S L 1 , before and after applying superresolution, as well as the central view of S L 2 are compared with the original central view. It is shown how applying super-resolution and adding an enhancement layer improves the quality of the decoded central view.

D. Random Access
Random access to an arbitrary view decreases memory footprint and bandwidth requirements. The bitrates required to access views and their maximum (R A p ) are shown in Fig. 12. R R A p is also shown in Fig. 12 as embedded plots. It is seen that at the higher number of encoding bits, where random access is crucial, only a small portion of the whole bitstream is required to access an arbitrary view. Note that the flexibility of the proposed method allows to address the trade-off between the compression efficiency and random access. For instance, if the synthesized views are removed from the reference list and only the super-resolved images are used as virtual reference images to encode views in S L 2 , random access is improved while the compression efficiency is reduced. It should be mentioned that the baseline codecs, i.e., JPEG Pleno anchor (x265), MV-HEVC, STBP, and VM2.1 (4D Prediction) show low random access performance since they are highly dependent on the inter-view prediction between the different views. JPEG Pleno anchor (x265) converts all views into a single PVS and encodes them sequentially; thus, it does not provide random access to views. Similarly, in STBP, the prediction residuals of all views are converted to a PVS and compressed with a video encoder, which makes all views dependent on each other and significantly impairs the random access performance. VM2.1 (4D Prediction), which is based on WaSP, is also highly dependent on the amount of reference views that are warped and merged using one optimal least-squares merger. Fig. 13 compares the performance of R R A P of the proposed method with the one of MV-HEVC for the Bunny light field. It is shown that the proposed method achieves a better random access performance compared to MV-HEVC. The superiority is more significant at higher number of encoding bits, where random access is more crucial.

E. Error Resiliency
Compressed data is always vulnerable to channel errors and bandwidth constraints. However, our proposed method can synthesize all views even with a small portion of the whole bitstream, i.e., b S L 1 V L 1 and b S L 1 V L 2 . When corner views are available in the first spatial layer, all other views can be synthesized and super resolved to generate the whole image views but at a lower quality. For example, as shown in Fig. 10, at bpp3, with only b S L 1 V L 1 + b S L 1 V L 2 = 1.7% + 2.3% = 4% of the whole bitstream, all other views can be synthesized. To show how much quality improvement can be achieved by additionally downloading each layer (and loosing next layers), we plot quality vs. downloaded bits for the Bunny light field in Fig. 14. It is seen that the proposed method is resilient to channel errors and can retrieve image views even when a significant portion of a bitstream is lost.

F. Flexibility
Due to its high flexibility, the proposed approach can address different trade-offs including compression efficiency, random access, uniform quality distribution, and error resiliency with adaptive bit allocation to different layers. In this paper, the bits were empirically allocated among different layers in a way that they yield image views with similar qualities. For example, Fig.15a shows the standard deviation for PSNR of views of the Bunny light field for S L 1 , S L 1 +S R, and S L 2 points. The scatter plot for the absolute difference between PSNR of each view and PSNR of the central view (for S L 2 ) is also shown in Fig.15b to validate the uniform quality distribution. It is seen that the criterion of |q view − q c | < 1dB for all views has been met. However, the bits can be allocated in a way to yield a higher compression efficiency or random access performance.
RIFE is capable of interpolating intermediate viewports without any limits on the maximum number of interpolated views at the same inference time. In this paper, it is used to interpolate only one intermediate view, i.e., equidistant intermediate views. However, with interpolating more than one intermediate view, each viewport layer may contain more views, and the number of viewport layers and the inference time for interpolation may be reduced. This will allow us to have flexibility in the number of viewport layers (n). For example, we encode only the first and second viewport layers in the first spatial layer (i.e., S L 1 V L 1 and S L 1 V L 2 ), and we then use corner views in S L 1 V L 2 as inputs of RIFE to interpolate all intermediate views between the corner views without adding any quality enhancement layer. In this way, we need to run RIFE at most thrice to access any arbitrary view in S L 1 and additionally DASR once to access any arbitrary view in S L 2 without any need to encode/decode any enhancement layer (See Fig. 16a). Note that in this structure, the number of viewport layers (i.e., four viewport layers for S L 1 and one viewport for S L 2 ) is independent of the light field's angular resolution. The compression efficiency of the above-mentioned structure (S L 1 + S R (2)) for the Bunny light field is shown in Fig. 16b. It is seen that this structure shows lower performance in terms of compression efficiency; however, it results in fast access to any arbitrary view. Additionally, since the quality enhancement layer is not applied to views, the average standard deviation of PSNR of views for all quality levels is increased from 0.24 for S L 2 to 1.04 for S L1 + S R(2).
Light field super-resolution (LFSR) approaches [53], [54], [55] may result in views with higher reconstruction quality compared to single image super-resolution (SISR) approaches [50], [56] since they better preserve angular consistency. However, it should be noted that LFSR approaches usually utilize all or a huge set of low resolution views as inputs to super resolve all of them, which harms the random access performance and viewport scalability. To evaluate the impact of super-resolution on the performance of the proposed method, we take the 5×5 central views of the T ar ot light field and encode S L 1 with the proposed method (Section IV-A). For super-resolution, we select EDSR [56] as an SISR approach and LFT [55] as an LFSR approach from BasicLFSR, 7 an Fig. 12. R A p and R R A p for light field test images. R A p denotes the maximum number of encoded bits requited to access an arbitrary view at each bitstream. The embedded plots represent R R A p , i.e., the relative number of encoded bits required to access an arbitrary view at each bitstream. open-source light field super-resolution toolbox. EDSR and LFT have been selected since they have been both trained with the same light fields allowing a fair comparison. We super resolve views using EDSR and encode views in S L 2 using the proposed method (Section IV-B). When EDSR is replaced with LFT, all views in S L1 are used as inputs of LFT and the output of LFT will be all views that have been super resolved. Therefore, S L 2 comprises only one viewport layer with LFT approach. We show the compression efficiency and random access performance of both methods in Fig. 17. It is seen that utilizing the LFSR approach for super-resolution improves the   compression efficiency at the cost of reduced random access performance.

G. Future Directions
RIFE has been trained for video frame interpolation and its training for light field view synthesis may improve its efficiency for the view synthesis. Both RIFE and DASR have been trained with uncompressed images but we deploy them to interpolate and super resolve compressed images. Fine tuning these DNNs with compressed images may improve their accuracy.

VI. CONCLUSION
In this paper, we propose a novel light field compression method based on video interpolation and image super-resolution techniques. Light field views are compressed in two spatial layers to support spatial scalability. Views at each spatial layer are divided into various viewport layers. The previously encoded views are used to synthesize their equidistant intermediate views and the synthesized views are then used as virtual reference frames to inter-code the intermediate views and improve their quality. A super-resolution method is applied to the compressed views at the lowest resolution and they are used as additional reference images to inter-code their corresponding views at the highest resolution. In addition to the spatial, viewport, and quality scalabilities, the proposed structure improves the flexibility of light field compression, provides random access to the viewports, and increases error resiliency.

ACKNOWLEDGMENT
The financial support of the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development, and the Christian Doppler Research Association are gratefully acknowledged. Christian Doppler Laboratory ATHENA: https://athena.itec. aau.at/.