Joint Light Field Spatial and Angular Super-Resolution From a Single Image

Synthesizing a densely sampled light field from a single image is highly beneficial for many applications. Moreover, jointly solving both angular and spatial super-resolution problem also introduces new possibilities in light field imaging. The conventional method relies on physical-based rendering and a secondary network to solve the angular super-resolution problem. In addition, pixel-based loss limits the network capability to infer scene geometry globally. In this paper, we show that both super-resolution problems can be solved jointly from a single image by proposing a single end-to-end deep neural network that does not require a physical-based approach. Two novel loss functions based on known light field domain knowledge are proposed to enable the network to consider the relation between sub-aperture images. Experimental results show that the proposed model successfully synthesizes dense high resolution light field and it outperforms the state-of-the-art method in both quantitative and qualitative criteria. The method can be generalized to various scenes, rather than focusing on a particular subject. The synthesized light field can be used as if it has been captured by a light field camera, such as depth estimation and refocusing.


I. INTRODUCTION
Light fields have attracted considerable interest from computer vision and graphic communities due to their capability to capture multiple light rays from various directions. Recent studies utilized densely sampled light field captured by off-the-shelf light field cameras. Many applications, such as depth estimation [15], [16], [25], [26], refocusing [13], and 3D reconstruction [8], [21], exploit the rich information of a light field image. At present, a light field image is captured using either plenoptic (light field) cameras [1] or camera arrays [24]. However, the absence of the only available consumer light field camera, i.e. Lytro, has created a gap between consumers and light field experiences. In addition, light field camera also suffers from spatial and angular resolution trade-off due to the sensor limitation. Most light The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei . field camera favor angular resolution which results in a low spatial resolution light field image. We focus on filling this gap so that end users can experience the advantages of light field imaging and beyond. The idea is to jointly synthesize light field through angular and spatial super-resolution (SR) from only a single image which is abundantly available in the real world. In order to do that, the geometry information from a single image should be inferred and used to synthesize the surrounding angular images. Synthesizing a 4D light field is a severely ill-posed problem, but the impact of such work is considerably significant. For example, promoting a single image into a densely sampled light field can elevate existing AR/VR immersion experiences.
In this context, light field synthesis has attracted considerable attention in recent years [9], [12], [17], [19], [22], [28], [30]. Previous approaches can be grouped into two categories based on the input type, i.e. single or multi-view inputs. The multi-view input utilizes multiple images captured from specific viewpoints to infer the geometric clue and use it to synthesize the light field. However, only a few consumer cameras can simultaneously capture multi-view images, which makes the approach impractical for general use.
Existing method involving a single input utilizes two-stage neural networks and depth image-based rendering (DIBR) technique to synthesize the light field [17]. Reference [17] is inspired by previous view synthesis techniques using geometry estimation [4]- [6], [29]. However, [17] is highly dependent on the estimated depth quality and physical-based depth warping to synthesize angular images. The depth-based approach also faces difficulty in reconstructing the occlusion and homogeneous region. Typical learning based works rely on minimizing the error between the synthesized view and the ground truth image straightforwardly. This leads the network to rely on pixel intensity cue and cannot be easily generalized to data with different and complex distribution.
In this paper, we develop a novel joint deep neural network for spatial and angular light field SR that utilizes the appearance flow to synthesize novel views. To the best of our knowledge, this is the only work that tackle the joint super-resolution problem using only a single image. We also introduce a spatio-angular consistent loss function based on known light field domain knowledge that helps the network learn robustly and efficiently. Figure 1 shows the result and application of the proposed method. The key contributions of this paper are summarized as follows.
• End-to-end encoder-decoder style for joint spatial and angular light field SR model.
• Novel spatio-angular consistent (light field based) loss that imposes geometric reasoning to the network.
• Capability to be generalized to arbitrary scenes rather than a specific class of object compared with the previous approach.

II. RELATED WORKS
Learning based light field synthesis (angular SR) has been investigated by many researchers in the past few years [9], [12], [17], [19], [22], [28], [30]. On the basis of the number of input images, related approaches are categorized into multi-view image [9], [22], [28], [30] and single image [17] based. Multi-view image-based light field synthesis is impractical due to its specific input pattern. Meanwhile a single image-based light field synthesis is severely ill-posed although it is the most practical approach for light field synthesis.

A. SPARSE-INPUT LIGHT FIELD SYNTHESIS
Wanner and Goldluecke [23] introduced a light field SR framework adopting the estimated depth information and variational optimization to fill missing views from a sparse light field image. Phase-based light field synthesis from a micro-baseline stereo pair was proposed by Zhang et. al [31]. Those studies were rooted on traditional approaches that use complex processing and various optimization approaches. Meanwhile, learning-based view synthesis achieves better results by using an end-to-end training strategy. Zhou et. al [32] proposed a new geometric representation called appearance flow to synthesize an image with a novel view. However, the proposed representation is not generalized well to a complex scene with multiple object and non-homogeneous background. Zhou et. al [33] presented a novel geometric representation called multi-plane images (MPI) to synthesize a horizontal light field from a narrow baseline camera. Srinivasan et. al [19] extended the MPI extrapolation boundaries based on the Fourier domain analysis. Recent work by Mildenhall et. al [12] exhibited state-of-the-art view synthesis performance with multi MPI and a blending technique. However, these approaches require camera pose information and/or multiple inputs to synthesize the novel view which is not commonly available in the real world. In addition, MPI requires a significant computing resource(increasing the layer count) to get an accurate result. Kalantari et. al [9] introduced the first learning based light field synthesis solution. They utilized four corner images to synthesize a 4D light field using a depth estimation, warping, and color refinement approach. The inputs to the depth estimation network were mean and variance images, as inspired by the depth estimation work of [20]. Wu et. al [28] utilized an epipolar plane image (EPI) obtained from sparse input images and synthesized an up-sampled EPI through a specially designed blur kernel. The framework is then extended further into several applications [27]. Wang et. al [22] employed a pseudo 4DCNN represented as 2D strided convolution and 3DCNN, where the light field image was synthesized in a step-by-step manner. Yeung et. al [30] applied a high dimensional convolutional kernel to infer spatial and angular information from sparse input images. In summary, sparse input light field synthesis focuses on synthesizing in-between views and could be regarded as solving an interpolation problem. The specific input sampling pattern also hinders its practical usage.

B. SINGLE-INPUT LIGHT FIELD SYNTHESIS
Srinivasan et. al [17] introduced the first solution to solve light field synthesis from a single image. They proposed a single image based depth estimation to obtain the approximate geometry of a scene. Then, the estimated depth was utilized to synthesize a novel view using the DIBR approach. However, their method is constrained to a simple scene and highly dependent on pixel intensity.
In this paper, we focus on solving the problem of single-image light field angular synthesis. Contrary to [17], we propose to use an alternative geometric representation (appearance flow) and present a light field based loss function to avoid reliance on bright pixel color. Furthermore, we go beyond the problem scope of [17] and solve the spatial resolution problem simultaneously.

III. PROPOSED METHOD A. JOINT LIGHT FIELD SUPER-RESOLUTION
This paper aims to synthesize a high resolution 4D light field L(x, u) given a single image that serves as the central sub-aperture image (SAI) L(x, 0). We follow the two-plane parametrization of light field L(x, u), introduced by [10], where x and u are the coordinates in spatial and angular planes, respectively. In general, the light field synthesis problem is formulated as DIBR problem, which is described as follows: where d(u) is the disparity in x direction. Disparity depends on the depth information of the central image and the novel angular coordinate u. We address the light field synthesis problem by using an approximation function f (·) represented as a deep convolutional neural network, as described in Function f solves a highly ill-posed problem. To solve the problem jointly, we design an encoder and multi decoder framework. We use the shared encoded feature to solve both angular and spatial light field SR. The joint framework is decomposed into two decoder branches. The top (angular) branch solves the angular SR problem. While the bottom (spatial) branch solves the spatial SR problem using additional information estimated by the first decoder branch. The overall network structure is shown in Figure 2.

1) ANGULAR DECODER
The angular decoder branch estimates appearance flows to extrapolate the central view to each SAI in the 4D light field. Appearance flow [32] represents 2D coordinate vectors specifying where pixels are mapped in the reconstructed novel views. Appearance flow is accompanied with little blur, preserves color identities, and removes the dependency on the physical-based approach to synthesize novel views.
Considering that the ground truth light field appearance flow is difficult and expensive to obtain, the proposed network is designed to estimate appearance flow in an unsupervised manner. The network learns to estimate appearance flows by supervising the synthesized light field (warped novel views) image. The angular decoder is decomposed into three subproblems, i.e. appearance flow estimation for each viewpoint u, image shifting with respect to the central view, and novel view extrapolation. Each sub-problem can be defined mathematically as follows.
where F estimates appearance flow L f for each novel angular view. S performs angular shifting to the position ∇(u) of novel views. Image shifting serves as a bias initialization for the network. W is the warping function for shifted image L s (x, u) using its corresponding appearance flow L f (x, u). Image warping is performed using a bilinear sampler module [7] to produce the light field imageL(x, u).

2) SPATIAL DECODER
We solve the angular SR first followed by spatial SR jointly. The spatial decoder branch estimates a residual light field image to enhance the initially estimated 4D light field (angular SR). The final HR light field is the result of adding multi-stage residual images. The purpose of multi-stage residual estimation is to solve missing high-frequency information and to post-process the initial light field. They are further discussed in Section III-C. The problem can be defined as follows.
where R outputs two residual light field images, namely residual flow L rf (x, u) and residual intensity L ri (x, u). The output of spatial decoder layer is concatenated with appearance flow and initial light field image, indicated by red dashed lines in Figure 2. Residual flow and residual intensity are named based on the prior information concatenated to produce the residual light field. The final high resolution light field is denoted byL H (x, u). The proposed objective function is defined as where θ denotes the deep neural network parameters. The problem formulation and objective function enable the network to estimate an appearance flow for each SAI at u in an unsupervised manner through the supervision of synthesized pixels. The common pixel-wise loss is not utilized in the proposed method because it does not enforce the geometry constraint to the network. Instead, we rely on the known light field domain knowledge and design two geometrically constrained losses, i.e. global L g (θ) and local L l (θ) light field losses. Both loss functions are useful in preserving the spatio-angular consistency between light field SAIs.
In addition, we propose a regularization loss ψ tv (θ) to the estimated appearance flow to enforce angular consistency. L sr (θ) denotes the spatial SR loss. Each loss is weighted with a corresponding λ to balance each loss contribution accordingly.

B. ANGULAR SUPER-RESOLUTION
Angular SR decoder relies on estimating appearance flow to warp the shifted image into novel SAIs. To estimate an accurate and dense appearance flow, we designed the encoder to extract all important representation in the image. Sharing encoded feature with both decoders encourages the encoder to extract important features in the input image. Angular decoder consists of convolution layers with skip connections followed by leaky ReLU [11]. Additional details about the network structure are available in the supplementary material.
In particular, the proposed flow estimation network produces coordinate vectors (x, y) to sample from L s (x, u) using the bilinear sampler to synthesize the novel viewL(x, u). The appearance flow is visualized in Figure 3. It is observed that the estimated appearance flow is smooth, edge-aware, and consistent in the inverse direction. In addition, it robustly estimates the geometry of a scene with similar intensities. We learn F to estimate the appearance flow by minimizing the loss between the extrapolated novel viewsL(x, u) and ground truth light field L(x, u). The angular decoder estimates appearance flow for all novel SAIs in a single shot. Image shifting is designed to guide the network by providing an initial bias. This technique is inspired by the work of Xie et. al [29], in which the input image is shifted to guide the network in synthesizing the corresponding stereo image. The shifting operation can be written as follows.
where η is the constant angular shift value in horizontal and vertical directions. u is the angular distances between novel and central views. Considering the redundancy in light field SAI, we can partially imitate how pixels shift to each angular position and utilize this to provide better initialization. The η value is predetermined based on the disparity between SAI in the target light field.

C. SPATIAL SUPER-RESOLUTION
In spatial SR, we estimate residual images to recover the high frequency information which is lost during initial light field upsampling. Moreover, we try to refine the initial light VOLUME 8, 2020 field image. The spatial decoder is designed to solve this problem by adding the initial light field image with residual images. In particular, we incorporate information estimated by angular decoder into the spatial decoder, such as appearance flow and the initial light field image. We concatenate those information before the final convolution layer in the decoder. The estimated residual image confirms the effectiveness of the proposed multi-stage residual framework, as shown in Figure 4. The first stage estimate the high frequency information, such as edge region and textured (details) region. While the second stage focus on a more sparse estimation of occlusion and erroneous region. The second stage can be seen as post-processing or refinement part of the framework.
The spatial decoder estimates residual image for every SAIs in the light field image in a single shot. During inference, given a single input, the network estimates high resolution light field in a single run. While in the training stage, the spatial decoder is frozen for several iterations before both decoders are trained together in an end-to-end fashion. These are discussed further in Section IV.

D. LIGHT FIELD LOSS
Although the proposed framework enables us to solve the super-resolution problem jointly, a good objective function for learning the relation between angular views is a mandatory. L1 or L2 loss, which are commonly used by conventional approaches, cannot provide proper geometric reasoning to the network. It encourages the network to look at dominant pixel color individually instead of understanding the whole scene. We propose to use light field angular information instead of pixel information individually. In specific, we use mean and variance of a light field image which exists in any light field image.

1) GLOBAL LIGHT FIELD LOSS
We propose a novel 4D light field loss, which is formulated as where M (L) and V (L) denote the mean and variance while s denotes the index of SAI in the light field. Computing the light field mean is equivalent to obtaining the refocus image at zero disparity. The refocused image correlates to the depth of the light field image. We did not need labelled light field depth because it is embedded in the light field itself through light field mean. Therefore, the synthesized light field depth can be explicitly evaluated in an efficient way and unsupervised manner. Meanwhile, variance captures the difference between SAIs and helps the network learn the occlusion and edge region. This is known from the light field depth estimation work of [20], [25], [26], which utilize the mean and variance to compute defocus and correspondence responses, respectively. Kalantari et. al [9] also employed the mean and variance images as the input to their depth estimation network. In this paper, we show that mean and variance images can be used as loss function to help the network learn the light field geometry efficiently.

2) LOCAL LIGHT FIELD LOSS
Although 4D global loss captures geometric information globally, the network should learn the local geometric relation between SAIs in a refined manner. The idea is to help the network explicitly understand angular relation in the horizontal and vertical directions. We compute the mean and variance for SAIs in each row and column in the 4D light field. The losses at each light field row and column are accumulated to obtain the final local loss. The process can be formulated as where m, n denote the angular resolution of the light field image, and mean and variance are computed for s ∈ {1, . . . , U }. Without the loss of generality, we assume the light field angular resolution is equal in both horizontal and vertical directions. Figure 5 visualizes the losses by computing the light field mean and variance.

3) LOSS REGULARIZATION
An inconsistent appearance flow might appear and cause artifacts between SAIs. This problem is expected because appearance flow is estimated from a single image in an unsupervised manner. An alternative approach is to use the conventional flow estimation method into the ground truth light field and compare it with the estimated flow. However, this approach is tedious and increases framework complexity. Thus, we present a strategy to remedy inconsistent and incorrect flow by incorporating a regularization term into the loss function.
To compress artifacts from inconsistent appearance flow, total variation regularization is applied. Total variation is commonly used for noise removal in image processing. The idea is to smooth inconsistent (noisy) appearance flow while keeping important edge information. L 2 minimization is performed on the gradient of the estimated appearance flow.
Moreover, total variation denoising capability is also carried on to the network, as shown in Figure 6.

4) SPATIAL SUPER-RESOLUTION LOSS
Spatial SR loss is straightforward. We minimize the error of upsampled initial light field added by estimated residual images and the loss computation is defined as follows.
where L H (x, u) is the ground truth high resolution light field image. We upsample the initial light field image using bilinear interpolation.

IV. EXPERIMENTAL RESULTS
We evaluate the performance of the proposed framework both qualitatively and quantitatively. We then compare its performance with the state-of-the-art method of light field angular synthesis from a single image [17]. We utilize two datasets, i.e. Flower and Toys [17]. For the evaluation, all networks are re-trained with the corresponding dataset. For evaluating Srinivasan et. al [17], we use the author's original code. Readers are recommended to view the supplementary video for better understanding and elaborated results.

A. EXPERIMENT ON AVAILABLE LIGHT FIELD DATASET
Light field images in the Flower dataset mostly have a clear distinction between the background and foreground. The flowers have a dominant color and are located in the foreground. Unlike the Flower dataset, Toys dataset has more variance in object shape, color, and location in the image. We utilize Toys dataset to verify the performance of the proposed model in more general scenes. Toys dataset is significantly more complex and closer to the real world image.
We evaluate both the initial light field (low resolution) and the high resolution light field in PSNR and SSIM, which is shown in Table 1. In performing quantitative evaluation, we exclude the input image which is directly used as the center of the synthesized light field. This is how numbers for the quantitative evaluation is obtained. For each evaluation the network is trained on the corresponding dataset. Note that we solve both super-resolution problems simultaneously and therefore our network need to solve more difficult problem   than [17]. Nevertheless, the proposed method outperforms the state-of-the-art method on both low resolution (LR) and high resolution (HR) light field image. The proposed network outperforms [17], which proves its capability for handling more complex and general scenes.
SSIM drop from LR to HR light field image stems from the second residual addition which causes slight blur in the erroneous region. The spatial decoder tries to smooth out the erroneous region in the initial light field. Figure 7 shows a qualitative comparison with the state-of-the-art work [17]. While [17] achieves high PSNR, it suffers from unpleasing artifacts around the edge and occluded region. In addition, [17] cannot handle the scenes with multiple objects robustly. Reference [17] assumes that any pixel with a similar intensity is located in a close place. [17] fails on general scene due to its dependency on finding a single object with a dominant color, which leads to inaccurate depth estimation. As shown in the error map in Figure 7, pixels on the flower in the backside have much error. On the other hand, the proposed method successfully synthesizes a proper light field image due to the power of the proposed light field based loss function. The error map and image patches show the proposed method handles occlusion better and have less artifact.
Generally the result of [17] on Toys dataset shows incorrect EPI slope. The slope direction is either reversed or flat VOLUME 8, 2020 indicating that the shifting direction of the object is incorrect. Reference [17] has difficulty in determining or inferring objects position. On the contrary, the proposed method successfully synthesizes a proper light field.

B. ABLATION STUDY
To reveal the discriminate power of the proposed light field synthesis network, we evaluate the performance of different loss functions, i.e. pixel-wise L1 loss, global loss, local loss, and local-global loss. Table 2 shows the quantitative evaluation of each loss function's effect on the network. The simple pixel-wise loss is outperformed by the proposed loss functions. Meanwhile, local loss achieves satisfactory performance, and the combination with global loss leads to the best result. Global and local losses complement each other by forcing local and global consistency in synthesizing SAIs. Image shifting and regularization are proven to improve network performance as shown in L g +L l and S, respectively. Quantitative evaluation of each loss function's effect to the network trained using Flower dataset. Each row represents the framework trained using only the corresponding loss, except for S (full framework trained without the image shifting). L g + L l can also be inferred as results without regularization.
Additionally, we also show that the synthesized light field can be used for common light field applications, such as depth estimation and refocusing. We test our network on real single image captured using a smartphone or downloaded from internet, as shown in Figure 8. This shows our network capability to handle the images outside the training set distribution. In addition, Figure 9 shows convincing depth estimation and synthetic refocus. The estimated depth in both figures shows the network can estimate both steep discontinuity or smooth discontinuity. It confirm that the synthesized light field is geometrically correct and can be used in various applications of light field processing.
To justify our multi-stage residual estimation design, we perform an extensive experiment to identify the best combination. Table 3 shows that two-stage residual estimation produces the best light field quantitatively. The order of residual addition does not cause any significant difference. We hypothesize that at the first stage the network recovers the lost information during upsampling (anti-aliasing). Thus, we observe strong gradient in the estimated residual and we believe this information strongly relates to the geometry information (appearance flow). On the other hand, the second stage focuses on the homogeneous region. Therefore, it relates to the pixel color. Note that we try to increase the multi-stage residual into three stages, which does not yield any meaningful boost to the performance. We choose to perform angular resolution first as it saves the memory needed to train the network. Upsampling the spatial resolution of the input image first will lead to large memory consumption in estimating the appearance flow. The process of reconstructing the light field is expensive during training  stage. Performing angular upsampling first gives us extra information to help the spatial super resolution. In addition, using residual approach it acts as a post-processing module while also solving the spatial SR problem.
Finally, we show the scalability of the proposed spatial SR module. Since there is no available light field dataset in HR, we use similar technique by [3] to predict image in HR. To obtain HR light field we run the network twice sequentially. During the second inference, we only take the estimated residual to enhance the previously estimated light field. Since no HR ground truth is available, we show qualitatively in Figure 10 that 4× as well as 2× spatial SR are performed successfully on the proposed framework.

V. PROPOSED NETWORK STRUCTURE
In this section, we provide details of the proposed network structure that takes the luminance component of the input. The dimension is represented as spatial (width and height) and channel or filter. Each convolution has the bias initialized with 0 and the weight initialized using the He initializer [34]. The image shifting operation is performed using tf.translate. Mean and variance for computing loss are obtained using tf.nn.moments. The order of light field stack is the column-or u based. The angular and spatial network details are tabulated on Table 4 and 5, respectively. Every convolution is followed by Leaky ReLU activation function. Convolution with stride 1, convolution with stride 2, transpose convolution, and convolution without activation function are denoted as 'Conv', 'Conv*', 'Conv t ', and 'Conv o '. Concatenate operation on channel axis is denoted as 'Concat'.

VI. TRAINING DETAILS
The proposed network is implemented using TensorFlow [2]. We crop the light field spatial resolution into random patches of 128 × 128×8 × 8 for training. Full light field resolution is used for inference. The proposed network is trained in an endto-end fashion for 280,000 iterations with a batch size of 1. The angular decoder is trained alone for 100,000 iterations while the spatial decoder is freezed. Hyperparameters η, λ l , λ g , λ tv , and λ sr are set to 0.8, 1.0, 10, 1e −4 , and 10.0, respectively. We augment the input with random gamma ranging from 0.4 to 1.0. The training input is also randomly crop around the center region. The input augmentation is designed to avoid overfitting. We use Adam optimizer [35] as our optimization algorithm with the default parameters. Training is performed for approximately 1 day on NVIDIA GTX 1080Ti GPU with 11GB of memory and Intel i7-7700 @3.60 GHz CPU with 16GB of memory.

VII. CONCLUSION
In this paper, we proposed an end-to-end deep model for joint angular and spatial light field SR from a single image.
Novel light field based loss functions were introduced to preserve spatio-angular consistency and to remove the dependency of pixel intensity. Joint end-to-end framework were presented to solve both problem simultaneously. The experimental results showed that the proposed network outperformed the state-of-the-art algorithm qualitatively and quantitatively. In addition, the proposed method can also be generalized to various scene, such as toys and internet images. Future work includes wide-baseline light field synthesis, and higher degree of spatial SR.