Light Field Reconstruction With a Hybrid Sparse Regularization-Pseudo 4DCNN Framework

The densely-sampled light field (LF) is highly desirable in many applications, such as 3D reconstruction, post-capture refocusing and virtual/augmented reality. However, it is costly and challenging to capture them because of the high dimensional nature. Existing view synthesis methods employ depth information for the densely-sampled LF reconstruction from the undersampled LF with a large disparity range, but fail to achieve non-Lambertian performance. In this article, a novel coarse-to-fine LF reconstruction method is proposed. We will develop a hybrid reconstruction framework that fuses model-based sparse regularization with deep learning. Specifically, sparse regularization with directional filter bank is utilized to solve the large baseline problem and gives a coarse densely-sampled LF. In addition, for those that cannot be recovered by classical model-based methods due to limited angular resolution, is estimated by learning a pseudo 4D convolutional neural network, and thereby offering a refinement on the intermediate LF. Extensive experiments demonstrate the superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods.


I. INTRODUCTION
As a promising representation of 3D visual scenes, the light field (LF) records light rays through every point traveling in every direction in the free space. The LF image contains almost all appearance information of the scene, which facilitates a wide range of applications, such as image refocusing [1], depth estimation [2], [3], 3D reconstruction [4], virtual/augmented reality [5], to just name a few. In these applications, densely-sampled LF, with high resolution in the angular domain, is highly required for sufficient information to avoid ghosting effects.
However, in many practical cases to sample a realworld scene with sufficient number of cameras to directly obtain a densely-sampled LF is challenging. For example, typical camera array systems [6] are bulky and expensive and thus are unsuitable for most commercial uses; the computer-controlled gantry [7] is always time-consuming The associate editor coordinating the review of this manuscript and approving it for publication was Eduardo Rosa-Molinar . and limited to static scenes. The commercial hand-held LF cameras such as Lytro [8] and Raytrix [9] make it convenient to acquire LF images. Unfortunately, due to restricted sensor resolution, they must make a trade-off between spatial and angular resolution. Therefore, the required number of views has to be generated from the given sparse set of images by using view synthesis technology. Modern view synthesis methods can be divided into two categories based on their difference in prior assumptions. The first category estimates the scene depth, and then synthesizes novel views using various methods, such as soft views rendering [10], or learning-based prediction [11]. However, these depth-dependent view synthesis methods inevitably rely on the accuracy of depth information, which tends to produce artifacts where inaccurate depth estimation usually happens, such as occluded regions, non-Lambertian surfaces, etc. The second category uses specific priors such as sparsity in transformation domain for dense reconstruction [12]- [15]. They try to resolve the wavefront set of the underlying plenoptic function with specific transforms, such VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as shearlets [16]. However, due to the limited angular resolution, the corresponding inverse problem is severely ill-posed and renders it very difficult to robustly recover specific parts of the target, such as occlusion and non-Lambertian regions.
With recent development of deep learning solutions for visual modeling, some learning-based methods [18]- [21] have been proposed. However, most of these either need to be retrained for different interpolation rates, or suffer from aliasing when the input LF is extremely undersampled, i.e. the samples have large baselines. Moreover, at this time, neural networks are mostly used as black boxes that are for instance trained for direct inversions, or for a replacement of iteration steps in optimization algorithms. For the inverse problem of limited angle computed tomography, Bubba et al. [17] proposed a hybrid reconstruction framework that fuses model-based sparse regularization with data-driven deep learning. They utilize sparse regularization with shearlets for splitting the data into a visible part, recovered by classical model-based methods, and an invisible part, inferred in the shearlet domain by a neural network.
In this article, inspired by [17], a novel coarse-to-fine method is proposed to reconstruct densely-sampled LFs from undersampled LFs. The proposed method first utilizes sparse regularization to robustly recover the part of the wavefront set that are recoverable by classical model-based methods, referred to as recoverable part, and yield an intermediate densely-sampled LF. During the process instead of using shearlets we use our previously presented directional filter bank [22] so that a decent result can be achieved without precise disparity estimation. Therefore, our method is robust to depth inaccuracy. The remaining part, which cannot be recovered by classical model-based methods with finite iterations and is referred to as unrecoverable part, is then predicted with an encoder-like view refinement network. This will ensure a maximal amount of reliability and interpretability of our results, since deep leaning is only use to infer the unrecoverable part. A brief recoverability analysis will be given in Section III-B. Additionally, in order to make full use of 4D LF data while circumventing high computational burden, 3D convolutional neural network (CNN) is applied on stacked epipolar plane images (EPIs) in the row or column pattern. Moreover, since the sparse regularization module can handle different interpolation rates and the network only learns the unrecoverable part from the output of the former, our system can perform different interpolation rates with one model. Experimental results demonstrate the superior performance of our method over the other state-of-the-art densely-sampled LF reconstruction methods on both realworld and synthetic light field datasets with various disparity ranges.
The rest of this article is organized as follows. Section II comprehensively reviews existing methods for view synthesis and densely-sampled LF reconstruction. Section III presents the proposed approach. In Section IV, extensive experiments are carried out to evaluate the performance of the proposed approach. Finally, Section V concludes this article.

II. RELATED WORK
Many studies have focused on densely-sampled LF reconstruction using undersampled LF with a small set of views. These methods can be modeled as a reconstruction of the plenoptic function using limited samples. We divide the related work into two categories: those that use depth estimation, and those that do not.

A. DEPTH-DEPENDENT VIEW SYNTHESIS
These approaches typically first estimate the scene depth and then warp the given images to the novel view based on the estimated depth, where the depth information works as correspondence map for view reprojection. A number of depth estimation methods have been developed specifically for stereo images [23], and for multiview images as well [3], [4], [24]- [28].
In recent years, many learning-based methods have been proposed for depth image-based view synthesis. Flynn et al. [11] proposed a deep learning method to synthesize novel views using a stereo pair or sequence of images with wide baselines. Srinivasan et al. [29] proposed to synthesize a 4D LF image from a 2D RGB image based on estimated 4D ray depth. Kalantari et al. [19] proposed to synthesize novel sub-aperture images with two sequential networks that perform depth estimation and color prediction successively. Zhou et al. [30] and Mildenhall et al. [31] trained a network that infers alpha and multiplane images. The novel views are synthesized using homography and alpha composition. Jin et al. [32] proposed a coarse-to-fine framework that accepts irregular input with large baseline. However, all these approaches are based on the Lambertian assumption without explicitly addressing the non-Lambertian challenge.

B. DEPTH-INDEPENDENT LF RECONSTRUCTION
Densely-sample LF reconstruction can be considered as consecutive reconstruction (interpolation) of the underlying plenoptic function from its given incomplete measurements (input views). The required bounds for sampling the LF of a scene have been defined in [33]. For a light field that the disparity between neighboring views is less than one pixel, novel views can be produced using linear interpolation [33].
For undersampled LF, direct interpolation will cause ghosting effects in the rendered views. To mitigate this effect, some studies have investigated LF reconstruction with specific priors. Levin and Durand [34] exploit dimensionality gap priors to synthesize novel views from a set of images sampled in a circular pattern. Shi et al. [13] sampled only the boundary viewpoints or diagonal viewpoints to recover the full light field using sparsity analysis in the Fourier domain. However, Fourier and other isotropic transforms have been found less efficient for representing images. The sought transforms should be local, directional and multiresolution [35]. Among the candidates, shearlets have been shown to be effective in reconstructing a densely-sampled EPI from an undersampled EPI with a large disparity range. Vagharshakyan et al. [14], [15] proposed an approach using the sparse representation of EPIs in the shearlet transform domain. Gao et al. further introduced optic flow [36], [37] and video frame interpolation methods [38] to improve the reconstruction result of shearlets-based methods. However, shearlet transform requires a precise estimation of the disparity range of the undersampled LF in order to design a shearlet system with decent scales and to pre-shear the undersampled EPIs. In addition, due to the limited angular resolution, it is difficult for classical model-based methods to recover some specific parts, such as occlusion and non-Lambertian regions.
Recently, some learning-based approaches were also proposed for depth-free reconstruction. Yeung et al. [39] applied a coarse-to-fine model using an encoder-like view refinement network for larger receptive field. Wu et al. [20] proposed a ''blur-restoration-deblur'' framework as learningbased angular detail restoration on 2D EPIs. Wang et al. [40] introduced an end-to-end learning-based framework using 3D light field volumes, achieving over 10× faster than Wu et al.'s framework [20]. Based on the observation that an EPI shows clear structure when sheared with the disparity value, Wu et al. [21] proposed to fuse a set of sheared EPIs for LF reconstruction. However, since each EPI is a 2D slice of the 4D LF, the accessible spatial and angular information of these EPI-based models are severely restricted.

III. THE PROPOSED APPROACH A. LF REPRESENTATION AND PROBLEM STATEMENT
The 7D continuous plenoptic function [41] describes the set of light rays traveling in every direction through every point in 3D space, from a geometric optics perspective. In practice, the plenoptic function is simplified to its 4D version, which records light rays by their intersections with two parallel planes, namely the two-plane parameterization. Consider a pinhole camera, with focal plane (u, v) and focal length f , moving along the second plane (s, t), an oriented light ray defined in this system first intersects the image plane at coordinate (u, v), corresponding to the spatial axes, and then intersects the second plane at coordinate (s, t), corresponding to the angular axes, and is thus denoted by L(u, v, s, t).
In the two-plane system, we can consider the (s, t) plane as a set of cameras with their focal plane on the (u, v) plane, and each camera collects the light rays leaving the (u, v) plane and arriving at a point on the (s, t) plane. Thus the 4D light field can be represented as a 2D array of images, such as the one shown in Fig. 1. Each image recorded by the camera is called a sub-aperture image. Let L(u, v, s, t) ∈ R W ×H ×N ×M denote a densely-sampled LF containing N × M sub-aperture images of spatial dimension W × H , which are sampled on the angular plane with a regular 2D grid of size N × M .
By gathering the light field samples with a fixed spatial coordinate v and an angular coordinate t (or u and s), one can produce a slice E v * ,t * (u, s) (or E u * ,s * (v, t)), resulting a map called EPI. The EPI contains relative motion between the camera and object points. Points with different depths can be visualized as lines with different slopes in the EPI. Compared with regular photo images, an EPI has a very and t = t * , and the right EPI (v , t ) by setting u = u * (highlighted in green) and s = s * .
well-defined structure. Any visible scene point appears in one of the EPIs as a line whose slope reflects the depth of the point and the measured intensity over the line reflects the intensity of emanated light from that scene point. For Lambertian reflectance model (any point in the scene emanates light in different direction with same intensity), a point on the surface of an observed object forms a line with a constant intensity in the EPI. In Fourier domain, the frequency support of a layer with depth Z is confined to a line with a slope of Z f . Therefore, the spectrum of a Lambertian scene between Z min and Z max is constrained within a wedge bounded by two lines passing through the origin with slopes Z min f and Z max f . Since the EPI is a 2D slice of the 4D LF with more definitive structure, the reconstruction of densely-sampled LF from undersampled LF can be considered as an inpainting problem on the EPI. Without loss of generality, we assume that the undersampled EPI is a uniformly downsampled version of the densely-sampled one where A is a measuring matrix, and η models measurement noise. The measurements y form an incomplete EPI where only rows from the available views are presented, while everywhere else values are 0.
In order to reconstruct the densely-sampled EPI from the undersampled one (see Fig. 2), one need to solve the inverse problem described in (1). Due to the limited angular resolution, not all features of the measured object x are captured under A and the resulting inverse problem is severely ill-posed.

B. OVERVIEW OF THE PROPOSED METHOD
There exists an abundance of inversion strategies for natural images, among which shearlets-based methods [14], [15] have been proven to be effective. Because of the multiresolution property, shearlet transform is effective to reconstruct densely-sampled EPIs from undersampled EPIs with large baseline. However, for scenes containing occlusions, or non-Lambertian surfaces, the inverse problem is severely ill-posed. Due to limited angular resolution, some part of the coefficients cannot be robustly recovered by shearletsbased methods since not enough samples are contained in the measured data. For example, as shown in Fig. 2(a), some lines are partially occluded in the densely-sampled EPI. In a task with just 2 × 2 input views, only the top pixels and the bottom pixels are sampled in the undersampled EPI, and the presence of occlusion makes the endpoints of some lines have different values, see Fig. 2(b). For lines bounded by two endpoints with similar values (boxed in green lines, to only list a few), they can be robustly recovered by shearlets-based methods, referred to as recoverable part, while for lines bounded by two endpoints with different values (boxed in red lines, to only list a few) cannot, referred to as unrecoverable part. In addition, lines in non-Lambertian regions are not exactly straight (see Fig. 2(a), boxed by yellow dash lines, to only list a few), making partial of the corresponding shearlet coefficients unrecoverable. Precisely this part is sought to be recovered by an inference in the shearlet domain by means of a trained neural network. But due to the high dimensional nature of the LF, it is challenging to perform learning methods in shearlet domain. Note that the above recoverability analysis is only qualitative insight inspired by [17] without rigorous proof. It will prove to be effective in Section IV through extensive experiments.
Therefore, we choose to first reconstruct the recoverable part using sparse regularization with our previously proposed directional filter banks [22], and then learn a network to estimate the unrecoverable part in the spatial domain.
The prediction of the missing part can be solved by residual learning. Such an inference of unrecoverable information from the knowledge of its recoverable counterpart is feasible, since the EPIs under the same scene geometry have a similar texture structure. Moreover, the state-of-the-art LF reconstruction networks are more effective than shearletsbased method in the case of small disparity range. Being aware of this unique characteristic as well as the great success of deep learning, we propose a learning-based approach to explore the EPI texture structure for densely-sampled LF reconstruction. It can be considered as a refinement of the coarse result reconstructed by sparse regularization from the undersampled LF, which makes our framework work in a coarse-to-fine manner. Fig. 3 shows the flowchart of the proposed framework, which comprises three steps. The first step of our algorithm consists in solving a sparse regularization problem. Promoting sparsity in the transform domain allows to robustly recover the recoverable part. The next step is to generate an estimation of the unrecoverable part by training an artificial deep neural network. The architecture, referred to as pseudo 4DCNN, is an encoder-like network that performs inpainting on stacked EPIs. Having an estimation of the unrecoverable part at hand, the final step consists in fusing both parts in the spatial domain. In the following, the details of the proposed approach are presented step-by-step.

C. RECOVERING THE RECOVERABLE PART VIA SPARSE REGULARIZATION
Shearlet transform is extremely effective in reconstructing a densely-sampled EPI from a undersampled EPI with a large disparity range. Since we are going to process EPIs of scenes placed between the focal plane and infinity, whose spectrum locates at the wedge in the range of [3/4π, π) in the frequency plane, only the shear operation with a positive sign is considered. For a more comprehensive frequency analysis of both LF and EPI, we refer the interested reader to [14], [42].
In such elaborately-designed shearlet systems, a preshearing by the minimum disparity is required in order to correctly process the undersampled EPI. A precise depth range should be estimated first.
In our process, we utilize our previously proposed directional filter bank [22] to replace shearlets. The following wedge-shaped filters with cosine roll-off transition band are used  where ω m1 and ω m2 determine the angular range that the corresponding frame covers in the frequency plane, L is the filter size, and ω T is the width of the transition band. The wedgeshaped filters can be placed anywhere, even in an angular range across the diagonal, as shown in Fig. 4. In contrast, classical shearlets are restricted in two cone-like regions with fixed shear operation. Since our filter bank has a more flexible partitioning, it is possible to perform better frequency tiling than classical shearlets in the context of EPI representation. The analysis filter at scale j and direction range ω is given byψ whereĝ j is the Fourier transform of the highpass filter g j = h J * g j+1 ↑ 2 at scale j, h j = h J * h j+1 ↑ 2 is the corresponding lowpass filter, 0 ≤ j < J , and ↑ 2 means upsampling by 2.
In addition, h J and g J are lowpass and highpass filters of the 2D two-channel nonsubsampled filter bank presented in [43]. The lowpass component is extracted by As the frame elements are not orthogonal, one needs also the dual frame elements. They can be constructed based on pseudoinverse. First, we set Then, the synthesis filters are defined in Fourier domain, as followφ =φ ,γ j,ω =ψ j,ω .
Since we only consider the wedge in the range of [ 3 4 π, π), which corresponds to the depth range from the focal plane to infinity. The resulting direct transform F for discrete values The corresponding inverse transform is then The reconstruction of x given the sampling matrix A and the measurements y can be cast as an inpainting problem, with constraint to have solution which is sparse in the transform domain, i.e., x * = arg min It is typically solved via an iterative inpainting process with t iterations. We make use of the double relaxation method from [15] where Here, sum(·) returns the sum of all the elements in the input matrix, ''•'' denotes the elementwise (Hadamard) product, VOLUME 8, 2020 A is a logical measuring mask obtained by rearranging A, where ideally x t • A = x 0 , and typically x 0 = x. T λ i (·) is a hard thresholding operator applied on transform domain coefficients and α is an acceleration parameter. Please note that (10) does not solve the optimization problem (9) if F is not an orthonormal basis. It is a reconstruction algorithm that is only inspired by (9) and often yields a good reconstruction performance in practice. As will be demonstrated in Section IV, the proposed frame produces better results with inaccurate depth information, making our approach more robust.

D. LEARNING THE UNRECOVERABLE PART WITH CNN
Given the intermediate LF reconstructed by iterative process, we are now working on the inference of unrecoverable part by means of a CNN. The overall goal is to find a nonlinear mapping N N θ from the recoverable part L rec ≈ L inter to the unrecoverable part L unr N N θ (L inter ) ≈ L unr .
Because of the high dimensional nature of the LF, it is rather difficult to work on 4D data, and individually processing all EPIs cannot make full use of the input views. Therefore, we take advantage of pseudo 4DCNN in [40] and build an encoder-like view refinement network on stacked EPIs. Taking the row volume as an example (shown in Fig. 5), both the angular and spatial correlation are employed by the 3D CNN. Since the goal is to predict the unrecoverable part which has the similar texture structure with the recoverable part, residual learning is used in this module. The overview of the pseudo 4DCNN is given in Fig. 3, where the unrecoverable part is recovered first horizontally, then vertically, and added to the recoverable part successively.
Concretely, for the refinement of the intermediate LF L inter (u, v, s, t) with size of W × H × N × M , we first extract 3D row volume for each column t * , t * = 1, 2, . . . , M , and permute it as with the size of H × W × N , where perm(·) denotes permutation operation. The 3D row volumes are refined by the row network modeling as F row (·), and then compose a new intermediate LF as Similarly, we extract 3D column volume from L inter (u, v, s, t) for each row s * , s * = 1, 2, . . . , N , and permute it as with the size of W × H × M . The 3D column volumes are refined by the column network modeling as F col (·), and then compose the reconstructed LF as We use a fully convolutional encoder-decoder architecture for both proposed network F row (·) and F col (·). As depicted in Fig. 6, the network for the residual prediction consists of an encoder and a decoder. The encoder comprises three convolution layers. The first layer contains 64 filters of size 1 × 5 × 5 × 3, where each filter operates on 5 × 5 spatial regions across 3 adjacent EPIs. The second layer contains 16 filters of size 64 × 3 × 3 × 3. The last layer contains 16 filter of size 16 × 3 × 3 × 3. The decoder also comprises three convolution layers which are exactly symmetric to those of the encoder. Each layer uses a stride of 1 followed by a rectified linear unit (ReLU) except for the last one. To avoid border effects, we appropriately pad the input and feature maps before every convolution operation to maintain the input and output at the same size.

E. EDGE ENHANCEMENT LOSS FUNCTION
The proposed framework is designed to reconstruct the desired 4D LF by inpainting its EPIs. Since the EPIs have a clear texture structure, we proposed an edge enhancement loss function consisting of three parts.
The first part provides supervision for the output LF by calculating the weighted average over the entire mean squared errors (MSE) between the output LF images and ground-truth ones, i.e., where ω s * ,t * is the prior sensitive weights in [40]. Specifically, synthesized views after the row network F row (·) are inferred from the input views, while a portion of those views synthesized after the column network F col (·) only receive prior information propagated from earlier synthesized views.
In the prior sensitive scheme, more attention is paid to the errors of the later synthesized views by using larger weights. According to the order that views are generated and the level of received prior knowledge, all the synthesized views are divided into four groups and their MSE against the ground truth are summed up with corresponding weights. The weighting coefficient ω s * ,t * for the synthesized view at (s * , t * ) is particularly set as where p is the upscale factor. As shown in Fig. 7, the EPI consists of homogeneous regions bounded by straight lines, and the straight lines, which in the distributional situation refers to the wavefront set of the EPI, causes the singularities that we are trying to reconstruct. To enhance the edge pixels of the EPIs, we penalize the mean absolute errors (MAE) of the gradients of both row and column EPIs, obtained by Sobel operator, denoted as row and col , respectively.
where ∇ is the gradient operator.
Finally, the output reconstructed LF image is optimized by minimizing the objective where λ 1 , λ 2 , and λ 3 are the weighting for the reconstruction accuracy and the edge enhancement, which are empirically set to 1, 9 and 9, respectively.

A. DATASETS AND IMPLEMENTATION DETAILS
Both synthetic LF images from the 4D light field benchmark [44], [45] and real-world LF images captured with a Lytro Illum camera provided by Standford Lytro LF Archive [46] and Kalantari et al. [19] were employed to train and test. Specifically, 20 synthetic images and 100 real-world images were used for training, while 9 synthetic data, including 4 LF images from the HCI [44] dataset and 5 LF images from the old HCI [45] dataset, and 3 datasets with totally 70 realworld LF images captured with a Lytro Illum camera were used for test, namely 30scenes [19], Occlusions [46] and Reflective [46]. These datasets cover several important factors in evaluating the methods for light field reconstruction, including high-frequency texture, natural illumination and practical camera distortion, large baselines, occlusions and reflective surfaces. We removed the border views and cropped the LF images to 7 × 7 views as ground truth, and then downsampled to 2 × 2, 3 × 3 views as the input.
In the procedure of sparse regularization, since no disparity information is known, the directional filter system is constructed with 4 scales, which means that the maximum disparity that can be processed is up to 16 pixels, i.e. J = log 2 d max . We used 3, 5, 9, 17 directions in the scales from coarser to finer, respectively. For each scale, the central frequency of these direction subbands are uniformly placed from 3 4 π to π, resulting in a slight wider coverage than the classical shearlets. The thresholding level λ i decreases with the iteration number linearly in the range [2, 0.02]. Note that these two values are for the case of using a normalized undersampled EPI ε, i.e.
where max(·) and min(·) return the maximum value and the minimum value of an input matrix, respectively. The reconstructed x using this normalized ε is then rescaled back to original range of values via  Besides, we set the maximum iteration number t = 20, and the acceleration parameter α is 10. Both the sparse regularization and the network are implemented with PyTorch. During training the network, small patches in the same position of each view are extracted. The spatial patch size is 48 × 48 and the stride is 20. Similar to other SR methods, we only processed the luminance Y channel in the YCbCr color space. Moreover, we adopted ADAM optimizer with β 1 = 0.9 and β 2 = 0.999. And the batch size was set to 64. The filters of 3D CNNs are initialized from a zero-mean Gaussian distribution with standard deviation 0.01 and all the bias are initialized to zero. The learning rate is initially set to 10 −4 and then decreased by a factor of 0.5 every 10 epochs until the validation loss converges. Other hyperparameters not mentioned were set to PyTorch default values. Note that in the process of the pseudo 4DCNN, the LF images in the range of [0, 255] are automatically rescaled to [0, 1] by the Pytorch PILImage methods.

B. COMPARISONS WITH STATE-OF-THE-ART METHODS
We compared with 5 state-of-the-art learning-based methods that were specifically designed for densely-sampled LF reconstruction, i.e., Wu et al. [20], Yeung et al. [39], Wang et al. [40], Wu et al. [21], Kalantari et al. [19], and Jin et al. [32]. Among these methods, the last three are geometry-based. In the subsequent content, we conducted various experiments for comparisons.

1) REAL-WORLD SCENES
As to experiments on real-world scenes, we constructed 7 × 7 LFs from both 3 × 3 and 2 × 2 sparse views (see Fig. 8). The performances of comparative methods [19]- [21], [39] were obtained via implementing source codes released by respective authors. While for the approaches of Wang et al. [40] and Jin et al. [32], since they have not open-sourced their software, we directly take the results in [40], [32]. Please note that since we used the exact same datasets as Jin et al. [32] for both training and test, taking their results directly for comparison is feasible. As for Wang et al. [40], because they used some private dataset, only the results on the public dataset 30scenes [19] are compared. For all these methods used for comparison, we used the best configurations listed in the corresponding references. We used the average value of PSNR and SSIM over all views in each scene to quantitatively measure the quality of reconstructed densely-sampled LFs. Due to limited space, we only report the average result for all data entries in each dataset.
We first designed the experiment 3 × 3 to 7 × 7. As to evaluation on the proposed approach, we followed the protocols in [20] and used 30scenes, as well as two representative scenes, Reflective29 and Occlusion16, from the Stanford Lytro Light Field Achieve [46]. We compared with Wu et al. [20], Yeung et al. [39] and Wang et al. [40]. Table 1 lists the corresponding results on the real-world datasets. Our proposed model performs better for all datasets than three comparing methods: with 3.53 dB, 0.16 dB, and 1.36 dB reconstruction advantage over Wu et al. [20], Yeung et al. [39] and Wang et al. [40] in terms of PSNR, respectively; and with 0.006, 0.004, and 0.002 reconstruction advantage over Wu et al. [20], Yeung et al. [39] and Wang et al. [40] in terms of SSIM, respectively.
For the task 2 × 2 to 7 × 7, we carried out comparison with the method by Kalantari et al. [19], Wu et al. [20], Wu et al. [21], Yeung et al. [39] and Jin et al. [32]. The method by Wang et al. [40] cannot be compared since their approach requires 3 views in each angular dimension to provide enough information for the transposed convolution layers. Our testing dataset contains 30scenes, Occlusions and Reflective. This test set contains 113 LFs which is sufficient to provide objective evaluation of model performance.
We used the four corner images of a LF as input. Reconstruction quality was measured with PSNR and SSIM averaged over the luma component, and over all reconstructed views. As shown in Table 2, our proposed approach obtains an average of 39.83 dB, 0.08 dB higher than that of Yeung et al. [39] in terms of PSNR, respectively. The latter achieves 39.75 dB and is the second best among all methods.
It can be observed that EPI-based methods, including Wu et al. [20] and Wu et al. [21], are inferior compared with others. The former produced satisfactory reconstruction results using 3 × 3 inputs but declines quickly for a sparser input (2 × 2). The latter performs relatively better on 30scenes but fails on Reflactive, as depth information TABLE 2. Quantitative comparisons (PSNR/SSIM) of the reconstruction quality of the proposed approach with the state-of-the-art methods under the task 2 × 2 to 7 × 7 on real-world scenes. The input undersampled LFs are sampled at the four corners during both training and test.

TABLE 3.
Quantitative comparisons (PSNR/SSIM) of the reconstruction quality of the proposed approach with the state-of-the-art methods under the task 2 × 2 to 7 × 7 on synthetic scenes. The input undersampled LFs are sampled at the four corners during both training and test. is utilized as guidance. The possible reason is that only 2 rows or columns of pixels are available during the reconstruction of each EPI, making it difficult to recover the intermediate linear structures, especially when the scenes are complicated. In contrast, Yeung et al. [39], Jin et al. [32] and our method employ the similarity of neighboring pixels and make a better use of the full 4D LF data, indicating that the pseudo 4D filters effectively explore the spatial and angular relations between input views. The difference is that we perform 3D CNNs on EPI volumes, while the other two use 2D CNNs on spatial and angular dimension successively.

2) SYNTHETIC SCENES
We used the 9 synthetic LF images, including 4 LF images from the HCI [44] dataset and 5 LF images from the old HCI [45] dataset. The angular resolution of these synthetic images is 9 × 9 and were cropped to 7 × 7. While for the input, we adopted the four corner images (2 × 2) of a LF as input to evaluate the performance of the proposed framework for large disparity. Table 3 shows a quantitative evaluation of the proposed approach on the synthetic dataset compared with other methods. It can be observed that Jin et al. [32] achieves the best results on the synthetic datasets. This is because the HCI dataset contains LF images with large baselines, which measure the ability on very sparse sampling, and geometry-based methods are efficient to reconstruct LFs from large baseline sampling. While our method is comparable to Jin et al. [32] on synthetic datasets, without explicitly using geometry information.
We also visually compared the reconstruction results of different algorithms, as shown in Fig. 9. It can be observed that Wu et al. [20] and Wu et al. [21] fail to recover delicate structures, such as the leaves and the textures on the wall, while Kalantari et al. [19] and Yeung et al. [39] struggle with large disparities. In contrary, our approach produces accurate estimations, which are closer to ground-truth ones. Note that since the source code of Jin et al. [32] has not been released yet, some figures are extracted from [32].
The running time (in second) of different methods for reconstructing a densely-sampled LF is lasted in Table 4. All methods were tested on a desktop with Intel i9-9900k @ 3.6 GHz CPU, 32 GB RAM and a Nvidia GeForce RTX 2080ti GPU, except for Jin et al. [32]. Note that we directly take the results in [32] since their software has not been opensourced yet. They used a desktop with the same RAM size and GPU as ours, but the CPU is Intel i7-8700 @ 3.6 GHz, so that their method could be slightly faster when running on our platform. The reason why our method is relatively slow is that the sparse regularization is time-consuming. This is the cost of reliability and interpretability as deep learning is only used to estimate the unrecoverable part.

C. ABLATION STUDY
To better illustrate the advantage of the proposed method, we conduct ablation experiments on 30scenes with the two components of our framework, i.e., the sparse regularization module and the pseudo 4DCNN module.

1) THE EFFECTIVENESS OF THE DIRECTIONAL FILTER BANK
In the proposed method, the sparse regularization is performed by promoting sparsity on the coefficients obtained by our previously proposed directional filter bank. Alternatively, shearlet transform [14], [15] is used for such tasks. To validate the advantages of our directional filter bank, we experimented both shearlets in Vagharshakyan et al. [14], [15] and our filter bank under the task 3 × 3 to 7 × 7. Both methods used the same configuration as previously described. Since the EPIs should be pre-sheared, to avoid the influence of depth  estimation error, we looped through all possible minimum disparities using shearlets and tested our filter bank with the same best pre-shearing for each scene. Fig. 10 depicts quantitative comparisons of the average PSNR and SSIM on 30scenes. It is easy to find out that the proposed directional filter bank performs significantly better than shearlets. Specifically, our directional filter bank achieves 37.18 dB per LF scene on average, 0.80 dB higher than shearlets used in Vagharshakyan et al. [14], [15] (36.38 dB).

2) THE EFFECTIVENESS OF THE PSEUDO 4DCNN
To demonstrate the effectiveness of the pseudo 4DCNN module, we conduct ablation experiments with variants of the pseudo 4DCNN, and Table 5 lists the results. We can see that the results decrease 0.87 dB on average without the prior sensitive weights from [40]. Also, the results decrease 1.02 dB on average without the edge enhancement loss. It can be demonstrated that each component of the pseudo 4DCNN contributes to improving the performance.

V. CONCLUSION
In this article, a hybrid sparse regularization-pseudo 4DCNN framework is proposed to directly synthesize novel views of 4D densely-sampled LF from sparse input views. We assemble sparse regularization with directional filter bank for recovering the recoverable part and a pseudo 4DCNN for inference of the unrecoverable part to solve the problem of reconstructing LFs with large disparity range. Our framework works in a coarse-to-fine manner by first reconstructing an intermediate densely-sampled LF using our previously proposed directional filter bank, and then apply 3D CNNs on row and column EPI volumes successively for refinement. Extensive evaluations on real-world and synthetic LF scenes show that our proposed framework outperforms various state-of-the-art approaches.