Light Field Image Super-Resolution Based on Multi-Level Structures

Light field (LF) images suffer from low spatial resolution due to the trade-off between angular and spatial resolutions. Thus, spatial super-resolution (SR) of LF images is an essential task to obtain high-quality LF images. However, the existing SR networks still have limitations, since they exploit only single-level features to use sub-pixel information in LF images. In this paper, we proposed a light field super-resolution (LFSR) network to effectively improve the spatial resolution of light field images. The proposed network takes one target image and its 8-neighboring images for references. We construct multi-level structures for the proposed network to effectively estimate and mix sub-pixel information in reference images. The proposed network is composed of a feature extractor, a feature warping module, a feature mixing module, and a upscaling module. The feature extractor provides multi-level features for SR and offsets to the feature warping module to obtain aligned features for multiple reference images. The feature mixing module mixes multiple aligned features based on the similarity between the target and reference images to obtain multi-level mixed features. Finally, the upscaling module generates a high-resolution residual image using the multi-level mixed features. Experimental results demonstrate the proposed network outperforms the state-of-the-art methods on various light field datasets. The pre-trained model and source codes are available at https://github.com/Hwa-Jong/LF_MLS.


I. INTRODUCTION
Light field cameras record not only spatial information but also angular information by inserting a micro-lens array between the main lens and the image sensor [1]. From recorded spatial and angular data, multi-view images of a scene can be reconstructed. These light field images have been used in many computer vision tasks, such as saliency detection [2], [3], depth sensing [4], [5], de-occlusion [6]- [8]. However, light field images have low spatial resolution due to the trade-off between angular and spatial resolutions. Low spatial resolution images lead to performance degradation of computer vision applications. Therefore, light field image superresolution (LFSR) is required to improve the performance of the applications. To solve this problem, we propose a LFSR network to achieve spatial super-resolution based on multi-level structures.
Since light field images are highly correlated with each other, sub-pixel information can be estimated using adjacent view images. Different from traditional single image super-resolution (SISR), LFSR can generate high-resolution images using sub-pixel information estimated from other light field images. Recently, with the release of large LFSR datasets [7], [9]- [13], many deep learning networks [14]- [18] have been developed based on convolutional neural networks (CNNs), which uses multiple view images as references. Figure 1 shows some approaches in the existing LFSR methods using multiple view images. All LF images are used to generate all super-resolved LF images (all-toall) [15], [19] or each super-resolved view image (all-toone) [18]. Also, ResLF [14] uses several sets of LF images FIGURE 1: Examples of approaches in SR of 4 × 4 LF images: (a) all-to-all [15], [19], (b) all-to-one [18], (c) ResLF [14], and (d) LFSR_AFR [16]. Red border boxes represent target images, which are super-resolved, and blue boxes are reference images.
in various directions. These approaches require the fixed angular resolution of LF images to obtain super-resolved LF images. On the other hand, the approach in LFSR_AFR [16], which uses 8-neighboring view images, can provide superresolved LF images regardless of angular resolutions. However, LFSR_AFR [16] still has limitation that it exploit only single-level features to increase the spatial resolution.
In this paper, we propose a light field super resolution network, which enhance the spatial resolution of LF images, based on multi-level structures. The proposed network consists of a feature extractor, a feature warping module, a feature mixing module, and a upscaling module. The proposed network takes one target view image and its 8-neighboring view images for references. In the feature extractor, the proposed network extracts low-level features, as well as highlevel features for each view image. Then, the feature warping module warps features of reference images to the target image using deformable convolution to obtain aligned features. Next, the feature mixing module yields multi-level mixed features by combining the multiple aligned features of reference images based on the similarity between the reference images and the target image. Finally, the multi-level mixed features are used to produces a super-resolved residual image in the upscaling module. Experimental results demonstrate that the proposed LFSR network outperforms the state-ofthe-arts on various LF datasets without increasing the number of parameters.

II. RELATED WORK
Image SR aims to generate a high-resolution image from its low-resolution image. However, image SR is a classical ill-posed problem [20]. To solve this problem, many CNNbased SR networks have been developed to improve the SR performance. SR can be categorized into single image superresolution (SISR), video super-resolution (VSR), and light filed super-resolution (LFSR).

SISR:
The goal of SISR is to generate a high-resolution image from only one low-resolution image. SRCNN [21] is the first SISR network, which is composed of only three convolution layers. Also, VDSR [22] including 20 convolution layers was developed. After that, many CNN-based SISR networks have been developed, including attention-based [23], [24], and generative adversarial networks-based [25], [26] methods.
VSR: VSR attempts to increase the spatial resolution of frames based on temporally adjacent frames. In terms of exploiting sub-pixel information of reference frames through motions, VSR is similar to LFSR. DUF [27] generated highresolution frames using dynamic upsampling filters. Tian et al. [28] adopted deformable convolution [29] for VSR. EDVR [30] reconstructed high-resolution frames using a deformable alignment module and temporal and spatial attention fusion modules.
LFSR: LFSR aims at generating high-resolution LF images from low-resolution LF images. To learn mapping between low and high-resolution light field images based on data-driven method, Yoon et al. [31] proposed an early model for LFSR based on deep CNNs. Fan et al. [32] proposed two-stage CNNs for SISR and multi-patch fusions. Gul and Gunturk [19] proposed two networks for angular SR and spatial SR.Wang et al. [33] improved horizontally and vertically stacked images separately and combined the using stacked generalization. Zhang et al. [14] divided view images into four groups and stacked views in each group to use residual information between neighbor views. Yeung et al. [34] generated high-resolution LF images using the spatial-angular separable convolution. Ko et al. [16] proposed two networks to improve spatial and angular resolution based on the adaptive feature remixing. Jin et al. [18] proposed an all-to-one strategy for LFSR, which enforces the LF parallax structure in reconstructed LF images.

III. PROPOSED NETWORK
We adopt the 4D light field representation in [35]. Thus, the light field can be represented as: where L is in the color space, such as RGB space. Also, Also, (u, v) denotes an angular coordinate and (x, y) denotes a spatial coordinate. Thus, L has U ×V light field images of H ×W spatial resolution. Let I u denote a view image at an angular coordinate u = (u, v).
The proposed network super-resolves each view image I u to reconstruct higher-resolution light field defined on N U × N V × N rW × N rH with a scale factor r. Figure 2 shows the overview of the proposed network. To increase the spatial resolution of a view image I u , the proposed network additionally takes 8-adjacent view images  in the angular domain to use sub-pixel information. Let

A. NETWORK ARCHITECTURE
denote 3×3 view images centered on I u . They are indexed from top-left to bottom-right, and thus we refer to I 5 = I u as a target image and {I i } 9 i=1,i̸ =5 as reference images. As in Figure 3(a), a target image has 8 reference images when it is located in center of light field. On the other hand, some reference images are unavailable when a target image is located in the boundary of light field as in Figure 3(b). In this case, we use virtual images filled with zero value for unavailable reference images. To this end, given I u , the proposed network estimates a super-resolved image I HR u through feature extractor, feature warping, feature mixing, and upscaling modules.
Feature extractor: The feature extractor consists of a con- volution layer and six residual blocks. It takes each view image and extracts a feature for every two residual blocks to construct multi-level features. Thus, 3-level features are obtained from the feature extractor. As in Figure 4, we use two feature extractors of the same structure to extract SR and offset features. To this end, for each I i , multi-level SR features {F l i ∈ R H×W ×C } 3 l=1 and multi-level offset features Feature warping module: Reference images in I u contain sub-pixel information for SR, but there are various subpixel shifts according to angular positions such as horizontal, vertical, and diagonal offsets. To use sub-pixel information FIGURE 5: Structure of the feature warping module.
effectively, SR features of reference images should be aligned to the target image I 5 . The feature warping module estimates offsets between the target image I 5 and reference images to warp SR features of reference images to the target frame. Figure 5 illustrates the detailed structure of the feature warping module. For each reference image I i , i ̸ = 5, offset features O l i and O l 5 are concatenated, and the concatenated one is fed into a convolution layer to obtain an offset U l i ∈ R H×W ×2 . Then, by employing the deformable convolution [36], F l i is warped to the target image to obtain an aligned SR feature A l i . Here, U l i is used for the offset in the deformable convolution. This warping process is performed for all feature levels, and thus multi-level aligned features for I i , {A l i } 3 l=1 , are obtained. Warping results at lower levels tend to preserve detailed local motions, while those at higher levels contain global shifts between target and reference images. To explore these multi-level features effectively, the feature warping module combines them based on RCAB [23]. Specifically, multilevel aligned features are concatenated along the channel dimension, and then the concatenated feature sequentially passes through one convolution layer and three RCABs to form the combined aligned feature A i . Since RCAB is the channel attention block, it can generate more effective feature for SR from the concatenated feature. Unlike the reference images, the multi-level SR features for the target frame are combined without the deformable convolution, since the SR features for the target frame do not need the warping processing. To this end, the feature warping module provides the set of aligned features {A i } 9 i=1 for all view images. Feature mixing module: Figure 6 illustrates the detailed structure of the feature mixing module. The feature mixing module combines the aligned features of the reference images to explore sub-pixel information for SR of the target image. However, some reference images have the low reliability to use sub-pixel information. For instance, as in Figure 3(b), reference images are filled zero images, when the target image is on the boundary of LF. To alleviate the impact of those dummy images, we compute similarity scores between the target image and reference images. For each reference image I i , i ̸ = 5, the feature mixing module compares aligned features A 5 and A i through pointwise dot product. Thus, the similarity score s i is defined as where Tr(·) denotes a trace operation andÃ i ∈ R HW ×C is the reshaped matrix of A i . Then, the similarity scores are used for weights of aligned features. Given weighed aligned features of reference images, i=1,i̸ =5 , the feature mixing module combines them using several RCABs. The weighed aligned features of reference images are concatenated, and then the concatenated feature sequentially passes through one convolution layer and ten RCABs to obtain mixed features. The feature mixing module extracts mixed features from every two RCABs. To this end, the feature mixing module produces multi-level mixed features {M l } 5 l=1 . Upscaling module: The upscaling module takes A 5 and the multi-level mixed features {M l } 5 l=1 to increase the spatial resolution of the target image I 5 . Figure 7 illustrates the architecture of the upscaling module. To consider the multilevel mixed features sequentially, we use several U-blocks, each of which is composed of one convolution layer and three residual blocks. Each U-block takes the mixed feature of each level and the output of the previous U-block as an input and sequentially processes it to extract the higher-level feature.
Also, we adopt the information pool [37] to combine multi-level outputs of U-blocks. The information pool analyzes outputs of the first to fifth U-blocks to extract a information feature P. The information feature P is used in the last four U-blocks as in Figure 7. U-blocks at the same level are linked with skip connections as done in U-net [38]. The upscaling module adds the output of the last U-block to A 5 , and the pixel-shuffle layer [39] generates a superresolved residual image △I HR 5 , whose spatial resolution is rH × rW . Finally, the super-resolved image for the target image I HR 5 is defined as   is a up-sampled target image using bilinear interpolation.
Implementation details: As done in [14]- [16], [33], we convert RGB color space into YCbCr color space and try to super=resolve only Y color space. Cb and Cr colors are up-sampled using bicubic interpolation. We use the L 1 loss function between the predicted super-resolved image and the ground-truth. We use the Adam [40] optimizer and the leakyReLU [41] with the slope of 0.2 for negative input as the activation function. The batch size is set 32. Also, the learning rate is initially set to 0.001 and decreased by a factor of 0.5 for every 50 epochs. We stop the training after 300 epochs.

IV. EXPERIMENTAL RESULTS
We first perform ablation studies to demonstrate the components in the proposed network. Next, we compare the proposed network with the state-of-the-arts networks, includ-ing [14], [16].

A. DATASET AND METRIC
For experiments, we use HCI [9], HCI2 [10], EPFL [12], Stanford [11], and INRIA [7]. Here, HCI and HCI2 are synthetic datasets, whereas EPFL, Stanford, and INRIA are real-world datasets. For the fair comparison, we use the same training set as [14], [16], which is composed of 246 LF images. Also, Table 1 lists test LF images used in our experiments. For each scene in the test set, 9 × 9 view images are used for the evaluation. For quantitative evaluation, we use the PSNR/SSIM scores between the original high-resolution image and the super-resolved image.

B. ABLATION STUDIES
We perform ablation studies to demonstrate the effectiveness of the components in the proposed algorithm. Table 2 shows the comparative results between the proposed network and its VOLUME 1, 2021   without reference images: We do not use reference images to enhance the spatial resolution of the target image. In other words, the network takes only the target image as an input. Also, in this setting, the feature warping module and the feature mixing module are excluded from the proposed network, since there are no reference features. As in Table 2, without sub-pixel information in reference images, the network provides unreliable SR results.
without feature warping module: We remove the feature warping module from the proposed network to validate the effectiveness of the feature warping module. Without the feature warping module, features of reference images are not aligned to the target image, and thus sub-pixel information between the target and reference images cannot be precisely exploited. To this end, incorrect sub-pixel information degrades the SR performance as in Table 2.
without feature mixing: we measure the performance of the proposed network without the feature mixing module.In Table 2, we observe that the performance is degraded severely for all datasets. This indicates that the feature mixing module is designed to combine various features in the target and without multi-level structure: we remove all multi-level structures in the proposed network. Specifically, in the feature extractor, only single-level (highest-level) feature is extracted from each view image. Also, in feature warping module, the warping process is performed on the singlelevel feature for each image. Finally, only single-level mixed feature (M 5 ) is extracted from the feature mixing module. Then, the upscaling module increases the resolution using the single-level mixed feature. This variation degrades the performance of the proposed network, since receptive fields of various sizes cannot be available. Table 3 and Figure 8 compare the proposed network with the existing LFSR networks (LFNet [33], ResLF [14], and LFSR_AFR [16]), the SISR network (EDSR [42]) the SISR network (EDSR [42]), and the video SR network (SOF-VSR [43]). The PSNR and SSIM scores of the existing algorithms on the HCI, HCI2, EPFL, and Stanford datasets are from LFSR_AFR [16]. For the INRIA datset, we compare SR results using the source codes, provided by the respective authors.

C. COMPARISON WITH STATE-OF-THE-ARTS
In Table 3, we observe that the proposed network achieves the best performance on all datasets. Especially, the proposed network outperforms the state-of-the-art [16] with significant margins on the real-world LF dataset (EPFL, Stanford, INRIA). Figure 9 and Figure 10 illustrate qualitative LFSR results for real-world and synthetic images, respectively. The proposed network generates visually pleasing results as comapred with the existing networks. Finally, Table 4 shows the number of network parameters and run time of the proposed algorithm and the state-of-the-arts [14], [16]. The proposed network requires the smallest number of parameters. It is worth pointing out that the proposed network surpasses the state-of-the-arts, even though the proposed network uses the minimum network parameters. Also, the proposed network is faster than LFSR_AFR [16].

V. CONCLUSIONS
In this paper, we proposed the LFSR network based on multilevel structures. The proposed network extracts multi-level features to estimate sub-pixel information from reference images effectively. It then provides multi-level mixed features by combining reference features based on the similarity between the target and reference images. Using multi-level mixed features, the proposed network gradually reconstructs a residual image for SR. Experimental results demonstrated that the proposed network outperforms the state-of-the-art LFSR methods on various LF datasets. FIGURE 9: Qualitative comparison of the proposed network with Bicubic, ResLF [14], and LFSR_AFR [16] on the real-world datasets. VOLUME 1, 2021