Comparison of Layer Operations and Optimization Methods for Light Field Display

A light-field display provides not only binocular depth sensation but also natural motion parallax with respect to head motion, which invokes a strong feeling of immersion. Such a display can be implemented with a set of stacked layers, each of which has pixels that can carry out light-ray operations (multiplication and addition). With this structure, the appearance of the display varies over the observed directions (i.e., a light field is produced) because the light rays pass through different combinations of pixels depending on both the originating points and outgoing directions. To display a specific 3-D scene, these layer patterns should be optimized to produce a light field that is as close as possible to that produced by the target three-dimensional scene. To deepen the understanding for this type of light field display, we focused on two important factors: light-ray operations carried out using layers and optimization methods for the layer patterns. Specifically, we compared multiplicative and additive layers, which are optimized using analytical methods derived from mathematical optimization or faster data-driven methods implemented as convolutional neural networks (CNNs). We compared combinations within these two factors in terms of the accuracy of light-field reproduction and computation time. Our results indicate that multiplicative layers achieve better accuracy than additive ones, and CNN-based methods perform faster than the analytical ones. We suggest that the best choice in terms of the balance between accuracy and computation speed is using multiplicative layers optimized using a CNN-based method.


I. INTRODUCTION
Three-dimensional (3-D) displays have been the subject of study for many years [1]- [5]. These displays can be categorized on the basis of several criteria such as the necessity of wearing glasses and the number of supported viewing directions. Glasses-free (naked-eye) displays have attracted attention because they enable a more natural viewing experience than glasses-based ones. Multi-view displays have more potential than conventional stereo-only displays because they not only provide depth perception by showing different images to the left and right eyes but also present natural motion parallax in accordance with the movement of observers. In particular, multi-view displays that can support many and dense viewing directions are often referred to as light-field displays.
The associate editor coordinating the review of this manuscript and approving it for publication was Ganesh Naik .
To develop glasses-free multi-view/light-field displays, researchers have devised several methods, including those that use parallax barriers [1], [6]- [8], specially designed lenses (lenticular screens or integral photography lenses) [2], [3], [9]- [11], and stacked layers [12]- [17]. In this paper, we focus on the third method. This type of display, called a ''layered display,'' can be viewed from many directions (angles) simultaneously without the resolution of each viewing direction being sacrificed (but the quality of each view degrades), which is deemed as one of the desirable properties for glasses-free multi-view/light-field displays.
The structure of a layered display is illustrated in Fig. 1. A few layers, each of which has pixels that can carry out light-ray operations (multiplication and addition), are stacked with small intervals. With this structure, the appearance of the display varies over the observed directions, because the light rays pass through different combinations of pixels depending on both the originating points and outgoing directions. To display a specific 3-D scene with this structure, the layer patterns should be designed to make the direction-dependent views consistent with the appearance of the target 3-D scene. More precisely, a light field [18], [19] (i.e., tens of images), which is expected to be observed from different viewing directions, is given as the input, then the layer patterns are optimized to reproduce the light field as accurately as possible. This design can also be applied to light-field projections [20], [21], headmounted displays [22], [23], and table-top displays [24].
To deepen the understanding for this type of light field display, we conducted comparisons based on two important factors: light-ray operations carried out using layers and optimization methods for the layer patterns. We compared multiplicative and additive layers. Multiplicative layers are implemented using liquid crystal display (LCD) panels and backlight [16]. Additive layers are constructed with holographic optical elements (HOEs) and projectors [21]. Regarding the optimization of the layer patterns, we compared analytical methods used in previous studies [16], [21], which are slow due to heavy computation, with faster data-driven methods implemented as convolutional neural networks (CNNs) [25]. Such a CNN-based method requires significant time for training, but inference using a trained network is very fast, which paves the way for light-field displays running at video-rate speed.
We compared combinations within these factors (layers operations and optimization methods) in terms of the accuracy of light-field reproduction and computation time. To the best of our knowledge, we are the first to present such comparisons for layered light-field displays. 1 Our results indicate that multiplicative layers achieve better accuracy than additive ones, and CNN-based methods are faster than the analytical methods, with comparable accuracy to the best achievable accuracy of analytical methods. We recommend that in terms of the balance between accuracy and computation speed, one should adopt multiplicative layers optimized using a CNN-based method. 1 A preliminary version of this paper was presented at a conference [26]. A more complete description, thorough discussions, and additional experimental results are included in the present paper.

II. LAYERED LIGHT-FIELD DISPLAY
We first introduce a parameterization for a light field. We then mention how a light field is produced using stacked layers with which multiplication or addition of light rays are carried out, as illustrated in Fig. 3. Finally, we describe the coordinate system we used in this study.

A. LIGHT-FIELD PARAMETERIZATION
A light field is defined as a 4-D function describing all the light rays that travel straight in free space [18], [19]. In this paper, we adopt a plane + angle parameterization, as shown in Fig. 2. A reference plane (z = 0) is defined, and a light ray is parameterized by the point of intersection with the reference plane [(u, v)] and the outgoing direction with respect to the z axis [(θ, φ)]. The intensity of each light ray is described as L(s, t, u, v) with s = tan(θ) and t = tan(φ). We assume that all the elements of L(s, t, u, v) take non-negative values because the light intensity is non-negative.

B. MULTIPLICATIVE LAYERS
Regarding multiplicative layers [16], we assume that a few light-attenuating panels (e.g. LCD panels) are stacked with evenly spaced intervals in front of a backlight. Let us consider a light ray passing through point (u, v) on the reference plane and going in the direction of (s, t). We can see that the intersection of this light ray with a layer located at depth z is (u + zs, v + zt). Therefore, the intensity of a light ray (normalized by the intensity of the backlight) emitted from this display can be described as where P z (u, v) denotes the transmittance of a layer located at z and Z denotes a set of depths where the layers are located.

C. ADDITIVE LAYERS
Regarding additive layers [21], we assume that they are evenly spaced and the luminance of layer pixels are summed  along the path of a light ray. Such operations can be implemented using HOEs and projectors. A light ray emitting from this display can be given as where P z (u, v) denotes the luminance of a pixel (u, v) on the layer located at z.

D. COORDINATE SYSTEM
Throughout the paper, we assume that all four variables (s, t, u, v) in a light field are integers. With this assumption, a light field can be regarded as a set of directional views: where (s, t) corresponds to an index of a viewpoint (viewing direction) and (u, v) indicates a discrete pixel position. We assume that a light field consists of 5 × 5 views; thus, s and t are limited within the range of [−2, 2]. We also assume that a light-field display is composed of three layers located at Z = {−1, 0, 1}. Note that z corresponds to the disparity among the directional views rather than the physical length.

III. OPTIMIZATION METHODS
We now describe how the multiplicative or additive layer patterns are optimized to display a target 3-D scene. A light field that should be emitted from the display is described as L(s, t, u, v). The optimization goals for the layer patterns are expressed as arg min arg min To achieve these goals, we use the two types of optimization methods shown in Fig. 4. We first describe the analytical methods for optimizing the layer patterns both for multiplicative and additive layers then data-driven CNN-based methods as effective alternatives.

A. ANALYTICAL METHODS
We first describe an analytical optimization method for multiplicative layers, which is equivalent to the multiplicative update rule mentioned in a previous study [16]. Since (3) is a non-convex problem, we resort to an alternative optimization. First, all the layer patterns are initialized with random values in the range of [ , 1], where is a sufficiently small positive number. Next, we carry out optimization for one layer at a time and circulate the optimization for all the layers until convergence. When optimizing a specific layer P z (u, v), we assume that the other layers P z (u, v) (z ∈ Z\{z}) are fixed. We define a column vector l that includes all the elements of the light field L (s, t, u, v). Similarly, we also define a column vector p z including all the elements of P z (u, v). We can also introduce VOLUME 8, 2020 a matrix A z , with which A z p z corresponds to the light-field vector generated by the layer patterns. Note that A z is determined by the fixed layers z ∈ Z\{z}, and all the elements are non-negative. The optimization of the target layer P z (u, v) is formulated in a non-negative least square problem as arg min The squared error can be reduced using the multiplicative update rule as follows: where and // represent element-wise product and division, respectively. Since all the elements in the right-hand side are non-negative, the left hand-side is ensured to be always non-negative. After (6) is applied, all the elements of p z are clipped to [ , 1]. Next, we describe a similar analytical method for the additive layers. In contrast to the multiplicative case, (4) is a convex problem; thus, the layer patterns for all depths z ∈ Z can be optimized simultaneously. We define a column vector p that contains all the elements of all the layer patterns P z (u, v) (z ∈ Z). We can also define a matrix A with which Ap corresponds to the light field produced by the layer patterns p. All the elements of A are also non-negative (more specifically, 0 or 1). Using these variables, the optimization is formulated also in a non-negative least square problem [21] Similarly to the case with multiplicative layers, we can apply the multiplicative update rule described as After (8)

B. CNN-BASED METHODS
We also evaluated CNN-based methods for optimizing multiplicative and additive layer patterns, respectively. These methods were constructed based on our previous study [25], but with one significant difference. In that study [25], we obtained the layer patterns from a compressively sampled light field. In this study, however, the layer patterns were obtained from a full light field. The optimization process for the layer patterns can be written in a form of mapping as where L represents a tensor that contains all the pixels of L(s, t, u, v) for all (s, t). Similarly, P represents a tensor that contains all the pixels of P z (u, v) for all z ∈ Z. To make the notations consistent, the mappings from the layer patterns to the light field ( (1) and (2)) can be rewritten as where L mul and L add represent all the light rays in L mul (s, t, u, v) and L add (s, t, u, v), respectively. We constructed two CNNs that correspond to the composite mappings g mul • f and g add • f , respectively, and minimized the squared error loss given as arg min arg min f ||L − L add || 2 (13) over a massive amount of training samples. The network architecture is rather straight-forward, as illustrated in Fig. 5. The network consisted of 20 2-D convolutional layers stacked in a sequence. Throughout the networks, the spatial size of the tensors was constant, but only the number of channels was changed. Tensors L, L mul , and L add had 25 channels, each of which corresponds to a viewpoint. Tensors P had 3 channels, each of which corresponds to the 3 layer patterns of the display. The other intermediate feature maps had 64 channels. During the training stage, 38770 VOLUME 8, 2020 training samples passed through the entire network. However, in a real application, only the mapping f is conducted on a computer, but the mapping g mul or g add is conducted using the physical display hardware.

IV. EXPERIMENTS A. IMPLEMENTATION
We used a Linux-based PC equipped with a NVIDIA Geforce GTX 1080 Ti. We executed all the methods on the GPU for fair comparison. To implement the analytical methods for multiplicative and additive displays, we used the open-source matrix library CuPy, which enables the methods to be executed on a GPU. 2 Note that the analytical methods update the solution in an iterative manner; therefore, we need to investigate the trade-off between the computation time (the number of iterations) and accuracy of the solution. Regarding the CNN-based methods, we constructed two networks for multiplicative and additive displays. Following a previous study [25], we gathered training samples from several light-field datasets [27]- [30]. Three color channels of each dataset were used as three individual datasets. Each training sample was a set of 25 (corresponding to 5 × 5 views) 2-D image blocks with 64 × 64 pixels that were extracted from the same spatial positions in a light-field dataset. With data augmentation in the intensity levels, we finally collected 295,200 samples. The networks were implemented using Chainer version 3.2.0, a Python-based framework for neural networks. The batch size for training was set to 15. We used a built-in Adam optimizer. The number of epochs was 20 for both networks. Once the training finished, inference was conducted almost in a constant time. We made our software available from our website [31].

B. RESULTS
We evaluated four combinations (a multiplicative or additive display optimized with an analytical or CNN-based method) in terms of the computation time and the accuracy of the reproduced light fields. The accuracy was measured using the peak signal-to noise ratio (PSNR) against the original light field, which was obtained from the mean squared error over all 5 × 5 multi-view images and three color channels. The quantitative results with four light fields (which were not included in the training data) are summarized in Fig. 6. When the analytical methods were used, accuracy gradually increased along the computation time, and it took several seconds to converge. However, the CNN-based methods were very fast and achieved good accuracy, which is comparable to that analytical methods can reach after many iterations. We can conclude that CNN-based methods are better than analytical ones in terms of the balance between computation time and accuracy. Regarding the light-ray operations carried out using layers, multiplicative layers yielded better results than additive ones.
We also present visual results obtained with two datasets in Figs. 7 and 8. In each figure, we show input light fields and top-left views in (a), layer patterns obtained with the four combinations mentioned above in (b)-(e), reproduced topleft views and errors from the ground truth (magnified by 2 for better visualization) in (f)-(i). The number of iterations for the analytical methods were fixed to 200, which produced sufficiently converged results. Each of the reproduced views was obtained from the corresponding layer patterns by calculating each of the light rays produced through the stack of those layers. We also reported the PSNR and structural similarity (SSIM) values. The SSIM values reported here are the averages over all 5 × 5 multi-view images. The errors are perceivable mainly around the object edges. We can see that an additive display causes larger errors than a multiplicative one regardless of the optimization method. Please refer to the supplementary video for more details.

V. CONCLUSION
We compared the performance of layered light-field displays with different light-ray operations (multiplication and addition) conducted using layers and optimization methods (analytical and CNN-based methods) for the layer patterns. Our results indicate that multiplicative layers achieve better accuracy than additive ones, and CNN-based methods are faster than analytical ones, with comparable accuracy to the best achievable accuracy of analytical methods. We recommend that in terms of the balance between accuracy and computation speed, one should adopt multiplicative layers optimized using a CNN-based method. Our future work includes further improvement of the CNN structure for better accuracy with less computation time. We also need to increase the number of images in light fields to support wider viewing directions. Development of better display hardware 3 is also necessary. These efforts will pave the way for high-quality light-field displays running at video-rate speeds.
KEITA MARUYAMA received the B.E. degree in electrical engineering from Nagoya University, Japan, in 2018, where he is currently pursuing the degree with the Graduate School of Engineering. His research interests are in light-field acquisition and rendering for 3D displays. TOSHIAKI FUJII (Member, IEEE) received the B.E., M.E., and Dr.E. degrees in electrical engineering from The University of Tokyo, in 1990, 1992, and 1995, respectively. From 2008 to 2010, he was with the Graduate School of Science and Engineering, Tokyo Institute of Technology. Since 1995, he has been with the Graduate School of Engineering, Nagoya University, where he is currently a Professor. His current research interests include multidimensional signal processing, multicamera systems, multiview video coding and transmission, free-viewpoint television, and their applications. VOLUME 8, 2020