End-to-End Optimization of an Achromatic Diffractive Optical Element Array for Integral Imaging Three-Dimensional Display

Diffractive optical element (DOE) array promises compact shape for full-parallax integral imaging three-dimensional (3D) display. However, DOEs suffer from large chromatic aberration due to the strong wavelength-dependent nature of diffraction phenomena that degrade the quality of reconstructed 3D images. An end-to-end DOE optimization approach is proposed to reduce chromatic aberration for integral imaging. The end-to-end optimization framework includes RGB pre-processing convolutional neural networks and achromatic optics optimization design of rotationally symmetric DOE. An optical display model based on diffractive optics is proposed to analyze the integral imaging 3D display process for achromatic optical optimization design. The pre-processed elemental image arrays are modulated by an optimized DOE array to reconstruct the achromatic 3D images. A 3D artifacts scene without chromatic aberration is reconstructed in different views with the proposed method, and both peak signal to noise ratio (PSNR) and structural similarity (SSIM) are improved compared to the conventional Fresnel lens DOE.

to replace multi-layer lenses for medical, military, educational, entertainment, and even merging into a smartphone.
DOEs are important building blocks in science and technology for their ultra-thin and lightweight form factors and excellent off-axis imaging performance [8], [9], [10] and are currently used in holographic displays [11], [12], beam-shaping [13], [14], and computational imaging [15], [16]. Such thin and lightweight surfaces provide remarkable control on the phase, polarization and spectrum of light. The introduction of DOE simplifies the optical complexity and enhances the depth of field in the integral imaging 3D system. However, the inherent chromatic aberration characteristics of DOE deteriorate the display quality [17], [18]. Chromatic aberration is the inability of an optical element to focus light of different wavelengths on the same plane. The size and shape of the point spread function (PSF) change depending on the wavelength, as shown in Fig. 1(a). During 3D display, the discrepancy of dispersive PSFs at different wavelengths causes the reconstructed 3D image to suffer from color distortion, and oversized PSFs lead to blurred display images, compare to Fig. 1(b). For the chromatic aberration problem of DOE, numerous investigations have been performed in computational imaging [19], [20], [21], but the achromatic research in 3D displays is insufficient.
In this paper, we proposed an end-to-end optimization approach of the achromatic DOE array for integral imaging 3D display. The structure of the 3D integral imaging system based on the DOE array is shown in Fig. 1(a). The optimization framework of the approach includes RGB pre-processing convolutional neural networks (CNNs) and achromatic optics optimization design of rotationally symmetric DOE. An optical display model is proposed to analyze the integral imaging 3D display process with wavelength-based PSFs. Optimization design of rotationally symmetric DOE reduces chromatic aberration with the optical display model. Meanwhile, RGB pre-processed CNNs are inserted in the front end to eliminate residual chromatic aberration of three wavelengths (460, 550, and 640 nm) respectively. In the framework, superior-high-quality 3D displaying performance without chromatic aberration has been realized by jointly learning DOE height map and pre-processed CNN parameters. As such, experiments of the integral imaging 3D display based on the conventional Fresnel DOE and the proposed optimization method were implemented separately to verify the effectiveness of the proposed method. In future work, the end-to-end DOE This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ optimization approach can also be used to reduce aberrations in different fields of view and enhance the depth of field.

II. END-TO-END OPTIMIZATION OF OPTICS AND PRE-PROCESSING
The proposed end-to-end achromatic DOE optimization approach consists of two main parts -achromatic optics optimization design and RGB pre-processing CNNs. The optics optimization eliminates most of the chromatic aberration with the optical display model, and RGB pre-processing CNNs respectively eliminate the residual chromatic aberration of three wavelengths. The DOE height map and RGB pre-processing CNN parameters are jointly learned, using stochastic gradient methods. Eventually, the pre-processed EI array is optically modulated with optimized achromatic DOE to reconstruct chromatic aberration-free 3D images.

A. End-to-End Optimization Framework
The DOE-based integral imaging optimization framework is divided into capturing stage, end-to-end optimization and reconstruction stage, as shown in Fig. 2. In the capturing stage, sub-image arrays containing stereo information are captured from 3D scenes using camera arrays. Subsequently, the subimage arrays are encoded into the EI array and fed into the end-to-end optimization stage. The proposed end-to-end optimization jointly learns the parameters of three CNNs and the rotationally symmetric optics by building a differentiable pipeline architecture consisting of a wave-based optical display model, RGB pre-processing CNNs, and loss. The end-to-end optimizing process for reducing chromatic aberration with Ten-sorFlow using stochastic gradient descent. In the reconstruction stage, the pre-processing EI array is loaded onto the 2D display panel and decoded by the optimized achromatic DOE array to reconstruct the 3D scene.
The end-to-end optimization is divided into two parts: RGB pre-processing CNNs and optical optimization, as shown at the bottom of Fig. 2. In the optical optimization process, the PSFs at different wavelengths in the reference plane can be obtained from the height map according to the optical display model based on diffractive optics. A one-dimensional rotationally symmetric heightmap works as an optimization variable to optimize the PSFs uniformly for different wavelengths. The three-channel EIs pre-processed by the RGB CNNs are fed into the display model. Three-channel EIs are convolved with the PSFs of the corresponding wavelength respectively and integrated into the output EI. However, the output EI suffers from blurring due to the residual aberration of the optimized PSF, so the pre-processing CNNs are inserted before the display model. Three CNNs eliminate residual chromatic aberration for RGB channel images, respectively. The variables in DOE (height map) and pre-processed CNNs (network parameters) are optimized according to the mean square error and SSIM loss. The loss is calculated from the display EI and the captured ideal ground truth EI. Ultimately, the pre-processed CNNs generate the corresponding sharp EIs loaded on the 2D display panel, and the optically optimized achromatic DOE is fabricated for application in the integral imaging system to modulate the EI array.

B. Optical Display Model
The conventional lens-based 3D display model using geometric optics is no longer applicable in the DOE-based 3D display. Hence, an optical display model is derived by taking into account diffraction and wavelength-dependent effects. In the integral imaging display reconstruction process, EI array is loaded on a 2D display panel and each EI in EI array is displayed by its corresponding DOE. As shown in Fig. 3, the display model consists of generating wavelength-based PSFs of the DOE and simulating the reconstructed 3D images by convolving the EI with wavelength-based PSFs. The PSF is related to the phase profile, so the derivation of the PSF in the display model needs to calculate the phase retardation first. The phase change of light  for each point is calculated by: where λ is the wavelength, n λ is the refractive index, and hei(u, v) is the height map variable of a transmissive substrate that will be optimized to achieve the desired PSF. (u, v) coordinates on the DOE plane. 2D variable height maps increase computational memory and optimization complexity. Hence, as shown on the left side of Fig. 3, the proposed optimization method for optical elements sets the height map as a rotationally symmetric one-dimensional variable H(r). The practical rotationally symmetric height map can be written as: where r is the distance from the optimized variable to the center of the optical element. The pupil function p(u, v; λ) is represented by the aperture function P (u, v; λ): Although we reduce the number of optimization parameters to one dimension, the PSF is still calculated in 2D. The wavelengthbased PSF can be further derived as follow: (4) where z is the distance from the DOE to the reference plane, and (x, y) are coordinates on the reference plane.
The reconstructed 3D images at RGB wavelengths can be simulated with wavelength-based PSFs as follows: where m and n are coordinates of the elemental image EI mn . c represents the color of RGB. h c is the PSF h(x, y; λ) at a particular wavelength of RGB. Three PSFs at red, green, and blue wavelengths on the reference plane are shown on the right side of Fig. 3. For an RGB EI displayed on the 2D display panel, the reconstructed image RI mn (x, y) of EI mn can be discrete written as: The reconstructed 3D scene RI(x, y) of the EI array is integrated as follow:

C. RGB Pre-Processing Convolutional Neural Networks
To display high-quality 3D images of DOE-based integral imaging, RGB pre-processing CNNs are proposed to eliminate residual chromatic aberration. Due to the existence of residual chromatic aberration, the PSFs near different wavelengths cannot be completely consistent, so the traditional convolutional neural network (CNN) with uniform parameter processing for the three RGB channels is no longer applicable. Hence, three independent CNNs (CNN-R, CNN-G, and CNN-B) were designed to pre-process the three channels of EIs affected by chromatic aberration, as shown in Fig. 4. The EI (size of 128 × 128 × 3) is separated into three EIs (size of 128 × 128 × 1) input networks according to the channel. Residual chromatic aberration near 640 nm wavelength on the R channel is eliminated by CNN-R. CNN-G and CNN-B remove the residual chromatic aberration near 550 nm and 460 nm wavelengths on the G and B channels respectively.
Since the three networks are similar in structure, CNN-R is introduced later as an example, and CNN-G and CNN-B are not repeated. The structure of CNN-R is a simple autoencoder, divided into two parts: encoding and decoding. The first layer of the encoding part consists of a convolution layer with 32 kernels of a 4 × 4 matrix. The convolutional layers of the second to the fifth block of the encoding part use a doubled number of 4 × 4 convolutional kernels and halved output size to extract features. The decoding part includes four deconvolution layers with doubled number of kernels of a 4 × 4 matrix, which is used to ensure the residual output size equal to the input image size. All layers include a leaky rectified linear unit as the activation function.

III. EXPERIMENTAL AND SIMULATION RESULTS
To demonstrate the viability of the proposed end-to-end achromatic approach in integral imaging 3D displays, we performed DOE optimization experiments for visible spectral wavelengths at different depths. In the experiments, pre-processed CNN parameters and DOE height map are trained as variables in the end-to-end optimization approach. We model a discretized height map with a resolution of 400 × 400 and a height map pixel pitch of 5 µm. The height map is wrapped with 2π phase modulation to a uniform maximum height. We consider a focal length of 13.02 mm and an aperture of 2 mm.
In the proposed end-to-end framework, the height map is optimized to focus on three target wavelengths, i.e., 460 nm 550 nm and 640 nm. We optimize the pipeline using stochastic gradient descent with a step size of 5 × 10 -6 . The prototype of optimized DOE is shown in Fig. 5(a). As shown in Fig. 5(b), the performance of optimized DOE design is demonstrated in cross section of the normalized PSF. As shown in Fig. 5(f), a conventional Fresnel lens DOE that focuses light at a single wavelength (550 nm), is used as the baseline of regular DOEs showing strong chromatic aberration. The prototype of conventional Fresnel lens DOE is shown in Fig. 5(e). Compared to the Fresnel lens DOE, a diffractive achromat is optimized to create a wavelength-invariant PSF at 13.02 mm. Furthermore, the results of the field distributions between 12.90 mm and 13.30 mm with the conventional Fresnel lens DOE and the proposed DOE using pseudo colors corresponding to their respective wavelengths are shown in Fig. 5(d) and (g). The focal points of the optimized DOE corresponding to the RGB wavelengths all appear at z ≈ 13.02 mm. Conventional Fresnel lens DOE cannot focus on three wavelengths simultaneously within a partial continuous depth range. Depth-direction cross sections in Fig. 5(d) at RGB wavelengths are shown in Fig. 5(c). The three wavelengths roughly focus similar energy in the yellow region around z = 13.02 mm, indicating that the achromatic property has been achieved by the optimized DOE in a partial depth-of-focus range.
To describe the achromatic focal depth, we define the depth of field here as the range around the focus at which monochromatic wavelength normalized intensity exceeds the 0.8-1.2 range of the average RGB wavelength intensity (gray area in Fig. 5(c)). Analyzing the depth of field at RGB wavelengths, an overlapping region from z = 12.97 to 13.05 mm can be obtained (yellow in Fig. 5(c)). The reconstructed light of scene depth ranges from 12.97mm to 13.05 mm and can be focused approximately onto the same focal plane.
Subsequently, the integral imaging 3D display system is set up to verify the overall feasibility of the end-to-end optimization approach. To reconstruct a 3D image, the integrated imaging 3D display method first captures sub-image arrays from different perspectives of the 3D image. Subsequently, the sub-image array is encoded as an EI array, and the single EI contains 3D information from different viewpoints. Eventually, the EI array is decoded by the optimized DOE array to reconstruct 3D images. The achromatic DOE array is composed of 30 × 17 optimized DOEs from the upper experiment. As shown in Fig. 6, the chromatic aberration of the reconstructed 3D image is eliminated by achromatizing the EI array containing 3D information. The resolution of EI is 128 × 128. As shown in Fig. 6(a), a 30 × 17 EI array loaded on the 2D display panel with a resolution of 7680 × 4320. The difference in PSF of different fields of view is slight compared to the difference in PSF of different  wavelengths. Therefore, to increase the convergence speed, we use the normal direction in the optimization process. When the view angle increases further and the difference of PSF between various fields of view expands, we will incorporate the color difference of different fields of view into the optimization range by modifying the incident light direction to ensure the optimization results.
Meanwhile, the conventional Fresnel lens DOE was used as a control group. The experiments were conducted with the same system parameters together with the Fresnel lens DOE and optimized DOE, respectively. Fig. 6(d) shows the EI based on the optimized DOE loaded on the 2D display panel after preprocessing CNNs in the end-to-end optimization framework. The results of display EI for the Fresnel lens DOE and optimized DOE are provided in Fig. 6(c) and (e). With the captured ideal EI as the standard, the Fresnel lens DOE displays significant out-of-focus blur and chromatic aberration, while the optimized DOE succeeds in focusing on all three RGB channels to produce an achromatic 3D scene with articulated details.
To further demonstrate our purpose, a 3D display experiment based on human eye viewing was carried out. Fig. 7 shows the simulation of 3D images observed by the human eye at different positions with the conventional Fresnel lens DOE and proposed DOE. Two parameters, PSNR and SSIM, are used as evaluation metrics. PSNR is a commonly used image quality evaluation index to compare that image to an ideal image with the maximum possible power. PSNR is calculated by the following equation: where k is the corresponding binary digit of the image, generally 8. MSE is the mean square error and is calculated by: where Ideal(u, v) represents the matrix data of the captured ideal image. Display(u, v) represents the matrix data of the reconstructed image. U represents the number of rows of pixels and u represents the index of that row of the image. V represents the number of columns of pixels and v represents the index of that column of the image. SSIM Index quality assessment index is based on the computation of three factors; luminance (l), contrast (c) and structure (s). The overall index is a multiplicative combination of the three: If α = β = γ = 1, the equation simplifies to: where μ x , μ y , σ x , σ y , and σ xy are the local means, standard deviations, and cross-covariance for images x, y. The reconstructed 3D image has higher quality if higher PSNR or SSIM values are obtained. The reconstructed 3D scene by our method closely match the captured ground truth, and PSNR and SSIM values are higher than Fresnel lens DOE (with averaged PSNR and SSIM improvements of 10.04 dB and 0.2821, respectively). The results show that our method successfully preserves both fine details and color fidelity. The end-to-end design approach can reconstruct colorful achromatic 3D scenes.
In addition to the electronic sand table, DOE-based integrated imaging 3D display can also be applied to the exhibition of cultural relics. Fig. 8 shows a 3D object 'Artifacts -Tang Dynasty Women' was observed by the human eye at different positions. The colors of the cultural relic are realistically reverted with the proposed DOE and the relative position occlusions of the cultural relic can be clearly perceived. Most of the residual chromatic aberrations can be eliminated by the three networks, and although it cannot achieve exact agreement with the ideal image due to the limitations of the convolutional neural network, the color distortion is significantly reduced and the SSIM and PSNR values are significantly improved compared with that before optimization. If the full-spectrum chromatic aberration needs to be optimized, the number of CNNs corresponding to the colors can be appropriately increased at the front end of the framework, and then the full-spectrum chromatic aberration of 3D images can be eliminated.

IV. CONCLUSION
In summary, an achromatic end-to-end DOE optimization approach for integral imaging 3D display with diffractive optics is proposed. The end-to-end DOE optimization framework consists of an optical display model and pre-processed CNNs. 3D images with reduced chromatic aberration are displayed by jointly optimizing the rotationally symmetric DOE height map and RGB pre-processed CNN parameters. In the simulations, we showed the chromatic aberration correction performance of the optimized DOE near the reference plane at RGB wavelengths. Compared with the conventional Fresnel lens DOE, the PSNR and SSIM of the proposed method for different viewing angles are improved by 10.04 dB and 0.2821, respectively, effectively reducing the chromatic aberration.