Color Diffraction Computer for Incoherent Digital Holography

Incoherent digital holography is a three-dimensional (3D) imaging technique that uses a natural light source. Through computer-based diffraction calculations, 3D information can be reconstructed from self-interferometer-captured holograms. As practical applications require a computer capable of high-speed diffraction calculations, an investigation was conducted into a special-purpose computer that uses a field-programmable gate array (FPGA). This approach had not been used for incoherent digital holography. In this study, the computational performance of a proposed monochromatic diffraction circuit was improved. By using direct memory access transfer, the data transfer time bottleneck, which is a problem in conventional circuits, was eliminated. The calculation was also parallelized, and the processing speed was 150 times faster than that of conventional circuits. In addition, a special-purpose computer for 3D color imaging was developed by implementing the proposed circuits for the three wavelengths on a single FPGA board. As a result of the verification, a speedup of approximately nine times over a 10-core CPU was achieved. The results of this research strongly promote the realization of a holographic portable camera with real-time 3D color imaging capability.


I. INTRODUCTION
Recently, the demand for three-dimensional (3D) imaging technology has increased in various fields.In bioimaging, it is necessary to noninvasively acquire 3D information of an object to elucidate biological phenomena.In machine vision, it is necessary to efficiently measure the shapes of objects and inspect their products.Digital holography [1], [2], [3], [4] is a 3D imaging technique.3D objects are recorded by an image sensor as interference fringes in holograms.An object image of a plane in focus at an arbitrary depth is reconstructed by transferring the recorded holograms to a computer and calculating the diffraction of the light waves.As opposed to conventional 3D imaging techniques, multiple images do not The associate editor coordinating the review of this manuscript and approving it for publication was Norbert Herencsar .
need to be recorded while adjusting the focal point.Once a hologram is recorded, it can be refocused to any depth within the 3D object.
Digital holography typically requires a coherent light source such as a laser to generate holograms.Therefore, the imaging environments are limited.In contrast, incoherent digital holography [5], [6] is a technique for recording holograms with incoherent illumination, such as that from light emitting diodes (LEDs), white light sources, and even sunlight.3D color information can be captured by recording multiple wavelengths [7], [8].Therefore, it is expected to have practical application as a holographic camera that is not limited by the imaging environment.
Various methods have been studied for the practical application of incoherent digital holography.For color imaging, the wavelength-multiplexed method [9], [10] can reduce the recording time compared with the conventional timedivision phase-shifting method [11].By separating the 3D information of each wavelength on a computer using the wavelength-selective phase-shifting interferometry method (WS-PSI) [12], [13], the number of recorded holograms was reduced from twelve to seven.In addition, instead of reducing the spatial resolution, a method to record all the required wavelength-multiplexed holograms in a single shot [14] was proposed, and a palm-sized optical system [15] was presented.
From the perspective of portability as a holographic camera, however, its application needs to accelerate the reconstruction calculations and to implement the algorithm in hardware as an embedded processor.Our research group has built computers dedicated to holography: the Holographic Reconstruction (HORN) series for electro-holography [16], [17], [18], [19], [20], [21], [22], [23], [24], [25] and the fast Fourier transform (FFT)-HORN series for digital holography [26], [27], [28], [29].Digital holography uses FFT calculations.Therefore, the FFT-HORN computers have been equipped with FFT calculation circuits.In this study, we applied FFT-HORN with a field programmable gate array (FPGA) technology to challenge the construction of a special-purpose computer for incoherent digital holography [30].The FPGA allows designers to freely program the calculation circuits.Acceleration methods with the FPGA in the field of computer holography have also been studied by other research groups [31], [32], [33].
In our previous study [30], we designed and implemented the calculation circuit for a monochromatic (single wavelength) reconstruction of incoherent digital holography.For the calculation, a speedup of 2.6 times faster than that of a CPU was achieved.However, it was found that the slow programmed input/output (PIO) transfer was a bottleneck.
In this study, we considered the use of high-speed direct memory access (DMA) transfer instead of PIO transfer.By eliminating the data transfer bottleneck and with parallelized reconstruction calculation, faster computation was achieved.Moreover, the calculation circuit was extended to a 3D color imaging circuit.The evaluation results demonstrated its effectiveness.This research will lead to realization of a holographic 3D camera, which can be applied in visual inspection.
The remainder of this paper is organized as follows.In Section II, we present a summary of 3D imaging in incoherent digital holography.In Section III, we provide a description of the circuit design to accelerate Fresnel diffraction.In Section IV, we present the implementation results to evaluate computation speed and quality.Finally, in Section V, we conclude the paper and discuss future works.

II. INCOHERENT DIGITAL HOLOGRAPHY
A. OPTICAL SYSTEM Fig. 1 shows an overview of incoherent digital holography.Conventional digital holography with lasers typically uses a two-beam interferometer in which the reference light and the object light pass through different paths.When an incoherent light source with a short coherence length is used, a selfinterferometer is used in which two light waves pass through the same path.In a self-interferometer, the object is viewed as a set of point light sources that are divided by polarizers into two light waves: one horizontally polarized and the other vertically polarized.By applying phase modulation only to the horizontally polarized light wave using a spatial light modulator (SLM), a difference in the optical path lengths of several wavelengths is produced.By aligning the polarization directions of the two light waves with those of another polarizer, the two light waves interfere and create a hologram.The images are recorded with a monochrome image sensor.To eliminate unwanted light specific to digital holography by computational processing, the phase of the interference fringes is shifted by an SLM, and multiple holograms are usually recorded for each object.

B. RECONSTRUCTION CALCULATION
The recorded holograms are transferred to a computer.3D imaging is performed using image reconstruction calculations.First, unwanted light is removed from the hologram using the phase-shifting method, and the complex amplitude distribution of the object light is calculated.Then, diffraction calculations are performed on the light of the object.Fig. 2 shows the method used to obtain a reconstructed 3D image using diffraction calculations.u(x, y; 0, λ ) is the complex amplitude distribution of the object light on a hologram plane, where λ is the wavelength of the object light.Diffraction calculations are performed to calculate the propagation of light waves.The plane reconstructed image u(x, y; z, λ ) at an arbitrary depth z is obtained by a single diffraction calculation.If the captured 3D object is in the region z 1 ≤ z t ≤ z T , it is necessary to perform the diffraction calculations T times while shifting the propagation distance z step by step.When acquiring 3D color images, T reconstruction images have to be acquired for each of the three wavelengths, resulting in a total number of calculations of 3T times.This makes realtime 3D imaging difficult owing to the high computational cost.

III. CIRCUIT DESIGN A. FRESNEL DIFFRACTION
To enable real-time imaging in incoherent digital holography, diffraction calculations are accelerated.Diffraction calculations are performed by convolution using twodimensional (2D) FFT.The FFT-based Fresnel diffraction [34], [35] is calculated as follows: where x, y, z represent 3D coordinates and m, l represent 2D spatial frequencies calculated by 2D FFT.IFFT is the 2D inverse FFT.The spatial frequency spectrum U (m, l) and transfer function H m, l; z, λ are calculated as follows: where j is an imaginary unit.Z is a constant related to the wavelength λ and propagation distance z.It is expressed as where p is the pixel pitch and n 2 is the hologram resolution as n × n pixels.In these equations, (2) is computed only once for the first time on the host PC because it is the spatial frequency spectrum on the hologram plane.Equations ( 1) and ( 3) are computed repeatedly according to the propagation distance z and are therefore implemented on the FPGA.As (4) involves division, the parameter Z is calculated on the host PC for single image reconstruction.To calculate the diffraction at multiple depths, only the initial parameters Z 0 for the initial position z 0 and Z for the interval z are calculated on the host PC.Z 0 and Z are transferred to the FPGA along with the number of reconstruction planes T z .Z is calculated on the FPGA as follows [30]: Fig. 3 shows an overview of the calculation system based on the aforementioned studies.Diffraction calculations require zero padding around the hologram image to prevent light-wave wraparound due to circular convolution.Because the height and width of the hologram are extended by a factor of two, the total number of pixels in the diffraction calculation is four times that of the original image.U is calculated using 2D FFT and then transferred to the FPGA to begin the calculations.The host PC uses floating-point arithmetic, whereas the FPGA uses fixed-point arithmetic.Therefore, a fixed-point conversion process is required for the host PC before the data transfer.After the calculation is completed on the FPGA, the host PC receives the calculation results, calculates the amplitude distribution and gradation, and saves the reconstructed image as an image file.

B. FPGA IMPLEMENTATION
Fig. 4 shows a schematic of the calculation circuit for color diffraction implemented on the FPGA.For the implementation, we used intellectual property (IP) cores, which are functional circuits already designed by FPGA vendors.We employed an FPGA evaluation board and IP cores from AMD Xilinx.DMA for the PCI Express (PCIe) subsystem (XDMA IP) [36] was used as the communication circuit.The XDMA IP allows DMA transfer between the host PC and the FPGA via PCIe.The advanced extensible interface (AXI) is a communication protocol inside of AMD Xilinx FPGAs.Most of the IP cores are provided with the AXI interface.Therefore, the calculation circuit was designed using the AXI interface.The data transfer rate also depends on the operating frequency of the calculation circuit and the data bus width of the AXI.In this study, the calculation circuit was designed with an operating frequency of 250 MHz and AXI data width of 512 bits to enable high-speed data transfer.Because the calculation circuits are the same for each wavelength, three monochromatic diffraction calculation circuits were implemented.Color diffraction calculations were performed by changing the input parameters for the wavelength λ .The implementation utilized the AXI SmartConnect IP [37], which can connect multiple AXI interfaces to link the three diffraction circuits and the XDMA IP.
Fig. 5 shows a schematic of the monochromatic diffraction calculation circuit.After verifying the fixed-point arithmetic accuracy of the software, the data width of the input/output per pixel was determined to be 32 bits, which consists of a 16-bit real part and 16-bit imaginary part [30].The data width of the AXI was set to 512 bits, so that 16 pixels of data could be transmitted and received simultaneously.The data were stored in the random access memory (RAM) at 512 bits and split into 32 bits immediately before calculation.To realize eight parallel calculations, eight RAMs were allocated for input and output.The input data U had a data width of 32 bits  and depth of 512 × 512 = 2 18 .Because the input data were divided into eight RAMs and 16 pixels of data were stored per address, the depth of each input RAM was 2 18  ÷ 8 ÷ 16 = 2 11 .The data widths of Z 0 and Z , which are the transfer function H calculation parameters, were of 32 bits each.The data of Z 0 and Z are sent to the UH calculation unit.The number of calculations T z is sent to the control unit.
When all the data have been entered, the control unit asserts the ''calc_start'' signal to ''1'' for one clock cycle, and the UH calculation unit starts the calculation.When the calculation is completed and the ''calc_finish'' signal is received from the 2D IFFT unit, the ''rdata_ready'' signal is asserted to ''1'' to indicate that the data are ready to be output to the host PC.When the host PC has received all the data, the ''rdata_ready'' signal is set to ''0'' and the internal counter of the control unit is incremented.This counter keeps the number of calculations performed and the control unit sends the ''calc_start'' signal until the counter matches T z .
Fig. 6 shows a block diagram of the UH calculation unit.This unit calculates U (m, l) × H m, l; z, λ in (1), which means the 2D spatial frequency spectrum at a distance z from the hologram plane.The transfer function, H , is computed using trigonometric function tables.The input address for the table is a decimal part with 8 bits of Z (m 2 + n 2 ) calculated using this unit.The bit width of H output from the table is 5 bits for the real part and 5 bits for the imaginary part.After computing H , complex multiplication is calculated with U .Let the real and imaginary parts of U and H be U re , U im , H re , H im , and the general complex multiplication result UH is expressed as follows: However, ( 6) and ( 7) require four multipliers.As the multiplier is limited as a circuit resource, the equations are transformed as follows: As H im (U re − U im ) is the common term, ( 8) and ( 9) require only three multipliers [38].We design them in this research.Fig. 7 shows a schematic of the 2D IFFT unit.The parallelization of the 2D IFFT was designed using the method described in [39].First, 1D IFFTs are calculated in the horizontal direction using 1D IFFT IPs [40], and the results are stored in RAMs.For eight parallel calculations, 8 × 8 = 64 RAMs are required.If a single RAM stores a single horizontal row of data, eight different addresses of the single RAM are accessed simultaneously when reading the data vertically.However, because a maximum of two different addresses can be accessed simultaneously in a single RAM unit, it is necessary to divide the RAM.After the horizontal 1D IFFTs are completed, the vertical 1D IFFTs are calculated.As in the conventional circuit [30], a device to reduce the number of 1D IFFT calculations using the diffraction calculation property is also used in this circuit.Computing a 2D IFFT for a 512 × 512 pixel image requires 512 + 512 = 1024 1D IFFTs and RAM resources with a depth of 512 × 512 = 2 18 .However, the final reconstructed image obtained by the diffraction calculation consists of only the central 256 × 256 pixels.Therefore, the 128 columns at both ends of the 512 horizontal 1D IFFT results are not required for the next vertical 1D IFFTs.The calculation results require only 256 × 512 = 2 17 , which saves 50% of memory resources.Moreover, the next vertical 1D IFFTs are only performed 256 times, so the total number of calculations is 512 + 256 = 768, which is a 25% reduction.In this case, eight parallel calculations are performed, so the 2D IFFT is completed in 768 ÷ 8 = 96 times the computation time for the 1D FFT.The circuit calculation result u is 256 × 256 pixels in size.It is connected to 16 pixels and stored in eight output RAMs.The depth of each output RAM is 2 16  ÷ 8 ÷ 16 = 2 9 .

IV. IMPLEMENTATION RESULT A. DEVELOPMENT ENVIRONMENT
In this study, we used the Alveo U250 Data Center Accelerator Card [41] from AMD Xilinx as the FPGA evaluation board.We designed and implemented the circuits using Vivado 2023.1 as the integrated development environment.
The CPU of the host PC was an Intel Core i9-10900X @ 3.70 GHz and the OS was Ubuntu 20.04.6LTS.Alveo U250 and the host PC transferred data via PCIe Gen3 × 16.For the device drivers, we used the drivers provided by AMD Xilinx [42].Processing in the host PC was written in C++, and g++ 10.3.0 was used as the compiler.

B. CIRCUIT SCALE
Table 1 lists the usage of each resource obtained by logic synthesis and implementation using Vivado 2023.1.The circuit was designed to enable the color diffraction calculations of 256 × 256 pixel holograms.The operating frequency was set at 250 MHz.Memory resources such as block RAMs (BRAMs) and ultra RAMs (URAMs) were heavily used at 23.6% and 30.0%, respectively.The used memory resources were specified during synthesis.The BRAMs were configured with a data width of 36 bits and a depth of 2 10 [43].
Owing to their small size, BRAMs were used for the 2D IFFT units that require splitting.BRAMs were also used for the IFFT and XDMA IP.The URAMs were configured with a data width of 72 bits and a depth of 2 12 [43].Because URAMs can store large amounts of data, they were used as input and output RAMs, as shown in Fig. 5.There were 16 input/output RAMs per wavelength for a total of 48 RAMs at the three wavelengths.Based on the usage presented in Table 1, each input/output RAM was configured with 384 ÷ 48 = 8 URAMs to provide a data width of 512 bits.The depth of the input RAM was 2 11 , whereas that of the output RAM was 2 9 .Because the depth of the URAM was 2 12 , one-half of the input RAM and 1/8 of the output RAM were used for the depth.In the conventional calculation circuit implemented on the VC707 Evaluation board [44], the single monochromatic diffraction circuit used more than 40% of the memory resources.Therefore, it was impossible to implement a color circuit.However, by selecting Alveo U250, which has abundant memory resources, the implementation of a color-diffraction calculation circuit was successful.

C. EVALUATION OF THE CALCULATION TIME
First, the calculation speed of the monochromatic diffraction circuits was evaluated.The computation time for a single reconstructed image was measured for the circuit implemented in this study, the conventional circuit implemented on a VC707 [30], and the CPU of the host PC.Table 2 lists the measurement results.The operating frequency was doubled from 125 to 250 MHz, and eight parallel computation processes were developed, resulting in a speedup of approximately 16 times over conventional circuits.By introducing the high-speed DMA transfer, the communication time was also improved by approximately 300 times compared with that obtained with the PIO transfer.As a result, the total processing time was 150 times faster than that of the conventional circuit and 7 times faster than that of a 10-core CPU.
Next, we evaluated the color-diffraction circuit.Table 3 lists the measurement results for the proposed circuit and the CPU of the host PC.The FPGA processing time in Table 3 indicates the total time required for calculation and data transfer.As shown in Fig. 3, this system requires a conversion to fixed-point numbers before transferring the data to the FPGA.The computation time, including the fixed-point conversion time, was evaluated in this study.Although the calculation time for the FPGA was fast (0.78 ms), the conversion time to fixed-point numbers was 2.61 ms, accounting for approximately 70% of the total processing time.In the context of single-image reconstruction, the total processing speed was only approximately two times faster than that of the CPU.However, as presented in Table 4 of Section IV-D, the performance improved as the number of reconstructed images increased.

D. EVALUATION OF 3D COLOR IMAGING
Finally, the accuracy of the image reconstruction was verified for multiple depths.To verify the quality of the reconstructed images, a computer-generated hologram (CGH) was calculated on a host PC.   3 be C 0 , the total cost is C 0 and O(N 2 log N ).Here, we suppose that T layers are required to reconstruct a 3D color object.As C 0 does not depend on T , the total computational cost is C 0 and O(T N 2 log N ).For a large T , it is proportional to O(T N 2 log N ).In this research, we set these parameters as n = 256 (N = 512) and T = 32.
Fig. 9 shows an RGB-D image [45] that was used as a 3D object with 32 layers.The resolution was 256 × 256 pixels.The distance between each layer was 0.8 mm, and the closest distance between the hologram and the object was 30 mm.The wavelengths of the red, green, and blue light sources were 625, 520, and 455 nm, respectively.The pixel pitch of the hologram was 12.0 µm, and the resolution of the hologram was 256 × 256 pixels.
The CGH was input to the proposed circuit to calculate the reconstructed image.The reconstructed images generated by the CPU using 64-bit floating-point arithmetic were contrasted.Peak signal-to-noise ratio (PSNR) was used as a quality assessment metric for the images.PSNR is a measure of the error between two images.In general, two images are said to be of equal quality if PSNR ≥ 30 dB.Fig. 10 shows the reconstructed images.Here, three reconstructed images are shown with planes that are different from the hologram: near, central, and far.It can be observed that different parts the image come into focus depending on the propagation distance.The PSNR of the reconstructed images at each depth was calculated, and the average PSNR of the 32 images was 47.6 dB.Therefore, the proposed circuit can acquire reconstructed images with the same quality as the CPU.In addition, Table 4 lists the measurement results for the total time required to calculate the 32 layers of the reconstructed images.Even when calculating for multiple depths, the fixed-point conversion for the input data is required only once, and the processing time is the same as that in Table 3.The ratio between the fixed-point conversion time and total processing time was reduced to approximately 17%.The acceleration rate was nine times higher than that of the CPU.The designed special-purpose computer is effective for accelerating 3D color imaging.

V. CONCLUSION
To accelerate 3D color imaging, a special-purpose computer was implemented for incoherent-color digital holography on the FPGA.The proposed computer was equipped with DMA transfer capability, which enabled high-speed data transfer to and from the host PC.It was also equipped with three units that performed diffraction calculations in eight parallel units for color imaging.Finally, the system achieved a speedup of approximately nine times over a 10-core CPU without compromising the quality of the reconstructed image.
In our future work, further improvements to the proposed computer will be considered.First, we plan to extend the computable hologram resolution from 256 × 256 to 512 × 512 pixels.Because extending the resolution also requires increasing the memory resources, a possible concern is that the circuit may lack URAM resources.However, the discussion in Section IV-B shows that even for URAMs currently counted as usage, only one-half of the input RAM and 1/8 of the output RAM are used.Even if the resolution is increased by a factor of four, the URAM consumption should only increase by a factor of 1.5, with input RAM doubling and output RAM remaining the same.While the problem of increasing data can thus be resolved, the extending resolution is limited to 512 × 512 pixels in this system.The limit of the number of parallels also needs to be considered.Increasing the number of parallels leads to an increase in memory resource usage.Therefore, this should be examined after the hologram resolution has been extended.If the processable resolution extends more than 512 × 512 pixels, we should use multiple FPGA boards or external memory.
As an alternative to preserving the resolution, we also contemplate integrating the computational operations presently executed on the host computer.The aim is to achieve the development of a portable holographic camera through the use of the FPGA for all computational operations.The first step is to implement zero padding and a 2D FFT.In this manner, the hologram is input data for the FPGA.The captured holograms are represented by integer values.By eliminating the fixed-point conversion process on the host PC, the entire system experiences a speedup.
In addition, the special-purpose computer developed in this study only used computer-generated holograms.In the future, we plan to incorporate the implemented computer into an optical system for incoherent color digital holography and verify the 3D image reconstruction for experimentally recorded holograms.

FIGURE 2 .
FIGURE 2. 3D imaging by Fresnel diffraction on a computer.

FIGURE 3 .
FIGURE 3. Overview of the calculation system.

FIGURE 4 .
FIGURE 4. Schematic of the color Fresnel diffraction circuit.

FIGURE 5 .
FIGURE 5. Schematic of the monochrome Fresnel diffraction circuit.

FIGURE 6 .
FIGURE 6. Block diagram of the UH calculation unit.

Fig. 8
Fig.8shows an overview of the calculation conditions.When the resolution of the hologram is n × n pixels, Fresnel diffraction calculates 2n × 2n pixels because zero padding is performed.By setting N = 2n, the computational cost of a slice image is expressed as O(N 2 log N ) because 2D FFT is utilized.Letting the computational cost of the fixed-point conversion in Table3be C 0 , the total cost is C 0 and O(N 2 log N ).Here, we suppose that T layers are required to reconstruct a 3D color object.As C 0 does not depend on T , the total computational cost is C 0 and O(T N 2 log N ).For a large T , it is proportional to O(T N 2 log N ).In this research, we set these parameters as n = 256 (N = 512) and T = 32.Fig.9showsan RGB-D image[45] that was used as a 3D object with 32 layers.The resolution was 256 × 256 pixels.The distance between each layer was 0.8 mm, and the closest distance between the hologram and the object was 30 mm.The wavelengths of the red, green, and blue light sources were 625, 520, and 455 nm, respectively.The pixel pitch of the hologram was 12.0 µm, and the resolution of the hologram was 256 × 256 pixels.The CGH was input to the proposed circuit to calculate the reconstructed image.The reconstructed images generated by the CPU using 64-bit floating-point arithmetic were contrasted.Peak signal-to-noise ratio (PSNR) was used as a quality assessment metric for the images.PSNR is a measure of the error between two images.In general, two images are said to be of equal quality if PSNR ≥ 30 dB.Fig.10shows the reconstructed images.Here, three reconstructed images are shown with planes that are different from the hologram: near, central, and far.It can be observed that different parts the image come into focus depending on the propagation distance.The PSNR of the reconstructed images at each depth was calculated, and the average PSNR of the 32 images was 47.6 dB.Therefore, the proposed circuit can acquire reconstructed images with the same quality as the CPU.

FIGURE 8 .
FIGURE 8. Overview of the calculation conditions.

TABLE 2 .
Computation time for a monochromatic image.

TABLE 3 .
Computation times for a color image.

TABLE 4 .
Computation times for 32 color images.