A 3D Iris Scanner From a Single Image Using Convolutional Neural Networks

A 3D model of the human iris provides an additional degree of freedom in iris recognition, which could help identify people in larger databases, even when only a piece of the iris is available. Previously, we reported developing a 3D iris scanner that uses 2D images of the iris from multiple perspectives to reconstruct a 3D model of the iris. This paper focuses on the development of a 3D iris scanner from a single image by means of a Convolutional Neural Network (CNN). The method is based on a depth-estimation CNN for the 3D iris model. A dataset of 26,520 real iris images from 120 subjects, and a dataset of 72,000 synthetic iris images with their aligned depthmaps were created. With these datasets, we trained and compared the depth estimation capabilities of available CNN architectures. We analyzed the performance of our method to estimate the iris depth in multiple ways: using real step pyramid printed 3D models, comparing the results to those of a test set of synthetic images, comparing the results to those of the OCT scans from both eyes of one subject, and generating the 3D rubber sheet from the 3D iris model proving the correspondence with the resulting 2D rubber sheet and binary codes. On a preliminary test the proposed 3D rubber sheet model increased iris recognition performance by 48% with respect to the standard 2D iris code. Other contributions include assessing the scanning resolution, reducing the acquisition and processing time to produce the 3D iris model, and reducing the complexity of the image acquisition system.


I. INTRODUCTION
The human iris is composed of two muscle systems and a sphincter to control the amount of light entering the retina [1]. These muscular fibers, as well as the pigmentation, provide a unique texture to each iris that can be used for identification [2]. Traditionally, the texture of the iris has been analyzed using 2D images to produce accurate iris recognition [2]- [8]. However, in recent years, a 3D iris scanning method that exploits the 3D relief of the iris has been proposed [9]- [11]. This method reconstructs a 3D model of the iris surface using images from several perspectives and Structure from Motion (SfM) algorithms [12], [13]. The 3D iris model opens new frontiers for biometric applications, as well as in The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . ophthalmology [10]. For example, the 3D iris model can potentially be used as a screening method for Closure Angle Glaucoma, a disease currently diagnosed with Optic Coherence Tomography (OCT) scans [9], [10], [14].
A method for reconstructing a 3D model of the iris surface from several images was introduced by Bastias et al. [9] and improved by Benalcazar et al. [10]. The improved method consists of the following steps: First, visible light (VL) images of the iris are captured from different perspectives. These images are acquired with a custom device that illuminates the iris with Lateral and Frontal Visible Light (LFVL) [15]. Then, a modified SfM algorithm estimates the camera pose of every image jointly with a sparse 3D model of the iris [9], [10]. Then, a dense 3D point-cloud reconstruction is performed by extracting Shi-Tomasi keypoints from each image [10], [16]. Finally, the point-cloud model is converted into a mesh surface by the Screened Poisson Surface Reconstruction technique [17]. This mesh helps interpolate the depth information in areas of the iris with low texture [10]. The result is a 3D model that incorporates both depth and color information of the iris surface. The additional dimension aims to increase iris recognition accuracy particularly when the iris is occluded by eyelids, eyelashes, and reflections [9], [11]. The system recently developed by Cohen et al. [11] tracks fiducial points from two or more near-infrared (NIR) images of the eye to create the 3D model. They then calculate the geometric error between two 3D models as the Mean Square Error (MSE) of candidate matching points. They tested their method on a dataset of 20 irises, correctly classifying all of them.
As previously described, the 3D iris scanning method can produce a complete model of the human iris, but there are limitations to this technique. First, the SfM method requires a moving camera, which adds complexity to the system. Second, SfM was conceived to scan inanimate objects; however, the human iris can dilate from frame to frame, adding a source of distortion. This was solved by acquiring many images per position, and selecting those with a consistent dilation level [10]. This solution increases both acquisition and processing time. Third, because SfM relies on keypoints and descriptors, irises with richer texture generate more 3D points than those with fewer details. Finally, it is difficult to acquire 3D points from areas in the image that present no texture; thus the point-cloud 3D model has an uneven distribution of points in space. The mesh representation solves this issue at the expense of more processing time [10].
However, SfM is not the only method that can produce 3D scene reconstruction from 2D images. In recent years, Convolutional Neural Networks (CNN) have increased accuracy in depth prediction tasks [18]- [20]. Most of the CNNs rely on training an encoder-decoder architecture with the image of a scene as the input, and an aligned depthmap as the target [21]- [23]. As a result, the CNN learns to identify visual cues, such as perspective, that allow prediction of the depth of every object in the scene. The output depthmap captures the depth value of every pixel, even in low texture areas such as uniform color furniture or roads [23]. Therefore, the 3D model is always complete and evenly sampled regardless of the texture in the image.
The main contribution of this paper is to propose a new method to obtain a 3D model of the iris from a single image using CNNs. The method is based on a depth-estimation CNN for the 3D iris model. A dataset of real iris images from 120 subjects, and a dataset of synthetic iris images with their aligned depthmaps were created. Then, depthestimation CNNs were trained using the real and synthetic irises [18], [19], [24], [25], and two network architectures were combined to improve performance. We analyzed the performance of our method in predicting the iris depth by using real step pyramid printed 3D models, comparing the results to those of a test set of synthetic images, comparing the results to those of the OCT scans from both eyes of one subject and generating the 3D rubber sheet from the 3D iris model, and proving the correspondence with the resulting 2D rubber sheet and binary codes. Other contributions of the proposed method include assessing the scanning resolution, reducing the acquisition and processing time for producing the 3D iris model, and reducing the complexity of the image acquisition system since the camera does not need to move to scan the iris.

II. RELATED METHODS IN DEPTH ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS
Depth estimation by a CNN can be formulated as a regression problem, in which the input is an image, and the target is the depth value of every pixel, also known as the depthmap. Eigen et al. [21] used a single image of an indoor scene as input, and the aligned depthmap of the same scene as the target. Such a depthmap had been acquired previously with an RGB-D camera. As a result, the CNN learned the depth of the walls and objects in indoor environments with great accuracy from their contexts [21]. Since then, several methods have been reported in the literature that have used similar training schemes and improved architectures with excellent depth estimation performance [18]- [20], [22], [23], [25].
The architecture of some depth estimation CNNs has been improved to produce more robust solutions. Eigen and Fergus [22] expanded their previous work to also predicting surface normals and labels. Laina et al. [23] trained a ResNet50 [26] based auto-encoder to increase accuracy. Alhashim and Wonka [18] developed DenseDepth, a DenseNet-169 based encoder with upsampling layers in the decoder to obtain high resolution depthmaps of indoor and outdoor scenes. Xu et al. [27] integrated Convolutional Neural Fields and a structured attention model to generate pixel precision in depth estimation. Fu et al. [19] developed DORN, with a space-increasing discretization strategy to recast depth estimation as an ordinal regression problem. CNNs have been trained to produce more complex methods for map reconstruction and navigation. For example, the CNN SLAM not only estimates depth from a single frame, but also integrates successive predictions of a video feed into a larger and more complete map of the environment [28]. Another deep network, FastDepth, by Wofk et al. [20] focused on a real time implementation for robotic navigation.
One limitation of the previously described methods is the need for a large number of aligned depthmaps for training. That is why Godard et al. [29] developed Monodepth, an encoder-decoder CNN that is trained with stereo images. The input of that network is the left image and generating the right image is the target. In this sense, the network has to understand the 3D geometry of the scene implicitly to perform the task. Kuznietsov et al. [30] combined stereo image information with sparse depthmap ground truth to produce a semi-supervised implementation. Their approach uses a small number of aligned image-depthmap pairs as ground truth in a supervised manner, along with a greater number of stereo image pairs in an unsupervised manner [30].
Another solution for the limited availability of training data in depth estimation is the use of synthetic images. Tian et al. [31] trained detection and classification networks using a combination of real and synthetic images. In their work, CNNs trained with real and synthetic data outperformed those trained with only real images [31]. Moreover, Zheng et al. [25] developed a depth-estimation CNN (Translation and Task Network, T 2 Net) that incorporates the use of synthetic and real images in its architecture. The T 2 Net is composed of a Generative Adversarial Network (GAN) that translates synthetic images to the domain of the real ones. The task component is an encoder-decoder that then predicts depth from the translated images [25]. T 2 Net achieved state-of-the-art results in widely used datasets, such as NYU-DepthV2, and KITTY [25].
Zheng et al. [25] analyzed various strategies for incorporating synthetic data in depth estimation tasks. As a result, they propose that the best alternative is incorporating both the translation and the task in the same training loop. In this way the GAN will learn to modify synthetic images only in their appearance while keeping the main features aligned with their depthmaps. They call it the full approach, and it had the best results among the other strategies analyzed [25].

III. METHODOLOGY
Our methodology for developing a new method to obtain the 3D model of the iris from a single image using CNNs is based on a depth-estimation CNN. First we defined the requirements of the training images so that the CNNs could infer depth information from visual cues. Then, we acquired both real and synthetic iris datasets with the desired characteristics. After that, we used our datasets to train available depth-estimation CNNs for 3D iris scanning. We then analyzed the performance of our method in predicting iris depth, and using printed 3D step pyramid models, we compared the results to those of a test set of synthetic images, compared the results to those of the OCT scans from both eyes of one subject, and generated the 3D rubber sheet from the 3D iris model demonstrating the correspondence between the resulting 2D rubber sheet and binary codes.

A. LEARNING DEPTH INFORMATION
Several visual cues provide depth information to humans. Cutting and Vishton [32] identified nine distinct mechanisms from which humans perceive depth. Occlusions indicate whether an object is behind or in front of another. The relative size of an object also indicates depth. Due to perspective, an object that is closer to a camera appears bigger than another that is farther away [33]. Similarly, the texture density of a cobble road appears to be coarser close to the viewer than farther away [32]. Binocular disparity allows triangulation to compute the distance of an object from the camera depending on how its position changes from one view to the next [12], [13], [33]. These visual cues are exploited by most SfM and CNN systems to reconstruct the precise 3D model of an object or a scene [13], [21].
Depth information of the human iris images has some particular issues that are different from those of general visual scenes. In iris images, the iris is the main object in the scene, and its size is normalized. Therefore, depth information cannot be inferred by occlusions or perspective. However, shadows cast by objects are another type of visual cue that provides depth information [32]. Elevations and craters can be identified by the shadows they cast. Similarly, in our method it is desirable to learn the relationship between the shadows on the surface of the iris, and the depth of the features that produce them.
In order for the iris features to cast shadows, a lateral source of illumination is needed. For this purpose, we developed a device with lateral and frontal lighting [15]. The device has a black frame that blocks external light sources, and has six white LEDs in front of each iris and three white LEDs on the side of each iris (LFVL illumination), to illuminate both eyes. The lateral illumination creates shadows from the relief of the iris surface, increasing the texture in the image [15]. This texture improved results in iris recognition [15]. LFVL illumination has also been used in 3D iris scanning with good results [10]. It was shown in [10] that LFVL improved the iris texture by producing more keypoints for 3D iris reconstruction. In our work, however, the use of LFVL illumination is important because shadows from iris features carry depth information.

B. REAL IRIS DATASET
The real iris dataset contains iris images with a wide range of dilation levels from 120 subjects. The study was properly approved as states the resolution No.011, on May 9, 2019, by the Ethics and Biosafety Committee for Research, Faculty of Physical and Mathematical Sciences, Universidad de Chile. Each of the 120 subjects signed a letter of consent for participating in this study. Iris images were captured under LFVL illumination using the device described in both the previous section, and in [15]. Iris images were captured in 3-second videos of pupil reaction to light changes. The pupil reaction test consisted of dark adaptation for 10 seconds, so that pupils would dilate, followed by turning on the LFVL illumination for 3 seconds. This experiment is harmless to the human eye since the LEDs used in this study are catalogued Risk Group 0-1 [34]. The maximum admissible exposure time is 10,000s for those risk groups, and our subjects were only exposed for 3 seconds [34]. The video captures how the pupil contracts from a dilated state, frame by frame, at 30 f/s (frames per second). Figure 1 shows some frames of the pupil reaction experiment for 3 subjects, while Figure 2 illustrates the evolution of the dilation level over time for one subject. The dilation level is measured as the ratio between the radii of the pupil/iris boundary (Rp) and the iris/sclera  boundary (Rs) [1], [35]: In order to remove artifacts and normalize the number of images per subject, 60 valid frames were selected per video. At 30 f/s, each video has 90 available frames; however, some frames in the videos contained motion blur, occasioned by eye movements and blinking. Additionally, there were redundant frames with similar dilation levels, as can be seen in Figure 2 in the interval between 2.5 s and 3 s. Therefore, all images with motion blur or artifacts were removed manually, and 60 frames with different dilation levels were selected from the remaining images. The selection consisted of keeping the images with a steeper slope in the curve of Figure 2, and randomly sampling the images in the plateaus until 60 images were selected. Therefore, all the videos contain exactly 60 valid frames in the dataset. We captured two videos of pupil reaction from each eye of each subject. From the 480 videos of the 120 subjects, 38 were eliminated since the number of available frames without motion blur or artifacts was less than 60. Therefore, a total of 442 were available from the 120 subjects. The total number of iris images available was 26,520.
The dataset was acquired from 120 subjects with an average age of 23.2 ±5.0 years old. Of these subjects, 67% were male and 33% were female. Of the 120 subjects, their iris colors were 48 dark brown, 49 light brown, 19 green, 3 blue, and one gray iris. The average minimum and maximum dilation levels per iris among the subjects were 0.24 and 0.54 respectively in the dataset. However, the overall minimum and maximum dilation levels were 0.16 and 0.77 respectively.
The real iris dataset was partitioned in the following manner: 96 subjects were selected randomly for training, 12 for validation, and 12 for testing. There are, therefore, 20,940 training images, 2,700 validation images, and 2,880 testing images. It is worth mentioning that we have OCT scans available of both eyes of one subject in the dataset. This subject was placed in the test set in order to assess the generalization capacity of the 3D models in comparison to OCTs of that subject. Finally, each video was captured at a resolution of 8 Mpx, and the iris diameter is 800 pixels on average. However, due to GPU limitations, we resized the iris images to a resolution of 256 × 256. The resized images are similar in size to iris images in current commercial iris sensors.

C. SYNTHETIC IRIS DATASET
In order to acquire a synthetic iris dataset we used Blender, an open-source 3D-design application [36]. Blender can produce 3D models, simulate light sources and materials, render 2D images, and produce aligned depthmaps [36]. These characteristics allowed us to simulate LFVL illumination in virtual irises. We sculpted 100 virtual irises by obtaining texture information from the real iris dataset, and depth information from 36 OCT scans gathered from the internet. Figure 3 illustrates the process of sculpting irises using Blender. In this study, we define the xy plane as the same plane used in 2D iris images, while the z axis represents depth. First, one slide from one OCT is aligned with the yz plane. Then, the iris contour is carefully traced, and a revolution surface is created by revolving the OCT slice around the z axis. The 3D texture is then added to the model so that it will resemble that of the real iris. Each of the 100 virtual irises has a different dilation level, depth profile, and texture. To illustrate, Figure 3d shows four virtual irises that come from different OCTs, and therefore have different textures and dilation levels.
We then rendered synthetic iris images from those 3D models simulating LFVL illumination [15]. Thus, all the images have illumination sources from the side, and from the front. We used the same resolution of the real iris dataset, which is 256×256. In the synthetic images a virtual iris of 12.1 mm in diameter was assigned 230 pixels in the image. This diameter corresponds to the average diameter of a human iris [37]. Figure 4 shows examples of synthetic images and their respective depthmaps. The shadows in a synthetic image ( Figure 4) are simulated from the interactions of LFVL light with the 3D relief of virtual irises ( Figure 3). Next, we used data augmentation on the 3D models rather than on the 2D images to avoid aliasing and distortions. For this purpose, we changed rotation, translation, scaling, mirroring, and color in the 3D models. We used 4 colors, 9 positions, 5 rotations, 2 scales, and mirroring, generating a total of 720 images per each virtual iris. The synthetic iris dataset therefore has 72,000 images. Since the 3D information of each model is known, the corresponding synthetic images   are accompanied by their aligned depthmaps. However, since color swapping produces the same depthmap, there are only 18,000 depthmaps in the dataset. The depthmaps were encoded using 8 bits (0-255). The scale range of 255 is equivalent to 1.936 mm in Blender for our virtual irises.
We also added eyelids, eyelashes, and reflections to the synthetic images, emulating the real iris dataset. This step also helps the networks to learn to predict depth information even in the presence of specular highlights. This will also allow the network to learn how to segment eyelids and eyelashes from the iris. Figure 5 shows the synthetic images with the characteristics described. Eyelids were given a depth value of 10 on the scale of 0-255. This number was selected to avoid saturations during training using backpropagation.
We then partitioned the synthetic iris dataset randomly, using 80 virtual irises for training, 10 for validation, and 10 for testing. We thus have 57,600 synthetic images for training, 7,200 for validation, and 7,200 for testing. The synthetic iris dataset will be available on GitHub. 1 1 https://github.com/dpbenalcazar/irisDepth

D. NETWORK ARCHITECTURE AND TRAINING
In this work we trained several state-of-the-art CNNs to compare their performances in solving the iris depth estimation problem. We compared DenseDepth [18], DORN [19], and T 2 Net [25], those that have demonstrated great depth estimation performance in outdoor and indoor settings. We then introduce irisDepth, which combines the GAN of T 2 Net with the sophisticated depth prediction architecture of DenseDepth, to increase performance. Figure 6 shows the architectures of T 2 Net and irisDepth. The yellow module G S→R is a GAN that is shared in both networks. In order to use iris images with these networks, we added lateral illumination (LFVL) of the iris, which enhances shadows produced by iris features [15]. Thus, LFVL illumination allows the networks to relate shadows in RGB images to depth information. The networks were then trained to relate shadows in RGB images to depth information [32]. Both real and synthetic images were illuminated with LFVL in this work.
To make use of synthetic and real data in the training process, Zheng et al. described two training schemes, called vanilla and full [25]. In the vanilla approach, the translation component is trained first, and the task component is trained afterwards. In the full approach, both translation and task are trained simultaneously. In the context of iris depth estimation, the translation component performs domain adaptation to the synthetic iris images to look realistic, and the task component estimates the depth value of every pixel in the iris image. We used both vanilla and full approaches to train available state-of-the-art networks for 3D iris scanning with the datasets that were described in the Methodology, subsections B and C.
For the vanilla approach, we trained CycleGAN [38], [39] to perform domain adaptation on synthetic images. We used the synthetic iris images as the input, and the real iris images as the target. We trained the network using the train partition of both datasets, and the stop epoch was determined with the validation set. After that, we used Cycle-GAN to translate all 72,000 of the synthetic images, and thus formed a photo-realistic iris dataset. This dataset was FIGURE 6. Architectures of T 2 Net [25] and irisDepth in the context of iris depth estimation. (a) T 2 Net consists of two parts translation, in blue, and task, in red. The translation network is comprised of a GAN that enhances the realism of synthetic images. The task part is comprised of an encoder-decoder architecture f T , which makes depth predictions from real and translated images. (b) irisDepth uses the DenseDepth [18] architecture to improve depth prediction performance. A pre-trained T 2 Net GAN enhances the realism of synthetic images while leaving iris features aligned with the corresponding depth features. After training with realistic irises with aligned depthmaps, irisDepth can make depth predictions in real iris images. The yellow module G S→R is first trained in (a), and then used in (b) to generate the inputs.
partitioned identically to that of the synthetic iris dataset. Then, with the photo-realistic irises as the input, and the depth ground truth of the synthetic images as the target, we trained DenseDepth [18], DORN [19], pix2pix [24] and T 2 Net [25]. In all these cases, we used the same networks available on the original code, with the exception of adjusting image sizes to 256 × 256. We used the train partition of the dataset to train these networks. The validation partition was used to determine the stop criterion for each network.
The full version of T 2 Net, shown in Figure 6a, was trained using a similar procedure. We also made no changes in the network architecture other than adjusting input and output image sizes. The GAN part of T 2 Net (G S→R ) is based on SimGAN in the generator and PatchGAN in the discriminator [25]. The task network (f T ) uses ResNet-50 in the encoder and up-sampling layers in the decoder [25]. Due to GPU constraints, we had to reduce image resolution to 192 × 192 only for this network. Then, we used the train partitions of both real iris and synthetic iris datasets as the input, and the depth ground truth of the synthetic images as the target. Using the validation partition, we determined the stop epoch.
We propose a method to increase performance by merging Densedepth and T 2 Netn. As Zheng et al. described in their paper [25], the problem with the vanilla approach is that while the GAN could morph image features in favor of better appearance, those image features might no longer be aligned with depth features in the corresponding depthmap [25]. We experienced this phenomenon with CycleGAN. As a solution to this problem, we propose using the GAN prediction of a pre-trained T 2 Net along with the auto-encoder of DenseDepth, instead of using a GAN that is blind to depth information. We call this approach irisDepth, and it makes use of the precision of DenseDepth while solving the main problem of the vanilla approach. Figure 6b illustrates irisDepth's architecture.
The following steps were performed for the purpose of using irisDepth in our problem: First, we changed the configuration of T 2 Net to handle images with a resolution VOLUME 8, 2020 of 256 × 256. We used 6 down-sample layers in the transform network, 3 down-sample layers in the task network, 3 downsample layers in the discriminator, and kept the rest of the parameters of the original configuration of T 2 Net. Then, we trained T 2 Net (Figure 6a) with our datasets, and used the validation set to find the stop point. Figure 7 shows the evolution of the translation and task loss functions. This illustrates an example of convergence with the proposed method. We then discarded the task part of this T 2 Net, and used only its GAN at the best epoch for the next steps. This is the yellow G S→R module in Figure 6. After that, we translated all the images in the synthetic iris dataset to obtain a realistic dataset. We partitioned this dataset to be identical to the synthetic iris dataset. Finally, we trained the standard version of DenseDepth using the train partition of the realistic dataset as the input, and the corresponding depthmaps of the original synthetic images as targets, as illustrated in Figure 6b. In this way, our irisDepth uses a GAN with information about depth data and a robust auto-encoder for the task part.

E. DEPTH EVALUATION WITH SYNTHETIC IMAGES
As one performance evaluation, we compared each network depth estimation capacity using the test set of 7,200 synthetic images. The goal of this test is to evaluate the depth estimation part of each network rather than the photo-realism of the translated images. The results of this test do not generalize to the performance on a real iris, but give a good indication of the precision of each network in the depth estimation task. First, the synthetic images were translated to the realistic domain using CycleGAN for the vanilla networks, as well as their respective GAN for the full networks. Both T 2 Net and irisDepth have loss functions for the translation, as well as for the task part. Therefore, the networks perform domain adaptation instead of leaving synthetic images unchanged. Depthmaps were then predicted from the translated images using each network. Finally, we evaluated how similar the depthmaps that were predicted from the translated images were to the ground-truth depthmaps of the synthetic images.
For this purpose, we used the standard metrics: Absolute Relative Difference (abs_rel), Squared Relative Difference (sq_rel), Root Mean Square Error (rmse), Logaritmic Root Mean Square Error (rmse_log), and the Accuracy Metrics (a 1 , a 2 and a 3 ) [18]- [23], [25]. The accuracy metrics a 1 , a 2 and a 3 are computed using: where u and v are the coordinates of a pixel, depth(u,v) is the intensity of the predicted depthmap at the (u,v) coordinate, GT(u,v) is the intensity of the ground truth depthmap at the same coordinate, and n = {1, 2, 3}.

F. 3D RECONSTRUCTION OF HUMAN IRISES
After all the networks are trained and tested, they can be used to generate depth estimates on human iris images. With an iris image and the predicted depthmap we can construct a 3D model of the iris. The 3D pointcloud model consists of a list of (x,y,z) coordinates of each 3D point. The x and y coordinates come directly from scaling the position of the pixels in the image, while the z coordinate is related to the depth value. If we use u and v to describe the horizontal and vertical position of a pixel in the image, and x, y and z to describe the 3D position of a point in the point-cloud model, the coordinates of such a point in millimeters are obtained by: where W is the image width, and depth(u,v) is the intensity value of the predicted depthmap at the (u,v) coordinate. The constants in (4)-(6) depend on the size of the virtual iris and the distance to the camera. The constant 13.47 in the xy plane is computed assuming a design criterion where a virtual iris of 12.1 mm in diameter uses 230 pixels in the rendered image. Therefore, 256 pixels are 13.47 mm. The constant 1.936 mm is the maximum depth size equivalent of a variation of 255 levels in the depth map. Then, a 3D mesh model is formed by connecting neighboring points in the pointcloud. As a result, two 3D model representations are formed, and they are compatible with our previous SfM approach [10]. These models can easily be sliced and compared with OCT scans.

G. DEPTH EVALUATION WITH OCT SCANS
For one subject in the test dataset, we acquired four Anterior-Segment OCT slices of each eye, using the Visante TM OCT system [40]. These 8 OCT slices provide a ground truth for the evaluation of depth estimation from real iris images. First, we normalized the scale of the OCTs and rotated them  so that the intersections of the cornea and the iris lay in a horizontal line. Then we used Canny edge detection to obtain the positions of the points on the iris surface. Figure 8 shows an example of the OCT with its corresponding iris surface in red. After that, one 3D model was estimated for each iris using real images of the same subject, and using the trained CNNs. We also produced one 3D model for each iris using the SfM 3D-iris-scanning method described in [10].
Then we sliced each 3D model using the same angles as in the available OCTs: 0 • , 45 • , 90 • and 145 • . To compensate for dilation differences between the OCTs and the iris images, we transformed the 3D model slices linearly to match the beginning and ending points of the irises. Finally, we compared each 3D model slice with the corresponding OCT slice, and measured the mean absolute error (MAE). The scale information on the OCT scans allowed us to calculate MAE in micrometers.

H. RESOLUTION ASSESSMENT
We assessed the minimum depth that we could detect with our method, as well as the amount of error on all three axes. For the analysis, we manufactured and scanned 3D patterns of known dimensions. We printed them in 3D real truncated pyramids of various heights, as shown in Figure 9a. The x and y dimensions of every step are fixed, and the step height Z varies from 25 µm to 500 µm in increments of 25 µm. In total, we manufactured 20 pyramids for training and 5 for testing, using the FORMLABS FORM-2 stereolithography 3D printer. We set the 3D printer for the best resolution, which is 25µm per layer. Then we trained our irisDepth network with images of the 3D patterns. We used the same architecture and the same training scheme described in the Methodology, in subsection D. In this way, we used real images, as well as synthetic images with aligned depthmaps.
For the real pyramid image dataset, we used the same device and setup that was used for the iris images to assess the depth performance of our method. Figure 9b shows one image captured under these conditions as an example. We captured 360 images of the 20 real step pyramid printed 3D models, which included 6 different angles on the z axis and 3 angles on the y axis. We augmented the data using translation and scaling to produce a total of 7,200 images. For the synthetic dataset, we used Blender to create 20 virtual pyramids with similar characteristics to those of the 3D printed ones. Then, we simulated the same LFVL illumination as was used in the synthetic iris dataset. Figure 9c shows an example of a synthetic image, and Figure 9d shows its corresponding depthmap. Using 3D data augmentation, we included 45 different angles from the z axis, and 6 angles from the y axis, rendering 5,400 synthetic images with aligned depthmaps. Then, using 2D data augmentation of 6 random translations and scales, we obtained 32,400 synthetic images. Finally, we partitioned the image dataset into 80% (25,920) for training, 10% (3,240) for validation, and 10% (3,240) for testing.
We followed the same procedure for training our irisDepth network with the real and synthetic pyramid dataset as was used for real and synthetic irises. Using the trained irisDepth network, we reconstructed five 3D models from images of the real truncated pyramids, one for each of the five different heights (from 25 µm to 500 µm in increments of 25 µm). We then measured the height of each step in the reconstructed pyramids along the x and y axes. Figure 10a shows a reconstructed 3D pattern, and Figure 10b and Figure 10c show the segmented version of the 3D pattern in Figure 10a. After that, we measured the average z value, as well as the standard deviation (STD), of the 3D points that form each step. Figure 10c shows the height of each step, and the mean step size Z of the 3D model. We determined the measurement errors on each axis, using the absolute difference between the measured step on the 3D model and the measured step on the ground truth. The ground truth values ( Z GT ) were measured using a Mitutoyo 293-330 micrometer on the real truncated pyramids. The precision of the ground truth measurements is given by the micrometer precision, which is ±1µm.

I. 3D RUBBER SHEET MODEL PROOF OF CONCEPT
As indicated in the Introduction, a 3D model of the human iris could be used in the future to improve accuracy in iris recognition. In this paper we explore a proof of concept of constructing a 3D rubber sheet from the 3D iris. Additionally, we evaluate iris recognition performance in the test set of 12 subjects.
With the purpose of building the 3D rubber sheet model, we applied a slicing procedure at regular intervals as described in the Methodology section, in sub-section F. Each slice is a 2D curve that represents the relief of the iris in a radial manner. If the radial axis of the slices is normalized between 0 and 1, the 3D rubber sheet is resilient to dilation within certain ranges, as is the case with 2D rubber sheet models. The slices, then, obtained at different angles, are concatenated linearly to form a 3D structure. We built the 3D rubber sheet of the same subject used in the OCT test. We tested the similarity of a regular rubber sheet obtained from a 2D image [3] with the flattened version of the 3D rubber sheet. We tested separately the similarity using MAE, the zero crossing normalized cross correlation (ZNCC) [41], as well as with the Hamming Distance (HD) [2] of the iris codes from both rubber sheets. A close similarity would indicate that our 3D models contain the same information on the xy plane as a 2D iris image; but, we would have additional information available on the z axis to be exploited.
A preliminary 3D iris recognition method was implemented using a 3D rubber sheet model to extract 3D keypoints and descriptors, and to compare their distances. For this purpose, we constructed 480 3D rubber sheet models using 20 images per eye of the 12 subjects in the test set. We enrolled the 20 images with the dilation level closest to the median value of the subject, as recommended by Ortiz et al. [42]. We constructed the 3D rubber sheet models using 75 samples on the radial axis, and 360 slices on the angular axis. Our 3D rubber sheets, therefore, contain 75 × 360 = 27, 000 3D points. Our proposed method for iris recognition in 3D has the following steps: First, we sample TABLE 1. Designed and measured step sizes as well as estimated errors on the x, y and z axes, for the 3D truncated pyramids.
the 3D rubber sheet model with a 4 × 15 grid to find 60 keypoints. Then, we obtain the Spin Image descriptor [43], [44] for each keypoint. Finally, we assess the similarity of two 3D rubber sheet models as the average ZNCC [41] between corresponding Spin Images on the sampling grid. As with the 2D iris code, we account for small angular displacements by translating the 3D rubber sheet ± 5 • and storing the best result [2].
We compared the iris recognition performance of our 3D proposed method with that of the 2D iris code. For this purpose, we obtained the 2D rubber sheets and iris codes of the 480 images in the test set using Osiris V4.1 [45]. We then used the d index to score iris recognition performance [2]. This index shows how well we can separate intra-class from inter-class comparisons, and it is computed using: where µ 1 and µ 2 are the mean values of the intra-class and inter-class distributions, respectively, and σ 1 and σ 2 are the standard deviations (STD) of both distributions. The higher the d value, the easier it is to separate intra-class from interclass distributions.

A. RESOLUTION ASSESSMENT
The results on the 3D real truncated pyramids of different step sizes are as follows: Table 1 shows the five sizes for the 3D printed pyramids of the test set with a designed step 98592 VOLUME 8, 2020 FIGURE 11. Examples of depth estimation using synthetic images. Each row is a different example. The first two columns are synthetic images and their corresponding ground truth depthmap (Depth GT). The succeeding columns show the outputs of each network. DenseDepth and pix2pix make depth predictions from the synthetic images translated by CycleGAN. T 2 Net and irisDepth make depth predictions from the results of their own GANs. The second row shows a red circle highlighting an iris feature that can be followed into the corresponding depthmaps.
size ( Z Design ) of 25 µm, 50 µm, 100 µm, 200 µm, and 400 µm. The values of the step size measured with the micrometer are the ground truth for our depth measurements ( Z GT ); the mean step sizes measured in the 3D reconstructions ( Z ); the standard deviation of the 3D points that form each step (STD); as well as the absolute errors measured along each z (Z err ), x (X err ), and y (Y err ) axis.
The results of Table 1 show that the measured step size Z is close to the ground truth value ( Z GT ) for all five 3D patterns. The average absolute error on the z axis is 4.1 µm. The standard deviation represents how much the 3D points deviate from a perfect plane [10]. Its average value is 17.7 µm. This means that a feature on the z axis that is smaller than 17.7 µm is within the noise level of the 3D points. Features larger than 17.7 µm, however, can be detected by our system. Therefore, the resolution limit of our method is 17.7 µm. This figure is about 1/30 th of the iris thickness [37]. Additionally, the resolution limit of 17.7 µm is almost twice as high as the 10µm of conventional OCT scans, as well as the 11 µm reported in [10] for SfM. Our results show a reasonable level of precision from a single 256 × 256 image.
The scale values on the OCT scans, as well as equations (2), (3) and (4) allow estimating the theoretical resolution of our method. According to (2) and (3), a variation of 1 mm on the x or y axis produces a variation of 19 pixels for the 256 × 256 images. Therefore, the resolution of the 3D model on the xy plane is 52.6 µm/px. This figure is around 1/230 th of the iris diameter [37] and can be improved by increasing image resolution. For instance, if we used 800 × 800 images, equations (2) and (3) yield a resolution of 16.8 µm/px. A variation of 1mm on the z axis produces a depth change of 132 on the depth scale between 0 and 255. Therefore, the resolution on the z axis is 7.56 µm. Measurements are therefore 7 times more precise along the z axis than on the xy plane. These figures roughly match those shown in the experimental results of Table 1, where there is almost 10 times more error along the xy plane than on the z axis.

B. DEPTH EVALUATION WITH SYNTHETIC IMAGES
This test illustrates the precision of each network in the depth estimation task. The ground truth in this experiment comes from the depthmaps in the synthetic iris dataset, while the inputs are translated images. Figure 11 shows examples of ground truth synthetic images in the test set, results of translated images, and the network predicted depthmaps. The vanilla networks, such as DenseDepth, DORN, and pix2pix, make up a depth estimation from the photorealistic images produced by CycleGAN. T 2 Net and irisDepth make depth estimations from the output of their own GANs. Figure 11 also illustrates the problem of training a GAN blindly from depth estimation. The ground truth example in the second row has a concave feature highlighted with a red circle. Since this feature is not reproduced by CycleGAN, neither DenseDepth nor pix2pix can estimate its depth. However, the GANs trained in the full approach learn to reproduce this feature. Both T 2 Net and irisDepth were able to estimate the depth of this concave feature correctly.
The results of the depth evaluation with the 7,200 synthetic images in the test set are presented in Table 2. For abs_rel, sq_rel, rmse, and rmse_log metrics, a lower value means a better result, while for a 1 , a 2 and a 3 , a higher value is better [18]- [20]. The accuracy metrics a n are computed using (2)- (3). The best result of each column was highlighted in bold. Table 2 shows that irisDepth produced the best results on almost all the tests. DenseDepth and DORN also produced good results due to their specialized architectures in depth prediction tasks. IrisDepth produced the best overall results since it combines a GAN that has information on TABLE 2. Similarity using standard metrics between depthmaps predicted from the translated images and depthmaps of the synthetic images in the test dataset of 7,200 images. depth data, and the powerful depth prediction architecture of DenseDepth.

C. DEPTH EVALUATION WITH OCT SCANS
We also assessed the performance of our method by comparing the generated 3D models against the depth ground truth provided by iris OCT slices. Figure 12 shows the comparison between an iris 3D model slice and the corresponding OCT. Figure 12a shows the OCT image with markings of the ground truth iris surface, and the slice of the iris 3D model. Figure 12b illustrates the angle of the slice and the iris features that are present along this line. Figure 12c shows a spatial comparison of the 3D model with the OCT. This visual comparison illustrates the changes in the 3D model across the profile, and shows how they closely match the OCT.
We then compared the difference quantitatively between the ground-truth iris surface in the OCT slices and the corresponding slices of the 3D models produced by both SfM, and the different CNNs trained in this work. Figure 13 shows close-up comparisons between OCT slices and all the various 3D models produced by the different methods. Figure 13 shows that the models produced by DenseDepth, T 2 Net_full, and irisDepth follow the depth ground truth of the OCT closely. The model produced by SfM has a great resemblance on the left side, but a significant difference on the right side of the iris. For each method, we have the curve of the OCT ground truth, and that of the 3D model slice. We computed the mean absolute error to quantify the error between both curves. We compared the 3D iris models that are produced by each method for the left and right eyes of the subject to the total of 8 available OCT slices for the right and left eyes (4 for each eye). Table 3 shows the results of the mean absolute error in micrometers when comparing each 3D iris model to the ground truth (OCT). The minimum average error of 77 µm was obtained with our model irisDepth. The typical thickness of the iris is around 500 µm [37], and therefore, the error achieved with the iris-Depth method is within 15% of the thickness. Figure 13 also shows that irisDepth is the method that follows the ground truth the most closely. SfM produced the second to last good performance, and the error of SfM is 60% greater than the best CNN method (irisDepth). This indicates that the CNN irisDepth produces a more accurate 3D model from a single image than was achieved with SfM from multiple images.

D. 3D RECONSTRUCTION OF HUMAN IRISES
We produced pointcloud and mesh 3D models of the subjects in the test set using irisDepth. Figure 14 shows examples VOLUME 8, 2020  of 3D mesh models for five different subjects. For the purpose of appreciating the 3D information in a 2D image, a redgreen grid was drawn on the surface of the 3D model. In this way, deformations in the grid illustrate depth variations across the iris surface. This figure also shows the estimation of the 3D information performed by irisDepth from a single image of the human iris. The pointcloud models produce depth predictions from every pixel in the image. At a resolution of 192 × 192, the models have 36,864 3D points, and at 256 × 256 pixels, there are 65,536 3D points. In contrast, the SfM method reported an average production of 11,005 3D points [10]. Therefore, our CNN approach has more information available for producing the 3D model of the iris compared to that of the SfM approach.
Our results show that there are advantages to using CNNs over SfM for 3D iris model generation. Besides using multiple images at a greater resolution, SfM has problems producing 3D points in areas of the iris that have no significant texture. In contrast, the CNN models produce a uniform distribution of points regardless of iris texture. The number of 3D points obtained by CNNs is always constant, and it can be 6 times greater than those of SfM. Additionally, artifacts such as lateral reflections produced noisy points in the SfM model. One of the main advantages of our proposed  Figure 12b and the flattened version of the 3D rubber sheet in Figure 15. A mask was used in the comparison to avoid the effects of eyelids, eyelashes, and reflections [2]. method is that CNNs require only a single image for the 3D model estimation. This saves acquisition and processing time, as well as storage space. The acquisition time is relevant for subjects in the use of biometric applications. The SfM approach [10] requires capturing a burst of about 10 images per camera position for the 3D model construction. A set of one-hundred 16Mpx images, therefore, is typically used to reconstruct a single 3D model. Consequently, obtaining a 3D model from a single image is a significant improvement.

E. RUBBER SHEET MODEL AND 3D IRIS RECOGNITION PROOF OF CONCEPT
We reconstructed the 3D rubber sheet from the 3D model in Figure 12c by obtaining one 2D slice every 1 • . The 3D rubber sheet is shown in Figure 15. The 3D rubber sheet captures the color information of the 2D image, as well as the depth of the iris. Just like a 2D rubber sheet, this is a representation of the human iris that normalizes dilation changes in a linear manner [2].
We then compared the rubber sheet from the iris image of Figure 12b with the projection of the 3D rubber sheet of Figure 15 onto the xy plane. Figure 16 shows the resulting rubber sheets, as well as the mask of eyelids, eyelashes, and reflections. This mask was used to ensure that those artifacts would not affect the comparison. The resulting MAE value for the comparison is 0.0313; ZNCC is 0.9385; and HD is 0.226. These values indicate a small error and a large correlation between the two images. This means that the reconstructions of the 3D model and the 3D rubber sheet preserve the information along the xy plane with a small error. Additionally, the low HD ensures a true positive in biometric tests. For context, in a previous work, we analyzed that the mean intraclass HD of LFVL images is 0.243, while  [45]. (b) Using the proposed 3D rubber sheet model, with Spin Image Descriptors [43], [44] and ZNCC [41].
that of the interclass distribution is 0.48 [15]. Therefore, the HD value of 0.226 falls in the range of two different images from the same individual.
The results of the 3D iris recognition are presented in Figure 17 which shows the iris recognition performance of the 3D rubber sheet compared to that of the 2D iris code in our test set of 12 subjects and 480 images. The distributions in Figure 17 are normalized so that they have an area of 1. The results with the 2D iris code yielded a d of 8.51, using Osiris V4.1. The 3D rubber sheet achieved a d of 12.63, which is 48% higher. The mean value of the intra-class distribution is similar for both methods, with a value of approximately 0.45. However, the mean value of the intra-class distribution is 0.111 units less for the proposed 3D method. The results of this preliminary test show that the 3D characteristics extracted from the human iris are more discriminative than the 2D iris code.
The preliminary results of iris recognition in the test set of 12 subjects, along with the proof of concept of the Rubber Sheet model, and the depth evaluation tests with 8 groundtruth OCT slices of one subject illustrate the capabilities of the proposed method to reconstruct the surface of the human iris, and its applications in iris recognition. The tests with stepped pyramids of known dimensions demonstrate the smallest resolution our method can measure. All these evaluations show that our method can reconstruct a 3D model of the human iris with good performance.

V. CONCLUSIONS
Our proposed method for 3D iris model estimation from a single image produced complete 3D representations of the human iris using CNNs. Our method, irisDepth, uses the GAN part of a pre-trained T 2 Net with the depth prediction of DenseDepth. Therefore, the GAN is not blind to depth information during training, and the depth prediction is more powerful than T 2 Net alone. IrisDepth produced the best performance among the trained networks in both the synthetic and real iris tests. We used a dataset of 96 subjects randomly selected for training, 12 for validation and 12 for testing. There are 20,940 training images, 2,700 validation images and 2,880 testing images. We also used synthetic irises with 72,000 images. Both datasets used lateral illumination of the iris (LFVL) to enhance the shadows produced by iris features [15]. Thus, lateral illumination allowed the networks to relate shadows in RGB images to depth information.
We validated the results of our method for modeling the human iris by comparing slices of the 3D models with corresponding OCT slices of both eyes of one subject. The overall shape of the 3D models matches that of the OCT. Our method produced 65,536 3D points, with an absolute error of 77 µm on average. These numbers represent 6 times more 3D points and a 60% increase in accuracy with respect to previous 3D iris models based on SfM [10]. We proposed a 3D rubber sheet model proof of concept, which had a 0.9385 correlation with a 2D rubber sheet on the xy plane, and additional information on the z axis to be exploited. On a preliminary test with 480 images, the proposed 3D rubber sheet model increased iris recognition performance by 48% with respect to the standard 2D iris code [2]. Finally, the resolution of our method is 17.7 µm, as was measured by scanning 3D pyramids of known dimensions. This is roughly 1/30 th of the iris thickness.
A 3D model of the iris may open research lines in iris recognition and ophthalmology. In addition to increasing accuracy in iris recognition [11], obtaining 3D information of the iris could help in extreme pose detection [46]- [50]. Additionally, a 3D model of the iris could produce information similar to that of an OCT, which could help ophthalmologists in the detection of closure angle glaucoma [10], [14].
Future improvements could increase the precision of our method. First, modifying the architecture to train with OCT slices or OCT based 3D models would produce 3D iris models that correlate more closely with actual OCT scans. Also, although CNN and SfM are traditionally used separately, a combination of them could yield a more robust method [28].
The CNN prediction could be the starting point for SfM, which could output more 3D points from several views at a higher resolution, thus improving the 3D model [12], [13].