Towards an Efficient Segmentation Algorithm for Near-Infrared Eyes Images

Semantic segmentation has been widely used for several applications, including the detection of eye structures. This is used in tasks such as eye-tracking and gaze estimation, which are useful techniques for human-computer interfaces, salience detection, and Virtual reality (VR), amongst others. Most of the state of the art techniques achieve high accuracy but with a considerable number of parameters. This article explores alternatives to improve the efficiency of the state of the art method, namely DenseNet Tiramisu, when applied to NIR image segmentation. This task is not trivial; the reduction of block and layers also affects the number of feature maps. The growth rate (k) of the feature maps regulates how much new information each layer contributes to the global state, therefore the trade-off amongst grown rate (k), IOU, and the number of layers needs to be carefully studied. The main goal is to achieve a light-weight and efficient network with fewer parameters than traditional architectures in order to be used for mobile device applications. As a result, a DenseNet with only three blocks and ten layers is proposed (DenseNet10). Experiments show that this network achieved higher IOU rates when comparing with Encoder-Decoder, DensetNet56-67-103, MaskRCNN, and DeeplabV3+ models in the Facebook database. Furthermore, this method reached 8th place in The Facebook semantic segmentation challenge with 0.94293 mean IOU and 202.084 parameters with a final score of 0.97147. This score is only 0,001 lower than the first place in the competition. The sclera was identified as the more challenging structure to be segmented.


I. INTRODUCTION
Biometric recognition is gradually becoming part of our daily life thanks to the latest advances in sensor technology for capturing biometric data and the development of biometric recognition systems for smart devices, Virtual lenses (VL), and Augmented reality(AR). However, all these technologies present a common challenge; the correct estimation of the structures of the eye, such as pupil, iris, sclera left, and right. To find these structures is not a trivial task because of the presence of highlight, make-up, contact lenses, and others. This challenge has been studied using semantic segmentation algorithms, reaching a high performance but with complex models with a massive number of parameters. For biometric applications will be necessary to implement The associate editor coordinating the review of this manuscript and approving it for publication was Marina Gavrilova . these kinds of trained models in smaller devices and hardware; therefore, it is also relevant working towards efficient and lighter algorithms. These efficient models may, in part, be used to estimated and measuring where an individual is looking, where his/her eyes are, and the state of his/her eyes in a VR device. This is made using a Near-infrared spectrum camera. This application is colloquially known as 'eye-tracking.' [1].
Tracking the position and orientation of the eye, as well as its gaze, can be useful for improving Virtual reality devices. For instance, the tracking information can be used to develop new displays and rendering architectures that could substantially alleviate the power and computational requirements to render 3D environments. Furthermore, eye-tracking enables gaze prediction, and from that, the intentions of the user may be inferred, allowing a more intuitive and immersive experience.
These applications require reliable eye-tracking systems that may be used under all environmental conditions and within the power and computational constraints imposed by Virtual Reality devices.
In order to estimate the gaze, it is relevant to know the position of the eyes and the center of the pupil. This challenge is a hard task when using traditional NIR eye images and even more complicated when images come from VR devices due to make-up, highlights, and reflections coming from the lights. A commonly used approach to estimate gaze is to segment parts of the eye, such as pupil, iris, and sclera. However, image segmentation is a challenging problem that has arisen in several areas of computer vision.
State of the art algorithms such as semantic segmentation has been mainly trained to localize very complex objects from cities such as cars, buildings, and people and not in biometric gaze applications. Even we can found more than one class object in the same image. Traditional pre-trained implementation models reach very low results when directly applied to eye segmentation and gaze estimation. Another limitation is that most states of the art segmentation algorithms are based on deep convolutional networks with a large number of layers and parameters. Although they achieve high accuracy they can not be used in mobile devices such as a VR lens.
This article is an extension of our previous work reported in [2]. This such work, results obtained in the Facebook OpenEds challenge were reported. The goal was to achieve an accurate and efficient segmentation algorithm for NIR eye Images taken from Virtual Reality (VR) lenses. As a result, a lighter version of a semantic segmentation algorithm that is able to discriminate among the pupil, iris, sclera, and eye background is proposed. This efficient architecture can be used for mobile devices applications. As a complement, this article report in more detail the Dense10 architecture proposed as an efficient algorithm for the segmentation of NIR eye images taken from Virtual Reality (VR) lenses. A full explanation of the reasons that lead us to such architecture is discussed in this work since the reduction of layers in the architecture is not trivial. A full description of each part and new figures of the modified architecture are reported (feature extractor, down-sampling, and upsampling). The comparison with U-Net, Mask-RCNN with ResNet 50, and 101 was also explored. In order to understand the segmentation performance along with different parts (classes) of the eye (sclera, iris, pupil, and background), a set of experiments were also performed using Mask-RCNN50, Mask-RCNN101, and our proposed model DenseNet10 in each class separately. As an additional contribution, this article includes experiments performed with several states of the art algorithms that were implemented and tested using the NIR eye images provided by the Facebook competitions. All these experiments are compared with those previously obtained in our previous conference paper.
The remainder of the paper is structured as follows: A literature review for semantic segmentation is presented in Section II. The proposed segmentation method is shown in Section III. Results and conclusion are reported in Sections V and VI respectively.

II. STATE OF THE ART
This section reviews the most used computer vision techniques applied to image analysis and the state of the art in semantic segmentation algorithms.

A. SEGMENTATION NETWORKS
Computer vision techniques have greatly improved in recent years, mainly due to the development of neural network techniques such as deep learning [3]. Several Deep Learning algorithms have been applied to a wide range of fields, such as agriculture, medical image analysis, biometrics, scene understanding, autonomous driving, amongst others [4]- [7]. The most common applications of deep learning for image analysis are image classification, object detection, semantic segmentation and instance segmentation [8]- [11]. Image classification has the goal of categorising an image into a particular class, returning the corresponding label and the classification confidence rate. In object detection, the goal is to localise regions of interest and classify each one individually, resulting in the class label for each object in the image and also their coordinates denoted by a bounding box. Semantic segmentation has the purpose of classifying each pixel of the image and group them according to their class. The task of instance segmentation can be thought of as object detection with the addition of semantic segmentation, where the goal is to detect each object in the image and classify each pixel of every instance. In contrast to object detection, the output of an instance segmentation algorithm is a mask around the object of interest instead of a bounding box.
Several deep learning techniques have been reported in the literature to address the semantic segmentation problem. For instance, Fully Convolutional Networks (FCNs) [12], [13] were introduced as a natural extension of CNNs to tackle per pixel prediction problems such as semantic image segmentation. FCNs add upsampling layers to CNNs to recover the spatial resolution of the input at the output layer. As a result, FCNs can process images of different kinds of sizes. In order to improve for the resolution loss because of pooling layers, FCNs using skip connections between downsampling and upsampling paths. The skip-connection stages help to improve the upsampling path to recover fine-details information from the downsampling layers.
Among CNN architectures extended like FCNs for semantic segmentation purposes, Residual Networks (ResNets) [14] are one of the most relevant approaches. ResNets are designed to ease the training of very deep networks (of hundreds of layers) by introducing a residual block that adds two signals: a non-linear transformation of the input and its identity mapping. The identity mapping is implemented by means of a shortcut connection. ResNets have been extended to work as FCNs [12] yielding very good results in different segmentation benchmarks. ResNets using additional paths to FCN (shortcut paths) and, thus, increase the number of connections VOLUME 8, 2020 within a segmentation network. These additional shortcut paths have been shown not only to improve the segmentation accuracy but also to help the network optimization process, resulting in a faster training process.
The Encoder-Decoder [15] is known as a SegNet architecture. The novelty of SegNet lies in the manner in which the decoder up-samples its lower resolution input feature map(s) coming from the encoding stage. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear up-sampling. This characteristic eliminates the need for learning to upsample. The up-sampled maps are sparse and are then convoluted with trainable filters to produce dense feature maps. A Encoder-Decoder model with fully convolutional and skip connections architecture was implemented. This model is based on the Encoder-Decoder architecture taken from George Seif's Semantic Segmentation Suite, which is based on SegNet. 1 The U-Net [16] is a convolutional network architecture for fast and precise segmentation of images. it has been used on the ISBI challenge for the segmentation of neuronal structures in electron microscopic stacks. [17]. Unet was implemented by Ronneberger et al. [16] to supplement a usual network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high-resolution features from the path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.
Chen et al. [18] proposed a complex DeepLabv3+ as an extension of the previous DeepLabv3 [19] by adding a simple yet effective encoder module to recover the object boundaries. The rich semantic information is encoded in the output of DeepLabv3 with atrous convolution allowing to control the density of the encoder features depending on the budget of computational resources. Furthermore, the decoder module allows detailed object boundary recovery. DeepLabV3plus+ [20] is a cutting-edge architecture for semantic segmentation. This architecture is able to do multi-scale processing without increasing the number of parameters. DeepLabV3+ adds an intermediate decoder module on top. After processing the information via DeepLabV3+, the features are then up-sampled N times. The features are then further processed along with the original features from the front-end, before being up-scaled again N times. This improves the load of the data from the end of the network and provides a shortcut path from the feature extraction front-end to the near end of the network. However The DeeplabV3+ use the ResNet101 as a backbone which is pre-trained using ImageNet database [21].
He et al. have recently introduced the Mask R-CNN algorithm [22]. It is based on the Faster R-CNN method reported by Ren et al. [13]. Mask R-CNN adds a branch for predicting segmentation masks on each Region of Interest (RoI). This is applied in parallel with the existing branch for classification 1 https://github.com/GeorgeSeif/Semantic-Segmentation-Suite and bounding box regression. The mask branch is a small fully connected network applied to each region of interest, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation. However, this framework takes 48 hours in a 8GB GPU to be trained. Different backbones such as ResNet 101 (60 million of parameters) and ResNet50 (40 million of parameters) have been used.
Very recently some authors such as [23]- [27] have been published improvements in order to make more feasible the semantic segmentation to be used in mobile devices. Those papers also used EDS-Dataset from a Facebook competition creating a new test set reaching very competitive results. However, these test sets are different that was used in the official competition.
Densely Connected Convolutional Networks (DenseNets) [20], [28] are built from dense blocks and pooling operations, where each dense block is an iterative concatenation of previous feature maps. The pool layers (average, sum and mean pooling) are used mainly to down-sample the volumes spatially and to reduce the feature maps of previous layers. Mean-pooling is useful for two main reasons: (a) by eliminating maximal and minimum values, it reduces computation for upper layers; and (b) it provides a form of translation invariance.
DenseNet architecture can be seen as an improvement of ResNets, which perform the iterative summation of previous feature maps. However, DenseNets are more efficient in parameter numbers; perform deep supervision thanks to short paths to all feature maps in the architecture. All layers can be accessed from their previous layers making it easy to use the information of the features from previously computed maps. DenseNet56/67 and DenseNet103 were designed to represent very complex features in order to segment city elements such as cars, people, building and others. Thus, a simpler architecture is proposed here in order to reduce the numbers of dense blocks.
The characteristics of DenseNets mentioned above make them a perfect fit for semantic segmentation as they naturally induce skip connections and multi-scale supervision. However, the large number of parameters limits its use, specifically in applications such as eye segmentation from VR-lens images. In order to achieve a more efficient architecture, a novel implementation is proposed in the following section.

III. TOWARDS AN EFFICIENT SEGMENTATION ALGORITHM
In order to achieved an efficient segmentation algorithm, several state of the art methods were implemented and modified accordingly to be applied to NIR eye images (section III-A). The results were evaluated in terms of accuracy and efficiency of the algorithm (See Table 3.) As improvement, a modified DenseNet10 architecture is proposed in section III-B.

A. IMPLEMENTATION OF STATE OF THE ART ALGORITHMS FOR VR-NIR EYE IMAGE SEGMENTATION 1) ENCODER-DECODER
The Encoder-Decoder architecture proposed by [15] was implemented. In order to customize it to solve the NIR eye segmentation problem and and to reduce the number of parameters (from 29.5 to 19.5 million), the following modification were made: • To Reduce the number of layers: The original number of layers of this architecture is 36 (18 for the encoder and 18 for the decoder). In this work, they were both reduced to 12 layers (encoder and decoder). The layers were applied directly using between layers the skip connection similar to the original implementation.
• To find the best set parameters: A grid search was used to find the best parameter values (filter size, dropout rate and batch size) of the Encoder-Decoder model in order to yield the best results. Several filter sizes called A and B were tested in some of the layers. However, the optimal values were A = 300 and B = 500 respectively. The dropout rate of 0.1 and a batch size of 1 were chosen. Despite the improvement (from 29.5 to 19.5 million of parameters), this model is still large to be implemented in real-time systems.

2) DeepLabV3+
The method proposed by [20] was implemented but changing the following parameters.
• Block reduction: Two blocks from the original 7 blocks were deleted (the last two). This affects the filter size, the atrous convolution rate and the spacial pyramidal pooling rate.
• The atrous convolution rate was applied from block two up to block five with dilation rate of 2 up to 16 times in the network. Image size of 640 × 320 was reduced to 80 × 40. This is the minimum rate that can be used without affecting the accuracy of the results. Lower resolutions affects the segmentation of pixels between classes.
• Filter dilation size: Each convolution uses different dilation rates to capture multi-scale context from 16, 8, 4 up to 2.
• Stride: The output stride was reduced from 16 to 8. The number of parameters was reduced from 40 to 31 million parameters. The main reason of the small reduction of number of parameters is the backbone used by this method. This reduction of number of parameters is not enough for real-time systems applications.

3) U-NET
Despite the previous architecture, U-Net has the advantage to extract features in the downsampling path without using a pre-trained model such as ResNet or VGG. It comprises of ten layers, five in the downsampling path and 5 in the upsampling path. The main changes to the original architecture proposed to improve efficiency are summarised. See Table 1: The number of parameters was reduced from 7.6 million to 6.9 million. This quantity was not enough to outperform the minimum requirement of Facebook challenge base line of 416,000 parameters.

4) DenseNet 56/103
Two traditional DenseNet architecture were implemented, with 56 and, 103 layers respectively. Those networks were designed to represent very complex features in order to segment city elements such as cars, people, building and others. These models (DenseNet56 and DenseNet103) used 1.5 and 9.5 million of parameters respectively. This quantity also is impracticable for real-time systems. The main changes made to the architecture implemented by [20] to reduce the number of parameters were as follows: • Grown Rate (k): The grown rate (k) is the additional number of features for each layers. The original implementation has ground rates (k) of 40 and 16. Several values of k were explored achieving the best results with k = 8 and 12 for DenseNet 101 and 56 respectively. The reduction of the number of the grown rate imply the reduction of the number of feature used. Therefore, the number of parameters of the model is also reduced. This reduction of features is possible thanks to the high portion of black background in the VR-NIR eye image.
DenseNet has a powerful architecture for the semantic segmentation task. However, it uses huge number of layers from 56 up to 103 and a high number of GPUs to be trained. Reducing the number of layers is not straightforward since each layer adds k feature-maps of its own to the final state. The growth rate regulates how much new information each layer contributes to the global state, therefore the trade-off between grown rate (k), IOU and number of layers needs to be carefully studied.
The results of all methods described in this section are presented in Table 3. Most of them reached a high IOU and accuracy but with a large number of parameters. Those methods do not meet the maximum number of parameters (416.000) required for the Facebook competition (baseline).
In order to improve the results obtained, a novel implementation of a DenseNet with only 10 layers is proposed as follows. VOLUME 8, 2020 B. DenseNet10 In this section, an efficient DenseNet10 architecture that is able to deliver high performance with very accurate representations of the original images is presented. The resulting fully convolutional models have few parameters based on [27]. It does not require expensive hardware to be trained. Our model is based on the original DenseNet103, The mathematical model for the output xι of the ιth layer of the Dense Block is: where H ι represents the non-linear transformation to the output of the set of layers xι and [. . .] the concatenation operation from the previous layers. This architecture includes a Bach normalization H , ReLU, Convolutional layer and dropout. The goal of this work is to reduce the layers and the concatenation matrix's size. For doing so, a feature extractor and two paths (Down-sampling and one Up-sampling) are proposed. The down-sampling path has 1 Transition Down (TD) and an the up-sampling path has 1 Transition Up (TU) instead of the 4 Transitions (2TD + 2TU) used in the traditional approach. For each layer i, the number feature maps k obtained is given by the following formula: k0 + k × (ι − 1), where k0 is the number of channels in the input layer.
A brief description of each stage of the architecture is described as follows.

1) FEATURE EXTRACTOR
In this work, fewer dense blocks are used (3 instead of 5). Also, we reduce the number of layers from 100 to 10 layers in comparison with the original implementation [20]. Additionally, a strided average-pooling layer (D) is inserted between dense blocks according to Figure 1. The features extracted from the output blocks were concatenated (C). The pooling layer helps to decrease the computational complexity of the model and increase the receptive field of all convolutions. At the end of the feature extractor, all DB units are concatenated into 256 subsampled representation.

2) DOWNSAMPLING DATAPATH
As shown in Figure 1, the down-sampling path was modified in order to reduce the number of layers. It comprises a feature that gradually reduce the spatial resolution of the input image increasing the number of feature maps. The resulting features create the context-aware feature D enriched by all the feature maps.

3) UPSAMPLING DATAPATH
The upsampling path, on the other hand, transform the low-resolution features D to high-resolution semantic predictions. The semantic from deeper layers is efficiently combined with the fine details of early layers. In this work, a smaller transition-up (TU) block is proposed in order to combine two feature representations whose spatial resolutions differ by a factor of 2 where one of such representation comes from the upsampling path (smaller representation) while the larger representation comes from the downsampling path through skip connections. In order to have similar resolution and to blend the two representation by summation, the smaller representation is upsampled with a bicubic interpolation.
After adding both feature representation, the dimension ally is reduced by applying a 1 × 1 convolution followed by a 3 × 3 convolution. The resulting feature tensor is used for pixel-wise classification In the case of NIR eye images, the background class has more pixels than the pupil. The number of feature extracted in the pupil is smaller than most applications where more objects (classes) are present. In the proposed method max-pooling operations were replaced by average pooling and stridenet connections since in there is not much variability in the background. Therefore, only changes are detected by using the average pooling. The final model has 3 transitions with four, three and four layers respectively. That is considerably smaller than the DenseNet Tiramisu model which has 5 transitions with 103/56 layers each.
The resulting DenseNet10 architecture is shown in Figure 1. In order to define the best number of filters k to use in each layer, a grid search from k = 3 to k = 21 was used. The best result was obtained with k = 3. The resulting architecture architecture can be trained with Batch Size of 32 and only 1 GPU of 8 GB. Figure 1 shows the proposed architecture and Table 2 shows the final parameters used.

A. OpenEDS
For training and testing the OpenEDS 2 database was used. OpenEDS is a data set of eye images captured using a virtual-reality Human Machine Device (HMD) with two synchronized eye-facing cameras at a frame rate of 200 Hz under controlled illumination. This dataset contains semantic segmentation data collected from 152 participants. A total of 12,759 images with resolution of 400 × 640 were annotated.
This database was divided in three folders: Train, Test and Validation set. Train, Test and Validation folders have 9,054  2,053 and 2,053 images respectively. Each folder also has a ground truth set. This dataset correspond to facebook semantic segmentation competition. 3 It includes an additional Test set without ground truth labels. This set was used by Facebook team to evaluated the models using online EvalAi software. 4 All the results reported in this article represent the EvalAi Score. Figure 2 show examples of the database. To artificially increase the number of training images, and thus the robustness of the semantic segmentation models, an image generator function was used for Data-Augmentation. The following geometric transformations have been identified as most relevant: image rotation within a range of 15 degrees, image shifts with a variety of 15%, a zoom range within 15% and horizontal, and vertical flip. Also, two functions noise, such as Poisson and Gaussian, were added. All image modifications are performed using the corner fill mode. Additionally, a mirroring technique was applied. A fix-fold increase of all datasets is achieved by applying an aggressive data augmentation.  examples of the resulting images from the data augmentation process used in the training stage of the semantic segmentation models.

B. EVALUATION METRICS
The mean Average Precision (mAP) was used to evaluate detection performance. The mAP value ranges from 0 to 1. The higher the number, the better it is. The mAP can be computed by calculating average precision (AP) separately for each class (pupil, iris, sclera left, right and make-up) and then averaging over the classes. A detection is considered a true positive only if the IoU is above a certain threshold. In this work, a threshold value was evaluated 0.7. This threshold is very strict in comparison with the state of the art, where 0.5 is used. VOLUME 8, 2020 In order to evaluate the algorithm the rules of ''The Eye Tracking Semantic Segmentation Challenge'' sponsored by Facebook Technologies were followed. Eligible Entries were scored through automated software on the Website EvalAI using the performance metric (0 < M ≤ 100), defined as follows: min(1/s, 1)) (2) where, M represent the the final Score, 0 < P ≤ 1, measures the model-accuracy as defined by the mean intersection-over union score for the test set, and s > 0, measures the modelcomplexity, as defined by the number of model parameters, measured in the unit of model size in MB.
The scores were computed using test labels and considering the solutions that offer the best trade-off between model performance and model complexity with model-accuracy taking precedence. The ground truth test data was hidden from the entrants and is used by the Facebook team to score model performance through the website per the Criteria.
Semantic segmentation implies a pixel-wise classification that defines whether the pixel is part of one the five classes and/or background. Therefore, to evaluate the performance of eyes segmentation algorithm, the Intersection Over Union measure (IOU) is used [29]. This measure represents the best option to evaluate segmentation according to the state of the art. Metrics such as accuracy and confusion matrix are used for detection and classification and they are not suitable for segmentation. The IOU is computed by comparing pixel-wise the predicted segmented pixels and the mask given by the ground truth. See Figure 4. . An example of the Intersection Over Union (IOU) metric. The dashed red-box is the predicted detection, the continuous box is the ground truth, and the grey area is the overlap between the two. The example shows three different IOU scores from left to right, with the rightmost being the best.

V. EXPERIMENTS AND RESULTS
Dense implementations do not use a pre-trained model as a backbone. Therefore, the parameters can be reduced, and the hyper-parameters fine-tuned. In order to find an efficient algorithm, several models using DeepLabV3, UNet, Mask-RCNN, DeneNet-56, DenseNet101 and, DenseNet10 were trained as shown in Table 3 from DenseNet10_1 to DenseNet10_14. A PC with Intel I7, 32 GB RAM, and GPU-1080TI was used for all the experiments. An RMS optimizer with a learning rate from 0.1 up to 0.0001, a decay rate of 0.995, layers per block (1,3,5,10,12), and a growth rate of 3, 5, 7, 9 were used. Each model took on average 3 days to be trained with 100 epochs. The number of layers for each dense block (TU and TD) was reduced from 5 to 3 and 4 respectively. The growth rate regulates how much new information each layer contributes to the global state, therefore the trade-off between grown rate (k), IOU, number of layers and pooling number needs to be carefully studied. Figure 5 shows examples with higher IOU from DenseNet10_13. Table 3 show the models submitted to the Facebook competition in order to be evaluated. Column 1, shows the name of the models. Column 2 shows the number of parameters used by each model. Columns 3 and 4 show the mean IOU and final scores estimated for a third party software (EvalAi-website 5 ). Models with backbones such as DeeplabV3+ and also UNet reached higher IOU but score low due their large number of parameters. The best result was reached for the DenseNet model with 10 layers, with a dropout of 0.5, 0.2 and 0.2 for Denseblock 1 up to 3 respectively. Figure 5 shows three different eye segmentation results obtained with the proposed DenseNet10_13 model. Table 4 shows examples of three challenging NIR eye images that present shadows, highlights, brights and even make-up. The segmentation results using all the explored model are compared (DenseNet10_13, DenseNet56, DeepLabV3+, MobileUNet, Mask-R50 and Mask-R101). The image with make-up shows the worse results, the main error is presented in the sclera and the iris area. This is mainly due to the reflection light and make-up presented in  the original image. Even the Mask-R50model can not detect the scleras. Figure 6 reports the accuracy curve along 80 epochs obtained for the best model (DenseNet10_13). The instability and noise were improving along the epochs. Note, that the DenseNet10_13 model was trained with a high number of training examples and few parameters. Figure 7 shows the validation curve (IoU) along 80 epochs for the DenseNet10_13 model. Similar than Figure 6 the curve tends to stabilize at epoch 80. Figure 8, shows a comparison of the number of parameters amongst the state of the art models tested and the proposed DenseNet10_13 algorithm. The minimum number of parameters was 202,084 with a mIoU of 0.9429. This reduction in the number of parameters helps to reduce the data required to train eye segmentation algorithms. This is very important  when dealing with constrained labeled data environments such as biometric and gaze estimation.
All eye components present different degrees of complexity. For instance, the pupil has fewer pixels in relation to the background pixels or the pixels of the iris; therefore, the segmentation of each class does not contribute to the overall segmentation error in the same manner. To understand the segmentation performance along with different objects (classes) of the eye (sclera, iris, pupil, and background), a set of experiments were performed using Mask-RCNN50, Mask-RCNN101 and our proposed model DenseNet10 in each class separately. Surprisingly, the sclera shows a lower performance when compared with pupil and iris. The eye as an entire structure reaches a higher performance.
The OpenEDS database was used in this experiment using the following partition: 6,791 images for Training, 2,265 for Testing, and 2,263 for Validation. The segmentation results for each class are reported in Table 5. Surprisingly, the sclera TABLE 4. Example of three challenging images with the presence of shadows, highlights, brights, and even make-up. Comparison of the results obtained using the explored methods based on State of the art. shows a lower performance when compared with pupil, and iris. However, its segmentation performance is improved when using our proposed DenseNet10 model. Table 5 also reports the Mean IOU which represents the average results of each segmentation for all the structures belong to the each class. The eye class is the combination of all individual classes (Iris, pupil, and sclera).

VI. CONCLUSION
In this work, a DenseNet10 model that uses only 10 layers and 202,084 parameters was implemented. This model achieved one the best performance in the Facebook eye semantic segmentation competition when comparing with SegNet (Encoder-Decoder), DenseNet Tiramisu 56-67-101, UNet, DeepLabV3+, Mask-RCNN based on ResNet 50 and 101. A trade-off between accuracy and size of the model are considered as a success metric. Experiments have shown that, in this particular case (NIR eye images)), there is no need for very deep CNN models in order to achieve competitive segmentation performance. Only a few layers can properly capture the information contained in the images.
As is shown in Table 5, all eye components present different degrees of complexity. For instance, the pupil has fewer pixels in relation to the background pixels or the pixels of the iris; therefore, the segmentation of each class does not contribute to the overall segmentation error in the same manner. To understand the segmentation performance along with different objects (classes) of the eye (sclera, iris, pupil, and background), a set of experiments were performed. Surprisingly, the sclera shows a lower performance when compared with pupil and iris, reached only 0.862. The eye as an entire structure reaches a higher performance with 0.982.
For future work, the make-up presence (mascara and shadows) and the loss estimation error will be explored in order to consider the number of pixels per class. In the eye images used in this challenge, the background class has more pixels in comparison with pupil, sclera, and iris. The categorical cross-entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels asserting equal learning to each pixel in the image. This can be a problem if each class has an unbalanced representation of the image, because of training can be dominated by the most prevalent class (larger number of pixels). Furthermore, will be very interesting planning on studying segmentation algorithms for Eye Images using images captured under a different spectra.
The proposed DenseNet10_13 algorithm reached 8th place in The Facebook semantic segmentation challenge with 0.94293 mean IOU and 202,084 parameters with a final score of 0.97147. This score is only 0,001 lower than the first place of the competition.