A DCNN-Based Fast NIR Face Recognition System Robust to Reflected Light From Eyeglasses

Due to an increasing need for face recognition under poor lighting conditions, near infrared (NIR) face recognition based on deep convolutional neural networks (DCNN) has become an active area of research. However, in NIR face images of eyeglasses wearers, reflected light is generated around the eyes due to active NIR light sources, and it is one of the main contributors to performance degradation in NIR face recognition. In addition, there have to date been no attempts to lighten DCNN models for NIR face recognition. To solve these problems, we propose a DCNN-based fast NIR face recognition system which is robust to reflected light. This work has two main contributions: 1) We generated synthetic face images of individuals with and without eyeglasses using our proposed CycleGAN-based Glasses2Non-glasses (G2NG) data augmentation. We then constructed an augmented training database by adding the synthetic images, and the database helps to make the NIR face recognition system robust against reflected light. 2) A lightweight NIR FaceNet (LiNFNet) architecture was developed to reduce the computational complexity of the proposed system by adapting the depthwise separable convolutions and linear bottlenecks to VGGNet 16. The proposed architecture reduces the computation required, while improving the performance of NIR face recognition. Through the experiments reported in this paper, we verified that the proposed G2NG data augmentation improved the face recognition validation rate by 99.09% for NIR face images which have the reflected light from eyeglasses. Also, LiNFNet reduces the number of multiplication operations by $4.4\times 10 ^{9}$ compared with VGGNet 16.


I. INTRODUCTION
Most deep convolutional neural networks (DCNN)-based face recognition (FR) studies have been conducted using RGB face images [1]- [8]. However, Kim et al. [9] showed that the validation rate of RGB FR decreases significantly under poor lighting conditions. In these environments, the validation rate of Kim's near infrared (NIR) FR method [9] was 40% or more higher than that of RGB FR. Since such environments are common in FR scenarios, such as unlocking a cell phone with FR in a dark room, it is important to research the field of NIR FR. Even though the Kim's method [9] has significantly improved the accuracy by introducing the finetuning approach into NIR FR, DCNN-based NIR FR still has The associate editor coordinating the review of this manuscript and approving it for publication was Weizhi Meng .
considerable room for improvement with respect to accuracy and computational complexity.
One of the main issues with the existing NIR FR studies [9]- [11] is that their performances with respect to accuracy and validation rates are significantly reduced in Glasses and Non-glasses (G-NG) positive NIR FR scenarios. As shown in Fig. 1 (a), the scenario means that the system conducts the NIR FR for the face image pair of a person with and without eyeglasses. In this scenario, the validation rate is decreased because the gallery and probe images have large intensity differences around the eye regions due to reflected light. The validation rates of the Kim's method [9] are less than 93% in the scenario, as shown in Fig. 1 (b). Performance at this level cannot guarantee sufficient security to justify the use of NIR FR in the real world. Since G-NG positive NIR FR scenarios are very common in real-world applications, improving the performance of FR in such scenarios is crucial. The validation rates of the Kim's method [9] and proposed method in the G-NG positive NIR FR scenarios. ''Existing method-I'' and ''Existing method-V'' are the Inception ResNet v1 and VGGNet 16 versions of the Kim's method [9], respectively. (c) and (d) show deep features of the Kim's method [9] and the proposed method for same person's face images with and without eyeglasses. These deep features are represented using t-SNE [12].
Another issue with the existing approaches is computational complexity. Despite recent advances in NIR FR [9]- [11], there are very few studies related to reducing the computational costs of NIR FR. Since recentlyproduced smartphones provide a feature that enables the unlocking of a phone using a face, it would be beneficial to make a lightweight and fast DCNN architecture for NIR FR.
In consideration of the above-mentioned issues, our goal was to develop a fast DCNN-based NIR FR system robust to reflected light. To achieve this objective, we utilized two contributions to construct the proposed NIR FR system: 1) CycleGAN-based Glasses2Non-glasses (G2NG) data augmentation 2) Lightweight NIR FaceNet (LiNFNet) architecture The first contribution makes the DCNN architecture for NIR FR be trained robust against reflected light. The second contribution not only effectively reduces the computational cost of NIR FR, but also models human faces well even if reflected light is present. The detail explanations of the contributions are as follows.
A. CycleGAN-BASED G2NG DATA AUGMENTATION When using publicly available NIR face databases to train DCNN architectures, we cannot adequately cover G-NG positive FR scenarios. This is because the numbers of face images both with and without eyeglasses are not balanced in most face labels of the public NIR face training databases. To solve an unbalanced data problem, three methods are frequently used: under-sampling [13]- [15], over-sampling [15], and synthetic over-sampling [16], [17]. If synthetic over-sampling methods can generate images close to real ones, we can increase a proportion of minorities in the database better than other sampling methods. In this point of view, we adapted CycleGAN to implement synthetic over-sampling, and generated realistic face images of individuals with and without eyeglasses.

B. LiNFNet ARCHITECTURE
Recently, several architectures [18]- [24] have been developed to reduce the computational cost of problems such as classification and detection, while maintaining accuracy. However, it is not clear that such architectures can achieve state-of-the-art performance in NIR FR. Instead of using the successful architectures [18]- [24] in classification or detection, we aimed to improve VGGNet 16 [25] and Inception ResNet v1 [26] known to have good performances in NIR FR. By adapting the depthwise separable convolutions [18] and linear bottlenecks [21] that efficiently reduce the number of VOLUME 8, 2020  parameters and computations of convolution filters, we created a lightweight architecture for NIR FR, and we call this architecture LiNFNet in this paper.
To visualize the effect of two contributions on reflected light, we investigated the deep features used for NIR FR in the feature space using t-SNE [12]. The deep features produced by the proposed method, when applied to images of the same person wearing or not wearing eyeglasses, have less variance than those produced by Kim's method [9] as shown in Fig 1 (c) and (d). The discriminative ability of Kim's method [9] is acceptable for the three identities (Fig 1 (c)). However, NIR FR was conducted on the database, which includes more than two hundred identities, and the feature space is densely filled with the features from the identities. In this case, even slight distances between the features of face images of the same person with and without eyeglasses are likely to reduce the performance of NIR FR. In other words, the same identity's concentrated features produced by the proposed method contribute to improve the NIR FR performance in the G-NG positive FR scenario, and it can be found in Fig. 1 (b).
The rest parts of this paper are organized as follows. In Section II, related works of the proposed system are explained. Section III elaborates training and inference pro-cesses of the proposed system. CycleGAN-based G2NG data augmentation and LiNFNet are described in Section IV and V, respectively. In Section VI, the experimental results are presented. In Section VII, we conclude our work by summarizing the pros and cons of the proposed NIR FR system, and discussing the future works.

II. RELATED WORK
In this section, we summarize work related to the proposed NIR FR system's two contributions, the CycleGAN-based data augmentation and LiNFNet.

A. GAN-BASED DATA AUGMENTATION
Following the pioneering work of LeCun et al. [27] and Krizhevsky et al. [28], DCNN [25], [26], [29]- [31] became a main-stream approach to research into wellknown computer vision problems such as recognition, classification, and segmentation. Using powerful deep models [25]- [31], performance on these problems has been drastically improved. However, such deep networks require numerous well-annotated databases to achieve state-of-theart performance. Since obtaining such high-quality databases is time-consuming and expensive, data augmentation methods which generate synthetic training images have been actively researched. Recently, several studies [32]- [35] have utilized GAN [36]- [41] for data augmentation, and have succeeded in generating realistic synthetic training images.
DA-GAN [32] introduced the GAN architecture for instance-level image translation. In one example, synthetic bird images involving various poses were generated, and these images were used as training data for fine-grained classification.
Antoniou et al. [33] introduced a conditional GAN for data augmentation. From the encoder of the conditional GAN, a representation of the input image was acquired. The representation and a random vector were then concatenated, and the decoder generated a synthetic image from the concatenated vector. Using the conditional GAN, Antoniou et al. [33] constructed augmented databases for the Omniglot [42], EMNIST [43], and VGG-Face [1] databases. Antoniou et al. [33] showed that recognition accuracy was improved on these databases.
AugGAN [34] added a segmentation network to GAN to maintain the structures of the input images in the synthetic images.
FaceID-GAN [35] introduced the concept of three players: a generator, a classifier for identity classification, and a discriminator. With the training of the three players, the classifier for identity classification achieved high performance. Due to the classifier, the generator generated synthetic images while preserving the identities of the faces in the input images. Using the Shen's method [35], synthetic frontal face images were generated from face images which had various poses, and face verification was conducted using the synthetic images. Shen et al. [35] improved the verification accuracy.
To prevent degradation of the NIR FR performance due to reflected light, Jo and Kim [58] added the simple reflected light patterns, such as rectangle, circle, or ellipse shapes, to the parts of the NIR face images near the eyes. Although their data augmentation method improved the NIR FR performance, this approach did not generate the sufficiently realistic reflected light patterns in the NIR face images.
After reviewing the existing methods [32]- [35], [58], we postulated that there could be a performance improvement in NIR FR in G-NG positive FR scenarios when G2NG data augmentation was well conducted using GAN. In this work, since G2NG data augmentation can be represented as an unpaired image-to-image translation problem, we utilized CycleGAN [44] to generate synthetic images. We demonstrated that the NIR FR accuracy in the G-NG positive FR scenarios was improved using CycleGAN-based G2NG data augmentation, as shown in Section VI.

B. LIGHTWEIGHT DCNN MODELS
Despite the high accuracy of most DCNN-based applications, they cannot be applied in most smartphones or embedded environments, due to limited computing resources. To extend deep learning applications to mobile environments, it is necessary to conduct studies into the reduction of computational cost, by making the DCNN models lightweight. Also, there have been several studies [18], [19], [21]- [23] addressing this problem.
MobileNet v1 [18] introduced depthwise separable convolution to lighten the DCNN architecture. In the work of Howard et al. [18], ImageNet classification accuracy did not decrease significantly, while the computational burden was considerably reduced. Chollet [19] demonstrated that depthwise separable convolutions could be adapted to the inception modules [30]. The training speed of the Chollet's lightweight architecture [19] was increased compared to Inception v3 [30]. ShuffleNet v1 [23] utilized pointwise group convolutions to reduce the computational cost of pointwise convolutions and developed channel shuffle to overcome the side effect of pointwise group convolutions. Channel shuffle made it possible to transfer information between groups of activation channels. MobileNet v2 [21] developed a DCNN architecture with linear bottlenecks. Linear bottlenecks helped the efficient reduction of the channels VOLUME 8, 2020   [25]. CASIA VIS-NIR 2.0 [48] is utilized as the training database for fine-tuning, and the validation database is same as the test pairs in Fig. 4 and 5. The NIR FR was conducted on NVIDIA GTX 1080ti GPU. ''Time'' means the average time which is taken to extract features for NIR FR of the output activation, by estimating the manifold of the activation while retaining the information in the activation. In ShuffleNet v2 [22], channel split was introduced into the architecture that was introduced in ShuffleNet v1, to efficiently use the architecture.
Wu et al. [56] developed a light DCNN architecture for FR. They introduced max-feature-map (MFM) into each convolution layer, which helped their DCNN architecture to extract a compact face representation while reducing the number of parameters, and the computational costs. However, Wu's architecture [56] was not designed for NIR FR, and Wu et al. [56] did not sufficiently analyze the effects of reflected light in NIR face images on the performance of NIR FR. Zheng and Zu [57] developed a light DCNN architecture for RGB FR by adding a normalized layer to Wu's architecture [56]. Zheng's architecture [57], therefore, was also not designed for NIR FR.
In the work reported in this paper, we lightened one of the powerful off-the-shelf DCNN architectures, VGGNet 16 [25]; this architecture was shown to have high performance for NIR FR in the literature [9]. The reason for using VGGNet 16 as a backbone network is that, in our toy experiment, VGGNet 16 is about twice as fast as another powerful architecture, Inception ResNet v1 [26]. In addition, the NIR FR accuracy of VGGNet 16 in G-NG positive FR scenarios is higher than that of Inception ResNet v1. The results of the toy experiment can be found in Section V. We lightened VGGNet 16 by simultaneously adapting depthwise separable convolutions [18] and linear bottlenecks [21]; the proposed lightweight model is called LiNFNet. Depthwise separable convolutions and linear bottlenecks significantly reduced the computational complexity of VGGNet 16. Especially, linear bottlenecks considerably improved the accuracy of NIR FR by efficiently increasing the number of channels of the input activations using pointwise convolutions.

III. PROPOSED NIR FR SYSTEM
An overview of the proposed system is presented in this section. The proposed system was designed as an end-toend framework which includes the LiNFNet architecture. The inference process of the proposed system is same as FaceNet [2]: 1) A face image pair is inserted to our NIR FR system, and two deep features are extracted from the LiNFNet architecture. 2) Euclidean distance between the two features is calculated. 3) If the distance is less than a predefined threshold, the system considers that the two face images are from the same identity; otherwise, the images are from different identities. In Fig. 2, the training process of the proposed NIR FR system is depicted. Before training the LiNFNet architecture, G2NG data augmentation is conducted to robustly train LiNFNet against reflected light from eyeglasses. During the data augmentation, CycleGAN [44] generates synthetic NIR face images of individuals with and without eyeglasses. Then, we construct the augmented training database by merging the real and synthetic images. The numbers of the face images with and without eyeglasses in the augmented database are balanced. According to Kim et al. [9], the fine-tuning approach to NIR FR achieved a better validation rate than the learning from scratch approach. As with the fine-tuning approach of Kim et al. [9], we utilized a pretrained model of LiNFNet on CASIA WebFace [45] and, conducted finetuning on the augmented training database.

IV. CYCLEGAN-BASED G2NG DATA AUGMENTATION A. MOTIVATION
After reviewing publicly available NIR face images, we predicted that the accuracy of NIR FR would be decreased in the G-NG positive FR scenarios due to reflected light.
To investigate this hypothesis, we defined six types of input pairs as shown in Fig. 3, and conducted two toy experiments.
In Fig. 3, the input pairs containing 0, 1, and 2 eyeglasses wearers are denoted as ''non-glasses'', ''mixed'', and ''glasses'', respectively. If the input pair was taken from one person, we denoted it as a ''positive'' pair; otherwise, it is a ''negative'' pair. Therefore, mixed positive pairs are identical to the G-NG positive FR scenarios.
Through the two toy experiments, we evaluated the NIR FR accuracies of the six types of input pairs. For each type of input pair, we extracted 2,000 pairs from the CASIA NIR [46] database, producing a total of 12,000 pairs for evaluation. The first and second experiments used CASIA VIS-NIR 2.0 [48] and PolyU-NIRFD [47], respectively, as training databases for the fine-tuning approach. In both experiments, we utilized Inception ResNet v1 and VGGNet 16 as backbone networks for the NIR FR system. The results of the experiments are summarized in Fig. 4 and Fig. 5.
As shown in Fig. 4, all types of input pairs except for the mixed positive pairs achieved an accuracy of more than 97%. On the other hand, the mixed positive pair achieved an accuracy of about 80%. This phenomenon can also be seen in Fig. 5. From these observations, we can say that the G-NG positive FR scenarios caused a number of failure cases in NIR FR due to the reflected lights from eyeglasses. VOLUME 8, 2020  To reduce the number of failure cases, each face label in the training NIR face databases should include a number of face image pairs with and without eyeglasses, and the number of these two types of face images should be similar. In other words, the databases should have a number of Glasses and Non-glasses (G-NG) mixed face classes; the G-NG mixed face classes denotes face classes that contain both face images with and without eyeglasses. In Table 1, information about G-NG mixed face classes and total face images in several public NIR databases [46][48] is presented. The CASIA VIS-NIR 2.0 [48] database has 86 G-NG mixed face classes. However, in this database, the ratio of G-NG mixed face classes to all face classes is low, at 11.9%. The PolyU-NIRFD [47] database has only two G-NG mixed face classes. Therefore, we expect that a DCNN model trained using the PolyU-NIRFD [47] and CASIA VIS-NIR 2.0 [48] databases will not be robust to G-NG positive FR scenarios. As shown in Table 1, the ratio of G-NG mixed face classes to all face classes is 32.5% in the CASIA NIR database [46]. Although this ratio is the highest among the databases summarized in Table 1, the CASIA NIR database is unsuitable for training DCNN models for NIR FR because there are only about 4,000 face images in the database. Therefore, G2NG data augmentation should be carried out to increase the number of G-NG mixed face classes in the CASIA VIS-NIR 2.0 and PolyU-NIFRD databases.

B. CYCLEGAN FOR G2NG DATA AUGMENTATION
The objective of the G2NG data augmentation is to produce both synthetic face images with and without eyeglasses. To make synthetic face images with eyeglasses, reflected light should be added; otherwise, reflected light should be removed. This objective can be achieved by solving an imageto-image translation problem.
As compared to the well-known Pix2Pix [49] which solves the paired image-to-image translation problem, Cycle-GAN [44] has two advantages. Firstly, it does not require paired annotations; it only requires images from two domains. Secondly, it can learn to produce outputs from both domains (A2B and B2A). These two advantages are crucial for our  application, because it is very difficult to acquire paired NIR face images with and without eyeglasses. Therefore, we used CycleGAN [44] rather than Pix2Pix [49] for the G2NG data augmentation.
To train CycleGAN for G2NG data augmentation, we used the same architecture and loss as in Zhu et al. [44], and identity loss [44] was also utilized to preserve the identities while generating the synthetic face images with and without eyeglasses. The images resulting from the CycleGAN-based G2NG data augmentation are shown in Fig. 8 and Fig. 10 in Section VI.

V. LINFNET ARCHITECTURE
In Kim et al. [9], it was shown that Inception ResNet v1 [26] and VGGNet 16 [25] achieved a high validation rate for NIR FR. Therefore, we expected that making lightweight versions of Inception ResNet v1 or VGGNet 16 would be effective. As shown in Table 2, VGGNet 16 is about 2.5 times faster than Inception ResNet v1; hence, VGGNet 16 is a more suitable architecture than Inception ResNet v1 for the proposed NIR FR system. In addition, VGGNet 16 has an advantage that its accuracy is higher than that of Inception ResNet v1 in the G-NG positive FR scenarios. Because of the NIR FR accuracy and speed, we chose VGGNet 16 to make a lightweight DCNN architecture for NIR FR.
In this study, we produced LiNFNet by lightening VGGNet 16 [25] using depthwise separable convolutions [18] and linear bottlenecks [21]. When constructing the LiNFNet architecture, we decreased the number of filters in the first convolution layer of the network by half. Fig. 6 shows several output activations extracted from the first convolution layers of VGGNet 16 for an NIR face image. These activations have similar patterns and structures of the intensity values. From this observation, we conclude that the activations contain redundant information. Thus, decreasing the number of convolution filters in the first layer does not significantly decrease the NIR FR accuracy. The result of such reduction is shown in Table 3.
We made the initial convolution layers of LiNFNet by adapting the depthwise separable convolutions [18] to the 2 nd , 3 rd , and 4 th convolutions of VGGNet 16. We expected that the NIR FR accuracy would not significantly decrease upon replacing the full convolutions of the initial convolution layers with the depthwise separable convolutions [18], which are the lightweight version of full convolutions. This is because the initial convolution layers are simpler functions for extracting the output activations than the rest of the convolution layers; the initial convolutions extract the lowlevel information, such as the edges and the combination of the edges, for the input NIR face images. From the experiment reported in this paper, we found that such replacement effectively reduces the computational complexity while improving the NIR FR accuracy in the G-NG positive FR scenarios.
It is necessary that the layers following the initial convolution layers extract rich feature information for NIR FR from the input activation. To produce output activations including this rich information, we should expand the input activation by increasing the number of channels, and extract the output activation by combining many channels of the expanded input activation. However, as the number of channels of the input activation increases, the computational complexity also increases. Therefore, we should efficiently extract the rich information for NIR FR from the input activation while preserving a low computational complexity. To do this, we adapted linear bottlenecks [21] to the last three convolution layers of VGGNet 16 to make the LiNFNet architecture.
In Fig. 7 (c), the expansion pointwise convolution of the linear bottleneck increases the number of channels of an input activation to extract the rich information for NIR FR. The depthwise convolution of the linear bottleneck extracts the rich information for each channel of the input activation. Pointwise convolution linearly decreases the number of channels of the output activation to reduce the computational cost of the next convolution layer. This approach helped us to efficiently extract more rich information for NIR FR than full convolution or depthwise separable convolution. As explained in Sandler et al. [21], the information in the intermediate activation in Fig. 7 (c) is considerably redundant for NIR FR. Therefore, the number of channels of intermediate activation can be linearly reduced using pointwise convolution. To prevent information loss, we did not use ReLU6 after the pointwise convolution in the same manner as Sandler et al. [21]. Since the manifold of the output activation can be well acquired by linearly reducing the number of channels of output activation, additional information loss from ReLU6, which is a nonlinear function, causes a considerable drop in the NIR FR accuracy. The LiNFNet architecture is summarized in Table 4. It is necessary to compare the computational complexity of a full convolution, depthwise separable convolution [18], and linear bottleneck [21] to verify the extent to which LiNFNet reduces the number of computations compared with VGGNet 16 [25]. In this paper, only the multiply operation is considered. The equations to compute the number of multiply operations in the convolution modules are as follows: where these equations can be derived from Fig. 7. C F , C D , and C L are the numbers of multiply operations of a full convolution, depthwise separable convolution, and linear bottleneck, respectively. The meanings of the other notations are shown in Fig. 7. Equations (1) and (2) were formulated in the literature [18].
To quantitatively verify how much lighter LiNFNet is than VGGNet 16 [25], we calculated the differences (D D ) between the number of the multiply operations of the full convolution and depthwise separable convolution. For the linear bottleneck, we calculated D L in the same manner as the linear bottleneck.
If D D or D L have negative values, the number of multiply operations of the depthwise separable convolution or linear bottleneck will be lower than that of the full convolution, and vice versa. From equations (4) and (5), the number of the multiply operations of LiNFNet is about 4.4 × 10 9 lower than that of VGGNet 16.

VI. EXPERIMENTS
In this section, we evaluated performance of LiNFNet regarding robustness against reflected light and performance versus computational complexity trade-off. In addition, competitive analysis of the proposed system with existing systems [9]- [11] was conducted. For the two main experiments, the augmented database, which was constructed by CycleGAN-based G2NG data augmentation, should be utilized. Therefore, before the main experiments, we conducted the qualitative and quantitative evaluations of the proposed data augmentation. In Section IV-A, the qualitative and quantitative evaluations of the proposed data augmentation are described. Databases and training setup for the two main experiments are present from Section IV-B and IV-C, respectively. In Section IV-D and IV-E, the descriptions of the main experiments are provided.

A. CYCLEGAN-BASED G2NG DATA AUGMENTATION
In these experiments, qualitative and quantitative performance evaluations were conducted for the CycleGAN-based G2NG data augmentation.

1) QUALITATIVE EVALUATION
Through the performance evaluation, we investigated how realistically the proposed G2NG data augmentation generates the synthetic NIR face images with and without eyeglasses from real images. We split the CASIA VIS-NIR 2.0 [48] and PolyU-NIRFD [47] databases into training and test databases. Table 5 shows the number of training face images with and without eyeglasses. For testing, we used all of the NIR face images in the CASIA VIS-NIR 2.0 and PolyU-NIRFD databases. In Fig. 8 and 10, the results of the proposed CycleGAN-based G2NG data augmentation are shown.
In Fig. 8, the synthetic images with and without eyeglasses are very similar to real images. In the synthetic images with eyeglasses, the reflected lights, which are generated around the eyes, appear in various patterns. Therefore, the generalization ability of CycleGAN is good with respect to the generation of various reflected lights. Even though the average intensity values of the synthetic images without eyeglasses were higher than those of the real images, the reflected lights of the real images with eyeglasses were successfully removed in the synthetic images, and the identities of the real images are well preserved in the synthetic images. To analyze the phenomenon in which synthetic images without eyeglasses are brighter than real images with eyeglasses, we compared the 3D profiles of a real and synthetic image pair (Fig. 9). In the profile of the real image, the intensities of reflected light around the eyes were almost 255, and the rest of the image had intensities near 150. On the other hand, in the profile of the synthetic image, most parts of the face had intensities near 255. From this observation, we expected that CycleGAN for our augmentation method was trained to remove the reflected light around eyes by increasing the overall intensities of the face rather than by adding information about the face to the areas of the reflected light. Fig. 10 shows the failure cases of the proposed G2NG data augmentation. In the synthetic images without eyeglasses, black noise occurs around the eyes, and the eyes which are covered with the reflected lights are not realistically synthesized. However, the number of failure cases is much lower than that of the success cases. The numbers of the success and failed synthetic images are 32,992 and 4,191, respectively. Therefore, we can justify using CycleGAN for the proposed G2NG data augmentation.

2) QUANTITATIVE EVALUATION
Because it is not straightforward to quantitatively evaluate synthetically generated images, we assumed that if the synthetic images are realistic, the accuracy and validation rates of NIR FR would be increased in the G-NG positive FR scenarios after data augmentation. Therefore, as a quantitative evaluation of the proposed data augmentation, we compared the NIR FR validation rates with or without the use of the proposed data augmentation.
For this evaluation, instead of using LiNFNet, we utilized off-the-shelf DCNN architectures (Inception ResNet v1 [26] and VGGNet 16 [25]) to investigate the effects of the pro-  [18], and the linear bottleneck [21]. The details of the VGGNet 16_light architecture were already explained in Table 3. ''DSC'' and ''LB'' mean the depth separable convolution and linear bottleneck, respectively. The architectures are trained using Integrated NIR Face database. Table 9 according to the types of the input pair in Fig. 3. The abbreviations of the input pair types in X-axis have the same meanings as those in Fig. 4 and 5. posed data augmentation. The results of the data augmentation in LiNFNet are discussed in Section VI-D.

FIGURE 12. The NIR FR accuracy of the architectures in
We prepared several databases to train DCNN architectures ( Table 6). The validation database was generated from CASIA NIR [46], and contains 2,000 pairs for each input pair type described in Fig. 3. The architectures were trained using fine-tuning [9], and the pretrained models were trained with data from the CASIA WebFace database [45].
The results of this experiment are shown in Fig. 11. For Inception ResNet v1 and VGGNet 16, the augmented training databases (CASIA VIS-NIR 2.0_AUG and PolyU-NIRFD_AUG) helped these architectures achieve higher validation rates of NIR FR than the original training databases (CASIA VIS-NIR 2.0 and PolyU-NIRFD). The augmented training databases considerably improved the accuracy of NIR FR for the mixed positive pairs (see Table 7). For Inception ResNet v1, the CASIA VIS-NIR 2.0_AUG and PolyU-NIRFD_AUG databases increased the NIR FR accuracy for the mixed positive pairs by 17.25% and 40.15%, respectively. In the case of VGGNet 16, the NIR FR accuracy for the mixed positive pairs increased by 16.25% and 30%, respectively. The use of the augmented training databases significantly improved the validation rate of the DCNN models for NIR FR in the G-NG positive FR scenarios.

B. DATABASES
In this section, we will explain the details of the training, validation, and test databases which were used in the experiments  [18], [19], [21], [22], [25], [26]. We use LFW database [52] as the validation database. described in the next sections. As explained in Section III, the training stage consists of two steps: obtaining the pretrained model and fine-tuning.

2) FINE-TUNING DATABASES FOR NIR FR
We prepared two fine-tuning databases for NIR FR: the Integrated NIR Face database and the Integrated NIR Face_AUG database. The Integrated NIR Face database was constructed by combining the CASIA VIS-NIR 2.0 [48] and PolyU-NIRFD [47] databases. This database includes 37,183 NIR face images for 948 identities. The Integrated NIR Face_AUG database is an augmented version of the Integrated NIR Face database; the database was constructed by CycleGAN-based G2NG data augmentation. When augmenting the database, we excluded the failure cases of the synthetic images shown in Fig. 10. This database contains 70,175 NIR face images for 948 identities. We did not follow the performance evaluation protocols of CASIA VIS-NIR 2.0, because these protocols are designed for heterogeneous FR (using both RGB and NIR face images).

3) VALIDATION / TEST DATABASE
For the experiments described in the following sections, we used the CASIA NIR database [46] as the validation and test database, because this database has a number of G-NG mixed face classes including face images both with and without eyeglasses. By using the CASIA NIR database, we could construct a number of mixed positive pairs (Fig. 3 (c)) to evaluate the performance of the G-NG positive FR scenarios. The CASIA NIR database includes 3,938 NIR face images of 197 identities.

4) DATABASE CONFIGURATION
In the following sections, we report two experiments: the performance evaluation of LiNFNet, and the performance comparison of the proposed NIR FR system and existing NIR FR methods. We describe the database configuration for both experiments in Table 8. In these experiments, both the Integrated NIR Face and Integrated NIR Face_AUG databases were used as the training databases.
The CASIA NIR database [46], however, was utilized differently for two experiments. For the performance evaluation of LiNFNet, we acquired 12,000 pairs from the CASIA NIR database for validation; there are 2,000 pairs for each type of input pair (Fig. 3).
the CASIA NIR database. For the identification scenarios, we grouped the problems into two types: open-set and closedset. The proposed system and Kim et al. [9] solve the open-set problem, and Zhang et al. [10] and Peng et al. [11] solve the closed-set problem.

C. TRAINING SETUP
In this section, we explain the detailed training settings for LiNFNet. The size of the NIR face images is 160 × 160 pixels. We conducted random crop and flip as the basic data augmentation apart from the proposed CycleGANbased G2NG data augmentation. We set the iteration, batch size, and learning rate as 90,000, 32, and 0.001, respectively. Following the literature [2], we set the embedding size as 128. For all of the experiments in the following section, keep probability of dropout and weight decay were 0.8 and 0.00005, respectively, and we set center loss factor to 0.01 and center loss alpha to 0.9. When training LiNFNet, we used RMSProp, which is one of the gradient descent methods, and the fine-tuning method [9] was used as the training method. Ruder [53] has stated that RMSProp, Adadelt, and Adam are good gradient descent methods. Wilson et al. [54] also found that the image classification loss of RMSProp on the CIFAR dataset [55] was slightly lower than that of Adam. Since NIR FR is strongly associated with image classification, we chose RMSProp as the gradient descent method with which to train the DCNN architecture for NIR FR. We trained the LiNFNet architecture on a NVIDIA GTX 1080ti.

D. PERFORMANCE EVALUATION OF LINFNET
To evaluate the performance of the LiNFNet architecture in the G-NG positive FR scenarios, we conducted two exper-VOLUME 8, 2020 FIGURE 14. The NIR FR accuracy of LiNFNet and the existing archite-ctures [18], [19], [21], [22], [25], [26] for the mixed positive pairs according to the training databases.
iments. The first experiment was an ablation study of the LiNFNet architecture. We compared the performance of LiNFNet with existing DCNN architectures [18], [19], [21], [22], [25], [26] as the second experiment. As the performance metrics, we used accuracy, validation rate, the number of parameters, and FLOPs.

1) ABLATION STUDY
We conducted an ablation study to investigate the effect of the depthwise separable convolutions [18] and linear bottlenecks [21] in LiNFNet. For the baseline, we utilized VGGNet 16_light, a lightweight version of VGGNet 16. Using this baseline, we compared the performance of the following architectures: Baseline+DSC, Baseline+LB, and Baseline+DSC+LB (LiNFNet). The results of the experiment are summarized in Table 9.
The accuracy and validation rate of the Baseline+DSC were 0.8% and 1.2% higher than those of the baseline, respectively. Although the Baseline+DSC does not contribute much to the reduction of the number of parameters, this architecture reduces about 1.82 × 10 6 FLOPs over the baseline with respect to computational cost. Depth-wise separable convolution thus appears to be more suitable for the initial convolution layers of the VGGNet 16 architecture in NIR FR than the full convolution.
As shown in Table 9, the NIR FR accuracy and validation rate of the Baseline+LB increased by 1.3% and 5.8% over the baseline. This is because the linear bottlenecks extract better features for NIR FR by using a number of channels of the input activation than the full convolutions. In addition, the number of parameters and FLOPs of the Baseline+LB are about twice those of the baseline. Therefore, the linear bottleneck is the main factor in improving the performance of NIR FR in terms of accuracy, validation rate, memory, and computational complexity.
As shown in Table 9, the validation rate of LiNFNet increased over the baseline as much as the total increases of the Baseline+DSC and Baseline+LB. This means that the contributions of the two lightweight convolution modules (the depthwise separable convolution [18] and linear bottleneck [21]) to the improvement of the NIR FR validation rate do not overlap. Therefore, in order to construct the LiNFNet architecture, utilizing the lightweight convolution modules to VGGNet 16_light is extremely effective for improving the accuracy and validation rate of NIR FR. As shown in Fig. 12, LiNFNet showed considerable increase in NIR FR accuracy for the mixed positive pairs over other architectures. We demonstrated that LiNFNet is an efficient lightweight version of the VGGNet 16 architecture in the G-NG positive FR scenarios with respect to memory usage, computational complexity, and NIR FR accuracy.
The first experiment was designed to evaluate the performances of the pretrained models of LiNFNet and other DCNN architectures [18], [19], [21], [22], [25], [26] in the RGB domain. The LFW database [52] was used as a validation database. The results of the experiment are summarized in Table 10. In general, the performance of a DCNN architecture decreased as the architecture became lighter. However, although LiNFNet is a lightweight version of VGGNet 16, LiNFNet had higher accuracy and validation rate than VGGNet 16, and also achieved the best performance amongst all architectures for the performance comparison.
For the second experiment, the performances of the architectures without the proposed G2NG data augmentation are summarized in Table 11. LiNFNet achieved the highest NIR FR accuracy and validation rate among all architectures described in Table 11. As shown in Fig. 13, LiNFNet had the best FR accuracy of the mixed positive pairs. Even though LiNFNet was trained without the proposed G2NG data augmentation, it could achieve a high accuracy of 94% in the G-NG positive FR scenario.
As shown in Table 10 and Fig. 13, The LiNFNet architecture is more effective at recognizing the mixed positive pairs in the NIR domain and the challenging face image pairs in the RGB domain than the existing DCNN architectures [18], [19], [21], [22], [25], [26]. In addition, LiNFNet has considerably fewer parameters and FLOPs than VGGNet 16. Although LiNFNet is slightly heavier than the existing lightweight architectures [18], [21], [22] described in Table 11, the accuracy and validation rate of LiNFNet are considerably higher than those of the competitors. Therefore, LiNFNet achieves a good balance between accuracy and computational complexity.
To explore the performance improvements achieved through the proposed data augmentation, all architectures [18], [19], [21], [22], [25], [26] were fine-tuned using the Integrated NIR Face_AUG database. The results of the performance evaluation are summarized in Table 12. After the proposed data augmentation, all of the architectures in Table 12 performed better than the no-augmentation versions shown in Table 11. From the results shown in Fig. 14, it is apparent that the proposed data augmentation is effective in improving accuracy for the mixed positive pairs. By integrating CycleGAN-based G2NG data augmentation and LiNFNet, the proposed NIR FR system achieved an accuracy and validation rate of more than 99%, and the proposed system also had a better ability to recognize the mixed positive pairs than the off-the-shelf DCNN architectures [18], [19], [21], [22], [25], [26].

E. PERFORMANCE COMPARISON OF THE PROPOSED NIR FR SYSTEM AND EXISTING METHODS
We compared the proposed system with the existing DCNNbased NIR FR methods [9]- [11], [58]. For this experiment, we reproduced the Zhang's method [10] and the Peng's method [11] known to have the NIR FR accuracies of around 98%. We verified that the two implemented methods achieved identification rates of 97.92% and 97.4%, respectively. These values are similar to those which are reported in [10] and [11]. Therefore, we verified that the implementations of [10] and [11] were correct. The work of Kim's method [9] and Jo's method [58] was also reproduced. Kim's method [9] achieved an identification rate of over 99%. The NIR FR method developed by Kim et al. [9] had a better ability to recognize the pairs that included only NIR face images without eyeglasses than the Zhang's method [10] and the Peng's method [11].
Despite the high reported accuracy of the existing NIR FR methods [9]- [11], [58], these results did not consider mixed positive pairs. Peng et al. [11] excluded NIR face images with eyeglasses in the training and test processes of FR, and Zhang et al. [10] utilized the PolyU-NIRFD database [47] as training and test databases to conduct performance eval-uation; as shown in Table 1, there are few mixed positive pairs in the PolyU-NIRFD database. In the literature [9], an analysis of the G-NG positive FR scenarios was lacking. Jo and Kim [58] added simple reflected light patterns to the areas of the NIR face image around the eyes. However, the patterns did not prove to be the sufficiently realistic.
To compare the proposed NIR FR system with existing NIR FR methods [9]- [11], [58] in G-NG positive FR scenarios, we constructed a G2NG test database, as described in Table 8, and conducted performance evaluation of identification on the G2NG test database. The results of this experiment are presented in Table 13.
When using the proposed CycleGAN-based G2NG data augmentation to train the LiNFNet architecture, the identification rate of the architecture increased. The proposed data augmentation therefore contributes to an improvement in the identification rates on the G2NG test database. In addition, LiNFNet trained without CycleGAN-based data augmentation achieved 4% and 0.6% higher identification rates than Kim's method [9] and Jo's method [58], respectively. Therefore, the LiNFNet architecture itself is robust against reflected light in the G-NG positive FR scenarios. The proposed NIR FR system (LiNFNet + CDA) has the best NIR FR ability to recognize the mixed positive pairs among the NIR FR methods, as shown in Table 13.

VII. CONCLUSION
In this paper, we propose a DCNN-based fast NIR FR system robust to reflected light. The proposed system has two contributions: one is the CycleGAN-based G2NG data augmentation, and the other is LiNFNet. Through these two contributions, the performance of the proposed NIR FR system is improved with respect to accuracy and computational complexity. Especially, the proposed NIR FR system considerably improves the accuracy of DCNN-based NIR FR in G-NG positive FR scenarios. We showed that the proposed system has advantages in terms of striking a balance between accuracy and the computational complexity of NIR FR over existing lightweight architectures [18], [19], [21], [22] as well as off-the-shelf DCNN architectures [25], [26]. The proposed system also has the best identification rate, compared to the existing NIR FR methods [9]- [11], on the G2NG test database, which includes mixed positive pairs, as shown in Fig. 3. The system achieved an identification rate of 100% on the G2NG test database.
Before discussing future works, it is worth mentioning the pros and cons of our NIR FR system compared to existing methods [56]- [58]. Based on the experiment of [9], the proposed NIR FR method is expected to have an advantage over existing RGB FR methods [56], [57] regarding FR validation rate under poor lighting condition. However, the architecture of Wu et al. [56] can be more versatile than LiNFNet for different modalities of FR, because it was designed to solve not only RGB FR scenarios, but also infrared-visible heterogeneous FR scenarios. As compared to the method of Jo et al. [58], the proposed system has a better FR validation VOLUME 8, 2020 rate than the competitor; however, DCNN architecture used in [58] is less complex than LiNFNet.
Based on the pros and cons of the proposed system, we can set two possible future directions of research: 1. Improving LiNFNet to handle various modalities of FR, 2. Developing a DCNN architecture which can produce more efficient facial representations than LiNFNet.
Also, the accuracy and validation rate of NIR FR depend upon the contents and characteristics of the training and validation databases. To address this problem, we will research methods that reduce the sensor dependency of NIR FR.