High-Fidelity Illumination Normalization for Face Recognition Based on Auto-Encoder

Nonuniform illumination is one of the main issues that hinder the accuracy of face recognition because it makes the intra-person variation more complicated. To minimize the intra-person differences caused by varying illuminations, this paper presents a normalization method based on Convolutional Auto-encoder (CAE). The CAE is employed to map the face images under various illumination conditions to a normalized one, generating preliminary results with blurry and insufficient facial details, which are tricky for recognition. To recover these details, a restoration method based on re-blurring strategy and frequency analysis is proposed to preserve the facial features lying in high-frequency components based on discrete cosine transform (DCT). Therefore, in our method, these components are extracted and re-introduced into the outputs of CAE to enhance the fidelity of outputs. Thus, the facial details are preserved to the largest degree and the following works such as recognition tasks are benefited. Experiments conducted on the AR, extended Yale B, and CAS-PEAL database demonstrate the effectiveness of our method.


I. INTRODUCTION
Face recognition has the potential to be widely applied in access control, identity authentication, watch-list surveillance etc. However, the uncontrolled illumination condition poses an obstacle to its robustness [1], because the intra-person variations caused by varying illumination conditions can be more complicated than inter-person variations. For example, the shadow cast over faces varies drastically according to the direction and intensity of lighting, which degrades the accuracy of face recognition. Therefore, it is crucial to conduct illumination normalization before recognition.
Over the last decades, many algorithms have been developed for illumination normalization. From an earlier time, holistic normalization methods have drawn great attention. Histogram equalization (HE) [2] and Histogram match (HM) manage to deal with less complicated illumination problems, by altering the pixel values and adjusting the intensity histogram of a gray-scale image. The gamma intensity correction (GIC) [3] and the logarithm transform (LT) focus on the overall brightness and achieve similar results. However, The associate editor coordinating the review of this manuscript and approving it for publication was Yan-Jun Liu. such holistic methods can only cope with simple illumination variations.
Later, numerous methods were proposed to model the illumination variation. Georghiades et al. [4] reconstruct the shape and albedo of a face using a small number of training samples taken under different lighting directions. This is based on the observation that for a certain identity in fixed pose the face images under all possible illumination conditions form a convex cone in the image space. In [5], illumination variation is modeled by low-dimensional linear spaces and the linear subspaces spanned by the corresponding images is a good approximation to the illumination cone. The main drawback of this kind of methods is that their accuracy relies heavily on precise face alignment to obtain samples under the same pose and expression. Besides, the cost of preparing and collecting face images covering various lighting conditions is quite high and their accuracy relies heavily on precise face alignment to obtain samples under the same pose and expression.
Extracting features invariant to illumination has also been studied extensively. For example, Gabor wavelets [6] is designed to simulate the receptive fields of striate neurons and thus can obtain illumination-robust features. Local binary patterns (LBP) [7] and Gradientfaces [8] are also proposed as descriptors to extract the illumination invariant features. Many other methods attempt to conduct illumination normalization in frequency domain [9], [10] because the low-frequency component (LF) is highly related to the illumination variations while the high-frequency component (HF) contains the intrinsic features of an image, according to the Lambertian reflectance model [11]. Although reasonable results are reported, the features extracted with these methods cannot properly tackle the illumination normalization under extreme lighting conditions and may ignore useful cues, therefore these features cannot meet the rigorous demand of face recognition.
Recently, deep neural networks (DNNs) have been implemented to conduct illumination normalization. In [12], the local pattern extraction layer and the illumination elimination layer are designed and integrated into a Convolutional Neural Network (CNN) to obtain illumination invariant feature maps. Wu et al. [13] devise a multi-task DNN in order to complete the tasks of normalization and reconstruction. Generative Adversarial Network (GAN) with four types of loss function is utilized in [14] to generate images under several fixed illumination conditions. Han et al. [15] propose to enhance the output quality of a primary GAN by incorporating another GAN, which relights the normalized results. By training the two GANs interactively, the output is premium in quality. However, these methods [14]- [16] usually rely on carefully-designed network architectures or sophisticated loss functions, thus increasing computational cost. Besides, labels indicating lighting conditions are required in their training process.
The Auto-encoder (AE) and its upgraded version, Convolutional Auto-encoder (CAE), have been widely used to tackle a wide range of face-related tasks, including pose alignment [17], [18], 3D face reconstruction [19], de-occlusion [20] etc. Considering the success of Convolutional Neural Networks (CNNs) in various computer vision tasks, the CAE incorporates the feature extraction power of convolutional operation into AE, which improves its ability in understanding 2D image structures [21], [22]. Park et al. [23] achieve good performance in low-light image enhancement by utilizing two networks, including an AE for illumination estimation and a CAE for image restoration. Another stacked sparse demoising autoencoder is employed in [24] to enhance the low-light images. Overall, the CAE is capable of transforming its input images to a constricted domain and is promising in solving the illumination problems in face recognition tasks [25].
In this paper, a CAE combined with a detail restoration method is proposed for illumination normalization. The CAE is adopted to normalize the illumination because of its capacity of transforming images into a normalized domain, but its outputs often lack vital details, especially when there are great changes in light and shade. While the traditional methods based on the Distinct Cosine Translation (DCT) can preserve useful facial details invariant to illumination but rely on sophisticated parameter settings. Consequently, we propose to combine the generation power of CAE and the detail-preserving capability of traditional methods to obtain better performance. This paper illustrates how the HF of original images and the LF of CAE outputs can be adaptively extracted and combined to optimize the performance by implementing an iteratively re-blurring strategy [26], [27]. Experiments on the AR database, the CAS-PEAL dataset [28], and the Extended Yale B database validate the power of this combination. The main contributions of this paper are as follows: • The CAE is firstly introduced to generate preliminary normalization results because of its power of reconstruction. Besides, no other auxiliary network or loss function is required. Therefore it can be easily optimized and the computational cost is reduced.
• The CAE output cannot restore facial details well when the illumination condition is complex because it only focuses on holistic reconstruction. Therefore we propose a detail restoration method to enhance the quality of its outputs, inspired by the existing normalization methods based on the DCT. The HF of the original image is extracted and combined with the LF of the generated image to achieve higher fidelity. This strategy takes the advantages of both the CAE and traditional methods, and can be extended to alleviate the quality degradation problems in other similar fields.
• A re-blurry strategy is introduced to decide the boundary between the HF and LF automatically in DCT domain where the output of CAE is taken as an reference. This strategy avoids sophisticated parameter settings and thus ensures effective and efficient detail restoration. The rest of the paper is organized as follows: Section 2 provides a summary of the state-of-the-art methods related to our work. Section 3 describes the proposed work. Section 4 presents the results from the experimental evaluation and Section 5 concludes this paper.

II. RELATED WORK
The AE couples an encoder with a decoder to learn some certain mapping between its inputs and targets. The encoder and decoder of CAE are composed of convolutional layers and deconvolutional layers respectively. For example, Hinton et al. [25] conduct view-point transformation with AE. Tewari et al. [19] porpose to learn the mapping between 2D images and their corresponding 3D Model is learnt by an expert-designed CAE.
The CAE is implemented in many other face re-rendering tasks. Hinton et al. [25] conduct view-point transformation by translating images into codes describing their pose information with the encoder before computing the transformation parameters with the decoder. In [29], the very generic CAE is employed. The neurons in the last layer of its encoder are separated into several groups, each of which learns to represent a certain type of transformation (e.g. face rotation, lighting direction etc.). Then the decoder re-renders the input VOLUME 8, 2020 images to different viewpoints, lighting conditions etc. For each group of neurons, a mini batch of images corresponding to changes in only a single scene variable is used for training. Wu et al. [13] convert images into recon code representing poses and illumination conditions and then reconstruct images in the frontal view and the neutral lighting condition. With the aid of 3D face models, enormous training samples are generated to optimize their network. Their works have achieved reasonable results. However, one fatal drawback is that the outputs of the CAE are often low in quality and cannot meet the rigorous demand of face recognition. Specifically, the CAE often fail to recover facial details well when the illumination condition is complex. For example, if there are both low-contrast and high-contrast regions, which constantly occur in the images under non-uniformed lighting conditions, the CAE fail to recover high-quality details.
As for the underperformance of CAE in reconstructing high-quality images, the reasons are two-folds [30]: • The Compression Nature of the Encoder and the Pooling Layer: The encoder, acting as a feature extractor, converts the images into a lower-dimensional subspace, from which the subtle details may be lost. Particularly, if the network contains pooling layers, which may further discard useful details, the quality will be degraded significantly.
• The Optimization Objection Inconsistent With Human Visual Perception: The CAE is trained by minimizing the pixel-wise differences between the original images and the generated ones, and such holistic error is far from human visual perception. It is also tricky that to minimize such error, the local facial details may be ignored because they only account for a small proportion of images.
In terms of the first factor, Mao et al. [31] propose to pass image details from the encoder directly to the decoder. However, for transformation tasks, such as pose alignment and illumination normalization, the feature maps extracted by the encoder and that of the decoder are in different conditions (i.e. poses, lighting directions), and as a result, such algorithm cannot be implemented directly. To address the second factor, meanwhile, the GAN is designed to compensate for this drawback by introducing an extra discriminator to ensure the fidelity of images. However, the hyper-parameters are difficult to select and sophisticated loss functions and structures are needed, or else visually absurd outputs may be generated, putting burden to the training process [30]. In [14], image quality is enhanced by introducing two types of loss functions based on two image quality assessment index, i.e. the Peak Signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM). By maximizing these indexes, images of higher resolution are obtained. Han et al. [15] propose to enhance the quality of their results by attaching another GAN, GAN 2 , to a primary one, GAN 1 . The auxiliary GAN 2 is trained to transform its input to a fixed lighting condition, while it is necessary to provide it with the lighting label.
By training the two GANs interactively, the whole network is optimized and can produce high-quality results. These methods have achieved reasonable results, while introducing additional computation and auxiliary architecture. As illustrated above, in frequency domain, the LF corresponds to illumination variations and the HF contains facial details. Therefore, many existing methods based on frequency analysis preserve the HF while discarding or normalizing the LF and their capability of maintaining image quality is considerable. For example, the images are transformed into logarithm domain to disentangle the illumination component and the reflectance and then DCT is performed independently on the 8 × 8 non-overlapping blocks. The LF is discarded to achieve good normalization performance. In [9], images are preprocessed using HE, Contrast Limiting Adaptive Histogram Equalization (CLAHE) and logarithm transformation and in DCT domain, the LF are multiplied by an exponential function over the relative location of the coefficients. These methods all preserve crucial details and yield competitive results. However, many of these methods separate the LF and HF manually and there is no efficient way to automatically locate the exact boundary between the HF and LF in frequency domain. In time domain, the logarithmic total variation (LTV) [10] is proposed to separate large-and small-scale feature corresponding to the LF and HF respectively, but the calculation is relatively complex. Therefore, it is natural to combine the LF of the outputs of CAE and the HF of the original images to enhance image quality.
In this paper, we modify the re-blurry strategy to extract the HF and LF. By comparing the blurred original images with the outputs of CAE, the boundary between the HF and LF is located, making it convenient to conduct combination of the HF and LF of the original images and CAE outputs respectively. The re-blurry strategy performs well in image quality assessment [32], [33] and de-blurring [34], [35]. Taking the corresponding re-blurred restored results as references to the original images, parameters in these algorithms are fine-tuned and optimized results are achieved.

III. PROPOSED METHODOLOGY
The CAE is an efficient network and has been applied to many face-related tasks. In this case, it translates images under arbitrary lighting conditions into a uniform one. The 3D aided sample augmentation is utilized to boost the performance. However, the outputs of the CAE are relatively poor in quality. Hence, the time-frequency analysis is exploited to restore vital details. In this section, 3.1 introduces the theory and the defect of the CAE and 3.2 illustrates the proposed method. In 3.3, the implementation details are depicted and 3.4 provides the discussion on our method.  Traditionally, the training error is measured by the pixel-wise differences between its input and output so as to ensure the completeness of the extracted features and the accuracy of the reconstruction process.

A. CONVOLUTIONAL AUTO ENCODER
In our case, however, the aim is to minimize the distance between the output image and the corresponding target image under the uniform illumination condition. As a result, we take the images under arbitrary illumination conditions as inputs. The corresponding images under the frontal lighting condition and the same pose and expression, meanwhile, are regarded as expected results as in [29]. The reconstruction loss is defined as the pixel-wise difference between them, which can be represented as: where I norm indicates the image with frontal uniform lighting and I CAE represents the output of the CAE. The training pairs can be obtained with the aid of 3D models and the Cook-Torrance re-rendering model [36]. Although the images can be normalized to some extent, it is difficult to restore high-quality facial details if the lighting condition is complex. For example, when the lighting is too strong or too weak, the facial regions are prone to be either overexposed or dark. Both circumstances would lead to low contrast in these regions. The CNNs, however, treat every receptive field as the same, no matter how the contrast is, so they cannot effectively extract the features of these regions. Though theoretically increasing the quantity of kernels would bring some benefits, the expansion of the network would inevitably lead to over-fitting. Consequently, the regions with low contrast cannot be restored effectively.
According to the previous researches on image analysis based on frequency domain, the facial details all lies in the HF while the illumination variation mainly affects the LF. Hence, it is reasonable to deduce that the LF, which is affected by the nonuniform illumination, of the original image can be normalized properly by the CAE, although some details lying in the HF are lost. In comparison, for the traditional illumination normalization methods, it is difficult to eliminate the effect of illumination in an effective, efficient and adaptive manner, whereas it is less complicated to extract the HF. Based on this consideration, we propose to compensate for the detail loss of the CAE by re-introducing the HF of the original images as in the traditional methods.By taking the advantages of both the modern networks and the traditional image processing methods, the results of our method are optimized and are ideal for recognition.

B. FREQUENCY ANALYSIS FOR DETAIL ENHANCEMENT
To restore the missing details, we propose to integrate the HF from the original image to the output of CAE. To extract the illumination-invariant component lying in HF, previous works either directly separate HF and LF in frequency domain or divide the images into large-and small-scale features in time domain. The former mostly set the boundary between HF and LF to constant values, ignoring the diversity of the frequency distribution of local details. The latter, meanwhile, rely on statistical indicators which cannot represent facial features properly.
In this article, we propose to estimate the boundary between them by analyzing the differences between the DCT coefficients of the original image and that of its filtered version which is most similar in image quality to the output of the CAE. Compared with the mentioned methods [10], our method can adaptively locate the boundary without complex parameter settings.
In our method, the 2D DCT is applied to transfer the M ×N image into coefficient matrix in frequency domain. As shown in Fig. 2, the DC component, which lies in the first column first row of the matrix, represents the propotional average of the image. The remaining part denotes the AC component, whose frequency increases following the direction of the arrows in Fig. 2.
As mentioned above, the CAE is capable of normalizing the LF of its input, where the illumination variation mainly lies, while it also deteriorates the details, which corresponds to the HF of the input image. As a result, the CAE output contains normalized LF and middle-frequency components. To achieve ideal performance, it is vital to retrieve the lost HF. Therefore, we propose an iterative re-blurring strategy combined with frequency analysis. As illustrated in Fig. 3, we first utilize the re-blurring strategy to discard the HF of the input image gradually, in order to find out the very filtered sample that is most similar to the CAE output in quality. In the frequency domain, this sample can be regarded as the combination of the LF and middle-frequency component of the original image, and it lacks the HF to the same degree as the CAE output. Hence, this sample and the input image are both transferred into the DCT domain and the former is taken as reference for the latter to locate the HF, thus the expected HF of the original image can be extracted and integrated to the the CAE output to recover the lost details.

1) ITERATIVE RE-BLURRING STRATEGY
In order to find out the most similar sample I sim to the output of the CAE, I CAE , in quality, the original image, I org , is first processed with the same Gaussian low-pass filter for 1 to M times, respectively. As the filtering times increases, the ambiguity of the sample rises as well. Then each of the filtered samples is compared with I CAE in quality. Among these samples, there exists one sample I sim , which is most similar to I CAE in quality because discarding either too much or too little information will lead to changes in quality. The parameter M is set to a constant. An alternative is to stop filtering once the quality similarity starts to decrease, because it indicates that the filtered sample contains less details than I CAE and the quality dissimilarity would increase if the filtering process continues. The selected I sim lacks details to the same extent as I CAE , while containing the same LF and the middle-frequency component as I org , therefore is taken as the reference to locate the HF.
It is also worth noting that although the ideal low-pass filter can also be employed to conduct the blurring operation, the Gaussian low-pass filter is preferred in practice because it performs better in de-noising.

2) ASSESSMENT OF IMAGE QUALITY
The image quality resemblance is measured with the SSIM index based on local region [37], thus the illumination variation, which mainly affects large-scale features, would barely have effect on its accuracy. Consequently, when assessing the image quality resemblance, the illumination variation can be ignored. The Structural Similarity Index Measure (SSIM) [37] is employed to evaluate the similarity between I b and I g . This metric is a combination of three factors concerning image distortion, namely the loss of correlation l, luminance distortion c and contrast distortion s.

3) LOCATING THE BOUNDARY BETWEEN THE HF AND LF
To locate the boundary between LF and HF, we compare the DCT coefficient matrices of the original image and the manually blurred sample which is most similar to it in image quality. The coefficient matrices are converted into vectors by mapping (u, v) → w following the direction of the arrows in Fig. 2. The difference between them is measured by the variation between the coefficients, expressed as: where c filtered (w) indicates the coefficient of the filtered signal at frequency w, and c org (w) is that of the original signal. The boundary, therefore, is defined as the mutation point (i.e. frequency at which the coefficient variation rises most rapidly) of the coefficient variation, represented as: The component that is located in frequency band [w b , ∞] is regarded as HF, and the rest is defined as LF, i.e. which is located in [0, w b ]. Clearly, when α is set to constant, w b decreases as iteration rises. It is necessary to have the constant α because the boundary between HF and LF is vague, and α controls the actual boundary in our algorithm. The higher α is, the less the reserved HF will be, and the lower the image quality will be, and vice versa. However, when α is too high, the variation caused by illumination may be re-introduced to the result image. In order to achieve the best recognition performance, α should be carefully selected to balance the trade-off between illumination normalization and the quality of image. Usually, our results can be optimized when α = 0.1.

4) INTEGRATION OF THE DETAIL
After the HF of the original image is located, the components in frequency band [w b , ∞] of it is integrated with the CAE output to optimize the result. With the HF of the original image, the details are preserved and the quality of our result is enhanced. By integrating the LF of the output of the CAE, the adverse effect caused by illumination variation is eliminated. As a result, the restored result is both high in quality and invariant to illumination.

5) 3D-AIDED DATA AUGMENTATION
The illumination variation is inextricably related to 3D geometry features of faces, so it is beneficial to generate aligned samples for network training with 3D face models and Cook-Torrance reflectance model [36]. According to this model, the intensity of pixel I (x, y) can be decomposed into the surface reflection coefficient R(x, y) and the illumination coefficient L(x, y). The Cook-Torrance model also enables us to obtain L (x, y) under arbitrary illumination conditions. Then the corresponding generated images I (x, y) are fed into the network and the original images are regarded as 'ground truth' to optimize the whole network. Some of the generated training samples are provided in Fig. 4.
Here we reaffirm the necessity of the combination of these two process. The traditional illumination normalization methods based on DCT achieve fair results by discarding the LF directly or by replacing them with specific values. However, the contrast ratio of the face images is affected greatly by the non-uniform illumination conditions, and as a result, the distribution of DCT coefficient in different facial regions can be considerably distinct. Hence, separating HF and LF with a unified value cannot cope with complex variation in illumination properly.
By contrast, the CAE is capable of extracting facial feature invariant to illumination and then reconstructs normalized image from it. Although it is blurred to some extent, it contains normalized LF and useful middle-frequency information of the original images. The detail restoration process only makes up for the HF which is undoubtedly robust to illumination changes, so it will not degrade the normalization effect. In addition, the boundary localization process enables us to disentangle HF and LF adaptively according to the input image. Therefore, theoretically, our method yields better results than simply extracting the HF from the original images.

IV. EXPERIMENT
In this section, we first introduce the architecture and the training of our network. The impact of parameter α is illustrated and analyzed by conducting experiments on the AR dataset [38]. Experimental results on the AR dataset also shows the effectiveness of our method on RGB data. Experiments on the CAS-PEAL database [28] and the extended Yale B database demonstrate the superiority of our method in handling complex lighting conditions and the comparison with other state-of-the-art methods is provided.

A. NETWORK SETTINGS 1) NETWORK ARCHITECTURE
The encoder and the decoder both contain three fully convolutional layers and the kernels are all 3 × 3.
The memory occupied by the parameters of our network is extremely small compared with many popular networks in illumination normalization. As shown in Tab. 1, our network only needs 6M to restore its parameters. Since our network is smaller, the training process is more time-saving and our algorithm can handle illumination variations with relatively low computational cost. Note that only part of the architecture of the AJGAN [15] is provided because the whole network is not presented in their article.

2) NETWORK TRAINING
Our network is trained on the generated images mentioned in section 3 and fine-tuned in a self-supervised manner with the images under frontal lighting conditions in the AR, Extended Yale B, and the CAS-PEAL datasets. Note that the faces in the generated database are detected by the MTCNN [41]. For the Extended Yale B dataset, we crop the images and discard the hair and facial contour in order to align the generated images with the test images in this dataset..

B. EXPERIMENTS ON THE AR DATASET
In this section, we demonstrate the effectiveness of our proposed algorithm on RGB images and illustrate the impact of the parameter α by conducting experiments on the AR database. The AR dataset contains 4000 pictures of 126 identities with different expressions, lighting conditions, and occlusions and this dataset is widely referred to in the pattern recognition community. For each identity, 14 images are used to verify our algorithm, since ones with occlusions are excluded because they are irrelevant to our study.
In order to illustrate the benefit of the detail restoration process and discuss the effect of the parameter α, SSIM and PSNR values are used to measure the quality of normalized images. These two indexes are positively associated with image quality. As shown in Tab. 2, the two indexes rise as α decreases, indicating growing image quality. The reason is that the greater α is, the less HF from I o , therefore the two values are smaller. This is consistent with our analysis that the HF is related to facial details, and transferring the HF from the original image brings benefits to detail enhancement. Fig. 5 shows the enhanced results under different α s. It is evident that when α equals 0, the ultimate results are the same as the original images. Also, when alpha approaches infinity, the results are the output of the network. Evidently, the quality of images rises as α decreases, while the effect of illumination normalization strengthens as α increases. It is obvious that there is a trade-off between image quality and the normalization effectiveness.   6 provides some randomly selected results of our methods. The first row of each block provides the original images while the second row and the third row compares the results of the CAE and the restored images. Apparently, the restored images are both premium in quality and normalized in illumination. As shown in this figure, the eye region and mouth region are enhanced after implementing our proposed detail enhancement method. It is worthy to note that this algorithm will inevitably alter the colors of skin due to the disparity between the training samples and the practical input.
Also, the Receiver Operating Characteristic (ROC) curve is employed to assess the recognition performance on the normalized results. For each normalized raw image, a nearest neighbor classifier is implemented to search for the most similar image in the rest of the whole dataset. The cosine similarity is used to measure the distance between one sample to another. Fig. 7 shows the ROC curves obtained when the parameter α takes different values.
It is obvious that when α is 0.05, our method underperforms because of lower image quality. As α increases, the AUC of these curves rises and reaches a high of 0.3494 when α equals 0.15, which is 17% more than that of the original, which indicates that the normalization process is conductive to recognition. Evidently, when α is higher, HF from I o is less, so the results of our algorithm is of lower quality and the recognition rate is degraded. When α is lower, meanwhile, the effect of illumination normalization is deteriorated. In other word, the variation caused by lighting conditions is re-introduced to the images, so the recognition rate is adversely affected. It seems that there is a trade-off between illumination normalization and image quality. In the following experiments on the Extended Yale B and CAS-PEAL databases, α is set to 0.1 to reach the best performance.

C. EXPERIMENTS ON THE EXTENDED YALE B
In this section we compare our method with the state-ofthe-art ones on the Extended Yale B dataset. This dataset contains frontal images of 38 identities under 64 illumination conditions.  This database is divided into five subsets according to the lighting incident angles (seen in Tab. 3). For face recognition, subset 1 (lighting angles less than 12 • ) is taken as training set and the rest ones as testing sets.
It is noteworthy that the proposed detail enhancement method is conducted on patches for the Extended Yale B dataset because the lighting condition is complex. For an input image with size of 100 × 100, an 50 × 50 patch is taken every 10 pixels and enhanced with our method. The average of the processed patches are taken as the final result. When the lighting condition is less complex, this process can be conducted on the whole image.
To optimize the recognition result, α is set to 0.1. The CLAHE is implemented to the training samples and the ones in the dataset so that the distribution is consistent. As for the training set, the images are converted to gray-scale and cut to be aligned with that in the Extended Yale B. Fig. 8 shows the optimized results which are randomly selected from five subsets.
According to Tab. 4, our recognition rates achieve comparable results to the results of existing state-of-the-art algorithms. The possible reason is that images in this subset contain complicated local changes. For example, the shadow in the vicinity of one's nose and eyes often causes sudden changes, leading to disturbance in HF. When the DCT fusion is conducted, it is re-introduced to the results, therefore the gap between the normalized images and the original ones is widened and the recognition rate is reduced. Another reason for this is that there is some disparity between the synthetic training samples and the realistic images. Cook-Torrance reflectance model [36] is not capable enough of simulating complicated illumination conditions, therefore the generalization power of the network is limited. To sum up, although there are some flaws of our proposed algorithm, we achieve comparable results to the state-of-the-art algorithms, demonstrating the effectiveness of our algorithm.

D. EXPERIMENTS ON THE CAS-PEAL DATASET
The lighting subset of tje CAS-PEAL-R1 dataset contains 2450 images of 233 subjects under more than 9 different conditions [28]. For each identity, various numbers of images taken under more than 9 illumination conditions are provided.   9 shows the normalized results of our method, which are randomly selected and are organized according to the azimuth of their lighting sources. The parameter α is set to 0.1.
It is obvious that our results are premium in quality and can handle illumination changes well. While it is inevitable that some of the images contain high-frequency noise, because the noise is included in the original images. This indicates that during the combination of the HF from original images and the LF of CAE outputs, high-frequency noise is re-introduced to the result, as the aim of our method is to recover as much details as possible.
To prove the effectiveness of our method for face recognition, features extracted from the normalized images by the pre-trained model of VGGFACE2 are used for classification. For every identity, images under frontal lighting source are taken as gallery set and the rest are used as probe set. Altogether, 1973 faces of 188 identities are detected with MTCNN and resized to 160 × 160. Some learning-based methods [15], [39], [40] are compared with our algorithm and the recognition results are shown in Tab. 5. As shown in Tab. 5, our method exceeds other methods in recognition rate, verifying the effectiveness of our method. Besides, the gap between recognition results on the enhanced images and the CAE output indicates that the proposed detail enhancement method can improve the quality of network outputs and is very helpful for face recognition.

V. DISCUSSION AND CONCLUSION
In this paper, we propose an illumination normalization approach based on CAE and DCT fusion. The CAE is used to obtain generally-normalized results while the DCT fusion based on iterative strategy compensate for its deficiency in image quality. Also, to decide the boundary between HF and LF, an iterative strategy based on frequency analysis is proposed.
Theoretically, our method is concise in structure and requires less computational cost because the CAE does not contain complex losses or auxiliary networks, which simplifies the training process. Also, the de-blurring strategy is easy to conduct.
More importantly, the framework proposed in our paper, which integrates the modern network and frequency analysis, is universal and robust for high-fidelity illumination normalization. The CAE which conducts efficient yet imperfect normalization and some tools in the detail-recovery process can be replaced and can obtain similar results. For example, the CAE can be replaced by GANs or CNNs which can realize similar function. The Gaussian filter can also be substituted by other kinds of low-pass filters. While the idea to incorporate the HF and LF of the original image and the generated image can be further explored for high-fidelity sample generation.
Experiments on the AR, Extended Yale B and CAS PEAL datasets further demonstrate the effectiveness and the generalization ability of the proposed method. The result images of these two databases show that our method achieves good visual performance for both RGB and gray images. Besides, quantitative results on the AR database, the Extended Yale B, and the CAS-PEAL databases indicate that normalizing face images under various illumination conditions with our algorithm is beneficial for face verification and recognition tasks.
However, there is still some space for further improvement. The main problem is that the indicator only provide a rough boundary between HF and LF. Therefore the parameter α is introduced to decide the precise boundary. To enhance the robustness of the algorithm, strategy which can adaptively compute α should be devised. Additionally, our method cannot handle extreme local variation because it treats every region as the same. It is estimated that integrating 3D facial data would mitigate this problem. Besides, the detail restoration method will inevitably re-introduce the noise of original images since it combine all the information in HF into the results.
Overall, as a pre-processing method, the proposed one achieves good results and is beneficial for recognition task.
CHUNLU LI is currently pursuing the Ph.D. degree with the School of Automation, Southeast University, China. She is supervised by Prof. Da and has been studying face recognition and 3D face reconstruction.
FEIPENG DA received the Ph.D. degree in 1998. He is currently a Professor with the School of Automation, Southeast University. He has published an academic monograph, and authored or coauthored over 150 high-quality articles, of which are retrieved by SCI, EI, and ISTP more than 100 times. He has 40 authorized invention patents, one authorized patent for utility models, four software copyrights, and three international invention patents (PCT applied). He also serves as a Reviewer for the journals from different areas, such as Optics Express, Optics Letters, Optical and Lasers in Engineering, the IEEE TRANSACTIONS ON NEURAL NETWORKS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II, Physics Letter A, Neural Networks, and Pattern Recognition.
CHENXING WANG received the Ph.D. degree in 2013. She was appointed as a Research Fellow at the Multi-platform Game Innovation Center, Nanyang Technological University, Singapore, in 2014. She is currently an Associate Professor with the School of Automation, Southeast University, China. She also serves as the Reviewer of many leading journals, such as Optics Express, Optics Letters, Applied Optic, Optics and Lasers in Engineering, IEEE ACCESS, and IEEE SIGNAL PROCESSING LETTERS. She is a member of the Society of Photo-Optical Instrumentation Engineers (SPIE).