Purifying Low-Light Images via Near-Infrared Enlightened Image

Cameras usually produce low-quality images under low-light conditions. Though many methods have been proposed to enhance the visibility of low-light images, they are mainly designed for illumination correction and less capable of suppressing the artifacts. In this paper, we propose to enhance the visibility and suppress artifacts by purifying low-light images under the guidance of the NIR enlightened image captured by using the near-infrared light as compensation. Specifically, we introduce a disentanglement framework to disentangle the structure and color components from the NIR enlightened and RGB images, respectively. Correspondingly, we introduce a new dataset with the RGB and NIR enlightened images for training and evaluation purposes. The experimental results show that our proposed method achieves promising results.


I. INTRODUCTION
T HE wide usage of camera sensors has made photography to be a ubiquitous part of the human experience.However, due to the size limitation of some devices (e.g., mobile phones and surveillance cameras), the aperture size built into these devices is restricted, which limits the amount of light received by camera sensors and leads to artifacts.As a result, most commercial cameras can only produce low-light images dominated by noise and artifacts for low-light scenes (Fig. 1).Thus, purifying low-light images to produce an image with high visibility and fewer artifacts becomes a meaningful task.Fig. 1.Given a low-light image captured under the visible mode and the near-infrared enlightened image captured under the grayscale/night-vision mode both by a surveillance camera (the region labeled by the red box), our method enhances the low-light images with better visibility and quality than ZeroDCE [1].
By assuming the reflectance component as the well-exposed image, most Retinex-based methods [7], [8] have already been able to correct the illumination well using different learning strategies.However, the reflectance component is far from a well-exposed image for most off-the-shelf devices.For example, since the surveillance image always needs compression before uploading to the cloud or downloading to mobile devices, they largely suffer from compression artifacts for low-light images, which cannot faithfully preserve the structure information.Without considering various degradation during the image formation or transmission process, existing image enhancement methods (e.g., ZeroDCE [1] in Fig. 1) cannot get clean results and may even amplify artifacts during the brightness correction process.Supervised low-light image enhancement methods can suppress the artifacts [50].However, the strict requirement for the paired ground truth limits their practicability in some changing environments.Though the unpaired and unsupervised fashion like CycleGAN [28], EnlightenGAN [29], or DPE [12] can alleviate the strict requirement for training pairs in supervised methods, the learning-based strategy alone without physical guidance is not able to suppress artifacts [30].
Instead of solely relying on the paired ground truth with normal brightness or the learning strategy, we propose to utilize the NIR enlightened (NIRE) image with the information from visible and NIR bands as guidance.Due to the invisibility to human eyes and effectiveness in enlightening the environment, near-infrared light has been utilized by different devices to compensate for visible light.With more light in the near-infrared spectrum, the NIRE image effectively suppresses artifacts and provides reliable guidance for the whole image enhancement process [32].For example, recently proposed methods [33], [53] recover the information of the visible band by extracting them from the NIRE image.However, since the visible and NIR information is highly mixed during the formation process, it may be difficult to accurately obtain the color information of the visible band for some cases.
Instead of extracting information of the visible band from NIRE images like previous methods [33], [53], we propose to purify the low-light images in a weakly-supervised manner via the disentanglement of color, artifact, and structure components from low-light images and NIRE images, respectively.Since the latent factors related to the color, structure, and artifact can be highly entangled and mixed in the examples from the real world, the learned representations are consequently prone to mistakenly preserve the confounding of the factors [31], leading to the color and structure inconsistency for estimated images.We further use self-compensation constraints to avoid interference from highly entangled latent factors and achieve more accurate color and structure preservation.Besides, without using any specifically designed cameras or settings, we obtain the low-light and NIRE images using commercial off-the-shelf devices to explore the influence from practical scenarios.
Our whole framework is shown in Fig. 2. Since the low-light conditions hide the color information, instead of directly building the self-compensation between the estimated image (Target image in Fig. 2) and the low-light image, we build the self-compensation loss based on the color-extracted image with the exposed color information from low-light conditions to preserve the color consistency better.Then, at the artifact purification stage, We further assume that a color-extracted image consists of an artifact component and a color component, while the NIRE image contains a structure component.From Fig. 2, by employing the encoder for the color (E C ), artifact (E A ), and structure (E S ), we disentangle these three components from the corresponding images and then achieve the image purification via the fusion of color and structure component.At last, we propose a dataset with the NIRE and low-light images for evaluation and training purposes.
Our major contributions can be concluded as follows: r A disentanglement framework to purify the low-light im- ages with the NIRE image as the guidance.
r A self-compensation loss with the color extraction mod- ule to mutually complement NIRE and visible image for artifact suppression and color consistency.
r A hybrid dataset with images from visible domain and NIRE domain for training and evaluation purposes.

A. Low-Light Image Enhancement
The enhancement for underexposed images has been studied for more than decades.Guo et al. [7] developed a structureaware smoothing model to estimate the illumination map.Wang et al. [23] utilized the deep networks to more effectively estimate the illumination map.Recently, Wei et al. [8] combined the classical retinex theory with the deep learning technique to enhance the low-light images effectively.Chen et al. [12] proposed to correct the brightness using unpaired learning.A recent method [29] proposed an unsupervised method based on the classical CycleGAN model [28].Inspired by the traditional non-learning-based methods, some unsupervised low-light image enhancement methods have been proposed recently to solve this problem.For example, the method proposed by [35] enhanced the visibility of the night scenes by leveraging advantages from the bright channel priors.Li et al. [1] corrected the illumination by iteratively updating the illumination map.The above methods are effective in correcting the image illumination but less capable of suppressing the artifacts.Recently, some methods are also proposed to suppress the image artifacts during the low-light image enhancement process.For example, Chen et al. [10] proposed to operate directly on raw sensor data and replace much of the traditional image processing pipeline.The method proposed in [27] and [9] utilized a restoration module and adjustment module to suppress artifacts and correct the illumination simultaneously.Lore et al. [34] proposed a method based on the denoising autoencoder to do the denoising and low-light image enhancement jointly.Recently, Yang et al. [50] also proposed a semi-supervised method to handle the artifacts that exist in the low-light images.

B. Near-Infrared Guided Computational Photography
The near-infrared information has been widely employed by different computational photography tasks.For example, Krishnan and Fergus [36] proposed to use gradient in both the NIR and Ultra Violet (UV) bands to improve the performances of visible image denoising.Zhuo et al. [37] applied the weighted least squares smoothing method to the visible band and transferred details from the NIR band.The method proposed in [32] tried to use the joint bilateral filtering to decompose the visible image into a large-scale image and a detail image.The detail image is then restored and recombined with the large-scale image to get the final result.Shen et al. [38] proposed a cross-field method for the image restoration based on both the visible and NIR information.Wang et al. [17] introduced a NIR image-guided deep networks for color image denoising.Lyu [33] introduced to extract visible information from the mixed multi-spectrum images to restore image contents.Besides, Li et al. [15] introduced an algorithm to fuse the visible and near-infrared images by considering their different reflection and scattering characteristics.By using the image sequence as the input, Wu et al. [53] introduced a multi-task deep network with state-synchronization modules to better utilize texture and chrominance information for this problem.Recently, Duan et al. [16] utilized the multi-scale edge-preserving decomposition and multiple saliency features for infrared and visible image fusion.
Besides the image restoration, the near-infrared information has also been adopted by other image processing tasks.For example, the method proposed by [39] solves the image intrinsic decomposition problem under the guidance of NIR images.Most above methods consider illumination correction or artifact suppression only.We propose to correct the image illumination and artifact suppression simultaneously under the guidance of NIRE images.The method proposed in [27] and [9] utilized a restoration module and adjustment module to suppress artifacts and correct the illumination simultaneously.Lore et al. [34] proposed a method based on the denoising autoencoder to do the denoising and low-light image enhancement jointly.

C. Perceptual Quality Assessment
How to evaluate the quality of the images after the enhancement is also a pivotal issue for low-light image enhancement or even almost all image restoration tasks.Currently, most methods mainly rely on PSNR and SSIM for evaluation.The development of deep learning also introduces some novel error metrics based on deep learning features (e.g., LPIPS [59]).However, all those error metrics rely on ground truth/reference.In some situations, the lack of reference images makes the evaluation difficult.A new error metric is proposed to address such difficulties for low-light image enhancement by comparing the low-light images with their enhanced counterparts [55], which makes substantial progress in addressing the no-reference evaluation for low-light image enhancement.Moreover, they further introduce an important IQA framework specifically for low-light image enhancement problems, which set a standard for the subsequent IQA frameworks in this area [54].Besides those pivotal error metrics specifically designed for low-light image enhancement, several methods have been developed in the past several years to provide more robust perceptual quality assessment.The pioneers in perceptual quality assessment also proposed to use free energy for the quality assessment [60].Besides the error metric for images, some methods also proposed a blind quality evaluator for UGC videos [61], [62].Recently, more methodshave been introduced to evaluate the quality of audio-visual signal [63].More detailed surveys can be found in the following papers [56], [57].

III. DATASET COLLECTION
Since we use a data-driven approach to solve this problem, an appropriate dataset becomes necessary to learn the purifying process.Previous low-light image enhancement methods proposed to capture an image set with the low-light image I and its corresponding ground truth R under low/normal-light conditions (e.g., LOL dataset [8]), respectively.This is a reasonable way to obtain pairs of normal/low-light images by adjusting the light amount received by camera sensors.However, since the ground truth image with normal light is not available for training in our problem, the low/normal-light image pair (I, R) is not applicable for our purpose.
We instead introduce the low-light image set (I, N) and the non-paired reference image set (I f , N f ) for training, where I and I f denote low-light images to be enhanced, and the nonpaired reference images, respectively, and N and N f denote the corresponding NIRE images.We use a two-step setting to capture images.The low-light image I is first captured under the low-light conditions similar to the previous dataset [8].Then, its NIRE image N is captured by turning on the near-infrared light emitter and switching the shooting mode to night-vision mode.This two-stage process can be easily implemented for the cameras with the internal or external near-infrared emitter.For the unpaired reference image sets (I f , N f ), we first capture I f in the normal-light conditions using visible mode and then capture N f using the same mode for N.
We capture images using the off-the-shelf surveillance cameras (Wyze cam V2 and Anker Eufy Indoor Cam 2 K) in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.wild.Then, to investigate the performances in a more controlled scenario, we use a digital camera with the night vision function (Ordro V12) to capture images in different scenarios.We do not find specific details about NIR wavelength of the three cameras, while most IR illuminators employ 850 nm for their settings [4].Based on our experiments, the NIR wavelength of Wyze cam V2, Anker Eufy Indoor Cam 2 K, Ordro V2 is less than 900 nm.In general, the spectral sensitivity of the three surveillance cameras with silicon senors used in our experiments is from 300 nm to 950nm [5].
The images captured by the two devices are with distinct properties.As shown in Fig. 4, the images captured by digital cameras are mainly corrupted by the thermal noise, and the structure information is relatively better preserved.However, since surveillance cameras usually compress images before transferring images to the cloud and their aperture size is limited, compression artifacts corrupt the images captured by surveillance cameras, and the structure information are usually lost during the transmission process.Such differences increase the diversity and pose unique challenges for our dataset.We manually make the near-infrared light distribution uniformly to avoid the bright-spot phenomenon.

A. Training Dataset
The images captured by the two devices are with distinct properties.As shown in Fig. 4, the images captured by digital cameras are mainly corrupted by the thermal noise, and the structure information is relatively better preserved.However, since surveillance cameras usually compress images before transferring images to the cloud and their aperture size is limited, compression artifacts corrupt the images captured by surveillance cameras, and the structure information is usually lost during the transmission process.By flipping, rotating, and cropping these images, our training dataset contains 1200 low-light image sets (I, N) and 1200 reference image sets (I f , N f ) all from the real world.

B. Evaluation Dataset
To evaluate the performances of the proposed method, besides the two steps used to capture the training dataset, as shown in the rightmost part of Fig. 3, we further increase the light intensity (e.g., by turning on the light) to obtain the corresponding reference image with normal light for the evaluation.Our evaluation dataset contains 100 image triplets with 300 images in total.Among the evaluation dataset, 60 triplets are from the controlled scenes, and 40 triplets are from the surveillance scenes.The surveillance cameras are all controlled remotely as in their real working conditions, which also avoids the potential misalignment between low-light images and NIRE images.

C. Structure Vanishing Problem
Since some real-world materials exhibit slightly different reflective properties under NIR and visible bands [39], the NIRE images may face the structure vanishing problem [6], where some materials become invisible under the NIR band.Since some visible light remains being captured by the camera for our NIRE images, the structure vanishing problem is not widely observed in our dataset.However, as shown in Fig. 4, if the NIR light dominates the spectrum received by the camera under highly dark situations, the structure vanishing problem may affect our NIRE images.We also propose a self-compensation loss to address this issue in Section IV-B.

IV. PROPOSED METHOD
In this section, we describe the design methodology of the proposed method and the implementation details.As shown in Fig. 2, under the guidance of the NIRE image, we first extract the color information by correcting the image illumination in an Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.unsupervised way and then employ a disentanglement framework with the self-compensation loss for image purification.

A. Color Extraction
Due to to setting of our approach, we directly leverage advantages from previous unsupervised methods [1], [7] to build the color extraction module by correcting the image illumination.A classical way to correct the image illumination without the ground truth is to solve the following formulation [7]: where L denotes the corrected illumination map, L denotes the illumination map initially extracted from the low-light image I, and P (L) denotes the regularization prior on L. In general, (1) can be solved in an iterative manner [40], [41] and the initial estimation of the illumination map can be approximated by the following equation [7]: Inspired by the formulation in (1), (2) and the iterative scheme used in [1], we train an extraction module F to learn the mapping from the input image to its corresponding illumination map.Then, similar to the iterative optimization strategy for (1), by unfolding F for T times, the illumination map can be iteratively updated as L t = F(I t−1 ), where I t−1 denotes color-extracted image obtained at (t − 1)-th iteration and L t denotes the illumination map obtained in the t-th iteration.As shown in Fig. 2, the extraction module can be divided into three layers: 1) the feature estimation layer f est to initially extract the illumination related features; 2) an illumination output layer f out to generate the corrected illumination map; 3) an extraction layer f extract to extract the color information based on the corrected illumination map, which can be calculated as follows: In (3), z t denotes the illumination features extracted at the t-th iteration, and L t denotes the illumination map obtained at the t-th iteration.As shown in Fig. 2, the parameters of f est and f out are shared across each stage.We employ U-Net [42] with BatchNorm [44] as the backbone for the illumination output layer f out .For f extract in (3), we directly employ the pointwise multiplication used in [45] to get the color-extracted image as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

THE LOWER LPIPS VALUE INDICATES BETTER PERFORMANCES (↓). THE HIGHER VALUES INDICATE BETTER PERFORMANCE FOR OTHER THREE ERROR METRICS (↑) TABLE II QUANTITATIVE EVALUATIONS FOR THE MODEL WITHOUT THE COLOR EXTRACTION MODULE (CEM), THE MODEL WITHOUT THE SELF-COMPENSATION LOSS FOR COLOR (CCL) AND THE SELF-COMPENSATION LOSS FOR STRUCTURE (SCL)
Fig. 8.The progressive refinement stage with iteration number T equals to 1, 4 (our setting), and 7, respectively.
Specifically, from (2), since the illumination features are more related to pixels with larger values [22], instead of the ReLU activation layer used by many methods [28], we embed the maxout network into the feature estimation layer f est for non-linear mapping as , where g(x) denotes affine feature transformation, F i (x) denotes features after the maxout mapping, i and j denotes feature positions.For the estimated illumination map at each iteration, we employ the total variation loss to preserve the monotonicity relations between neighboring pixel as follows: (5) where i and j denote the pixel positions and ∇ represents the gradient operations.
We further adopt the color constancy loss proposed in [1] to correct the potential color deviations in the estimated image and build the relations among the three corrected channels as follows: (6) where I p denotes the average intensity value of p channel in the corrected image, (p, q) represents a pair of channels.
By combining the terms in ( 5) and ( 6), the loss functions for the illumination correction stage can be concluded as follows: where α c and β c are the weighting coefficients to balance the two terms.An example of the color-extracted image is shown in Fig. 5, where the color information has been extracted from low-light conditions.

B. Artifact Purification
As shown in Fig. 5, the color-extracted image generated at the first stage is still with noise and artifacts.Due to the lack of ground truth for training, we utilize the NIRE image as guidance to purify image artifacts.Because of the light compensation from the near-infrared band, the monochrome NIRE image is with fewer artifacts and better preserves the information related to structure and shape [39].Instead of separating the visible and NIR information mixed in the NIRE image like previous methods [33], we only extract the artifact-free information related to structure and shape.If the structure information can be well extracted from the NIRE image, it may contribute to the artifact purification process by combining it with the color information from the color-extracted image.Based on this assumption, we propose using the disentanglement framework to obtain the artifact-free image by disentangling the color and structure components from the color-extracted image and its NIRE image, respectively.For simplicity, the color-extracted image obtained in Section IV-A is denoted as the artifact-remained image I d in this section.
As shown in Figs. 2 and 6, the branch for the artifact-remained images I d contains a pair of artifact-remained image encoder as to encode the artifact-remained image I d into color space C and artifact space A, respectively, and its corresponding structure encoder {E S : N d → n S d } to disentangle the structure information from the NIRE image.If the disentanglement is well addressed, the encoded color components should contain no information related to the artifact while preserving the color information and the encoded structure space should also only contain the structure information of the image.Then, the decoder G F can reconstruct a clean image I u conditioned only on the color and structure components as Correspondingly, as shown in Figs. 2 and 6, the branch for the reference image I f contains a pair of artifact-free image encoder where I u ∈ I f denotes the target clean image and I v ∈ I d denotes the estimated artifact-remained image.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) Backward Translation: We then encode I u and I v into {z C u } and {z C v }, and perform the second translation as follows: where Îf and Îd denote the reconstructed I f and I d , respectively.Specifically, the artifact component z A d is disentangled from the artifact-remained image and shared among the forward and backward translation to ensure that the generated artifact-remained images I v and Îd are with consistent artifacts.
After the two translation stages, the cycle-consistency loss for this stage can be represented as follows: As shown in Fig. 6, the color encoder and the structure encoder share the similar network architecture just with different skip Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
connections to the decoder network, which helps the decoder network better preserve the color and structure information.
3) Self-Compensation Loss: Though the NIRE image provides more reliable guidance for the whole enhancement process, it may cause a negative influence on the final estimated results.The first problem is the color shift problem caused by NIRE images.Since the latent factors related to structure, color, and artifact information are highly entangled in some cases, the residual grayscale information from the NIRE image may interfere with the color information from the artifact-remained image, leading to the color shift problem for final estimated results.Besides, some essential structures may also not be preserved in the NIRE image due to the structure vanishing problem, which degrades the structure consistency of the final result.
We propose a self-compensation loss for color to complement the disentanglement framework by considering the color and structure consistency.Our self-compensation loss for color penalizes the errors between the target image I u and the colorextracted image I d as follows: Equation ( 11) requires the estimated result I u not to wildly deviate from the color extracted image I d [3] and also stabilizes the consistency in the color space.In (11), I u (i) and I d (i) denote their pixel values in position i. ρ is the robust function used to eject part of the noise from I d [3] and it is defined as We set α = 0.8 empirically in our experiments.We introduce a self-compensation loss for the structure to handle the structure vanishing problem by measuring the salient edge differences between the estimated artifact-free image I u and its corresponding low-light image I.As shown in Fig. 7, though the low-light image I is corrupted by artifacts, its salient edges are still consistent with its corresponding artifact-free version.Since the image gradient may be enlarged during the purification process, instead of measuring their pixel-wise difference, we propose to maximize their correlation by minimizing the following loss: where λ u and λ I are normalization factors, • F is the Frobenius norm, denotes element-wise multiplication, and n is the image downsampling factor.Since the gradient information present different properties under different scales [20], we set n = 3 to downsample the gradient map for 3 times.Similar to the settings in [18], we set Besides, we also impose the adversarial loss L adv [49] for the estimated clean image I u and artifact-remained image I v to make them similar to the real images.Our discriminator network takes an input image with a size of 224 × 288 and has 6 strided convolutional layers followed by the ReLU activation function.
In the last layer, we use the sigmoid function to generate the final result.
By combining the loss functions in (10), (13), and the adversarial loss, the loss functions for the second stage can be concluded as follows: where α e , ω e , γ e , and δ e are the weighting coefficients and L adv denotes the adversarial losses for I v and I u .

C. Implementation Details
We have implemented our method by using PyTorch.The whole training process of our network can be divided into two stages.In the first stage, we first train the illumination correction network for five epochs.In the second stage, we connect the illumination correction network with the suppression network.We then train the whole network to convergence.The learning rate for the first and second stages is all set to 1 × 10 −2 .The weighting coefficient in ( 7) and ( 14) are empirically set as: α c = 2, β c = 1, λ c = 0.005, α e = 10, δ e = 1, ω e = 1, and γ e = 0.5.At the first stage, the iteration time T is set to 4. From the results shown in Fig. 8, since our first stage is a progressive refinement progress, this iteration time T is empirically set to guarantee effective color extraction.

V. EXPERIMENTS
Due to the lack of ground truth for training, we choose several weakly-supervised methods for low-light image enhancement as the baseline for comparisons: LIME [7], Enlighten-GAN [29], and ZeroDCE++ [21].Besides, we also compare with RetinexNet [8], KinD++ [9], and DRPB [50], three supervised methods.For the supervised method, we employ the training samples with degradations and artifacts from VE-LoL [43] to finetune them, which can improve their capability to handle artifacts in our dataset.We also compare with Cy-cleGAN [28] to investigate the performances.CycleGAN [28], EnlightenGAN [29], and ZeroDCE++ [21] are all trained on our dataset by only using the low-light images as the input.We also compare with ScaleMap [3], a method specified designed for NIR/RGB fusion tasks, to evaluate the effectiveness of our proposed method.Besides, to better evaluate the effectiveness of NIRE images in a relatively simple framework, we train an additional CycleGAN model by considering the NIRE image as another guidance.Since CycleGAN [28] does not have any branches for the NIR feature extraction or embedding, we directly concatenate the NIRE image and the low-light image as 4-channel tensor as the input for CyleGAN [28].The input channel number of CyleGAN [28] is also changed to 4. The comparison related to this part can be found in Table I and Section V-B.
Besides the low-light image enhancement methods, we also compare with grayscale/infrared image colorization methods, including CIC16 [52], and IDC17 [58].Based on their settings, we directly use the NIRE images as their input.Some image   restoration methods [32], [33], [37], [38] based on the nearinfrared information are not involved in the comparisons, since they have different settings or do not release their codes.
In addition to the classical PSNR and SSIM error metrics, we adopt LPIPS [59] as the error metric.The lower LPIPS values indicate better performances.It measures perceptual image similarity using a pre-trained deep network.We also employ a newly proposed error metric NLIEE [55] to evaluate the performance.By directly comparing the enhanced results against its low-light counterparts, this method proposes a reasonable and convenient way to evaluate the performance.Besides, the recently proposed LIEQA [54] also makes a substantial progress in evaluating the performance of low-light image enhancement, which sets a standard for the follow-up IQA methods.Though LIEQA [54] is not publicly available, the model used in NLIEE [55] is trained on the database used in LIEQA [54].We, therefore, directly use NLIEE [55] for our evaluation.

A. Qualitative Evaluations
The qualitative comparisons are shown in Fig. 9 and more examples can be found in our supplementary material.Our proposed method not only enhances the visibility of the low-light images but also better suppresses the artifacts and preserves the image details.Though KinD++ [9] and DRPB [50] also suppress artifacts effectively, it introduces new artifacts to some results or causes over-smooth effects (the regions labeled by the red box in Fig. 9), which also leads to the lower quantitative values.The other low-light image enahancement methods (e.g., LIME [7], EnlightenGAN [29], and ZeroDCE++ [21]) can correct the illumination, but less capable of the artifacts suppression, which influences the visual quality of the final results.
We further show the results on surveillance images in Fig. 10.Since the surveillance camera cannot accurately record the color information under low-light conditions, it becomes difficult to faithfully recover the color information.Even the results of our method still show color bias (the first example in Fig. 10).Besides, in contrast to the images captured by the digital camera, the surveillance images are mainly corrupted by the compression artifacts and blurring effects.Our method still shows better ability in preserving the structure and color consistency.The results estimated by previous methods (e.g., CycleGAN [28] and DRPB [50]) are still with noticeable compression artifacts.Though ScaleMap [3] is proposed for RGB/NIR fusion, it cannot extract the color information like our methods.From the results shown in Figs. 9 and 10, the results obtained by ScaleMap [3] are still with dark appearance.
Since the grayscale/infrared image colorization methods (CIC16 [52] and IDC17 [58]) can well preserve the structure consistency by utilizing the NIRE image, they all achieve acceptable SSIM values for the examples in Figs. 9 and 10.However, without the color representations disentangled from the color-extracted image, their estimated results show obvious color bias and lower PSNR values.

B. Quantitative Evaluation
The quantitative results in Table I reconfirm the observations in Fig. 9. Our method achieves the best scores among all other methods.The higher SSIM values indicate that our method recovers images with better quality.The smaller LPIPS values indicate that our proposed method indeed generates images with a better perceptual similarity.KinD++ [9] achieves the second-best results.However, due to the new artifacts in the final estimated image, KinD++ [9] still cannot outperform our proposed method.Since the other low-light image enhancement methods cannot suppress the artifacts effectively, their error metric values cannot outperform our methods and KinD++ [9].
The SSIM values of CIC16 [52] and IDC17 [58] are better than other low-light image enhancement methods.However, as discussed before, this is mainly because SSIM only calculates the structural similarity (SSIM) index for grayscale images.It cannot fairly reflect the color bias problem of the final result.The lower PSNR and LPIPS values indicate that the image colorization methods cannot accurately estimate the final results.
The results obtained by CycleGAN with NIRE also show the NIRE image's effectiveness, which improves the performance of original CycleGAN.From results shown in Table II, NIRE images help CycleGAN [28] achieve better results.Besides, from the examples shown in Fig. 13, the result of CycleGAN with NIRE image (the third column of Fig. 13) can better suppress the artifacts that its counterpart shown in the second column of Fig. 13.
At last, the better NLIEE [54], [55] error metrics values also confirm that our method achieves better results than other methods.By directly comparing the enhanced results against its low-light counterparts, it provides another reasonable and convenient way to evaluate the performance.

C. Ablation Study 1) Two-Stage vs. One-Stage:
By directly utilizing the lowlight image and NIRE image as the input, the one-stage framework only with artifact purification module is also a solution to this problem.However, since our method relies on selfcompensation loss to compensate for the color and structure information, the second stage alone cannot effectively correct the image illumination.As shown in Fig. 11, the results without the first stage appear darker than the result obtained by the full model.The quantitative values in Fig. 11 also prove the effectiveness of the two-stage framework.
If we further remove the self-compensation loss, the illumination information can be better recovered.However, the color consistency cannot be accurately preserved without selfcompensation constraints.More details about the effectiveness of the self-compensation loss can be found in Section V-C3.
2) Effectiveness of NIRE: We then remove the structure encoder for the NIRE image.From Fig. 11, without the NIRE encoder, the artifacts cannot be well suppressed in the final results.The quantitative values in Table II also become similar to the values obtained by EnlightenGAN [29].However, this experiments show that our method still has the ability to improve the illumination of low-light images even without the guidance of NIRE images.When the guidance from NIRE images are removed, our method can be regarded as regular low-light image enhancement methods.If incorporating NIRE images, our method is assumed to have better results.3) Effectiveness of Self-Compensation Loss: We then remove the self-compensation loss to evaluate its effectiveness.We first remove the structure compensation loss in (13) and then the color compensation loss in (11).The examples in Fig. 12 show that some regions without structure compensation loss become invisible in the final estimated results since these details do not exist in the NIRE images.However, since such texture vanishing problem only occupies a small amount in our near-infrared images, the improvement brought by the loss functions is also not very significant.
From the results shown in Fig. 12, the color compensation loss plays a vital role during the purifying process.Without the color compensation loss, the color information cannot be well embedded into the final estimated results and also degrades the performances of the final results.

D. Analysis to the Structure and Color Encoders
We provide further analysis for the structure (E S ) and color (E C ) encoders.From the gradient response of the two encoders in Fig. 14, they indeed play different roles during the enhancement process.The gradient response of E S mainly focuses on the edge regions with more meaningful structure information, while the gradient response of E C focuses on the global regions.

E. Analysis to the Iteration Numbers
Our method relies on a color extraction module with progressive refinement to extract color information from low-light images.The iteration number for such progressive refinement is currently set to 4. We further perform quantitative evaluations to validate the effectiveness of such settings.As evidenced by the corresponding results displayed in Table III, when T is set to 4, our method can achieve the highest error metric values about other settings, which supports its rationale.Then, the artifact purifying module disentangles the color and structure information from the color-extracted and NIRE images to suppress the artifacts.The results show that our method achieves promising results.

A. Limitations
In spite of the promising results, our method still has several limitations that need to be addressed.First, due to the energy attenuation of the near-infrared light, some areas covered by the near-infrared light may cast nonuniform light intensity, which also influences the results of our proposed method (e.g., the example shown in Fig. 15).A more powerful near-infrared light emitter with more uniform illumination distribution can partially address this problem.Then, since the color gamut of the low-light images acquired under extremely dark situations may be distorted, the extracted color information may not be accurate enough for the next stage, which finally deteriorate the color consistency between our result and the reference image.We will address these issues in our future work by employing more sophisticated experimental devices and considering a more effective restoration model.Furthermore, due to the difficulty of capturing the data in the proposed capturing setup, our dataset does not represent the entire truth of the real world.We will explore different techniques in the future to expand our dataset, such as updating the capturing setup or synthesizing images with better consideration of physical properties.

Fig. 2 .
Fig. 2. The framework of our proposed approach.With the NIRE image as guidance, we adopt the disentanglement framework with the self-compensation loss for the low-light image artifact purification.To facilitate the self-compensation, the color extraction module extracts the color information from low-light conditions by iteratively correcting the image illumination.I d denotes the color-extracted image and I f denotes the artifact-free image for unpaired training.N d and N f denotes their corresponding NIRE images, respectively.For the color extraction stage, 1 , 2 , and 3 denote the feature estimation layer, illumination output layer, and the extraction layer, respectively.E S , E C , and E A denotes the encoder for structure, color, and artifacts, respectively.n S d , z C d , z A d , z C u , n S f , and z C f denote the disentangled latent factors.More details about the network structure and latent factors can be found in Fig. 6 and Section IV-B.

Fig. 3 .
Fig. 3. Examples of the low-light images, the NIRE images, and the reference image only used for evaluation.

Fig. 4 .
Fig. 4. (Left) The low-light images captured by the digital camera (DC) and the surveillance camera (SC).For visualization purposes, we multiply the images in the left column by 5 in the right column.The green box labels the regions with compress artifacts.(Right) Examples of the reference image, and the NIRE image with structure vanishing problem labeled by the green and red boxes.

Fig. 5 .
Fig. 5. Examples of the low-light image, the NIRE image, the reference image, the color-extracted image from the first stage, and the final result estimated by the second stage.

Fig. 6 .
Fig. 6.The branch for {I d , N d } and {I f , N f }.I u is the target clean image.Îd and Îf is the reconstructed I d and I f for the cycle consistency, respectively.

Fig. 7 .
Fig. 7. Examples of the reference image, low-light image, NIRE image, and their corresponding Edge Maps (EM).The red boxes label the regions with the structure vanishing problem and all images are from the evaluation dataset.
F i (x) generates a new feature map by taking a pixel-wise maximization operation over k affine feature maps.The maxout unit maps each of kN -dimensional vectors into N -dimensianl one by extracting the vectors with maximum values related to the illumination components.In this place, we empirically set k = 4.

Fig. 11 .
Fig. 11.Examples from our model without the color extraction module (CEM), our model without the color compensation loss (CCL), our model without the structure encoder (E S ) for the NIRE image, our complete model, and EnlightenGAN [29].

Fig. 12 .
Fig. 12. Examples from our model without the structure compensation loss (SCL) and the complete model with SCL.The blue box denotes the regions with the structure vanishing problem in NIRE image, reference images, and results obtained without SCL and with SCL, respectively.

Fig. 14 .
Fig. 14.Examples of (a) low-light images, (b) their corresponding colorextracted image, and (c) the gradient response of the structure encoder E C and (d) the color encoder E S .