RetinexGAN:Unsupervised Low-Light Enhancement With Two-Layer Convolutional Decomposition Networks

In the field of image enhancement, using deep learning methods to enhance low-light images is currently mainstream. However, the methods often have complex network structures of a large number of parameters, and their training often uses paired data-sets, which are difficult to obtain in actual practice. To solve these problems, this paper proposes a simple generative adversarial network and Retinex model, dubbed RetinexGAN, that is completely trained using unpaired data-sets. It contains a decomposition network and two discriminator networks. To reduce the parameters of the network, only two convolution layers are used in the decomposition network. We show more challenging testing data where some parts of the image are underexposed and others are normal light. Both quantitative and visual results show that RetinexGAN is largely superior to state-of-the-art methods.


I. INTRODUCTION
When shooting an image, the quality of the image will be greatly reduced by factors such as insufficient light, photography equipment performance and photographing skills, which will lead to low visibility, low contrast and noise. Human visual perception is challenged not only by these problems but also by the accuracy of some computer vision algorithms.
To improve image quality, there are currently many hardware devices and software technologies that are dedicated to making the details hidden in low-light images clearly presented. These methods have broader application prospects. For example, in photography, low-light image enhancement technology can help users shoot high-quality and more attractive images in low-light environments; for workers working in low-light environments, low-light enhancement technology can make their working environment clearer on the monitoring equipment to ensure their safety; and in artificial The associate editor coordinating the review of this manuscript and approving it for publication was Ze Ji . intelligence, especially in the field of computer vision, it can obviously improve target detection and recognition accuracy.
Early low-light enhancement methods were mainly based on a histogram or Retinex theory. The method based on histogram equalization only considers pixel-by-pixel mapping, and ignores that pixels will be affected by surrounding pixels. The Retinex method first predicts an illumination map and performs image reconstruction through this illumination map. However, this method is a complex mathematical problem. It is difficult to predict an illumination map under extremely dark conditions with unavoidable noise.
Recent research mainly adopts the deep learning method to learn color, contrast, brightness and saturation to produce more expressive results. It mainly relies on synthetic or captured damaged/clean image pairs for training. However, these methods still have limitations: (1) The synthesized paired data-sets are often not realistic enough, and it is difficult to produce good results when the trained model is applied to real low-light images. Paired images are extremely difficult to obtain in reality, and the generalization ability of their trained models is often poor.
(2) A good result requires a complex neural network, which will produce a large number of model parameters during the training process. However, in some mobile devices, to ensure a high device storage performance, a smaller model file is often required.
In this paper, we propose a new learning-based low-light enhancement method. The network we use is simple in structure, fast in training. And it is trained on unpaired data-sets. FIGURE 1 shows the results of using our method to enhance a low-light image. Our contributions include the following: (1) The decomposition network only needs two convolution layers to decompose the eigenimage. The low-light image is decomposed into a reflection map and an illumination map.
(2) Unpaired images are used in the network for low-light enhancement. This training strategy eliminates the dependence on paired training data so that the model has better generalization and can use more types of images from different fields for training.
(3) The network structure is simple, and the lightweight one-way GAN network model greatly reduces the training complexity and training time, which can be trained with no more than 100 epochs. In our work, the size of the trained model is 15KB-150KB. In addition, we can still adjust the parameters to achieve different enhancement results in tests. Therefore, for different data-sets, we do not need to retrain the model which can be applied to mobile devices with small memory and flexible applications. FIGURE 2 proves the effectiveness of the two-layer convolution. Generated Gaussian noise is inputted into the VGG-19 pretrained model to extract features. As shown in FIGURE 2, the first to fifth rows use one-layer, two-layer, three-layer, four-layer, and seventeen-layer convolutions. The first to sixth columns are 5 to 30 iterations. Only the first two convolution layers retain color information.

II. RELATED WORK
Research on image enhancement has a long history. Wang et al. [1] compared the low-light image enhancement methods in recent decades and divided them into seven categories. According to the research hotspots in recent years, we introduce four methods: histogram equalization, Retinex, methods based on deep learning, and methods based on GAN.

A. TRADITIONAL METHODS
In the past few decades, many scholars have endeavored to use histogram equalization and methods based on Retinex. The histogram equalization methods attempt to map the entire histogram into a simple mathematical distribution. The difference between different histogram equalization methods is the use of different additional priorities and constraints. There are some newer methods, such as DOTHE (dominant orientation-based texture histogram equalization) [2], EDSHE (entropy-based dynamic subhistogram equalization) [3] and UMHE (unsharp masking with histogram equalization) [4]. These methods mainly enhance the contrast of the image but insufficiently enhance the illumination, which leads to under/over enhancement or color loss.
Methods based on the Retinex theory assume that the image is composed of reflection maps and illumination maps. Some typical methods, such as MSR (multi-scale Retinex) [5] and SSR (single-scale Retinex) [6] use illumination maps for low-light image enhancement. AMSR(adaptive multiscale Retinex) [7] is proposed as a weighting strategy based on SSR. NPE (naturalness preserved enhancement) [8] balances the enhancement level and the naturalness of an image to avoid excessive enhancement. Fu et al. [9] proposes a model named SRIE (simultaneous reflectance and illumination estimation), which can accurately maintain the estimated reflection coefficient and suppress noise to a FIGURE 3. Network structure. x is the low-light image. x max/min/mean is the minimum, maximum, and mean value of the RGB channels in x. x R is a reflection map. x I is an illumination map. y is a normal image. y 3 max is the maximum value of the RGB channels in y . ⊗ denotes pixel-by-pixel multiplication. ⊕ denotes pixel-by-pixel addition. The decomposition network uses two convolution layers. The first convolution layer has 6 input channels and 64 output channels. The second convolution layer has 64 input channels and 6 output channels. certain extent. LIME (low-light image enhancement) [10], a structure-aware smoothing model, is designed to estimate the illumination map, and uses the reflection map as the final enhancement result. BIMEF (bio-inspired multi-exposure fusion) [11] proposes a double exposure fusion algorithm, and Ying et al. [12] uses the camera response model for further enhancement. Wang et al. [13] proposes an adaptive low-light image enhancement method. They converts the image into HSV space and then estimated the illumination map of the image. The method based on the Retinex theory can effectively enhance low-light images. However, the key of these Retinex-based methods is to estimate the illumination map. Mathematically, estimating the illumination map is an ill-posed problem. Moreover, the parameters of this method are handcrafted, relying on careful parameter adjustment.

B. METHODS BASED ON DEEP LEARNING
With the maturity of deep learning technology, it has been increasingly applied to the field of image enhancement, which has led to new developments in low-light image enhancement algorithms. LLNet (low-light net) [14] is the first network to apply deep learning to image enhancement in a true sense. It enhances and denoises low-light images by learning a sparse adaptive encoder. HDRNet (high definition resolution net) [15] predicts the coefficients of the local affine model in the bilateral space to obtain low-light image enhancement. LLCNN (low-light convolutional neural network) [16], [17] relys on some traditional methods during training and is not an end-to-end solution. MSRNet (multiscale Retinex net) [18] uses different Gaussian convolution kernels to learn dark to bright mapping directly. The MBLLEN (multibranch low-light enhancement network) [19] uses a multibranch low-light enhancement network structure to extract features of different levels, enhance them through multiple subnetworks, and generate output images through multibranch fusion. RetinexNet [20] is the first method to combine Retinex theory with CNN, which achieves the purpose of low-illumination map enhancement by estimating and adjusting the illumination map. Similarly, KinD (kin-dling the darkness) [21] also applies Retinex theory to estimate the illumination map by adding a restoration network for noise removal and then enhance the low-light image. Wang et al. [22] introduces intermediate lighting in their network (DeepUPE) and correlats the input with the expected enhancement result, which enhanced the network's ability to learn complex photographic adjustments from paired data-sets modified by experts. Ren et al. [23] proposes a deep hybrid network to improve image detail and edge and uses perceptual and adversarial losses to improve image quality. Zhang et al. [24] is inspired by the theory of information entropy and the retinal model and proposes a Retinex model based on maximum entropy, which uses a simple network to separate illumination and reflection to achieve self-supervised learning. Huang et al. [25] uses the attention mechanism to predict the illumination map, and uses the Retinex model to estimate the initial enhanced image. This method uses the image after histogram equalization as the reference image so that the enhanced image has chromatic aberration. However, most of the above methods use paired data-sets for training. Guo et al. [26] proposes a lightweight deep network named DCE-Net (deep curve estimation net), which is used to estimate pixels and depict high order curves to adjust the dynamic range of a given image. The attraction of DCE-Net is that it does not require any paired or unpaired data during training but achieves better enhancement results. However, generated image is whitish. VOLUME 9, 2021 C. METHOD BASED ON GAN Currently, GANs have achieved excellent results in imageto-image conversion work [27], [28]. In the low-light image enhancement field, Chen et al. [29] regards the image enhancement problem as an image-to-image conversion problem, and proposes a two-way GAN network that uses unpaired images enhance low-light images. Different from two-way GAN, EnlightenGAN [30] proposes a lightweight one-way GAN. It uses a UNet with self-feature retention loss and a self-regular attention mechanism and two global-local discriminators to achieve unsupervised low-light image enhancement. Yang et al. [31] proposes the DRBN (deep recursive band network) method, which uses paired low-light/normal-light images to restore and enhance the linear band representation of normal-light images and then uses other unpaired data driven by perceptual quality-driven adversarial learning. A linear transformation can be learned to reconstruct a given band to obtain an improved band representation. However, some areas of the image enhanced by this method will become blurred. Moreover, after testing, its enhancement results on real-world data-sets [32] are not ideal.
However, many current GAN network structures are more complicated. The GAN network has disadvantages such as troublesome tuning and poor stability. A more complex network structure will inevitably increase the difficulty of network training. Our goal is to find a simpler network to achieve the purpose of low-light image enhancement. Unlike Enlight-enGAN, our method only uses two convolution layers to decompose the eigenimage. Faster training will be achieved and more pleasing images will be produced.

A. RETINEX WITH CORRECTION COEFFICIENT
In the original Retinex theory, an image S is regarded as the product of illumination map I and reflection map R. The formula is: where S input is the input low-light image, * represents pixelby-pixel multiplication, the reflection map R is a constant determined by the nature of the object, and the illumination map I is affected by the external light source, which can be removed by removing the light influence or lighting component I is corrected to achieve the purpose of image enhancement.
Most current methods use the estimated reflection map R as an enhanced (good exposure) image [10]. However, we found in experiments that using only the reflection map will cause the loss of image details, color, and excessive enhancement. To solve this problem, we made the following improvements to the original Retinex formula: this formula is used in testing instead of training. It is simplified yet effective. The result is shown in FIGURE 4, where I and R are the illumination map and the reflection map, respectively. They are estimated using the decomposition network (the specific structure is introduced in the next section). We normalize I and R to [0, 1] and use (1 − I ) to suppress the lighter areas and lighten the darker areas. S input is the original low-light image, and S output is the enhanced image. β and α are the image brightness correction coefficients. The value range of β is [1,8]. As the value increases, the brightness of the low-light area increases. The value range of α is [1,3]. As the value increases, the overall part of the image increases. It is important to point out that Retinex theory assumes that 1) the visual appearance can be decomposed by reflection maps and illumination maps, 2) reflectivity is the intrinsic property of the object, and 3) the illumination map can be varied(depending on actual conditions). Therefore, unlike the illumination map and reflection map of methods [21], [22], our reflection map is closer to the normal image, and our illumination map is closer to the grayscale image of the normal image. This situation is more similar to [20].    we design a decomposition network with a two-layer convolutional neural network. We only use layer1 and layer2 to extract the color and brightness of the image without changing the texture and contour of the original image.
Lv and Lu [34] believes that the quality of information in the illumination map is lower. To solve the problem of less information in the illumination map, unlike other single-channel illumination maps based on Retinex theory, our illumination map uses 3 channels. We extract the minimum, maximum, and mean values of the RGB channels from the low-light image before training, and concatenate them to maintain more image details in the illumination map.

2) DISCRIMINATOR NETWORK
After experimenting, we found that the use of ordinary discriminators may overenhance the brighter areas or underenhance the darker areas in the image. To ensure that our network can adaptively enhance different regions in the image, we adopt the discriminator structure of PatchGAN, namely, the Markov discriminator. The discriminator structure is shown in FIGURE 5.
Currently, the Markov discriminator is used in GAN networks such as pix2pix [28] and CycleGAN [27]. It has a certain high resolution and high detail retention for super high resolution and clear pictures in style transfer. The discriminator maps the input to an N * N patch (matrix) X , which is actually the feature map output by the convolutional layer. The value of X ij represents the probability that each patch is a true sample, and the average value of X ij is the final output of the discriminator. Each output in the output matrix represents a receptive field in the original image, corresponding to a patch of the original image.
We use two PatchGAN discriminators as the illumination map discriminator and the reflection map discriminator. The two discriminators have the same structure, but the parameters are different. They are used to judge the authenticity of the generated illumination map and reflection map. This not only ensures that the content (semantics) similarity is maintained between the generated image and the original image but also ensures that the illumination map and reflection map of the low-light image can learn the brightness information of the normal-light images.

C. LOSS FUNCTION
The loss function of the network in this paper is only composed of adversarial loss. The adversarial loss is used to make x I ∼ P real-patches , which means that the distribution of the low illumination map is approximate to that of the normal image. Additionally, we need to make x R ∼ P real-patches , which means that the distribution of the reflection map is close to that of the normal image.
For the local discriminator, sixteen patches are randomly cut from the output image and the real image. Here, we use the original least square GAN (LSGAN) [35] as the antagonistic loss. LSGAN, as its name suggests, uses the objective function as a mean square error. The discriminator no longer maximizes the cost function of the original GAN but to minimizes it. The objective is to learn both the forward mapping G I : x I → y 3 max and G R : x R → y. In this paper, we code the generated samples and the real samples as 0 and 1. Using square error as the objective function, the goal of the illumination map discriminator is as follows: Similarly, the goal of the reflection map discriminator is:  [9], (c) LIME [10], (d) RetinexNet [20], (e) SIEN [24], (f) EnlightenGAN [30], (g) DCE-Net [26], (h) DRBN [31], (i) ours, β = 6.2, α = 1.3. Please zoom in to see the details.
where x I is generated illumination map and x R is generated reflection map, respectively. y 3 max is the maximum RGB value of the normal illumination map cascaded into three channels. D I and D R are the discriminators of illumination map and reflection map.
The total loss function of our decomposition network is as follows: VOLUME 9, 2021 L G I and L G R are the adversarial losses of illumination and reflection.

A. DATASET AND EXPERIMENTAL DETAILS
We use unpaired low/normal-light images for training, and the data-set is provided by EnlightenGAN [30]. These images are converted to PNG format and adjusted to 600*400 pixels. We select 900 low-light images and 900 normal-light images from this data-set as the training set. To expand the data-set, the images are randomly cropped to 400*400 and flipped during training. For the testing set, we compare previous works (LIME [10], LOL [20], SRIE [9], DeepUPE [22], DICM [36], and EnlightenGAN [30]) using standard images.
Our batch size is set to 1. We use the Adam optimizer. The learning rate of the decomposition network is set to 1e-4. The learning rate of the two discriminator networks is set to 4e-4. The network needs to train 100 epochs on a single Nvidia 1080Ti GPU, and it takes no more than 1 hour.
One of the shortcomings of GANs is that the training is unstable. Since the network structure we used is a variant of the GAN network, this problem is inevitable in our network training. After many repeated experiments, we found that the model achieves better results when iterating for approximately 50 epochs. We suggest that readers should save the model at intervals during the training process and check the generated reflection map when replicating the code in this paper. When the color, brightness, and texture of the reflection map meet your expectations, the model parameters are optimal at this time. Of course, readers can also refer to other articles that optimize GAN stability [37]- [39] to make training more stable.

B. EXPERIMENTAL COMPARISON 1) COMPARISON WITH TYPICAL EXPERIMENTS
We compare our method with the following state-of-the-art low-light image enhancement methods: SRIE [9], LIME [10], RetinexNet [20], SIEN [24], EnlightenGAN [30], DCE-Net [26], DRBN [31]. To ensure fairness in comparison, we use the source code and pretrained models provided by the authors of these methods and use the parameter settings recommended by the authors. That data-set in FIGURE 6 is from DeepUPE [22]. We can see in the FIGURE 6 that SRIE [9] enhances the input image while its brightness enhancement effect is limited compared to other methods. LIME [10] overenhances image areas with light sources or areas with higher pixel values, resulting in loss of surrounding details. After using RetinexNet [20] for enhancement, the texture of the image is very unnatural. SIEN [24] overenhances and distorts the color. In EnlightenGAN [30], although the attention map is used in its work to eliminate image artifacts, there are still some artifacts in some test images, and the generated effect is yellowish. DCE-Net [26] is whitish. DRBN [31] loses some details of the image.
We also carried out a quantitative comparison. Unpaired images (from DeepUPE [22]) and paired images (Enlighten-GAN [30]) are used for testing. For paired images, we use PSNR and SSIM to evaluate image quality. For unpaired images, we use NIQE to evaluate image quality. The average values are shown in TABLE 1.
We tested the running speed of each method. As shown in TABLE 2, our method is the fastest. For the traditional method, we use a CPU (Intel Core i7-7700HQ CPU) for testing. Due to the limitations of the hardware environment, we also use a CPU to test RetinexNet and SIEN. The other methods are tested on an Nvidia GTX1050 Ti GPU.
The current methods were proposed to enhance low-light images, but they fail to maintain enhancement performance when we input images with normal light. As shown in FIGURE 7, (i) are the input normal-light images, and (a)-(h) are the enhanced results outputted by different methods. Obviously, we have a better performance than other approaches, e.g., the enhanced images in line (h) are closer to the images in line (i) than other lines.

2) ABLATION STUDY
Different people may like images with different brightness. However, most current methods based on deep learning can only obtain one kind of brightness image after the operation of low-light image enhancement. By simply adjusting the β and α, we can meet the requirements of most people for low-light image enhancement.
In the test, the different β and α values in FORMULA 2 will affect the test results. FIGURE 8 shows the different brightness enhancement effects caused by different β and α in the test. β is used to adjust the enhancement of dark VOLUME 9, 2021 areas in the image. As β increases, dark areas in the image are illuminated. This value can guarantee the texture of the image. However, if the β value is too large, it will cause the image details to be lost. α is used to control the overall enhancement of the image. As α increases, the overall brightness of the image will increase. This value can guarantee the details of the image. However, if the α value is too large, part of the image (especially at the light source) will be overexposed. In addition, we tested the effect of β and α on image quality. As shown in FIGURE 9, as the value of β increases, the value of NIQE will decrease, but the values of PSNR and SSIM will also decrease. When α is 2 and β is 1, PSNR and SSIM achieve optimal values. However, in reality, images with good visual effects may not have large PSNR and SSIM values [40].

C. REAL-WORLD IMAGE ENHANCEMENT
To verify the generalization of the model, the FD-LOL (face detection in low-light conditions) [32] data-set is used in the test. This data-set has real low-light images taken at night, and is used for face detection in low-light scenes. We used DSFD (dual shot face detector) [41] method to detect the face of the original low-light image and the image enhanced by RetinexGAN. FIGURE 10 shows before image enhancement, face detection missed detections and false detections.
After image enhancement, face detection accuracy is significantly improved.

V. CONCLUSION
In this paper, we propose a deep learning model based on generative adversarial networks and Retinex to enhance low-light images with unpaired data-sets. It uses a decomposition network with only two convolution layers to decompose a low-light image into an illumination map and a reflection map. Then, Retinex with correction coefficients is used to reconstruct the illumination map and the reflection map into normal images. Our experiments proved the superiority of this method. In future work, we will focus on adaptive and real-time low light image enhancement, and apply this model to multi-agent systems [42], cyberphysical systems [43], and automated manufacturing systems [44].
XINCHENG REN (Member, IEEE) was born in 1967. He received the B.S. degree in physics and theoretical physics from Yanan University, in 1990, Yan'an, China, the M.S. degree in physics and theoretical physics from Northwest University, Xi'an, China, in 2000, and the Ph.D. degree in radio physics from Xidian University, Xi'an, in 2008.
He is currently a Professor with the School of Physics and Electronic Information, Yanan University. He has authored over 100 articles. His research interests include propagation and scattering of electromagnetic (optical) waves in complex systems and random medium, computational electromagnetics, and theory and technology of wireless communication.
RUNTAO XI received the bachelor's degree in software engineering from Langfang Normal University, in 2019. He is currently pursuing the master's degree with the Xi'an University of Science and Technology. His research interests include computer vision and reinforcement learning, especially adaptive image enhancement in reinforcement learning.
YUANCHENG LI was born in Henan, China, in 1981. He received the B.S. degree in computer science and technology from the Xi'an University of Posts and Telecommunications, China, in 2004, and the M.Sc. degree in computer software theory and the Ph.D. degree in computer system structure from Xi'an Jiaotong University, China, in 2007 and 2012, respectively.
Since 2012, he has been a Lecturer with the College of Computer Science and Technology, Xi'an University of Science and Technology, China. His research interests include parallel computing and compiler optimization. He serves as a Reviewer for Concurrency and Computation: Practice and Experience and the Journal of the Chinese Institute of Engineers.
XINLEI ZHOU was born in Xi'an, Shaanxi, China, in 1997. He received the bachelor's degree in engineering from Anhui Polytechnic University, in 2020. He is currently pursuing the master's degree in engineering from the Xi 'an University of Science and Technology. His research interests include computer vision and data visualization.