Low Light Image Enhancement Based on Multi-Scale Network Fusion

At present, researchers have made great progress in the research of object detection, however, these studies mainly focus on the object detection of images under normal lighting, ignoring the target detection under low light. And images in the fields of automatic driving at night and surveillance are usually obtained in low-light environments. These images have problems such as poor brightness, low contrast, and obvious noise, which lead to a large amount of information loss in the image. And the performance of object detection in low light is reduced. In this paper, we propose a low-light image enhancement method based on multi-scale network fusion to solve the problems of images in low-light environments. Aiming at the problem that the effective information of low-light images is relatively small, we propose a preprocessing method for image nonlinear transformation and fusion, which improves the amount of available information in the light image. Then, in order to obtain a better enhancement effect, a multi-scale feature fusion method is proposed, which fuses features from different resolution levels in the network. The details of low-light areas in the image are improved, and the problem of feature loss caused by too deep network layers is solved. The experimental results show that our proposed method can achieve better enhancement effects on different datasets compared with the current mainstream methods. The average recall value of the object detection with our method is improved by 38.25%, which shows that our proposed method is effective and can promote the development of autonomous driving, monitoring, and other fields.


I. INTRODUCTION
With the development of technology, computer vision has been widely used in various fields, such as autonomous driving, video surveillance, object detection, classification, etc. However, in our daily life, weather, lighting, etc. are the key factors that affect image quality. Images obtained in uneven illumination or weak light environments have some problems such as small overall gray value, low brightness, low contrast, and low signal-to-noise ratio, which reduce the amount of information contained in the image. And these problems will reduce the performance of image processing algorithms, The associate editor coordinating the review of this manuscript and approving it for publication was Antonio J. R. Neves . causing algorithms such as object detection and classification to fail to achieve expected results, and even cause danger to life in the field of autonomous driving. Therefore, it is necessary to study low-light image enhancement algorithms, which has aroused the interest of many researchers.
Traditional algorithms and deep learning-based algorithms have dominated low-light image processing technologies in recent years. Histogram equalization methods and methods based on the Retinex model are both traditional low-light enhancement methods, with the former being more common algorithms like Contrast Limited Adaptive Histogram Equalization [1](CLAHE) and Ibrahim's Brightness Preserving Dynamic Histogram Equalization [2](BPDHE) and so on. In the processed image, this approach results in VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ unnatural contrast and detail loss. Jobson proposed the Single Scale Retinex (SSR) and Multi-Scale Retinex (MSR) [3], [25], both of which are based on Retinex theory. And other sub-algorithms of Retinex, such as the method with color restoration MSRCR (Multi-Scale Retinex with Color Restoration) [4]. These algorithms estimate the illumination map in the number field, but they frequently cause noise and take a long time to run. Compared with traditional algorithms deep learning-based methods [5], [6], [23], [24], [29], [37], [38] are faster and have better accuracy and robustness, so deep learning-based algorithms are gradually becoming mainstream, such as Lore et al. proposed a deep self-encoderbased algorithm [7](LLNet) to identify low luminance image features and enhance the image by a self-supervised enhancement network. Although the image luminance is effectively improved, the enhanced image will have an overexposure problem, leading to image distortion. There are also convolutional neural networks for image enhancement based on retinex theory, such as the Retinex-Net algorithm [9], [22], [26]proposed by Wang et al, which consists of a Decom-Net module capable of decomposing the input image into lightindependent reflectance, structure-aware smooth illumination, and a luminance enhancement module Enhance-Net module, and this method can be used to improve image quality. Although this approach recovers color and brightness adequately, the resulting image is noisy. In addition, the Zero-Dce++ algorithm [8] treats light enhancement as an image-specific curve estimation task, taking a low-light image as input and producing higher-order curves as output. These curves are used as adjustments to the dynamic range of the image at the pixel level and result in a final enhanced image. This curve-based method does not require any paired or unpaired data during training. They implement zero-reference learning through a set of non-reference loss functions. Semi-supervised learning has been proposed in recent years. Yang et al. proposed a semi-supervised deep recurrent band network (DRBN) [8]. DRBN first recovers linear band representations of augmented images under supervised learning, and then recombines the given bands through a learnable linear transformation based on unsupervised adversarial learning to obtain improved band representations. DRBN is extended to achieve better augmentation performance by introducing a long short-term memory (LSTM) network and an image quality assessment network pre-trained on the analysis dataset. Apart from semi-supervised in-depth learning, Jiang et al. proposed an unsupervised learning method, EnligthenGAN [33], which uses attention-guided U-Net as a generator and global-local discriminators to ensure that the enhanced results look like real normal light images. In addition to global and local antagonistic losses, global and local self-feature preservation losses are also proposed to preserve image content before and after enhancement. At present, low illumination image enhancement algorithms based on deep learning still have some problems, such as noise, artifacts, and distortion. The existing deep learning-based low-illumination picture enhancement algorithm exhibits noise, artifacts, and distortion in the improved image quality. This research provides a low illumination image enhancement technique using multi-level network fusion (MFIE-Net) to overcome the problems of noise, artifacts, and distortion that emerge in photos enhanced by the above algorithms. The primary contribution of this algorithm is that: a new enhancement approach is proposed. First, in order to enrich the effective information of the low-light image, the image is first preprocessed with a nonlinear transformation, and then the modified image is fused with the original image.Then use the convolutional neural network to extract features from the fusion image, use the feature extraction module (FEM) and feature enhancement module (EM) to extract and enhance the features of each layer, and fuse the feature maps of the same scale of the network [36], it can effectively alleviate the feature loss caused by the network downsampling. Finally, a loss function that combines perceptual, structural, and color loss is devised to improve the model's enhancing impact. The results of the experiments reveal that the algorithm effectively reduces image noise without causing color distortion and that the evaluation index is higher than that of other techniques.

II. MULTI-SCALE FEATURE FUSION NETWORK A. STRUCTURAL DESIGN
This research proposes an end-to-end multi-scale network fusion approach to address issues such as artifacts and noise during image enhancement that are present in previous methods. Figure 1 depicts the network structure.
To begin, the image is logarithmically changed to expand the low gray value part of the image and compress the high gray value part to achieve the effect of enhancing the dark details, followed by channel fusion with the original image before entering the network to enrich the original features; as the network layer depth increases, the shallow features degrade, so the multi-scale network feature fusion method is used to fuse the low layer information with the high layer information. Reduce the amount of information that is lost due to feature loss. The encoder is made up of a 3 × 3 size convolutional kernel, while the decoder is made up of a transposed convolutional and activation layer with the same scale. The feature-extracted image is then fused with the input image, and the final enhanced image is output.

1) IMAGE LOGARITHMIC CHANGE PRE-PROCESSING
Because the brightness of most low-light photographs is too low, the features of the over-dark parts are not visible, and no useful features are retrieved when the images pass through the neural network, the images cannot be effectively upgraded. The majority of current low-light image preprocessing approaches use linear amplification of the overall image gray value, but this will lead to overexposure in some places.
As a result, this design uses logarithmic variation in nonlinear [10] variation to preprocess the image. Logarithmic FIGURE 1. This figure shows the overall network structure of the algorithm in this paper. The enhancement process is mainly divided into three steps: logarithmic change preprocessing, multi-scale feature enhancement, and reconstruction. In the logarithmic change preprocessing step, logarithmically change the gray value of the original image to improve the low gray value area features, and fuse with the original image to reduce the feature loss of high gray value. In the multi-scale enhancement step, we perform feature enhancement and extraction at different network scales on the fused image. Finally, all the enhanced feature layers are transformed into feature maps of the same scale for fusion reconstruction.
variation is known by the function curve, whose slope is higher when the gray value is low and lower when the gray value is higher, allowing the darker pixels in the compressed high-value image to be effectively extended.This method can effectively enrich the feature information of the low gray value area, suppress the overexposure area, and improve the feature extraction effect of the network. The formula is as follows: Among them, c is a constant, r is the gray value after image normalization, v is a constant, and the larger v is, the more obvious the improvement of gray level is. In order to prevent the loss of detail features of the original image after preprocessing, the processed image and the original image are fused into a six-channel tensor of RGB * 2, which is transmitted to the neural network for training.

2) FEATURE EXTRACTION MODULE (FEM) DESIGN
The feature extraction module utilizes a layer of convolution and ReLU activation function [30], [31], where the convolution kernel size of the convolution layer is 3 × 3 and stride 1. Each layer of convolution is followed by a ReLU activation function to enhance the nonlinear representation of the network.After each layer of FEM, there is an enhancement module pair to enhance the image feature information, and the output of the feature enhancement module(EM) of the previous layer is used as the input of the next layer of the FEM. The network has 3 FEM modules that can extract features at different resolutions.

3) ENHANCEMENT MODULE (EM) DESIGN
The number of feature extraction modules is equal to the number of improvement modules. According to the U-net network [11], [27], [28]structure, the EM module in this paper uses an encoder-decoder structure (shown in Figure 2), in which the encoding part consists of a 5×5 size convolutional kernel and an activation function layer. The encoder as a whole exhibits a gradually shrinking structure, continuously shrinking the resolution of feature maps to capture contextual information.
And it is divided into 4 stages. In each stage, convolution is used for downsampling, and the final feature map is reduced by 16 times. The decoder part consists of a deconvolutional layer and an activation function layer, And it is divided into 4 stages. In each stage, after upsampling the input feature map, it is spliced with the feature map of the corresponding scale in the encoder, and the final feature map is enlarged by 16 times. Unlike U-net, the decoder in this paper does not use a pooling layer, which can reduce the feature loss.

B. LOSS FUNCTION DESIGN
A single loss function cannot provide a decent recovery effect for the difficulties of low-light photos with low brightness and visible noise. In this paper, we consider using the combination of structural similarity loss L SSIM , perception error VOLUME 10, 2022 loss L Perceptualand , and color loss L colors as the overall loss function.

1) LOSS OF STRUCTURAL SIMILARITY
In the process of low-light image recovery, which usually leads to structural distortions such as blurring and artifacts, structural similarity can be determined by calculating the contrast, brightness, and structure of the original image and the labeled image to determine the similarity of the two images, using the mean, standard deviation, and covariance to estimate the brightness, contrast and structural similarity, respectively. As shown in Equation (2): among u x , u y is the image of x and y, respectively, represents the standard deviation of the images x and y, represents the variance of the image x and y. And for constant fractions, they are set to 0.01 and 0.02, in this case, to prevent a zero denominator. The structural similarity of two identical images is equal to 1, so the expression of the structure loss function is designed here as follows:

2) PERCEIVED LOSS
In addition to the low-level information on structural similarity loss, we also need to use the high-level information of the image to recover the image. Referring to algorithms such as style migration, the labeled image is first convolved to obtain the feature map and compared with the features of the enhanced low-illumination image to make the high-level information of both closes. In this paper, we use VGG16 as a feature extraction network for perceptual loss, as in Equation (4): where j represents the number of layer j of the network, C j H j and W j represent the channel number, height, and width of the characteristic graph of the layer j, φ j (x) and φ j (y) represents the feature maps of the enhanced image and labeled image at layer j, respectively. The perceptual loss ensures the quality of the enhanced image through high-level information.

3) COLOR LOSS
The problem of color bias can exist in low-illumination images, L SSIM and L Perceptualand functions can enhance the image through the structure and features, but there will be color difference in the enhanced image. Therefore, in this paper, loss functions with color recovery are designed. The RGB pixels are treated as three-dimensional vectors, and the color is adjusted by computing and adding the angles between each vector. The expression is as follows: In which (.) represents an operator for calculating the angle of two colors, (.) stands for a pixel. L c is the summation of the angles between the color vectors for each pixel pair. In summary, the total loss function L designed in this paper consists of three components as in Equation (6): where w 1 , w 2 and w 3 represent the weight of each loss function(w 1 = 0.45, w 2 = 0.45, w 3 = 0.1).

III. EXPERIMENTAL SETUP
In this paper, we select 15144 images in VOC dataset [14], [17] which are processed with random gamma adjustment as the training set and the test set, with 14,000 sets of images in the training set and 1,144 sets of images in the validation set. To objectively represent the training results, we use the LOL dataset(Low-Light dataset) [15] as the test set. The whole network is trained on NVIDIA GTX 2080Ti GPU and Intel Xeon E5-2690 CPU using the PyTorch framework.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. ABLATION EXPERIMENT RESULT
In order to illustrate the effectiveness of each part proposed in this paper, under the same conditions as other experiments, ablation experiments are performed on the enhanced network with and without preprocessing. The unprocessed RGB 3-channel image and the pre-processed RGB*2 6-channel image are used as the input of the network respectively. The experimental results of the two groups are compared to prove the effectiveness of fusing the preprocessed images to improve the image enhancement effect.
Through the comparison of the output of the network structure with or without logarithmic change preprocessing in Figure 3, it can be seen that the image preprocessed by logarithmic transformation can effectively improve the image features in the area with a lower gray value, and it can be more VOLUME 10, 2022 obvious. Improve the enhancement ability of the network in low gray value regions.

B. VISUAL QUALITY COMPARISON
In order to visualize the enhancement effect, the method in this paper is compared with other low-light image enhancement methods for experimental comparison, which mainly include HE of traditional methods and AMSRCR based on Retinex theory, Retinex-Net,Kind [34], DRBN, EnlightenGAN and Zero-DCE++ based on deep learning. We performed experiments on images with different lighting conditions. The LOL dataset and Night-driving dataset (which are not part of our training dataset) provided the images, and Figure 4 displays the results. After enhancement by the HE method, the gray level is reduced, the image details are lost and the contrast is unnatural, the AMSRCR method will have problems such as over-enhancement of color and contrast, and the image will also be noisy, the Retinex-Net method has better color recovery but the recovered image has obvious noise, Kind and DBRN algorithms can enhance the color and brightness of low-light images, but the effect of enhancing local features with the low gray value of the image is not ideal. The image enhanced by EnlightenGAN algorithm which bases on the unsupervised learning has noise. Although the Zero-DCE++ algorithm can effectively enhance the image details, the overall image brightness is low. In comparison to other algorithms, the MFIE-Net algorithm proposed in this paper is more perfect in overall and local details, more balanced in color recovery, and the image details are clearly discernible and do not generate additional noise, resulting in a visual effect that is closer to that of the label image.
In order to objectively reflect the effect of the algorithm, besides subjective evaluation, PSNR, SSIM [16] and LPIPS are used to evaluate the performance of the model. PSNR: peak signal-to-noise ratio (PSNR), commonly used as a measure of image and signal reconstruction quality. The larger the value, the smaller the image distortion.
where MAX 2 I represents the maximum possible pixel value for the image, if each pixel is represented by 8-bit binary, then its value is 255. MSE is the mean square error between the original images and the resulting images.
SSIM: structure similarity, which measures the similarity of two images in brightness, contrast, and structure. The larger the value, the more similar the two images are.
In which µ x and µ y represent the mean and variance of the image, respectively. c 1 and c 2 stand for two constants, avoiding division by zero. LPIPS: learned perceptual image patch similarity, used to measure the difference between two images [32]. The lower the value, the more similar the two images are.
where, d is the distance between x and x 0 . Feature stacks are extracted from the L layer and unit-normalized in the channel dimension. Andŷ l ,ŷ l 0 ∈ R H l ×W l ×C l is for layer l. The activations are scaled channel-wise by vector w l ∈ R C l . Table 1 shows the comparison results with other methods on the LOL dataset(testing pairs). From the data, it can be seen that the method in this paper has advantages in PSNR, SSIM and LPIPS indexes, which indicates that the algorithm in this paper is more effective in low-light image enhancement. Table 2 shows the comparison results of the image processing time of different algorithms. Although the traditional   HE algorithm has the fastest processing time, it has the worst image enhancement effect. Compared with other deep learning-based algorithms, in spite of EnlightenGAN and Zero-DCE++ are slightly faster than the proposed method, our proposed method has clear advantages in terms of vision and evaluation metrics, so our algorithm is more suitable for low-light image processing field. In order to prove that the algorithm proposed in this paper has a good enhancement effect in different scenarios, Figure 5 and Figure 6 mainly show the comparison between the algorithm in this paper and other algorithms in other datasets and actual night road pictures. As can be seen from other photos in the test set, the MFIE-Net algorithm does better in terms of brightness, contrast, and noise suppression. In the actual night road scene, the image enhanced by the algorithm in this paper is the closest to the real image in terms of brightness and color. It intuitively shows that the algorithm in this paper is excellent.

C. NIGHTTIME DETECTION
In the actual night detection process, in order to reflect the enhanced effect will be enhanced before and after the image for comparison, this paper uses YOLOv5 [18], [19], [20], [21] detection algorithm to compare the original image and the enhanced image of this paper algorithm.  As shown in Figure 7 and Figure 8, in the above 4 groups of images, the first row of images is the detection result without low-light enhancement, and the second row of images is the detection result after low-light image enhancement. The recall of (a) is increased from 0.25 to 0.5, (b) is increased from 0.5 to 0.83. Especially in (a), the recognition algorithm can accurately identify roadside cyclists and pedestrians, which can effectively reduce the missed detection rate of visual recognition at night. It can be clearly seen that the detection effect after image enhancement has a significant improvement(including (c),(d)), and average recall increased by 38.25%, in keeping within the actual application requirements.

V. DISCUSSION
This section states some discussions and future works.

A. IMAGE LOGARITHMIC CHANGE PRE-PROCESSING
The logarithmic curve has a large slope where the pixel value is low, and a small slope where the pixel value is high. After the image is logarithmically transformed, the contrast of the darker area will be improved, so the features of the low gray value part of the image are enhanced. According to this feature, we logarithmically transform the image and fuse it with the original image into an RGB 6-channel image as the input of the network. Through the ablation experiments with and without logarithmic change preprocessing, it can be clearly seen that logarithmic change preprocessing can effectively enhance the feature information of the dark part of the image.

B. MULTISCALE NETWORK DESIGN
We use a combination of multiple feature extraction and feature enhancement modules to enhance the image, using a convolution kernel with a convolution kernel size of 3 × 3 and a ReLU activation function as the feature extraction module. The feature enhancement module is similar to U-Net, in order to prevent feature loss, we discarded the pooling layer. Through image processing by multiple feature extraction and enhancement modules, we can extract image features at different resolutions, then the feature maps at different resolutions are unified to the same scale using bilinear interpolation upsampling, and finally, we use the 1 × 1 convolution kernel to merge them.

C. FUTURE WORK
In the future, our work will focus on using methods such as dynamic range conversion methods [35] to study the influence of network visualization and network intermediate layers on low-light image enhancement to further improve low-light image enhancement. Secondly, we will build data sets for specific fields such as night road scenes and night surveillance videos to improve the algorithm performance of specific scenes. Finally, we intend to cascade the algorithm network of this paper and the target detection network to improve the practical application of the algorithm.

VI. CONCLUSION
In this paper, we have proposed a multi-scale network fusion enhancement method (MFIE-Net), which solves the problem of low-light image enhancement and improves the enhancement effect. We first perform nonlinear transformation on the input low-light image and fuse it with the input image to achieve information enhancement in the low-light area, so that the network can learn more features of the low-light area. Then in order to improve the enhancement effect of the low-light image, we design a multi-scale feature fusion module to fuse the features of different levels of resolution in the low-light area. It improves the detail information in low-light areas, fuses deep features with shallow features, and avoids feature loss caused by too deep network layers. The experimental results show that our method is better than other algorithms in terms of PSNR, SSIM, LPIPS, and other evaluation indicators. And the speed of our method is similar to the latest methods. In summary, our method can be applied to low-light fields such as automatic driving at night and video surveillance and helps to improve the performance of tasks such as object detection and recognition in low-light environments. XUAN  TAILIN HAN received the B.Eng. degree in applied electronic technology, in 1991, and the M.Eng. and Ph.D. degrees in physical electronics, in 1999 and 2010, respectively. He is currently working as a Professor and a Doctoral Supervisor with the School of Electronic Information Engineering, Changchun University of Science and Technology. He presided over and participated in many national and provincial projects. He has published several scientific papers in national and international conference proceedings and journals, including two books, 24 papers, and 14 patents. His current research interests include special signal acquisition and processing, and digital image processing. He has received the title of the 14th Batch of Young and Middle-Aged Professionals and Technical Talents  YU TIAN received the bachelor's degree, in 2020. She is currently pursuing the master's degree with the Changchun University of Science and Technology. Her research interests include photoelectric detection technology and quality control.
BO XU received the B.Eng. degree in communication engineering, in 2016. She is currently pursuing the Ph.D. degree in information and communication engineering with the Changchun University of Science and Technology, China. Her research interests include special signal processing, digital signal processing, dynamic signal detection and processing, and transient signal compressed sensing.
MINGCHI JU received the B.Sc. degree in electronic information science and technology, in 2018. He is currently pursuing the Ph.D. degree in information and communication engineering with the Changchun University of Science and Technology, China. His research interests include special signal processing, digital signal processing, and transient signal compressed sensing.