LLISP: Low-Light Image Signal Processing Net via Two-Stage Network

Images taken in extremely low light suffer from various problems such as heavy noise, blur, and color distortion. Assuming the low-light images contain a good representation of the scene content, current enhancement methods focus on finding a suitable illumination adjustment but often fail to deal with heavy noise and color distortion. Recently, some works try to suppress noise and reconstruct low-light images from raw data. But these works apply a network instead of an image signal processing pipeline (ISP) to map the raw data to enhanced results which leads to heavy learning burden for the network and get unsatisfactory results. In order to remove heavy noise, correct color bias and enhance details more effectively, we propose a two-stage Low Light Image Signal Processing Network named LLISP. The design of our network is inspired by the traditional ISP: processing the images in multiple stages according to the attributes of different tasks. In the first stage, a simple denoising module is introduced to reduce heavy noise. In the second stage, we propose a two-branch network to reconstruct the low-light images and enhance texture details. One branch aims at correcting color distortion and restoring image content, while another branch focuses on recovering realistic texture. Experimental results demonstrate that the proposed method can reconstruct high-quality images from low-light raw data and replace the traditional ISP.


I. INTRODUCTION
Typically, the raw sensor data we captured will be processed by an in-camera image signal processing pipeline (ISP) to generate JPEG-format images.And the key steps in the ISP include: ISO gain, denoising, demosaicing, detail enhancing, white balance, color manipulation and color mapping.The quality of these JPEG-format images is very important both for our daily life and for many computer vision tasks, e.g., video surveillance, segmentation, and object detection [1], [2].However, images captured in low-light environments suffer from various problems such as heavy noise, color distortion and blur.And these problems will be aggravated by quantization, clipping, and other processing in the traditional The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino .
ISP. High ISO, large aperture, or long exposure time can be used to brighten the images, but they also lead to various drawbacks, for example, the amplified noise or inevitable blur.
Researchers have proposed lots of techniques to restore low-light images.Retinex [3], [4] and histogram equalization [5] are traditional methods to brighten images.Due to the lack of content understanding, these methods may produce unnatural results.Recently, deep learning-based approaches have revealed their superior performance in image enhancement.Some methods [6], [7] directly kindle low-light images without special consideration about noise or blur.Other methods focus on some challenges which are related to low-light image enhancement such as denoising [8], [9], demosaicing [10], deblurring [11], multiexposure image fusion [12], [13].However, these methods still cannot produce high-quality enhanced images for the following reasons: First, most lowlight enhancement methods cannot handle images taken in extremely dark conditions that contain severe noise and color degradation.Under these conditions, JPEG-format images cannot provide enough information due to the information loss during the traditional ISP.What's more, heavy noise often leads to inaccurate white balance and blurred results.Second, sequentially denoising, deblurring, and correcting color bias may accumulate errors.Hence, we need an effective method that can operate directly on raw sensor data and produce pleasant enhanced images.
In this paper, we propose a Low Light Image Signal Processing Network (LLISP) to address the extremely lowlight enhancement problem.As the traditional ISP cannot work well in such conditions, we reconstruct the images directly from raw sensor data to avoid further information loss.Inspired by the traditional ISP, we firstly use a U-netbased module [14] to remove noise as heavy noise is one of the most challenging problems in dark conditions, which also influences detail enhancement and white balance.Then, a two-branch network is proposed to reconstruct images and refine textural details simultaneously.Specifically, different network architectures are used in different branches.The reconstruction branch aims at correcting color distortion and restoring image content.Hence, we use a U-net [14] to learn high-level features.The enhancing branch aims at recovering texture and focuses on detailed information.In this branch, the resolution of features is not reduced to persevere structural integrity and the dilated convolution [15] is applied to enlarge the receptive field.
In summary, we make the following contributions: • We propose a novel two-stage low-light enhancement net which can directly brighten extremely low-light images from raw data and replace the traditional ISP.The proposed method inherits the benefits of both endto-end network and traditional multistage ISP.
• A two-stream structure is presented in the second stage, which consists of a reconstruction branch and a texture enhancing branch.The reconstruction branch restores images from both original input and pre-denoised features.The texture enhancing branch utilizes gradient information to reduce artifacts and enhance details.
• Experimental results demonstrate that, to enhance extremely dark images, a pre-denoising module is indispensable and can improve the robustness of the proposed method.
The rest of the paper is organized as follows.Section II briefly introduces the related works.Section III describes the proposed method in detail.Experimental results are shown in Section IV.Finally, Section V concludes this paper.

II. RELATED WORK
Low-light image enhancement has a long history and it covers lots of aspects such as denoising and demosaicing.We provide a short review of previous arts closely related to our task.

A. LOW-LIGHT IMAGE ENHANCEMENT
Classic approaches can be roughly divided into two main categories: histogram equalization (HE) [16]- [18] and gamma correction (GC) [19].These methods ignore the relationship between individual pixels and their neighbors.As a result, they often produce artifacts and compromised aesthetic quality.Another technical line is based on the Retinex theory [4], [20]- [22], which decomposes the image into two components, i.e., reflectance and illumination, and enhances the illumination component.But a global adjustment tends to over-/under-enhance local regions.To further improve the adaptability of enhancement and avoid local over/under enhancement due to uneven illumination, Wang et al. [23] enhances the image via multi-scale image fusion.Unfortunately, these approaches still cannot handle heavy noise and color bias.Besides, the lack of understanding of the image content causes unnatural enhancement.
Deep learning-based methods perform more global analysis and try to understand image content.Some works use paired data to learn the mapping function from low-light images to high-quality outputs [6], [24], [25].Other works use unpaired data to train the models which release the necessity for collecting paired data [7].However, these approaches generally assume that the images do not suffer from heavy noise and color distortion.As a consequence, under extremely low-light conditions, they may either enhance both the noise and scene details, or fail to recover the low visibility of low-light images.Compared with these methods, our LLISP brightens up the image while preserving the inherent color and details via a proper image processing pipeline and efficient utilization of the raw data.
More recently, some approaches [26]- [28] use neural networks to replace the traditional ISP and directly reconstruct high-quality images from raw data.By using raw data, they avoid information loss caused by the traditional ISP.However, these works tend to learn the ISP pipeline as a blackbox, which increases the learning burden of networks and causes the inefficient utilization of data.Different from those approaches, our LLISP pays more attention to model a proper image processing pipeline and make full use of the raw data.

B. IMAGE DENOISING METHODS
Image denoising is a hot topic in low-level visual tasks and is very essential for further image processing.Classic approaches [8], [9] use specific priors of natural clean images such as pixel-wise smoothness and non-local similarity.Recently, deep convolutional neural networks have led to significant improvement in denoising.Some works focus on applying effective network structure to learn the mapping between noisy images and clean images, e.g., auto-encoders [29], residual block [30] and non-local attention block [31].Other works focus on simulating realistic noise models for better performance on real-world denoising tasks [30].
In our work, we adopt a simple but effective pre-denoising module so that we can avoid the disruption of severe noise on the subsequent enhancement.

C. IMAGE SIGNAL PROCESSING PIPELINE
In order to reconstruct the images from raw data more accurately, it's necessary to be clear of the in-camera ISP.Typical ISP in our daily used cameras includes: ISO gain, denoising, demosaicing, detail enhancing, white balance, color manipulation, then mapping the data to sRGB color space and finally saving to file.There are many classical approaches for the above steps [32].Recently, lots of deep learning-based methods have been proposed and outperform those classical approaches.Some works focus on applying convolutional neural networks (CNN) for specific steps in the ISP, such as demosaicing [10] or white balance [33].Other works [26], [34] use deep learning models to replace the entire ISP pipeline.In this paper, we propose a deep network to replace the entire ISP for low-light image reconstruction.Inspired by the typical ISP, the proposed net also adopts a multi-stage enhancement strategy.

III. METHOD
The proposed LLISP aims at removing noise, correcting color bias and reconstructing high-quality images from raw data.As illustrated in Fig. 1, the proposed LLISP network consists of two components: a Denoising Module (DNM), an Enhancement Net (EN).

A. DATA PREPARING
In the training stage, four types of data are used, i.e., low-light raw data (I raw ), amplification ratio k, ground truth raw data (GT raw ), and ground truth sRGB data (GT sRGB ).The data can be collected from commonly used digital cameras or smartphones.In our experiment, we use the SID dataset [26], which consists of raw short-exposure images and the corresponding long-exposure images both in raw and RGB format.The corresponding exposure time for these images is also provided in the dataset.Following SID [26], the amplification ratio k is set to be the exposure difference between the input and reference images (e.g., x100, x250, or x300) for both training and testing.We scale the low-light raw data (I raw ) by the desired amplification ratio k to get the inputs (I * raw ) for our LLISP.Specially, in the testing phase, k can be specified by users.

B. STAGE I: DENOISING MODULE
Denoising is very essential and important in the image processing pipeline, especially for low-light images that suffer from heavy noise.Because heavy noise significantly influences subsequent processes, e.g., deblurring, white balance, and color mapping, we put the DNM in the first stage to obtain relatively clean data and reduce the difficulty for the following stages.Formally, given the scaled low-light raw inputs (I * raw ), we can generate clean raw data (C raw ) as, The architecture of this module can be seen in Table 1.
Commonly used U-net [14] is selected as the backbone of the DNM for its effectiveness in denoising tasks.The input and output channels are set to 4 to suit for raw data.As a trade-off between efficiency and restoration performance, the kernel size is set to (3,3) following SID [26].Considering the fact that, in extremely low-light conditions, even the longexposure ground truth data still has noise, besides the pixelwise Loss L1 , we also add the Loss TV to further smooth the denoised output.Loss L1 is defined as the l 1 distance between the output of the denoising module and ground truth raw data (2).Loss TV is defined as a total variation regularizer to constrain the smoothness of outputs ( 3) where∇ h and ∇ v denote the gradients along the horizontal and the vertical directions.The total loss function for DNM is defined as Loss DNM (4).We empirically set α 1 = 1, α 2 = 0.05.Note that the DNM is firstly pre-trained via GT raw and then fixed during the training stage of the following module.
C. STAGE II: ENHANCEMENT NET After obtaining pre-denoised raw data from DNM, the EN aims at mapping the raw data to final sRGB outputs, which corresponds to the processes that need global information in traditional ISP as shown in Fig. 2. To produce high-quality outputs, the EN consists of two branches, i.e., the Reconstruction Branch (RB) and the Texture Enhancing Branch (TEB).

1) RECONSTRUCTION BRANCH
The RB is responsible for global color mapping which is similar to white balance and color space mapping steps in the traditional ISP.The architecture of the RB net can be seen in Fig. 1(b).For accurate color mapping, a global understanding of the whole images is required.U-net architecture, which has a large receptive field, is used to extract high-level features.Specifically, to avoid checkerboard artifacts, we use bilinear interpolation for upsampling.Considering the loss of details caused by the denoising module, we input the original images and the denoised images together to this branch to get reconstructing features (RB feature ).The input channel is set to 8 and the output channel is set to 12. Formally: where [,] denotes the channel-wise concatenation operation.

2) TEXTURE ENHANCING BRANCH
The TEB aims at reducing artifacts and preserving highfrequency details which may be ignored in the RB net.The architecture of this branch can be seen in Fig. 1(c).In this branch, we use dense connection [8] and dilated convolutions [15] to make full use of multi-scale features and keep a large receptive field.Instead of using denoised images as input, we simply calculate the gradients of denoised images  as inputs(I TEB ).Formally: where∇ h and ∇ v denote the gradients along the horizontal and the vertical directions respectively.The input channel for TEB net is set to 4 and the output channel is set to 12. Formally, the output of TEB can be written as ( 7) 3) FUSION AND DEMOSAICING After concatenating the features generated from the above two branches, we use convolution layers and a sub-pixel layer [35] to fuse them and up-sample data to the original resolution.The final output O RGB is written as ( 8) where [,] denotes the channel-wise concatenation operation.We train the Enhancing Net using l 1 distance defined as IV. EXPERIMENTS A. DATASET We adopt the Sony set in [26].This set is captured by Sony α7IIS.It includes 2697 raw short-exposure images and 231 long-exposure images.The resolution of images is 4280 × 2832.The exposure time for low-light images is set between 1/30 and 1/10 second and the corresponding long-exposure ground truth images are captured with 100 to 300 times longer.We use the same training and testing set following [26].In their public dataset, approximately 20% of the images with different exposure time are selected to form the test set.

B. IMPLEMENTATION DETAILS
Our proposed framework is implemented with Pytorch and an Nvidia TITAN-V GPU is used in experiments.The architecture of the denoising module is listed in Table 1, and the architecture of the enhancement net can be seen in Fig. 1.We train the denoising module with a learning rate 10 −4 for 2k epochs.Then, we fix the weights of the denoising module and train the Enhancing Net for 3k epochs using ADAM [36] optimizer.The learning rate is set to 10 −4 and is reduced to 10 −5 after 1500 epochs.We randomly crop 512 × 512 patches for training and apply random flipping and rotation for data augmentation.Following Chen et al. [26], we subtract the black level and divide the maximal pixel value to map the data between 0 and 1.It takes 30 hours to train the whole net in which about 10 hours are used for pretraining.It takes about 0.5s to process one full-resolution image (4280 × 2832).Our code is available at https://github.com/Aacrobat/LLISP.

C. AMPLIFICATION RATIO k
The amplification ratio determines the brightness of the outputs.In our network, we firstly scale the low-light raw data  by the desired amplification ratios.This is similar to the ISO gain in cameras.During the training stage, the amplification ratios are set to be the difference between the exposure time for inputs and their ground truth images.During the test stage, users can adjust the brightness of the output images by setting different amplification factors.In Fig. 4, we show the effect of the amplification factors on images captured by smartphones.By choosing different amplification ratios, we can test the amplification range in which our method can produce high-quality results.Images with different exposure time and different amplification ratios are fed into the network.As shown in Fig. 5, longer exposure time and smaller amplification ratios will produce better results.Our method can reconstruct high-quality results with an amplification ratio up to 100.However, the enhanced results with an amplification ratio of 300 still suffer from color bias and blur.

D. QUALITATIVE EVALUATION
We firstly compare our model with the traditional ISP.We use the in-camera auto-bright to kindle the dark inputs.As we can see in Fig. 3(a,i), in extremely dark conditions, the traditional ISP breaks down.Most existing low-light enhancement methods [6], [7], [37] only focus on adjusting illumination without considering noise and other degradations.It can be seen in Fig. 3(b-d,j-l), heavy noise and color bias seriously spoil the enhanced results.Applying an existing denoising algorithm [9] after the enhanced images cannot produce promising results, which can be seen in Fig. 3(f,n).Taking heavy noise into consideration, Chen et al. [26] and our method start from raw data and get much better results.Compared with Chen et al., our method can recover color distortion accurately and suppress artifacts.
Since previous methods designed for JPEG-format images cannot handle extremely dark images, we mainly compare  with Chen et al. [26] to show our improvements in detail.It can be seen in Fig. 6a, because of the heavy noise, it is easy to produce artifacts during the enhancement.Owing to the denoising module and the texture enhancing branch, we can reduce artifacts during enhancing and produce more realistic images.Fig. 6b and Fig. 6c show that our method can correct color bias and preserve details.
As shown in Fig. 7, we test our model on three common cameras.We can see that, when there is a domain gap between training and testing data, our two-stage model has a stronger generalization ability.By using the denoising module, we can get clearer results (the third row of Fig. 7), and eliminate the influence of noise on white balance (the first row of Fig. 7).Thanks to our effective two-branch enhancing module, our results can preserve more details (the second row of Fig. 7).

E. QUANTITATIVE EVALUATION
In this section, we compare our approach with the state-ofthe-art methods [6], [7], [26], [28], [37]- [39].We also use the existing denoising method BM3D [9] post-hoc to the results produced by Lime [37].Besides, a baseline that simply duplicates the U-net is introduced.The first U-net learns to denoise the low-light raw data, and the second U-net learns to map raw data to sRGB outputs.
Table 2 reports quantitative results for different low-light enhancing methods.It can be seen from the first five rows, the traditional ISP cannot handle extremely dark scenes.Using the spoiled sRGB images produced by traditional ISP as inputs, most existing enhancing methods cannot remove heavy noise and color bias.It is necessary to begin with raw data and suppress the heavy noise.Our baseline outperforms CAN and Chen et al., which means that simply denoising the data before enhancing it is very helpful for extremely low-light image enhancement tasks.Thanks to our effective two-branch Enhancement Net, we further improve the accuracy from 29.18/0.815 to 29.68/0.832with respect to PSNR and SSIM.We also employ the LPIPS metric [40] to measure perceptual distance.Higher distance means further different and lower means more similar.As we can see from Table 2, in terms of SSIM and LPIPS, our proposed method outperforms the state-of-the-art methods by a large margin.The experimental results demonstrate we can achieve stateof-the-art results both in pixelwise distance and perceptual similarity.

F. ABLATION STUDY
Ablation experiments are performed in order to have a better understanding of our model and prove the indispensability of each module.

1) DENOISING MODULE
In this part, we show the importance of the DNM and compare the impact of different architectures and loss functions for this module.A single network can theoretically complete denoising and color space conversion at the same time.But heavy noise affects accurate color reconstruction it is difficult for networks to optimize both tasks at the same time.Learning denoising and color reconstruction in separate stages improves the final accuracy.As we can see from the second row of Table 3, we use the state-of-the-art denoising model RNAN [31] and retrain it using our dataset for denoising.However, due to the large memory consumption of the nonlocal module, we have to chop the input images into blocks which will result in uneven brightness and poor results.Note that although the addition of TV regularization term leads to   3. Ablation study on the denoising module.The results are in terms of PSNR/SSIM.We also compare the L1 distance between denoised images and corresponding ground truths in denoising stage.The best results are highlighted in bold.
higher l 1 error between denoised images and corresponding ground truths in the denoising stage, the smoothened images TV loss can help subsequent enhancements and thus obtain better results.

2) TEXTURE ENHANCING BRANCH
In this part, we show the indispensability of the TEB and compare different types of inputs for this branch.An interesting result is shown in the third row of Table 4.If we input the original images into the TEB, the final results are even worse than removing this branch, which indicates that the improvement of this branch is not because of increased parameters but because of more reasonable utilization of gradient features.We have also tried to use a simple edge detection algorithm such as Canny to extract the edges of denoised images and input them to the network.However, the edge detection algorithm will ignore the texture details and only retain the edge information, which is not conducive to texture enhancement and artifact removal.

3) RECONSTRUCTION BRANCH
As shown in Table 5, due to the loss of details caused by the denoising process, putting the original images and the denoised images into the network together can obtain better results.

V. CONCLUSION
In this paper, we present a novel low-light enhancement method LLISP.Inspired by the traditional ISP, our network firstly focuses on image denoising, and then finishes other image processing steps by a two-branch enhancement net.Extensive experiments depict the effectiveness and indispensability of different modules of the network.The proposed method is not only applicable to the training dataset but also applicable to raw data captured by different devices.

FIGURE 1 .
FIGURE 1.The architecture of our proposed LLISP.Our proposed LLISP consists of two stages: The first stage is responsible for denoising.In the second stage, the divide and conquer network is responsible for producing high-quality images in sRGB color space.The image reconstruction branch takes denoised raw data and original raw data as input to reduce color bias and recover image content.Using gradient information as input, the texture enhancing branch pays more attention to texture details and cooperates with the reconstruction branch to generate images with fewer artifacts.

FIGURE 2 .
FIGURE 2. The key steps in the traditional image processing pipeline.Although different cameras may apply different algorithms in the detail enhancing step, most of them use frequency filters to decompose the signal into different layers.

FIGURE 3 .
FIGURE 3. Qualitative results of state-of-the-art methods and our proposed LLISP evaluated on the SID test set.As we can see, the traditional ISP breaks down in extremely dark conditions, and most existing enhancing methods cannot reconstruct images successfully.Focusing on severe noise and the extremely dark conditions, both Chen et al.[26] and our method get much better results.Compared with Chen et al., our method can recover color distortion accurately and suppress artifacts.

FIGURE 4 .
FIGURE 4. The effect of different amplification ratios on the same images captured by smartphones.

FIGURE 5 .
FIGURE 5.The effect of different amplification ratios on images with different exposure time.The images were chosen from the SID test set.

FIGURE 6 .
FIGURE 6. Qualitative results for our proposed LLISP.As we can see, our method can accurately reconstruct low-light images.

TABLE 1 .
The architecture of the denoising module.

TABLE 2 .
Quantitative evaluation of low-light image enhancement algorithms in terms of PSNR/SSIM/MAE/NIQE/LPIPS.The best results are highlighted in bold.Note that a * indicates that we use the PSNR, SSIM and LPIPS values reported in their original papers.

TABLE 4 .
Ablation study on the texture enhancing branch.The results are in terms of PSNR/SSIM.The best results are highlighted in bold.

TABLE 5 .
Ablation study on the reconstruction branch.The best results are highlighted in bold.