Very Deep Learning-Based Illumination Estimation Approach With Cascading Residual Network Architecture (CRNA)

For the imaging signal processing (ISP) pipeline of digital image devices, it is of high significance to remove undesirable illuminant effects and obtain color invariance, commonly known as ‘computational color constancy’. Achieving the computational color constancy requires going through two phases: the illumination estimation, which will be the primary focus of this work, and the human visual perception-based chromatic adaptation. At the first phase, illumination estimation is to predict RGB triplets, the numeric representations of incident illuminant colors, by calculating the values of image pixels. How much the network can increase its estimation accuracy is a key to realizing computational color constancy. With recent advances in deep learning (DL), a lot of deep learning-based approaches have been suggested, bringing higher accuracy to computer vision applications, but there are still quite a few obstacles to overcome such as instability of learning. In an attempt to address this ill-posed problem in the illumination estimation space, this article presents a novel deep learning-based approach, the Cascading Residual Network Architecture (CRNA), which incorporates the ResNet and cascading mechanism into the deep convolutional neural network (DCNN). The cascading mechanism enables the proposed network to restrain from suddenly varying in size, serves to mitigate learning instability, and accordingly reduces the quality degradation. This is attributed to the ability of the cascading mechanism that fine-tunes the pre-trained DCNN. Considerable amounts of datasets and comparative experiments highlight that the proposed approach delivers more stable and robust results and imply the potential for generalization of the proposed approach across deep learning applications.


I. INTRODUCTION
In digital photography, digital images may carry undesired color casts due to an unintended source illuminant in a scene. A great way to understand the unwanted illuminant effect is taking multiple photos of the same scene with the same camera under varying illuminant conditions. The resulting digital images depict the same scene but in varying colors. The imaging model [1] is used to calculate the pixel values based on three factors: the spectrum of the source illuminant, the reflectance of the object surface and the spectral The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . sensitivity of the camera sensor. The latter two factors: the reflectance of the object surface and the spectral sensitivity of the camera sensor, are kept constant by shooting the same scene with the same camera whereas the spectrum of the source illuminant is varied by taking the photos under varying illuminant conditions. This points to a fact that a digital camera can capture any incident source illuminant but cannot detect the illuminant itself. Therefore, it is of top concern to build the illuminant invariance feature into the ISP pipeline of digital cameras, based on the concept of human visual perception-based (HVP) color constancy [2]. It has been verified that the color constancy is of great use in a wide range of image processing applications such as object recognition, scene comprehension and image reproduction [3]. Achieving the computational color constancy requires going through two phases: the illumination estimation, which will be the primary focus of this work, and the HVP-based chromatic adaptation. A network model predicts the colors of the scene illuminant based on the pixel values of the image and removes the estimated undesirable illuminant effect. Over the past several decades, a lot of efforts have gone into achieving more accurate illuminant estimation in the image processing space and also a lot of methods have been proposed, which are largely classified into statistics-based and learning-based categories. In an attempt to obtain accurate illumination estimation, the statistics-based approaches start with the assumption that the illuminant is commonly uniform in the scene, including WP (White-Patch) [4], [5] and its advanced version [6]- [8], and the gray world assumptionbased approaches such as GW (Gray-World) [9], SoG (shades of gray) [10], GE (gray edge) [11] and WGE (weightgray-edge) [12]. These approaches have the benefit of fast computing, but incur huge computation costs, unfortunately. Meanwhile, the learning-based approaches adopt learning models for illuminant color estimation. Recently, more and more learning-based approaches are employing deep learning for illumination estimation, thereby progressing their accuracy. However, a drawback of these approaches is demanding massive learning datasets. In addition, the learning-based approaches require using an increasing amount of computational resources and building more and more complicated architectures. Yet, the latest DL approaches present shallow architectures that consist of only a few convolution layers and fully connected layers [13], [14]. A content-based CNN is proposed in ref. [15]- [17].
Ref. [18] and [19] navigate the illumination classification issue with the use of DL. Ref. [20] proposes two DCNN architectures and selects the better of the two through comparative studies. Other learning-based approaches include Bayesian learning [21], color moments [22], gamut mapping [23]- [25], spatial localization [26], [27], Choi's illuminant estimation approaches [28]- [30] and others [31], [32]. In short, the learning-based approaches have proven to outperform their statistics-based counterparts in terms of estimation accuracy throughout a lot of literature and studies. There is no denying that deeper network architectures effect higher performance, as evidenced by a lot of proposed studies and literature.
Motivated to address remaining challenges and take opportunities to advance inference accuracy, this article presents the Cascading Residual Network Architecture (CRNA) by building the cascading mechanism and the residual network [33] into the DCNN. The residual network is commonly used in the DCNN. In the proposed CRNA, the key strategy is to increase estimation accuracy to the highest possible level and eliminate undesired illuminant estimates. Estimating and removing the undesired illuminant is referred to as white balance or HVP-based chromatic adaptation. What differentiates the proposed method from conventional DCNN architectures is the cascading mechanism which serves to restrain the network from sudden variations in size, thereby mitigating learning instability and reducing quality degradation. This is attributed to the ability of the cascading mechanism that finetunes the pre-trained DCNN.  This article has core contributions as follows: Proposing a novel deep learning-based approach, the CRNA, to taking estimation accuracy to a higher level, by building the ResNet and cascading mechanism into the network model. Addressing the instability problem and the quality degradation of conventional DCNNs by enabling the architecture to fine-tune the pre-trained DCNN.
Highlighting that the proposed approach effects higher stability and robustness and implying the potential for generalization of the proposed approach across deep learning applications, as supported by considerable amounts of datasets and comparative experiments.

II. RELATED WORK
In digital imaging, the imaging model is built based on the hypothesis of the Lambertian reflectance. An image f is made up of pixels. Each pixel x, represented by a red, green and blue triplet color c ∈ {R, G, B}, equals the sum of the multiplication of every triple of source illuminant spectrum e(λ, x), surface reflectance R(λ, x) and camera sensitivity ρ c (λ) in the range of wavelengths λ of the visible illuminant spectrum ω, which goes as follows: A. ILLUMNATION ESTIMATION Figure 1 illustrates how the proposed CRNA network performs illuminant estimation and the chromatic adaptation. The proposed network predicts illuminant colors with the use of image pixels calculated from Eq. (1). The illuminant estimation is determined by the source illuminant spectrum e(λ) and camera sensitivity ρ c (λ), which goes as follows: Yet the illumination prediction is a big challenge, given that both illuminant spectral distribution e(λ) and camera sensitivity ρ c (λ) are out of control.

B. CHROMATIC ADAPTATION TRANSFORMATION
In color constancy, the chromatic adaptation is another critical step to remove the color cast and perform color adjustment. In this white balance process, the proposed approach uses the prevalent diagonal matrix [34], known as the von Kries diagonal matrix model [35]. A triplet pixel p = (p R p G p B ) T multiplied by the diagonal matrix D equals a color rendered pixel,p = p RpRpR T , which goes as follows: The diagonal matrix is usually represented as follows: where e = (e R e G e B ) T refers to an undesired illuminant, andē = (ē RēGēB ) T means the desired illuminant or ground truth illuminant. The perfect white source illuminant e = (1 1 1) T serves as the ground truth illuminant in the proposed approach.

III. THE PROPOSED CRNA APPROACH
The objective of the proposed CRNA approach is to accurately estimate illuminant and efficiently eliminate undesired color casts. This section sets out what the proposed CRNA approach is and how it works. The proposed CRNA approach incorporates the ResNet and cascading mechanism into the DCNN with a view to making the network go deeper and thereby improve performance. The proposed CRNA realizes the cascading mechanism by embedding local and global cascading modules in it. The ResNet architecture is widely used in the DCNN architecture because of its higher learning efficiency and surpassing performance, but does not have both local and global cascading modules. In the proposed CRNA network, the embedded local and global cascading modules serve to accelerate performance. To explain how the proposed CRNA works, let f be a convolution function, with an activation function τ (z) ≡ max(0, x). In Eq. (5), H i represents the output of the i−th residual block (RB). W i R refers to a set of parameters used in the RB and W i,j R represents a set of parameters used in the i − th convolution layer inside the i − th RB. In the proposed CRNA, two consecutive convolution layers form a residual block and R i means the i − th RB, which goes as follows [36]: The input image goes through the Eq. (5), gets to the final RB of ResNet and produces the output feature map H u , which goes as follows: Each RB is followed by a single convolution layer f (X ; W c ) which has the parameter, W c . In general, ResNet does not have the cascading block, but the proposed CRNA has incorporated the local cascading block (CB) into it. B i is the output of the i − th RB and the input of the i − th local CB, and the i − th local CB has a set of parameters W i c . The i − th local CB is described as follows: where B i,U works in a recursive manner as follows: In the end, the final CB produces the output feature map H b which represents a combination of both local and global cascading mechanisms. H 0 refers to the output of the first convolution layer and the parameters are fixed at u = b = 3 throughout the network.
The key difference between the proposed CRNA and the ResNet lies in whether the cascading mechanism exists or not. As indicated above, the proposed CRNA has both global and local cascading mechanisms. Figure 2 (a) shows how the global cascading connection works and Figure 2 (b) depicts how the local cascading blocks work to form the global cascading connection. ResNet and its extended network in Figure 2 (c) and (d), respectively, are built into the local cascading block. Cascading at the local and global levels has two benefits: 1) it is possible for the architecture to combine multiple-level features from multiple layers. 2) Multiplelevel cascading connections serve as multiple-level shortcut connections that rapidly propagate information from lower to higher layers in the proposed CRNA. In brief, the local and global cascading mechanisms advance the proposed CRNA to a higher level of estimation accuracy and computational efficiency.

IV. EXPERIMENTAL RESULTS AND EVALUATIONS
This section explores comparative experiments conducted to optimize parameters, select better-performing residual network, compare the proposed approach with its latest learningbased competitors, and verify illuminant invariance and camera invariance. The experiments use benchmark image datasets: Gehler and Shi image dataset [21], NUS-8 camera dataset [37] and Gray Ball dataset [38]. To explain the dataset used in the experiments, the NUS-8 camera dataset consists of 8 subsets of images captured by 8 different cameras and each subset contains 210 images. The Gray Ball dataset includes 11,340 images which depict a variety of scenes and each image has a gray ball in sight. In addition, Gehler and Shi dataset is made up of 568 images that represent a lot of different indoor and outdoor images. Importantly, these images are taken under diverse source illuminants and each image has a color checker in sight for the purpose of evaluating the effect of the illuminant condition on the image.
In optimizing parameters, the initial training rate is chosen to be optimized among other parameters because it has the most significant impact on the accuracy of illuminant estimation. To find the optimal initial training rate, several initial training rates are compared in an experiment that uses Gehler and Shi image dataset. Other parameters including the weight decay of 5 × 10 −5 and a momentum of 0.9 are kept fixed. In the experiment, the network is designed to resize the images to 512 × 512 pixels and the batch size is set to 16. Figure 3 shows the comparative result of applying several initial training rates to the proposed CRNA. The proposed CRNA is programed based on Tensorflow [39] and operates on a single NVIDIA Titan RTX GPU. The training takes place 10K iterations for 1.5 days. Figure 3 (a) plots the median angular error and Figure 3 (b) plots the average angular error at the different initial training rate conditions. As a result, both median and average angular errors are least at the initial training rate of 3.00E-4 which translates into 3 × 10 −4 . In the deep learning space, it is well-known by various studies and experiments that going deeper is a great way to accelerate performance. So, a natural motivation is to test it through an experiment. In the next experiment, it is designed to determine which residual block performs better: Figure 2 (c) or Figure 2 (d). The residual block in Figure 2 (c) is called the bottleneck network and that of Figure 2 (d) is its extended version. Figure 4 shows the comparative result of implementing the residual block of Figure 2 (c) and its extended version of Figure 2 (d) in the proposed CRNA. The comparison is made in terms of median angular error and average angular error. In deep learning, there is a general belief that deeper networks make better performance. However, this experiment challenges the common belief and demonstrates that going deeper does not always equal higher performance.
In the following experiment, the proposed CRNA is compared with its latest learning-based counterparts, using Gehler and Shi dataset. Table 1 is the comparative summary of their respective average, median, trumean, best-25%, and worst-25% angular errors. Noticeably, Hu and his colleague [49] present a segment-wise illuminant estimation approach, known as FC4. Their approach uses the confidence map of each image patch to estimate undesired source illuminant colors. In their approach, the network architecture is meant to learn semantic information and build the confidence map. However, it comes with smoothing of the local region of the image due to partially inaccurate local estimation. The inaccuracy occurs because the confidence map, which is supposed to mask and remove noise, mistakenly perceives small objects as noise and thus masks them. This leads the architecture to falsely recognize small objects as noise and eliminate them. To solve the inaccuracy problem, Choi and his colleague [28], [29] set forth novel approaches by bringing the ResNet and the dilated convolution to the architecture. The two new approaches surpass their latest competitors from the estimation accuracy perspective. Unfortunately, these approaches have a long-term dependency problem due to the nature of linearity that the input of the current layer is the output of the previous layer. To address the long-term dependency challenge, Choi and Yun [30] come up with another novel approach called PMRN. In continuing efforts to advance the network to the next level, they present the proposed CRNA in this work by building the cascading mechanism and ResNet into the DCNN. The key strategy behind the proposed CRNA is to adopt the cascading mechanism and thereby improve learning stability. The cascading mechanism serves to restrain the network model from sudden variations in size. This cascading mechanism differentiates the proposed approach from conventional DCNN architectures by mitigating learning instability and accordingly reducing the quality degradation. To be concrete, it is attributed to the ability of the cascading mechanism that finetunes the pre-trained DCNN. Resulting, the proposed CRNA delivers the most up-to-date performance in the field of color constancy as in Table 1. Figure 5 displays another experiment which compares the proposed CRNA approach and its latest learning-based counterparts in terms of total training loss and the resultant  convergence behavior trend. It uses the standard stochastic gradient descent (SGD). As for the outcome, the proposed CRNA approach tops its latest learning-based approaches by tending towards the lowest total training loss. Figure 6 plots the angular error distribution, comparing the proposed CRNA approach and some high performers from Table 1: SNet-FC4 (or SqueezeNet-FC4), CMoDE and PMRN. Consequently, the proposed CRNA approach shows an overall lower angular error trend. In order to demonstrate the illuminant invariance, another experiment follows. Using the Gray-ball dataset, the experiment compares the proposed CRNA approach and its learning-based counterparts in terms of mean, median, trimean, best-25%, and worst-25%. Table 2 encapsulates the experimental result where the proposed CRNA approach reports the lowest angular error in contrast with its conventional counterparts. In order to verify the camera invariant, the final experiment compares the proposed CRNA approach and its latest conventional counterparts with the use of the NUS-9 camera dataset. Table 3 encapsulates the experimental result where the proposed CRNA approach excels its counterparts regardless of the camera sensitivity.

V. CONCLUSION
This article presents a novel approach to more accurate illuminant estimation by embedding the ResNet and cascading mechanism into the DCNN. The cascading mechanism of the novel approach differentiates the proposed CRNA from conventional DCNN architectures, which restrains the network model from sudden variations in size and thereby addresses the learning instability. It also serves to fine-tune the pre-trained DCNN and thereby contributes to reducing the quality degradation problem. From two aspects, the proposed architecture benefits from the cascading at both local and global levels. First, it is possible for the architecture to combine multiple-level features from multiple layers. Second, multiple-level cascading connections serve as multiple-level shortcut connections that rapidly propagate information from lower to higher layers. In summary, the local and global cascading mechanisms advance the proposed CRNA to a higher level of accuracy and computational efficiency. Prevalent benchmark image datasets and various experiments highlight unprecedented superiority of the proposed CRNA approach over its state-of-the-art conventional competitors. Notwithstanding, it is imperative to continue to work towards the optimization of the DCNN architecture and take the computer vision to the next level.