UnShadowNet: Illumination Critic Guided Contrastive Learning For Shadow Removal

Shadows are frequently encountered natural phenomena that significantly hinder the performance of computer vision perception systems in practical settings, e.g., autonomous driving. A solution to this would be to eliminate shadow regions from the images before the processing of the perception system. Yet, training such a solution requires pairs of aligned shadowed and non-shadowed images which are difficult to obtain. We introduce a novel weakly supervised shadow removal framework UnShadowNet trained using contrastive learning. It is composed of a DeShadower network responsible for the removal of the extracted shadow under the guidance of an Illumination network which is trained adversarially by the illumination critic and a Refinement network to further remove artefacts. We show that UnShadowNet can be easily extended to a fully-supervised set-up to exploit the ground-truth when available. UnShadowNet outperforms existing state-of-the-art approaches on three publicly available shadow datasets (ISTD, adjusted ISTD, SRD) in both the weakly and fully supervised setups.


I. INTRODUCTION
Shadows are a common phenomenon that exists in most natural scenes. It occurs due to inadequate illumination that makes part of the image darker than the other region of the same image. It causes a significant negative impact on the performance of various computer vision tasks such as object detection, semantic segmentation, and object tracking. Image editing [1] using shadow matting is one of the common ways to remove shadows. Shadow detection and correction can improve the efficiency of the machine learning model for a broad spectrum of vision-based problems such as image restoration [2], satellite image analysis [3], information recovery in urban high-resolution panchromatic satellite images [4], face recognition [5], and object detection [6]. In this work, we focus on natural images captured in a terrestrial setting, such as may be obtained by commercial devices and, particularly, automotive cameras.
Shadows are prevalent in almost all images in automotive scenes. The complex interaction of shadow segments with the objects of interest such as pedestrians, roads, lanes, vehicles, and riders makes the scene understanding challenging. Additionally, it does not have any distinct geometrical shape or size similar to soiling [7], [8]. Thus, they commonly lead to poor performance in road segmentation [9], [10], pedestrian pose estimation [11], [12], [13], segmentation [14], [15] and trajectory prediction [16]. Moving shadows can be incorrectly detected as a dynamic object in background subtraction [17], motion segmentation [18], depth estimation [19], [20] FIGURE 1: The proposed shadow removal framework. The shadow image and its shadow mask are subjected to pixel-wise product operation to obtain the shadow extracted which is fed as input to the DeShadower (D) and Illumination network (I) simultaneously. D learns contrastively from I and the resultant shadow-removed region is embedded via in the input image before feeding it to the Refinement network which produces the final Shadow-free image. The end-to-end network is trained in a weakly supervised manner. and SLAM algorithms [21], [22]. The difficulty of shadows is further exacerbated in strong sun glare scenes where the dynamic range is very high across shadow and glare regions [23]. These issues lead to an incomplete or partial understanding of 360°surrounding region of the vehicle and bring major safety concerns for the passengers and Vulnerable Road Users (VRU) while performing automated driving [24]. Alternate sensor technologies like thermal camera [25], [26] are resistant to shadow issues and can be used to augment cameras.
In recent times, convolutional neural networks (CNNs) based approaches have significantly surpassed classical computer vision-based shadow removal techniques [27]- [32]. The majority of the recent deep learning-based shadow removal approaches are fully-supervised in nature. However, such an end-to-end training setup requires paired data, namely shadow images and their shadow-free versions of the same images. These paired data are used to train CNNs [33]- [35]. Practically, the paired data is difficult to obtain particularly when the vehicle is moving fast. Some of the challenges include highly controlled lighting sources, object interactions, occlusions, and static scenes. Data acquisition through such a controlled setting suffers from diversity and often reports color inconsistencies [31] between shadow and shadow-free reference of the same image. Additionally, it is very difficult to capture any High Dynamic Range (HDR) natural scene without any presence of shadow for a shadowfree reference sample.
Some of the recent studies [36]- [43] address the abovementioned challenges and solve the shadow removal problem using unpaired data. They studied the physical properties of shadows such as illumination, color, and texture extensively. Motivated by these recent works, we propose an end-to-end trained weakly-supervised architecture for shadow removal as illustrated in Figure 1. In brief, we pass the shadow region of an input image to the DeShadower network that is aided by the Illumination network to contrastively learn to "re-move" shadow from the region by exploiting the illumination properties. It is followed by the Refinement network that helps to remove any artifacts and maintain the overall spatial consistency with the input image and finally generates a shadow-free image.
Summary of contributions and distinctively novel features of this work: 1) We develop a novel weakly-supervised training scheme namely UnShadowNet using contrastive learning to build a shadow remover in unconstrained settings where the network can be trained even without any shadow-free samples. 2) We propose a contrastive loss-guided DeShadower network to remove the shadow effects and a refinement network for efficient blending of the artifacts from shadow removed area. 3) We achieved state-of-the-art results on three public datasets namely ISTD, adjusted ISTD, and SRD in both constrained and unconstrained setups. 4) We perform extensive ablation studies with different proposed network components, diverse augmentation techniques, shadow inpainting, and tuning of several hyper-parameters.

II. RELATED WORK
Removing shadows from images has received a significant thrust due to the availability of large-scale datasets. In this section, first, we briefly discuss the classical computer vision methods reported in the literature. Then we discuss the more recent deep learning-based approaches. Finally, we summarize the details of contrastive learning and its applications since it is a key component in our framework.

A. CLASSICAL APPROACHES
Illumination-based shadow removal: Initial work [2], [27], [44], [45] on removing shadows were primarily motivated by the illumination and color properties of shadow region. In one of the earliest research, Barrow et al. [46] proposed an image-based algorithm that decomposes the image into a few predefined intrinsic parts based on shape, texture, illumination, and shading. Later Guo et al. [28] reported the simplified version of the same intrinsic parts by establishing a relation between the shadow pixels and the shadow-free region using a linear system. Likewise, Shor et al. [47] designed a model based on the illumination properties of shadows that makes a hard association between shadow and shadow-free pixels. In another study, Finlayson et al. [48] proposed a model that generates illumination invariant image for shadow detection followed by removal. The main idea of this work is that the pixels with similar chromaticity tend to have similar albedo. Further, histogram equalization-based models performed quite well for shadow removal, where the color of the shadow-free area was transferred to the shadowed area as reported by Vicente et al. [49], [50]. Shadow matting: Porter & Duff [51] introduced a matting-based technique that became effective while han-dling shadows that are less distinct and fuzzy around the edges. The matting method was only helpful to some extent, as computing shadow matte from a single image is difficult. To overcome this problem, Chuang et al. [1] applied matting for shadow editing and then transferred the shadow regions to the different scenes. Later shadow matte was computed from a sequence of video frames captured using a static camera. Shadow matte was adopted by Guo et al. [28] and Zhang et al. [30] in their framework for shadow removal.

B. DEEP LEARNING-BASED APPROACHES
Shadow removal using paired data: Deep neural networks have been able to learn the properties of a shadow region efficiently when the network is trained in a fully supervised manner. Such setup requires paired data which means the shadow and shadow-free versions of the same image are fed as input to the network. Qu et al. [33] proposed an end-toend learning framework called Deshadownet for removing shadows where they extract multi-scale contextual information from different layers. This information containing density and color offset of the shadows finally helped to predict the shadow matte. The method ST-CGAN, a twostage approach proposed by Wang et al. [31], presents an end-to-end network that jointly learns to detect and remove shadows. This framework was designed based on conditional GAN [52]. In SP+M-Net [32], physics-based priors were used as inductive bias. The networks were trained to obtain the shadow parameters and matte information to remove shadows. However, these parameters and matte details were pre-computed using the paired samples, and the same were regressed in the network. Further, Hu et al. [35] designed a shadow detection and removal technique by analyzing the contextual information in image space in a direction-aware manner. These features were then aggregated and fed into an RNN model. In ARGAN [34], an attentive recurrent generative adversarial network was reported. The generator contained multiple steps where shadow regions were progressively detected. A negative residual-based encoder was employed to recover the shadow-free area and then a discriminator was set up to classify the final output as real or fake. In another recent framework, RIS-GAN [53] used adversarial learning shadow removal was performed using three distinct discriminators negative residual images. Subsequently, shadow-removed images and the inverse illumination maps were jointly validated.
Shadow removal using unpaired data: Mask-ShadowGAN [36] is the first deep learning-based method that learns to remove shadows from unpaired training samples. Their approach was conceptualized on CycleGAN [54] where a mapping was learned from a source (shadow area) to a target (shadow-free area) domain. Le and Samaras [37] presented a learning strategy that crops the shadow area from an input image to learn the physical properties of shadow in an unpaired setting. CANet [38] handles the shadow removal problem in two stages. First, contextual information was extracted from the non-shadow area and then transferred the same to the shadow region in the feature space. Finally, an encoder-decoder setup was used to fine-tune the final results. LG-ShadowNet [39] explored the lightness and color properties of shadow images and put them through multiplicative connections in a deep neural network using unpaired data. Cun et al. [40] handled the issues of color inconsistency and artifacts at the boundaries of the shadowremoved area using a Dual Hierarchically Aggregation Network (DHAN) and a Shadow Matting Generative Adversarial Network (SMGAN). Weakly-supervised method G2R-ShadowNet [41] designed three sub-networks dedicated to shadow generation, shadow removal, and image refinement. Fu et al. [42] modeled the shadow removal problem from a different perspective, which is auto-exposure fusion. They proposed shadow-aware FusionNet and boundary-aware Re-fineNet to obtain the final shadow-removed image. Further in [43] a weakly-supervised approach was proposed that can be trained even without any shadow-free samples.
Miscellaneous: In video sequences, cast shadows are often misinterpreted as moving objects. It was highlighted in [55] and considered as insignificant shadows. These cast shadows were removed in [56] by conditional random field. Liu et al. [57] investigated the cast shadows in detail by proposing a Gaussian Mixture Model at the pixel level in HSV color space followed by a pre-classifier and finally using Markov Random Fields for shadow removal. Patchbased illumination-invariant features such as binary patterns of local color constancy (BPLCC) and light-based gradient matching (LGM) were introduced in [58]. These features were used to create two dictionaries each for objects and shadows respectively. Each patch was assigned to an independent class in each iteration based on the distance from the reference dictionary. A feature fusion-based approach was followed in [59] where Spatio-Temporal Kernel Density Estimation (ST-KDE) based model was proposed for background modeling and Local Binary Pattern (LBP) features of this model were fused with the Gabor features probabilistically. Apart from shadow removal, shadow detection is also a wellstudied area, some of the recent works include [6], [60], [61]. Inoue et al. [62] highlighted the problem of preparing a largescale shadow dataset. They proposed a pipeline to synthetically generate shadow/shadow-free/matte image triplets.

C. CONTRASTIVE LEARNING
Learning the underlying representations by contrasting the positive and the negative pairs have been studied earlier in the community [63], [64]. This line of thought has inspired several works that attempt to learn visual representations without human supervision. While one family of works uses the concept of a memory bank to store the class representations [65]- [67], another set of works develops on the idea of maximization of mutual information [68]- [70]. Recently, Park et al. [71] presented an approach for unsupervised imageto-image translation by maximizing the mutual information between the two domains using contrastive learning. In our work, we adopt the problem of shadow removal to solve VOLUME 10, 2022 it without using shadow-free ground truth samples with the help of contrastive learning.

III. PROPOSED METHOD
In this work, we define the problem of shadow removal as the translation of images from the shadow domain S ⊂ R H×W ×C to shadow-free domain F ⊂ R H×W ×C by utilizing only the shadow image and its mask and alleviating the use of its shadow-free counterpart. The proposed architecture UnShadowNet is illustrated in Figure 2. We briefly summarize the high-level characteristics here and discuss each part in more detail in the following subsections. In this section, we present the overall architecture of our proposed end-toend shadow removal network, namely UnShadowNet. The architecture can be divided into three parts: DeShadower Network (D), Illumination Network (I) and Refinement Network (R). These three networks are jointly trained in a weakly-supervised manner. Let us consider a shadow image S ∈ S and its corresponding shadow mask S M . We obtain the shadow region S s by cropping the masked area from S M in the shadow image S. The DeShadower Network learns to remove the shadow from the region using a contrastive learning setup. It is aided by the Illumination Network which generates bright samples for D to learn from. The Refinement Network finally combines the shadow-free region S f with the real image and refines it to form the shadow-free imageŜ.

A. DESHADOWER NETWORK (D)
The DeShadower Network is designed as an encoderdecoder-based architecture that generates a shadow-removed region (S r ) from the shadow region (S s ). The shadowremoved regions generated by this network S r should associate more with the bright samples and dissociate itself from the shadow samples. We employ a contrastive learning approach to help the DeShadower network achieve this and learn to generate shadow-free regions. In a contrastive learning framework, a "query" maximizes the mutual information with a "positive" sample in contrast to other samples that are referred to as "negatives". In this work, we use a "noise contrastive estimation" framework [68] to maximize the mutual information between S f and the bright sample B. We treat the bright samples generated by the Illumination Network as the "positive" and the shadow regions as the "negatives" in this contrastive learning setup. Thus, the objective function for maximizing (and minimizing) the mutual information can be formulated with the InfoNCELoss [68], a criterion derived from both statistics [68], [72] and metric learning [63], [64], [73]. Its formulation bears similarities with the cross-entropy loss: where x, x + , x − are the query, positive and negatives respectively. τ is the temperature parameter that controls the sharpness of the similarity distribution. We set it to the default value from prior work [65], [66]: τ =0.07. The feature stack in the encoder of the DeShadower Network, represented as D enc , already contains latent information about the input shadow region S s . From D enc , L layers are selected, and following practices from prior works [70], we pass these features through a projection head, an MLP (M l ) with two hidden layers. Subsequently, we obtain features: where D l enc is the l-th chosen layer in D l enc . Similarly the output or the 'unshadowed' region S f and the bright region B are encoded respectively as: We adjust the InfoNCE loss [68] into a layer-wise NCE loss: The generator should not change the contents of an image when there is no need to. In other words, given a shadow-free sample as input, it is expected to generate the same output without any change. To enforce such a regularization, we employ an identity loss [54], [74]. It is formulated using an L1 loss as: Additionally, as described further in the following sections, the Illumination Critic I C is trained on real non-shadow samples and augmented bright samples. Therefore, we can additionally use the cues provided by the Illumination Critic to distil its knowledge of illumination to the DeShadower Network. This is achieved by computing the loss:

B. ILLUMINATION NETWORK (I)
Shadow regions have a lower level of illumination compared to their surroundings. The exact illumination level can vary according to scene lighting conditions as illustrated in Fig. 3. To show that a real shadow image and an image with a region where brightness is reduced are similar even semantically, we designed a small experimental setup. We fine-tune a ResNet [75] with samples containing real shadows and no shadows for a Shadow/Non-shadow classification task and then test the images where we reduce the brightness in the shadow region. In the majority of the cases, the network classifies it to be a 'Shadow' image.
Using this heuristic, the Illumination Network (I) is designed as a Generative Adversarial Network [76] to serve as a complementary augmentation setup to generate synthetic images where the illumination level is increased in a shadow region. The shadow region S s is passed through the generator I G to produce brighter samples B of the shadow region. The illumination critic (I C ) learns to classify these samples generated by I G as 'fake'. The motivation of this discriminator

FIGURE 2:
UnShadowNet is the proposed end-to-end weakly-supervised shadow removal architecture. It has three main sub-networks: DeShadower Network (D), Illumination Network (I) and Refinement Network (R). The pixelwise product operation between shadow image (S) and its shadow mask (SM ) extracts the shadow region (Ss), which is then fed to D and I simultaneously. The generator of the adversarially trained Illumination network generates an illuminated version (B) of Ss which is subjected to validation by a discriminator, called Illumination Critic (Ic) trained on augmented shadow-free regions (Baug). DeShadower is trained to produce shadow-removed region (Sr) of Ss. To create a more realistic illumination region Sr, a contrastive approach is employed between B and Sr. Finally, shadow-removed image (Ŝr) is obtained by applying embedding operation to become input to the Refinement network. R is trained to efficiently blend the areas between shadow-removed and non-shadow regions so that it is robust to noise, blur, etc. Here contrastive learning approach was followed where positive samples (Ŝaug) were generated as per the method in [70].

Original
Shadow with varying illumination level is detailed in the following section. The generator I G and the discriminator I C thus learns from the adversarial loss as: We observe that the more optimal samples the Illumination Network generates, the better it aids D to create more realistic shadow-removed samples. Therefore, to improve I to create well-illuminated samples we employ the illumination loss as an L1 loss between the I G generated bright sample B and the shadow-removed sample S f as: The adversarial loss with the help of the discriminator and the illumination loss together play a role in generating wellilluminated samples, which in turn helps D to create better shadow-removed samples. In this regard, both D and I complement each other for the task. The Illumination Network supervises D to generate shadow-removed regions and likewise, D encourages I to create well-illuminated samples by learning from it. The choice of using I is experimentally justified in the ablation study section, as it helps to generate better results rather than relying solely on a pre-determined illumination level increase.

C. ILLUMINATION CRITIC (IC)
The role of the Illumination Critic (I C ) is two-fold. Firstly, in the Illumination Network which generates well-illuminated variations of the shadow region S s , the I C is designed as a discriminator to the I G . The knowledge I C learns from representations of shadow-free regions allows it to encourage I G to create well-illuminated variations of the shadow region S s which is later used as positive pair to contrastively train D.
Additionally, the DeShadower Network utilizes the knowledge of the I C to create realistic shadow-removed regions from the S s . Having learned the representations of shadowfree regions and augmented samples with varying illumination, I C can influence D to "remove" shadows from shadow regions using the L critic in Eqn. 6. This two-fold characteristic of I C facilitates the complementary nature of D and I where they mutually improve each other.
To train I C , we crop randomly masked non-shadow areas VOLUME 10, 2022 from S as well as other samples in the dataset similar to [41]. Additionally, I C is trained by augmented samples where each shadow region S s is converted to 3 different samples by varying the illumination levels. The illumination levels are increased by a factor µ − 5, µ, µ + 5 where µ is fixed empirically as presented in Table 3. It is trained using the same adversarial loss as the Illumination Network.

D. REFINEMENT NETWORK (R)
After obtaining the shadow-removed region S r , it is embedded with the original shadow image S. The embedding operation can be defined as: Following the embedding operation, there remain additional artefacts around the inpainted area due to improper blending. The Refinement Network R is designed to get rid of such artefacts by making use of the global context in the image. The absence of explicit ground truths in this setting motivated us to design a contrastive setup to train R. To generate the positive samples, we follow [70] to augment the generated shadow-removed image (Ŝ r ) by using random cropping of non-shadow regions. It is followed by additional transformations like resizing the cropped region back to the original size, random cutout, Gaussian blur, and Gaussian noise, represented asŜ aug . The objective is to maximize the information between the query image and the positive image pairs and reduce the same with the negative ones. In this phase, we reuse the existing encoder of R represented as R enc as a feature extractor. We extract the layer-wise features of the query F l , positive F + l and negative F − l images and pass them through an MLP with two-hidden layers, similar to D. Thus, we obtain the feature representations of F l , F + l and F − l respectively as follows: Therefore the objective function for the contrastive learning setup can be represented as: Additionally, we find that following [77], [78], using a "layer-selective" perceptual loss along with the contrastive loss helps to preserve the integrity of the overall spatial details present in the input and output images. It is computed based on the features extracted by relu_5_1 and relu_5_3 of a VGG-16 [79] feature extractor as:

E. SUPERVISED SETUP
Paired data is difficult to obtain for large-scale real-world datasets, however, it can be collected for a controlled smaller dataset. Here we demonstrate that UnShadowNet can be easily extended to exploit when paired shadow-free groundtruths (G) are available. Since the optimal level of illumination in the regions are available from G itself, we remove I in the fully-supervised setup and use different augmented versions of the G directly. Additionally, we make use of different losses that help to generate more realistic shadowfree images. To avoid loss of details in terms of content [80], we employ the pixel-wise L1-norm: Color plays an important role in preserving the realism of the generated image and maintaining consistency with the real image. To this end, we follow a recent study in the literature [81] to formulate the color loss as: where ∠(, ) computes an angle between two colors regarding the RGB color as a 3D vector [81], and P represents the number of pixel-pairs. In addition, style plays an important role in an image that corresponds to the texture information [82]. We follow [83] to define a Gram matrix as the inner product between the vectorised feature maps i and j in layer l: The Gram matrix is the style for the feature set extracted by the l-th layer of VGG-16 net for an input image. Subsequently, the style loss can be defined as: where S i and γ i are the gram matrices for the generated shadow-free image and ground truth image respectively using VGG-16. Therefore, the complete supervised loss can be formulated as a weighted sum (L sup ) of the pixel (L p ), color (L c ) and style (L s ) losses: where λ 1 , λ 2 and λ 3 are the weights corresponding to the pixel, color, and style losses respectively and are set empirically to 1.0, 1.0 and 1.0 × 10 4 following [84], [81] and [83] respectively in our experiments.

A. DATASET AND EVALUATION METRICS
Datasets: In this work, we train and evaluate our proposed method on three publicly available datasets discussed below.
ISTD: ISTD [31] contains image triplets: a shadow image, a shadow mask, and a shadow-free image captured at different lighting conditions that make the dataset significantly   diverse. A total of 1, 870 image triplets were generated from 135 scenes for the training set, whereas the testing set contains 540 triplets obtained from 45 scenes. ISTD+: The samples of ISTD [31] dataset were found to have color inconsistency issues between the shadow and shadow-free images as mentioned in the original work [31]. The reason was that shadow and shadow-free image pairs were collected at different times of the day which led to the effect of different lighting appearance in the images. This color irregularity issue was fixed by Le et al. [32] and an adjusted ISTD (ISTD+) dataset was published.
SRD: There are total 408 pairs of shadow and shadowfree images in SRD [33] dataset without the shadow-mask. For the training and evaluation of our both constrained and unconstrained setup, we use the shadow masks publicly provided by Cun et al. [40].
Evaluation metrics: For all the experiments conducted in this work, we use Root Mean-Square Error (RMSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM) respectively as metrics to evaluate and compare the proposed approach with other state-of-the-art methods. Following the prior-art [28], [31]- [33], [36], [37], [39], we compute the RMSE on the recovered shadow-free area, nonshadow area and the entire image in LAB color space. In addition to RMSE, we also compute PSNR and SSIM scores in RGB color space. RMSE is interpreted as better when it is lower, while PSNR and SSIM are better when they are higher.

B. IMPLEMENTATION DETAILS
The configuration of the generator is adopted from the DenseUNet architecture [84]. Unlike the conventional UNet architecture [85], it uses skip connections to facilitate better information sharing among the symmetric layers. For the discriminator, we employ the architecture of the PatchGAN [52] discriminator that penalizes generated image structure at the scale of patches instead of at the image level. We develop and train all our models using the PyTorch framework. The proposals are trained using Momentum Optimizer with 1 × 10 −4 as the base learning rate for the first 75 epochs, then we apply linear decay for the rest of the epochs. We train the whole model for a total of 200 epochs. Momentum was set to 0.9. All the models were trained on a system comprising one NVIDIA GeForce GTX 2080Ti GPU and the batch size was set to 1 for all experiments. In the testing phase, shadowremoved outputs are re-sized to 256 × 256 to compare with the ground truth images, as followed in [37], [43]. We used the shadow detector by Ding et al. [34] to extract the shadow masks during the testing phase. There remain some visible artifacts due to improper blending that is taken care of by the Refinement network.

C. ABLATION STUDY
We considered the adjusted ISTD [31] dataset to perform our ablation studies due to its large volume and common usage in most of the recent shadow removal literature. We design an extensive range of experiments on this dataset in both weakly-supervised and fully-supervised settings to evaluate the efficacy of the proposed several network components of UnShadowNet and find out the best configuration of our model. Network components: DeShadower network (D) is the basic unit that acts as the overall shadow remover in the proposal. In the weakly-supervised setup, first, we experiment with only D for shadow removal (D-Net). We then add the Illumination network (I) to include diverse illumination levels on the non-shadow regions in the image. We couple I with D in contrastive learning setup (D+I-Net). After shadow removal, the shadow-free region needs refinement for efficient blending with the non-shadow area. Hence we add a Refinement network (R) with D where L1 loss guides to preserve the structural details (D+R-Net). Next, we consider illumination-guided contrastive learned refinement (D+R-Net) network where we add I and that becomes D+I+R-Net. Further improvement is achieved when we add contrastive loss in R which completes the UnShadowNet framework. In the fully-supervised setup, as described earlier, I is not used. As a result, we present the study of D-Net, D+R-Net, and UnShadowNet respectively. Table 1 summarizes the ablation study of various proposed network components. Improvement in accuracy is observed due to the addition of I in contrastive learning setup. R adds further significant benefit when L1 loss is replaced with contrastive loss. The improvements of the proposed components are consistent in both self-supervised and fullysupervised learning as reported in the same table. All further experiments are performed based on the configuration marked as UnShadowNet.

Input
Inpainted Shadow Images Curriculum learning: Curriculum Learning [86] is a type of learning strategy that allows one to feed easy examples to the neural network first and then gradually increase the complexity of the data. This helps to achieve stable convergence of the global optimum. As per Table 2, it is observed that the curriculum learning technique provides considerable improvement when applied along with shadow inpainting and data augmentation.
Shadow inpainting: Appearance of shadows is a natural phenomenon and yet it is not an easy task to define the strong properties of shadow. This is because it does not have distinguishable shape, size, texture, etc. Hence it becomes important to augment the available shadow samples extensively so that they can be effectively learned by the network.
In this work, we estimate the mean intensity values of the existing shadow region of an image (I P ). Then we randomly select a shadow mask (S M ) from the existing set of shadow samples. The mask (S M ) is inpainted on the shadow-free region of the image (I P ). The pixels that belong to the S M in I P will have brightness adjusted as the earlier computed mean. We do not apply the same mean every time, in order to generate diverse shadow regions, the estimated mean value  is adjusted by ±5%. The main motivations of this inpainting are two-fold: 1) It is difficult to learn complex shadows when it interacts with diverse light sources and other objects in the scene. The inpainted shadows are standalone and will provide an easier reference sample to another shadow segment in I P .
2) It also increases the robustness of the network towards shadow removal by inpainting shadows with more diverse variations. Figure 5 shows the proposed shadow inpainting with random shadow masks and different shadow intensities. Table 2 indicates the significant benefits of inpainting complementing the standard data augmentation. Data augmentation: Data augmentation is an essential constituent to regularize any deep neural network-based model. We make use of some of the standard augmentation techniques such as image flipping with a probability of 0.3, random scaling of images in the range 0.8 to 1.2, adding Gaussian noise, blur effect, and enhancing contrast. Table 2 sums up the role of curriculum learning, shadow inpainting, and data augmentation individually and the various combinations. This ablation study is performed on both weakly-supervised and fully-supervised setups indicating that both these training strategies are beneficial to learn shadow removal tasks.
Illuminance factor (µ): The DeShadower Network maximizes the information with "bright" synthetic augmentations generated by the Illumination Network. The effectiveness of the Illumination Network is verified from the results in Fig. 4. To train the Illumination Network, we sample shadow regions from the dataset and vary their brightness by µ − 5, µ, µ + 5.
The different values experimented for the Illuminance factor (µ) are presented in Table 3. We find that setting the value of µ at 50 gives the most optimal results in shadow removal performance. For the fully-supervised setup, since the groundtruth images are available, the optimal level of brightness is obtained from those samples itself, consequently, µ = 0 gives the best performance.

D. QUANTITATIVE STUDY
We evaluate our proposals and compare quantitatively with the state-of-the-art shadow removal techniques on ISTD [31], Adjusted ISTD [32], and SRD [33] benchmark datasets.
ISTD: Table 4 compares the proposed method with the state-of-the-art shadow removal approaches using RMSE, PSNR, and SSIM metrics for shadow, shadow-free, and all regions. We achieve state-of-the-art results and the improvement with respect to all metrics for shadow area in both training setups, namely weakly-supervised (UnshadowNet) and fully-supervised (UnshadowNet Sup.), are quite significant. There are a few other fully-supervised shadow removal methods evaluated on ISTD [31] dataset, which we compared with our proposed fully-supervised setup. In this setup as well, as per Table 5, our proposed method outperforms other state-of-the-art approaches. ISTD+: Table 6 shows the performance of our proposed shadow remover on the adjusted ISTD [32] dataset using RMSE metric. The comparison of our method in a fullysupervised setup with other techniques trained in the same fashion demonstrates the robustness of our framework as it shows incremental improvement over the most recent stateof-the-art methods. In addition, we have performed experiments using a weakly-supervised setup where the metrics are comparable and only slightly behind the fully-supervised model.
SRD: We report and compare our shadow removal results in both the constrained and unconstrained setups with existing fully-supervised methods on SRD [33] using RMSE metric. Table 7 indicates that our proposal trained in a fullysupervised fashion obtains the lowest RMSE in all regions and outperforms the most recent state-of-the-art methods [42], [53]. Figure 6 shows qualitative results of the proposed model trained in weakly-supervised format on a total of three challenging samples from the ISTD [31] dataset. We also visually compare with two existing and most recently published weakly-supervised shadow removal methods by Le et al. [37] and G2R-ShadowNet [41] respectively. It is clearly observed that UnShadowNet is accurate while removing shadows in complex backgrounds. In addition to the unconstrained setup, Figure 7 shows the results of our UnshadowNet Sup. model on ISTD [31] dataset. It is to be noted that the visual results are not shown on the adjusted ISTD [32] dataset because the test samples are the same as in the ISTD dataset, the only difference is in the color of the ground truth. In addition, we consider SRD [33] dataset and this is the first work where visual results are presented on the samples from the same dataset. Figure 8 and 9 demonstrate the results of UnShad-owNet in weakly-supervised and fully-supervised setup.

F. RUNTIME ANALYSIS
We compare the runtime performance of our model with recent other contemporary architectures. For this purpose, the available code bases were used to estimate the run-time. During inference, LG-ShadowNet [39] [94]. UnShadowNet trained on ISTD [31] dataset enables to remove shadow reasonably in automotive scenes.
tively evaluate these datasets extensively. We sampled a few shadow scenes from the challenging IDD dataset [94] which contains varied lighting condition scenes on Indian roads. It was impossible to train our model as shadow masks were unavailable. Thus we used this dataset to evaluate the robustness and generalization of our pre-trained model on novel scenes. The qualitative results are illustrated in Figure 10. Although the performance of the proposed shadow removal framework is either comparable to the state-of-the-art or superior, it is still not robust to be used in real-world autonomous driving systems. We feel that more extensive datasets have to be built for shadows to perform more detailed studies and we hope this work encourages the creation of these datasets or annotations of shadows in existing datasets.

V. LIMITATIONS AND FUTURE DIRECTIONS
As presented in our experiments, UnShadowNet outperforms the existing state-of-the-art in several standard shadow removal datasets. However, there are certain areas that can be improved. Our model relies upon an external shadow detector [34] which may not always accurately predict the shadow VOLUME 10, 2022 regions. This may cause resultant areas where the shadow is not removed. In future work, we intend to build a singlestage architecture to incorporate both shadow detection and removal. Since shadows are physical phenomena, another interesting direction would be to exploit the inherent physical properties of illumination that result in shadows. Moreover, in our research, we observed that the focus is mainly on datasets that have images of a narrow field-ofview and lacks complex situations that may arise in reallife automotive scenes. For future works, we think it will be important to develop a suitable dataset that comprises such challenging scenarios as in real-world automotive settings.
The proposed method is not optimized for run-time and we still obtained a reasonable inference time of 0.822 seconds. With optimization techniques like pruning and multi-task learning, real-time performance can potentially be achieved.

VI. CONCLUSION
In this work, we have developed a novel end-to-end framework consisting of a deep learning architecture for image shadow removal in unconstrained settings. The proposed model can be trained with full or weak supervision. We achieve state-of-the-art results in all the major shadow removal datasets. Although weak supervision has slightly lesser performance, it eliminates the need for shadowless ground truth which is difficult to obtain. To enable the weakly supervised training, we have introduced a novel illumination network which is composed of a generative model used to brighten the shadow region and a discriminator trained using shadow-free patches of the image. It acts as a guide (called illumination critic) for producing illuminated samples by the generator. DeShadower, another component of the proposed framework is trained in a contrastive way with the help of illuminated samples which are generated by the preceding part of the network. Finally, we propose a refinement network that is trained in a contrastive way and is used for finetuning the shadow-removed image obtained as an output of the DeShadower. We perform ablation studies to show that the three components of our proposed framework, namely the illuminator, Deshadower, and refinement network work effectively together. To evaluate the generalization capacity of the proposed approach, we tested a few novel samples of shadow-affected images from a generic automotive dataset and obtained promising results of shadow removal. Shadow removal continues to be a challenging problem in dynamic automotive scenes and we hope this work encourages further dataset creation and research in this area.

VII. ACKNOWLEDGEMENT
We would like to thank Valeo for encouraging advanced research. Many thanks to Tuan-Hung Vu (valeo.ai, France), Saikat Roy (DKFZ, Germany) and Aniruddha Saha (University of Maryland, Baltimore County) for providing a detailed review prior to submission.  ANDREI BURSUC is a research scientist at valeo.ai in Paris, France. He completed his Ph.D. at Mines ParisTech in 2012. He was a postdoc researcher at Inria Rennes and Inria Paris. In 2016, he moved to industry to pursue research on autonomous systems. His current research interests concern computer vision and deep learning, in particular annotation-efficient learning and predictive uncertainty quantification. Andrei serves regularly as a reviewer for major computer vision and machine learning conferences and journals. He is teaching undergraduate courses at Ecole Normale Supérieure and Ecole Polytechnique.
UJJWAL BHATTACHARYA is currently a member of the faculty of the Computer Vision and Pattern Recognition Unit of Indian Statistical Institute situated in Kolkata. He joined his current Institute in 1991 as a Junior Research Fellow after obtaining his M.Sc. and M. Phil. degrees in Pure Mathematics from Calcutta University. In the past, he collaborated with a few industries and research labs in India and abroad. In 1995, he received Young Scientist Award from the Indian Science Congress Association. Also, he received a few best paper awards from various groups. He is a senior member of the IEEE and a life member of IUPRAI, the Indian unit of the IAPR. He has served as a Program Committee member of various reputed International Conferences and Workshops. Also, he worked as a Co-Guest Editor of a few Special Issues of International Journals. His current research interests include machine learning, computer vision, image processing, document processing, handwriting recognition, etc.
SENTHIL YOGAMANI is an Artificial Intelligence architect and holds a director-level technical leader position at Valeo Ireland. He leads the research and design of AI algorithms for various modules of autonomous driving systems. He has over 16 years of experience in computer vision and machine learning including 14 years of experience in industrial automotive systems. He is an author of 125+ publications with 5200 citations and 100+ filed patents. He serves on the editorial board of various leading IEEE automotive conferences including ITSC and IV and the advisory board of various industry consortia including Khronos, Cognitive Vehicles, and IS Auto. He is a recipient of the best associate editor award at ITSC 2015 and the best paper award at ITST 2012. VOLUME 10, 2022