Indirect Domain Shift for Single Image Dehazing

Despite their remarkable expressibility, convolution neural networks (CNNs) still fall short of delivering satisfactory results on single image dehazing, especially in terms of faithful recovery of fine texture details. In this paper, we argue that the inadequacy of conventional CNN-based dehazing methods can be attributed to the fact that the domain of hazy images is too far away from that of clear images, rendering it difficult to train a CNN for learning direct domain shift through an end-to-end manner and recovering texture details simultaneously. To address this issue, we propose to add explicit constraints inside a deep CNN model to guide the restoration process. In contrast to direct learning, the proposed mechanism shifts and narrows the candidate region for the estimation output via multiple confident neighborhoods. Therefore, it is capable of consolidating the expressibility of different architectures, resulting in a more accurate indirect domain shift (IDS) from the hazy images to that of clear images. We also propose two different training schemes, including hard IDS and soft IDS, which further reveal the effectiveness of the proposed method. Our extensive experimental results indicate that the dehazing method based on this mechanism dramatically outperforms the state-of-the-arts.


I. INTRODUCTION
Deep convolutional neural networks (CNNs) have been tremendously successful in many high-level computer vision tasks, e.g.image recognition [18], [24] and object detection [15], [36].Although recent works have shown that it is also possible to learn an end-to-end CNN model for low-level vision tasks, e.g.image dehazing [6], [20], the resulting performance is still not completely satisfactory.For highlevel vision tasks, it suffices to extract specific features and simply express them as very low dimensional vectors [24], which results in a relatively simple mapping.In contrast, lowlevel vision tasks require both global understanding of image content and local inference of texture details; as such, the associated mappings are more complicated.
One possible explanation for performance discrepancies on high-level and low-level vision tasks is as follows.For high-level vision tasks such as image recognition, a slight perturbation of the output tends to be inconsequential since the perturbed output is likely to get converted to the same onehot vector and consequently the classification label remains unaffected.However, for low-level vision tasks such as image dehazing, any perturbation can potentially manifest in the final result, jeopardizing the image quality.From this point of view, despite the fact that a deep CNN can in principle approximate any function, it is still difficult to train an accurate mapping that lifts the input to the target domain in one shot, since the loss function is typically very close to zero in the neighborhood of the target image [29].We argue that a different mechanism for domain shift is needed for image dehazing, which requires both memory and understanding of image contents.
To this end, we provide explicit guidance during model optimization to lead the domain shift path across several identified confident neighborhoods , resulting in the proposed framework shown in Figure 1.More specifically, instead of only imposing the loss function on the model output, we introduce multi-scale estimation, multi-branch diversity, and adversarial loss inside the model, thereby pulling the interim outputs to specific regions then merging them in the target domain; this yields an indirect but more accurate mapping.The contributions of this paper include: • By introducing loss functions inside a CNN model, we propose the framework of indirect domain shift (IDS) for image dehazing, which aggregates powerful expressibility of different architectures, i.e., multi-scale, multi-branch, and generator for lifting degraded images to the target domain indirectly.• We provide theoretic justifications for IDS and show that it provides valuable guidance for network construction.
-A multi-scale module takes the advantage of coarsefine network to maintain global-local consistency.-A multi-branch architecture is adopted to enable precise inference of local details by providing diverse confident neighborhoods.-A FusionNet further improves the perceptual quality by informed 'imagination', rather than blindly pursuing a higher PSNR, as the multi-scale multi-branch structure has shifted degraded images close enough to the corresponding ground truth in terms of objective image quality metrics.• It is demonstrated that IDS leads to remarkable performance improvements compared with the state-of-the-art algorithms.

II. RELATED WORKS
Image dehazing, which aims to recover a haze-free image from its hazy version, is a highly ill-posed restoration problem.The haze effect is often approximated using the atmospheric scattering model [33] given as follows: where I(x), J(x), and A are the observed hazy image, clear scene radiance, and global atmospheric light, respectively.The scene transmission t(x) describes the portion of light that is not scattered and reaches the camera.It can be expressed as x) , where β is the medium extinction coefficient and d(x) is the depth map of pixel x.
Based on this atmospheric scattering model [33], many strategies have been proposed by taking advantage of various prior knowledge.For example, the dark channel prior [17] assumes that in non-sky patches, at least one color channel has very low intensity.The color attenuation prior [51] assumes that the image saturation decreases sharply at hazy patches, so that the difference between brightness and saturation can be utilized to estimate the haze concentration.To address the weakness of DCP for the sky region, [39] proposes to separately deal with the non-sky region and the sky region using dark channel prior and luminance stretching.In [40], the authors come up with a new color channel method to remove atmospheric scattering for single image dehazing.The overall algorithm consists of atmospheric light calculation, transmission map estimation, radiance estimation and post enhancement.Furthermore, based on the assumption that a linear relationship exits in the minimum channel between hazy and haze-free images, a fast linear-transformation-based dehazing algorithm is introduced in [44].
Recently, data-driven approaches to image dehazing have received increasing attention.[37] and [7] propose to use CNN for medium transmission estimation, which is further leveraged to recover the haze-free image.In [37], a multiscale deep neural network is proposed to learn a mapping between hazy images and their corresponding transmission maps.A densely connected pyramid network is proposed in [47] to jointly estimate the transmission map, atmospheric light, and dehazed images, while an effective iteration algorithm is developed in [31] to learn the haze-relevant priors.[10] further embeds the atmospheric model into the designing of CNN and proposes a feature dehazing unit to ensure endto-end trainable.However, it is known that the atmospheric scattering model (ASM) is not valid in certain scenarios [28], which limits the applicability of the aforementioned dehazing methods.Unlike those ASM-dependent methods, [8] integrates multiple models to perform haze removal with attention, and [30] uses a GridNet-based network [14] to directly predict dehazed images via an ASM-agnostic approach.To further improve the performance in ASM-agnostic setting, [9] propose an multi-scale boosted dehazing network (MSBDN) with boosting strategy and back-projection technique.[19] firstly introduces knowledge distillation in solving dehazing problem.It allows dehazing model learn to dehaze from both ground truths and teacher outputs.
Many methods that have been developed for other image restoration tasks, e.g.deblurring, denoising, are also highly relevant.To remove blurring caused by the dynamic scenes, a multi-scale convolutional neural network is proposed in [32] to restore sharp images in an end-to-end manner.In [16], the weighted nuclear norm minimization (WNNM) problem is studied and applied to image denoising by exploiting non-local self-similarity.This work is later extended to handle arbitrary degradation, including blur and missing pixels [46].To tackle the long-term dependency problem, the MemNet [43] is proposed by introducing a memory block, consisting of a recursive unit and a gate unit, to explicitly mine persistent memory through an adaptive learning process.To make the deep networks implementable on limited resources, a new activation unit is proposed [23], which enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance.A super-resolution generative adversarial network (SRGAN) is developed in [25] to recover high-frequency details and produce more natural-looking images.

III. FORMULATION FOR INDIRECT DOMAIN SHIFT
In this section, we provide a theoretical formulation of the image dehazing problem and propose an indirect domain shift method as an effective approach to obtaining an approximation solution.
Denote the prior distribution of clear images of size m × n by p X , which is defined on a low dimensional manifold M in R 3×m×n .The image degradation mechanism can be modeled as a conditional distribution p X|Y , i.e., given the clear image x, a distorted image y is generated according to p Y |X .Note that p X and p Y |X induce the joint distribution p X,Y as well as the conditional distribution p X|Y ; in general, both p X and p Y |X need to be learned from the training data.Image dehazing can be formulated as a maximum a posterior estimation problem: In practice, one often considers the following alternative formulation: where is a loss function.In general it is expected that both xmap and x are close to the ground truth.However, there is no guarantee that x belongs to M. We shall describe an IDS method, which leverages multiscale estimation and multi-branch diversity to obtain an approximate solution of (3), then lifts it into M using the adversarial loss to produce a candidate solution of (2).A network that realizes the IDS method is shown in Figure 1.

A. Multi-scale Estimation
Note that (3) requires the knowledge of p X|Y , which needs to be estimated from the training data, hence we solve the following approximated version of (3), i.e., x = min where p X|Y is an approximation of p X|Y learned from the training data.To ensure that x ≈ x (and consequently close to the ground truth), we need p X|Y (x|y) ≈ p X|Y (x|y) for x ∈ M (at least for x in a neighborhood of y that contains the ground truth).However, since the difference between the ground truth and the distorted version y is not negligible, this neighborhood could be quite large, rendering a good approximation of p X|Y (•|y) in this neighborhood difficult to obtain.Indeed, the number of parameters need to specify p X|Y (•|y) in this neighborhood might be comparable or even larger than the available training data, hence a direct approximation can be highly unreliable, especially considering the fact that the approximation is in general done in a suboptimal way.For this reason, it is sensible to first approximate p X|Y (with x being a low-resolution version of the ground truth), which itself is an approximation of p X|Y and can be specified by a significantly smaller number of parameters (as compared to p X|Y ).In this way, we can get a good approximation of p X|Y , denoted by p X|Y , and solve the following optimization problem instead: Since p X|Y (x|y) is a good approximation of p X|Y (x|y), it is expected that x is close to x and consequently not very far away from the ground truth.Now with x at hand, we can further convert (3) to the following problem: where N (x ) is a neighborhood of x that is large enough to cover the ground truth.It suffices to have a good approximation p X| X ,Y (•|x , y) over N (x ).The above procedure is repeated until the required neighborhood is small enough.We assume that the smaller the neighborhood becomes, the fewer number of parameters are needed to specify the distribution defined over this neighborhood and consequently the approximation becomes easier.Multi-scale estimation is introduced to mimic conventional coarse-to-fine optimization methods and has been widely applied in many computer vision tasks [11], [12], [32], [37].

B. Multi-branch Diversity
The idea underlying multi-branch diversity is similar.Suppose we adopt two branches with different loss functions, denoted by 1 and 2 , respectively, then (6) becomes It should be clear that multi-branch diversity further narrows the region over which the distribution needs to be estimated.In our experiments, we choose 1 and 2 to be mean square error (MSE) and structural similarity index (SSIM) loss, respectively.The reason we choose MSE and SSIM as loss functions is that MSE focuses on the pixel-level difference while SSIM pays more attention to the perceptual quality.See Figure 1 (a) and (b) for the architecture of two multi-scale estimation branches of the proposed IDS network.

C. Adversarial Loss
The role of the adversarial loss ad is to lift x into M. Specifically, consider a neural network subject to the weighted loss + λ ad , which can be interpreted as solve the following problem: where N (x , λ) is a neighborhood of x .In general, this optimization problem tends to give a reconstruction that falls into M since p X is only positive on M. Note that the size of N (x , λ) depends on λ.Specifically, N (x , λ) is large when λ is large.In the extreme case of λ → ∞, we have x +λ ad → arg max x∈M p X (x); while when λ is very small, N (x , λ) may have no intersection with M, and in this case (8) reduces to (3).In principle it is desirable to choose the smallest λ such that N (x , λ) intersects with M. It is also worth noting that p X is in general unknown.So one has to solve a modified version of ( 8) with p X replaced by p X , which is an approximation of p X learned from the training data.
The adversarial loss serves an important role of generating texture details in image restoration.One of the reasons for its success in our framework is that, by leveraging multi-scale estimation and multi-branch diversity, one can already obtain an good estimate x which is in a narrow neighboring region of M, and consequently the generator does not need much "imagination" to produce a natural-looking image.However, we observe the similar phenomenon reported in [25] that adversarial loss is helpful for faithful reproduction, even though the final PSNR metric is slightly lower.Nevertheless, we introduce the adversarial loss to obtain better perceptual quality but not expect higher PSNR value.The relevant ablation study can be found in Section V-C.

IV. IMPLEMENTATION
In this section, we provide a detailed implementation of the indirect domain shift (IDS).We also propose two training schemes, i.e., the hard IDS and soft IDS.

A. Network Architecture
The proposed IDS network is shown in Figure 1, which consists of three basic components, i.e., the MSE branch, the MS-SSIM branch, and the FusionNet.The MSE and SSIM branches are built with multi-scale structure to successively map hazy images to their clear counterpart at different resolution levels (as in ( 6)); moreover, they are supervised by non-identical loss functions to ensure differentiated outputs.The FusionNet completes the domain shift process by merging the outputs from the two branches together with the input hazy image into a single clear image (as in (7)).We train the FusionNet (see Figure 1 (d)) using a content loss defined as the weighted sum of MSE loss and perceptual loss [21].The weight is carefully selected by searching from 1.0, 10 −1 , 10 −2 , and 10 −3 .We find that our network achieves the best performance when the weight is set to 10 −2 .An adversarial #parameters (M) loss (see (8)) is also imposed on the FusionNet to enhance the perceptual quality of the final result.
To be specific, inside each diversity branch, there are three sub-networks, each performing domain shift at a different scale level.The input of the coarse-scale sub-network is obtained from the original hazy image via bi-linear interpolation with a down-sampling factor of 4. Its output is up-sampled with a factor of 2 via pixel shuffle [42], then fed into the medium-scale sub-network, together with the down-sampled hazy image representation by a factor of 2. The input of the fine-scale sub-network is the concatenation of the original hazy image representation and the up-sampled output of mediumscale sub-network.
It is known that residual networks (ResNets) can facilitate gradient flow while dense networks (DenseNets) help maximize the use of feature layers via concatenation and dense connection.To capitalize on their respective strengths, [49] proposes so-called residual dense networks (RDNs), which consist of contiguous memory blocks, local residual learning blocks and global feature fusion blocks.
In this work, we use RDNs as the fundamental building components of the proposed IDS network.See Table I for detailed specifications.Note that hard IDS and soft IDS adopt the same network structure, but differ in terms of the number of trainable parameters.Model depth will be detailed in Section V-D.

B. Training Scheme
To handle the coexistence of multiple loss functions, we propose two back-propagation strategies characterized by different effective ranges of the loss functions.Specifically, we can separately update each module according to the associated loss function or jointly update all modules according to a global loss that aggregates the local ones.This results in the two IDS training schemes, i.e., hard IDS and soft IDS.
1) Hard IDS: We first present the isolated training strategy for hard IDS shown in Figure 2. Specifically, each module is supervised independently by the associated loss functions and deliver dehazed images to the next stage after updating their weights.Note that in this case, the convergence of the entire network does not depend on the convergence of all loss  functions, which means that the network performance may become stable before all loss functions are small enough.This is a consequence of direct mapping, since for each mapping step it suffices to enter one of many (almost) equally good confident neighborhoods, resulting in lower computational load.One advantage of isolated updating is that the gradient vanishing problem can be alleviated.Recall that this problem is caused by the emergence of small gradients in the earlier layers of very deep networks during back-propagation.As a comparison, isolated training shortens the back-propagation path, but maintains the depth of forward inference, at the expense of heterogeneous convergence rates of different loss functions.It is also worth noting that the isolated training strategy closely follows our analytical formulation which dictates how to shift from one domain to another.Therefore, the success of hard IDS can be viewed as a good indication of the correctness of our theoretical framework.
2) Soft IDS: In contrast to hard IDS, here a global loss function obtained by combining all local module losses is used to update network parameters via end-to-end backpropagation.Although the local losses are evaluated based on the images output by the respective modules, only the feature map from the penultimate convolutional layer of each module is delivered to the next module.This enables soft IDS to accomplish the desired task largely in the feature space.The fact that each module no longer has to re-map the previous module's output images back to the feature space is helpful for reducing the number of parameters and also making the indirect shifting path 'smoother'.Another advantage of soft IDS is that there is no need to be concerned with the convergence of a specific module as in hard IDS, which facilitates the training process.
In summary, the differences between Hard and Soft IDS are in two main aspects: (1) As in Figure 4, Hard IDS and Soft IDS deliver images and features to the next stages, respectively.(2) The Hard IDS adopts isolated training (optimization over modules independently), while Soft IDS computes the summation of all the local module losses and optimizes the entire notwork in an iteration.

V. ABLATION STUDY
We conduct ablation studies to investigate the respective contributions of multi-scale estimation, multi-branch diversity, and adversarial loss using RESIDE-standard indoor dataset [27] that will be introduced in detail in Section VI-A.To eliminate the influence of other factors, all training configurations are kept the same as that presented in Section VI-B, including the total number of trainable parameters for each network.More detailed analysis is shown in supplementary.

A. Multi-scale Estimation
As mentioned in Section III-A, a direct mapping can be highly unreliable, since the number of trainable parameters might be comparable or even larger than the available training data.To overcome this problem, a multi-scale network is applied in the first stage of IDS.Another important property of such coarse-to-fine estimation is the local-global consistency: the coarse-scale network first estimates the holistic structure of the image scene, and then a fine-scale network performs refinement based on both local information and the coarse global estimation.To further study the influence of such coarse-tofine structure, we test the performance of IDS framework without multi-scale estimation (w/o scale).
Following the ablation principle, we remove the coarsescale network and make the fine-scale network deeper to have the same number of parameters.One output example is presented in Figure 5a indicating that hard IDS w/o scale is able to recover the image reasonably well, but with some local inconsistency: the haze at the up-left corner is not removed faithfully.This verifies the above analysis that multi-scale network is able to capture both local and global features.We present the performance on PSNR and SSIM for both hard IDS and soft IDS in Table II (a) and (b), respectively.It can be seen that IDS w/o scale performs worse than IDS (especially in soft IDS), indicating that the local inconsistency has impact on both the quantitative metrics and perceptual quality.

B. Multi-branch Diversity
Using multi-scale estimation with MSE loss, one can realize domain shift to a certain extent.However, some important information may get lost along the way.To keep the information diversity, we introduce one more multi-scale branch and employ SSIM loss in this branch.This strategy enables a more precise inference of local details by providing distinctive confident neighborhoods identified by different branches.To further illustrate its effectiveness of this strategy, we test the performance of IDS without multi-branch diversity (w/o div).
Similarly, we remove the second branch and make the first branch deeper.One of the examples is presented in Figure 5b, in which the IDS w/o div sometimes delivers erroneous detail inference, since the "dark area" between the light and the wall clearly should not exist.This is further verified by the overall validation shown in the Table II, in which there is a large performance gap between IDS and IDS w/o div, indicating that it is well worth having two branches.

C. Adversarial Loss
The adversarial loss (together with the content loss) is employed at the last stage (i.e., the FusionNet) of the proposed IDS framework and is served to obtain high visual quality.The FusionNet takes the estimates from the two branches, in conjunction with the original hazy image, as the input and generates the final output with perceptually satisfactory high-frequency details via proper fusion.Since the estimates produced by the two branches are already in the neighboring domains of the target, the generator does not need to rely on pure "imagination" to create texture details; instead, it could, to a great extent, maintain the perceptual reality rather than blindly pursue a higher PSNR [25].
To prove this, we show that IDS without adversarial loss is able to produce a high PSNR but NOT able to obtain better perceptual quality.Following the ablation principle, we construct IDS IDS without adversarial loss (w/o adv) by simply removing discriminator.As can be seen, IDS w/o adv produces a slightly higher PSNR in Figure 5c (26.508), but obviously lower perceptual quality than IDS (26.094), as the wall is printed "darker" partially to minimize the MSE distance.This demonstrates the generalization capability of the generator and provides further justifications for the IDS framework.
To further prove the necessity of adversarial loss, we compare with GridDehaze [30].GridDehaze [30] is a pure CNN based dehazing method without adopting adversarial loss to generate natural distributed outputs.From Figure 6, it shows that the generated images from Soft IDS tend to be closer to the ground truth with less inconsistent color gradients on the road, sky, and wall.This verifies the phenomenon that the adversarial loss is introduced to obtain better perceptual quality but not blindly pursue higher PSNR value.

D. Model Depth
This section is devoted to investigating the impact of model depth on the performance of our hard IDS method.By adjusting the number of convolutional and dense residual blocks, we construct shadow, medium, and deep models with 8 M, 10.5 M, and 15 M trainable parameters, respectively.Detailed specifications is shown in Table I.As expected, the deep model achieves the best overall performance in terms of both PSNR and SSIM.As illustrated in Figure 3, both PSNR and SSIM values improve dramatically as the number of parameters increases, which further verifies the effectiveness of the IDS framework.It is worth mentioning that albeit with fewer trainable parameters (around 4.3 M), soft IDS still manages to outperform hard IDS as shown in Table III.

A. Benchmark Dataset
For training and testing purposes, we use the RESIDEstandard dataset [27], which is a benchmark for single image dehazing.The indoor training set (ITS) of RESIDE-standard contains 13990 synthetic hazy indoor images (together with haze-free counterparts).These synthetic images are generated using NYU2 [34] and Middlebury stereo [41] with the medium extinction coefficient β chosen uniformly from (0.6, 1.8) and the global atmospheric light A chosen uniformly from (0.7, 1.0).The outdoor training set (OTS) of RESIDEstandard contains 296695 hazy images generated from 8477 clear counterparts with β chosen uniformly from (0.04, 0.2) and A chosen uniformly from (0.8, 1.0).The testing set (SOTS) of RESIDE-standard contains 500 synthetic hazy indoor/outdoor images (together with haze-free counterparts).We also perform comparisons using the real-world hazy image dataset in [13] to show the perceptual difference.

B. Training Details
Our algorithm is implemented using the PyTorch library [35] and all tests are conducted on the same GPU of Nvidia Titan Xp.We train the network with the following configuration: the Adam optimizer [22] is applied with β 1 = 0.9 and β 2 = 0.999, where a mini-batch size of 10, a patch size of 180 × 180, an initial learning rate of 10 −4 are adopted.For hard IDS, the learning rate decays with a multiplicative factor of 0.5 every 120 epochs for a total of 700 epochs, while soft IDS is trained for 100 epochs with the learning rate reduced by half on the 60th, the 80th, and the 90th epochs.Besides, horizontal/vertical random flipping is applied for data augmentation.It is worth mentioning that after random flipping of both input and target images, the training data are still paired.Therefore, such an augmentation strategy is not harmful to supervised training but help expand the size of training data.

C. RCAN as Substitute
The proposed IDS framework is generic in nature and admits many different concrete implementations.In this work, we have focused on a particular implementation with RDNs as fundamental building blocks.However, this is by no means the best possible one.Indeed, the performance of our IDS network can be further improved by adopting more powerful substitutes of RDNs.To demonstrate this, we replace RDNs in soft IDS by residual channel attention networks (RCANs) [48] with the same number of trainable parameters.We further illustrate the effectiveness of adopting RCANs as substitute in the following experimental results.

D. Evaluation on Benchmark Dataset
We train our network from scratch on RESIDE-standard ITS, OTS and validate it on the separated testing dataset SOTS.The quantitative results and the qualitative results are shown in Table III and Figure 7, respectively.Here hard IDS corresponds to the deep model in Table I, while soft IDS is as described in Section V-D.It can be seen from Table III that soft IDS outperforms the other methods under comparison in terms of PSNR and SSIM.In particular, the PSNR achieved by soft IDS reaches 34.74 on SOTS indoor dataset.Moreover, with the boost from RCANs substitute, RCAN IDS outperforms the others by a large margin.
As for visual quality, prior-based methods [17] overestimate the haze thickness, which results in color distortion (e.g. the color of the wall turns purple in the fifth row in Figure 7).Although some learning-based baseline methods [7], [26] avoid   the color distortion problem, they tend to deliver unsatisfactory haze removal results for shaded regions.For example, in the seventh row of Figure 7, the area behind the arch should be dark; however, the restoration results produced by most baseline methods show light color instead.This is probably because of that the baseline methods fail to correctly estimate the depth information and consequently mislead by the haze effect.GFN generates decent results, and removes the haze in this area reasonably well.A possible explanation is that GFN does not rely on depth estimation for haze removal; it can also be attributed to the multi-scale approach adopted by GFN, which is an important ingredient of the IDS framework as well.Exploiting the full strength of IDS enables us to obtain better dehazing results.GridDehaze [30], PFD [10] and MSBDN [9] are the methods that can produce dehazed images comparable to ours.However, they still generates inconsistent color gradients on the venetian blinds in the fourth row.On the other hand, it can be seen in Figure 7 that our dehazed images can hardly be distinguished from the ground truth.

E. Evaluation on Real-world Photographs
We further show the dehazing results on real-world images in [13] to illustrate the generalization ability of IDS.In Figure 8, Prior-based method [17] introduces color distortion and over enhancement on images.
It is clear that DehazeNet [7], and AOD-Net [26] fail to remove haze completely, especially in the last column where heavy haze can still be seen around the haystack.Moreover, they also tend to over-enhance the images (e.g. the mountains in the fourth column).Although GridDehaze [30] , PFD [10] and MSBDN [9] work well on the synthetic dataset, its generalization performance on real images is unsatisfactory.The red boxes in Figure 8   weaknesses include color distortion, incomplete haze removal and over enhancement.We also notice that the proposed IDS is able to not only remove haze successfully, regardless whether it is dense or light, but also restore the texture details faithfully, which further proves the effectiveness of our method.

F. Evaluation on Real-world Datasets
The evaluation is conducted on the O-Haze [3], and Dense-Haze [1] datasets.The Two real-world datasets is challenging since they contain limited training images (45 and 55 respectively) and vivid haze patterns.Therefore, the performance on the two dataset can be a good indication to the effectiveness of the proposed methods.The training on the two datasets adopts same strategies as introduced in Section VI-B.For fair comparison, we omit to use pre-trained weights or data augmentations that are not introduced in Section VI-B.We demonstrate the evaluation quantitatively and qualitatively in Table V and Figure 9.
Results on NTIRE2018 O-Haze.We evaluate our proposed IDS on O-Haze dataset [3] following the data split in official NTIRE2018-Dehazing challenge [5].It can be observed in Table V that our IDS outperforms the other methods in terms of PSNR and SSIM. Figure 9a shows that our approach reconstructs faithful and sharp haze free images with good perceptual quality.
Results on NTIRE2019 Dense-Haze.In contrast to O-Haze that mostly contains light haze, Dense-Haze [1] records images with denser and more homogeneous haze layer.We follows NTIRE-2019 challenge [4] to conduct evaluation.Qualitative results in Figure 9b demonstrate that even if the background scene is occluded by thick haze, our IDS is still able to restore these region.In particular, since the second testing sample in Figure 9b is covered by severe haze, the background scene is almost invisible to human eyes.Nevertheless, our IDS surprisingly removes dense haze and reconstructs identifiable details.Quantitative comparisons in Table V illustrate that our IDS is the top performing method.

G. Runtime
Table IV shows runtime comparisons on the SOTS dataset.Our method is ranked the third among DNN-based methods.It is worth mentioning that in our implementation multiscale estimation is performed branch by branch.A significant reduction in runtime is possible via a parallel implementation of multi-scale estimation in two branches.

VII. CONCLUSION
In this paper, it is shown that the traditional direct mapping methods cannot provide accurate direct mapping for image dehazing.To solve this problem, an indirect domain shift (IDS) method is proposed by adding explicit loss functions inside a deep CNN model to guide the dehazing process.Multi-scale estimation, multi-branch diversity, and adversarial loss play important roles in this method as shown by the ablation studies.We also propose two training schemes, which have their respective advantages.Specifically, hard IDS is less demanding in terms of computational resources and alleviates the gradient vanishing problem.Besides, hard IDS is designed according to our theoretical formulation and its success provides a strong empirical indication of the correctness of our indirect domain shift mechanism.On the other hand, soft IDS is easier to implement and in general yields better performance.We show that IDS achieves remarkable improvements compared with the state-of-the-art on five dehazing datasets.Despite the success of our method, the visual performance of IDS is not completely satisfactory on Dense-Haze dataset.Since the deep learning methods often require large-scale datasets for training, we believe the performance of our method on Dense-Haze dataset can be further improved by simply acquiring more training samples.From another perspective, one interesting direction for our future work is to enhance the IDS framework to enable good generalization with limited training data.

Fig. 1 :
Fig. 1: One example of the proposed IDS network.(a) and (b) are the multi-scale estimation with MSE and SSIM loss, respectively.(d) is the FusionNet with adversarial and content loss.(c) shows the legend.

Fig. 2 :
Fig. 2: The isolated training of one iteration in hard IDS.

Fig. 5 :
Fig. 5: Some output examples of Hard IDS without multi-scale estimation (w/o scale), without multi-branch diversity (w/o div), only with adversarial loss (o/w adv), and without adversarial loss (w/o adv) in the ablation study, respectively.

Fig. 7 :
Fig. 7: The output examples from SOTS indoor testing set of the SOTA methods.

Fig. 8 :
Fig. 8: The output examples from real-world images in Fattal et.al.[13] to compare with SOTA DNN based methods.
locate their unsatisfactory regions.Their The output samples from O-Haze testing set.The output samples from Dense-Haze testing set.

TABLE I :
The configuration of the shadow, medium, and deep Hard IDS corresponding to Figure3.

TABLE II :
Ablation studies on the SSIM/PSNR performance.The best performance is shown in bold, while second best results are with underline.

TABLE III :
The SSIM/PSNR performance of different methods on SOTS-indoor, and SOTS-outdoor.Our proposed methods and improved network with RCAN outperform the others.

TABLE V :
The SSIM/PSNR performance of different methods on O-Haze and Dense-Haze dataset.Our proposed methods outperform the others.