Guided Image Deblurring by Deep Multi-Modal Image Fusion

Estimating sharp images from blurry observations is still a difficult task in the image processing research field. Previous works may produce deblurred images that lose details or contain artifacts. To deal with this problem, a feasible solution is to seek the help of additional images, such as the near-infrared image and the flashlight image, etc. In this paper, we propose a fusion framework for image deblurring, called Guided Deblurring Fusion Network (GDFNet), to integrate the multi-modal information for better image deblurring performance. Unlike previous works that directly compute a deblurred image using paired multi-modal degraded and guidance images, GDFNet employs image fusion techniques to obtain a deblurred image. GDFNet can combine the advantages by fusing the pre-deblurred streams of single and guided image deblurring using convolutional neural network (CNN). We adopt a blur/residual image splitting strategy by fusing the residual images to enhance the representation ability of encoders and preserve details. We employ a 2-level coarse-to-fine reconstruction strategy to improve the fusion and deblurring performance by supervising its multi-scale output. Quantitative comparisons on multi-modal image datasets show that our GDFNet can recover correct structures and produce fewer artifacts while preserving more details. The peak signal-to-noise ratio (PSNR) of GDFNet evaluated on the blurry/flash dataset is at least 0.9 dB higher than the compared algorithms.


I. INTRODUCTION
Image deblurring aims to recover sharp images from their blurred observations degraded by camera/object movements or lens defocus. Under the uniform blur assumption, the blurry image I can be modeled as the convolution of the sharp latent image F and a point spread function (PSF) k, where n represents the noise and * denotes the convolution operator. Generally, the blurring degradation can be categorized into out-of-focus blur and motion blur. Out-of-focus blur occurs when the image plane is away from the ideal reference plane. Motion blur is caused by relative motions between the scene and camera during exposure.
The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia .
Actually, restoring the sharp image based on one single blurry image is severely ill-posed as it needs to estimate both the point spread function k and the sharp image F simultaneously. The results of single blind image deblurring algorithms, such as multi-input multi-output U-Net (MIMO) [1], multi-stage progressively restoration network (MPR) [2], and a deep neural network embedded with residual Fourier transform (DeepRFT) [3], usually lose details during the process of removing motion blur. Guided image deblurring algorithms handle deblurring with the help of the additional structure and texture information provided by different wavelength sensors or camera settings. For example, near-infrared (NIR) sensors are highly adaptable to thick fog and darkness due to different wavelength sensitivity, and flashlight imaging captures a clear picture by changing the environment illumination. Previous work has used aligned multi-modal image pairs such as flash/no-flash image pairs [4], [5], RGB/NIR image pairs [6], and blurry/noisy image pairs [7] to relax the illness of the blind image deblurring problem. However, the original visible information like pixel intensity and texture is inaccurate in the guidance images due to structural inconsistency caused by the noise in NIR imaging [8], the reflectance differences, and the object movements. It is the reason why integrating information from guidance images to degraded images sometimes produces artifacts or halos. This observation inspires us to deblur images in a fusion manner.
In this work, we propose a deep fusion network, called Guided Deblurring Fusion Network (GDFNet), to perform joint image deblurring by fusing the pre-deblurred images obtained by multiple image deblurring streams. It is motivated by the fact that the single image deblurring stream cannot effectively recover detailed contents while the guided deblurring stream produces incorrect low-frequency content due to structural inconsistency. In comparison, our proposed GDFNet can effectively address these two problems in an image fusion mechanism. Enlighten by the residual learning [9] and the frequency principle [10], we embed a blur/residual image splitting strategy in GDFNet to estimate the fusion weights of residual images to enhance the representation ability of encoders. We use a coarse-to-fine reconstruction strategy to generate finer fusion weight maps by training the network using multi-scale supervision. Experimental results show that our GDFNet outperforms the competitors including blind deblurring algorithms, cascaded algorithms, and other fusion networks on multi-modal datasets. In summary, the main contributions of this work are as follows: • We propose a deep fusion framework GDFNet to deal with image deblurring by fusing the pre-deblurred streams of single and guided image deblurring using a multi-modal image pair as input.
• We employ a blur/residual splitting strategy to fuse the pre-deblurred residual images to enhance the representation ability with a coarse-to-fine reconstruction struc-ture trained using multi-scale supervision to improve the deblurring performance by generating finer fusion weights.
• We experimentally show that the GDFNet can fuse single and guided image deblurring streams, and outperforms the existing deblurring algorithms and fusion approaches on multi-modal datasets.
The organization of the paper is as follows. We review the related work in Section II. Section III presents the motivation, framework and details of the proposed method. Section IV illustrates the experimental results, and finally Section V concludes the work.

II. RELATED WORK
In this section, we provide a brief review of the work related to image deblurring and image fusion.

A. IMAGE DEBLURRING
Image deblurring techniques can be coarsely classified into two categories: single image deblurring and guided image deblurring. Single blind image deblurring refers to restoring the latent image from its degraded observation. Early works usually employ an alternative framework to estimate the blur kernel and latent image iteratively based on natural image priors. A heavy-tailed distribution on image gradients regularizes the iterative optimization for deblurring [11]. This regularization term is further improved by fitting the distribution using a hyper-Laplacian function [12]. The work [13] fits the logarithmic density of image gradients by concatenating two piece-wise continuous functions as a prior. L0 distribution is used to approximate the image gradients and intensity in [14] and [15]. Later, Pan et al. [16] and Yan et al. [17] define the dark channel and extreme channel and apply L0 sparse constraint on the intensities. Bai et al. [18] use coarse-tofine priors and recover the latent image using a multi-scale image pyramid. Levin et al. [19] modify a conventional lens by inserting a patterned disc into the aperture to produce a characteristic distribution of image frequencies that is very sensitive to defocus blur. Zhang et al. [20] estimate the sharp image using multiple blurry observations with a coupled sparse prior.
However, the recovered latent image can be visually poor when the kernel estimation is inaccurate. Recently, CNN has been widely applied in image processing and computer vision tasks. A cascaded network is adopted to estimate the latent image and blur kernel iteratively in [21]. Two generative networks are used to capture the blur kernel and the latent image in [22]. The work [23] uses a scale-recurrent network that shares network weights across scales. Min et al. decompose the low-and high-frequency information using wavelet transform followed by a recursive convolutional neural network to deblur. Further, a Multi-Input Multi-Output (MIMO) U-Net [1] is presented to deblur images in a coarseto-fine strategy. Ople et al. [24] extract multi-scale features using dilated convolutions with different dilated rates. VOLUME 10, 2022 FIGURE 2. The architecture of our proposed Guided Deblurring Fusion Network (GDFNet) using RGB/NIR image pairs as an example. First, the blurry RGB image and the sharp NIR guidance image are fed into three pre-deblurring streams to predict three pre-deblurred images. Then the residual images are computed by subtracting the pre-deblurred image from the blurry RGB image. Three individual encoders are used to extract the features of these residual images. Then the features of three streams are concatenated, followed by two convolutional blocks to aggregate. We use three decoders to estimate the fusion weights of three residual images of the pre-deblurred images in a coarse-to-fine scheme. Finally, the composite residual images are added to the original blurry input to generate the final fused deblurred result.
Liu et al. [25] refine the optimization based deblurred results using an encoder-decoder network. A multi-stage architecture called MPRNet progressively learns restoration functions for the degraded inputs in [2]. A residual fast Fourier transform with convolution block is introduced in DeepRFT [3] to integrate both low-and high-frequency residual information. Saqlain et al. [26] introduce a generative adversarial network (GAN) based approach called DeblurFusedGAN (DFGAN) that fuses a lightweight attention (LSA) mechanism and gradient-based filters in the generator work. Wang et al. [27] introduce Uformer, a transformer-based architecture for image deblurring. It uses a locally-enhanced window (LeWin) transformer block and a learnable multi-scale restoration modulator to capture both local and global dependencies. Tsai et al. [28] construct a transformer-based architecture using intra-and inter-strip tokens to catch blurred patterns with different orientations. Chen et al. [29] introduce an efficient Nonlinear Activation Free Network (NAFNet) that lowers the computational cost and removes unnecessary activation functions. Chu et al. [30] investigate the distribution differences in the features between training and inference and introduce a Test-time Local Converter (TLC).
Guided image deblurring algorithms introduce additional information from the guidance image to facilitate image deblurring. The paired images used in image deblurring are multi-modal, such as RGB/NIR [31], and blurry/flash [32], [33]. However, extraneous artifacts could appear when the guidance and input images are captured in different spectrums or have inconsistent structures. In the pioneering work [4], a robust flash gradient constraint is introduced to solve the flash deblurring problem by performing kernel estimation and non-blind deconvolution iteratively. Guided filtering [32] can be applied to deblur images by calculating the local linear model between two inputs. Further, a CNN-based joint filtering algorithm is designed to deblur images by estimating the coefficients of the spatially variant linear representation model (SVLRM) [5].

B. IMAGE FUSION
Image fusion can be roughly separated into three tasks: feature extraction, fusion rules, and feature reconstruction. Traditional fusion algorithms use domain transform approaches like wavelet transform, Laplacian pyramid decomposition, and guided filtering as feature extraction components. Recently, fusion algorithms based on deep learning have been introduced to improve the ability of feature representation like DenseFuse [34], or directly fuse images in an endto-end manner [35]. Zhou et al. [36] introduced a fusion algorithm that fuses infrared and visible using L 0 filter, the weighted least squares (WLS) filter, and parallel gradient fusion called target-aware decomposition and parallel gradient fusion (TAD-PGF). U2Fusion [37] solves different fusion problems using one fusion network in an unsupervised manner. In [38], a pair of infrared and visible images are used to fuse the obvious object information based on multi-level Gaussian curvature filtering image decomposition. Tang et al. [39] introduce a semantic-aware image fusion network (SeA-Fusion), which leverages the semantic segmentation task to guide the image fusion with a gradient residual dense block (GRDB). More multi-modal image fusion techniques in medical imaging are discussed in Tirupal et al. [40] and Srinvasu et al. [41].
In this work, we integrate the image deblurring and image fusion techniques and propose a deep learning based image fusion algorithm to deblur images by fusing the pre-deblurred streams.

III. PROPOSED METHOD A. GDFNet
The idea of deblurring by image fusion is motivated by the following observations. Single image deblurring algorithms usually recover low-frequency information or unreliable textures since the degradation corrupts the details and there are few clues to recover them. On the contrary, guided deblurring algorithms preserve more high-frequency components like edges and textures according to the guidance image. However, they produce artifacts due to the structural inconsistency and differences in wavelength sensitivites between multimodal images. To overcome these drawbacks, we fuse these deblurred images into I fuse . It is represented as the linear combination of three pre-deblurred image J 1 , J 2 , and J 3 , where ω 1 , ω 2 , ω 3 represents corresponding fusion weights and is the element-wise product. The pre-deblurred images J 1 , J 2 , and J 3 can be represented as and where N 1 , N 2 , and N 3 represent three pre-deblurring networks.
Since the pre-deblurred images computed by multiple image deblurring streams are the estimates of the sharp images, they are similar in low-frequency content. Therefore, small differences in fusion weights can greatly affect the fused details, which makes the image deblurring not robust. As we use CNN to predict the fusion weights, the networks often fit target functions from low frequencies to high frequencies according to the frequency principle claimed by Xu et al. [10]. The success of residual learning [9], [42] inspires us to use an effective blur/residual image splitting strategy in GDFNet to focus on high-frequency component fusion. The blurry image itself can be regarded as the low-frequency component of the estimated sharp latent image, and the difference between a pre-deblurred image and the blurry image is a reasonable initial guess of the highfrequency component. Correspondingly, the residual R i is defined in the form Combining (2) and (6), the fused deblurred image I fuse can be expressed as In this way, we encourage GDFNet to focus on the fusion of high-frequency components which makes the convergence faster and the deblurring performance better. The overall framework of our GDFNet is illustrated in Fig. 2. The network takes the blurry image I blur and the guidance image I guid as input to compute a deblurred image I fuse through image fusion. We use three deblurring streams to generate three different pre-deblurred images for image fusion. A single deblurring stream takes I blur as input and predicts a coarsely deblurred image J 1 . A guided deblurring stream takes I blur and I guid as input and jointly estimates an edge-preserving deblurred image J 2 . Another guided deblurring stream takes I guid and the output of the single deblurring stream J 1 as input to recover another pre-deblurred image J 3 .
Then we compute the corresponding residual images R 1 , R 2 , and R 3 of three pre-deblurred images J 1 , J 2 , and J 3 , respectively, by subtracting them with the blurry observation I blur . The features of these residual images are extracted by three individual encoders and aggregated by a feature concatenation layer. Since these aggregated features contain all information of three pre-deblurred images, they can predict the fusion weights ω 1 , ω 2 , and ω 3 of the three residual images. We compute the weighted summation of the multi-scale residual images using the coarse-to-fine fusion weights ω 1 , ω 2 , and ω 3 to obtain multi-scale fused residual images. During training process, we add multi-scale blurry input to them for multi-scale supervision to generate finer fusion weights. During the inference, we add the blurry input I blur to the original scaled fused residual image using the single-scale extraction step to generate the final fused output image I fuse .

B. PRE-DEBLURRING STREAMS
To generate the pre-deblurred images for image fusion, we use three pre-deblurring streams, including one single VOLUME 10, 2022 image deblurring stream and two guided image deblurring streams. All of the streams are replaceable and can be implemented using state-of-the-art approaches.
The single image deblurring stream N 1 provides an estimate J 1 , which cannot restore the details completely because most of the high-frequency components are lost during degradation. We use the guided image deblurring network N 2 to recover more details by taking the concatenation of the blurry observation I blur and guidance image I guid as input. The deblurred image J 2 embeds the information of the guidance image. Since the structure in the blurry input is not reliable, this stream tends to use low-frequency contents in I guid . However, these contents may be wrong due to object movements and sensor differences. As a result, it causes the pre-deblurred image J 2 to be structurally and color inconsistent with the ground truth, and creates halos and fake shadows. Therefore, we employ another guided image deblurring network N 3 which uses the single deblurring stream J 1 and the guidance image I guid to estimate J 3 . The pre-deblurred image J 1 is a prediction of the sharp image containing coarse but correct structures. Thus, this guided stream learns to believe the structures in J 1 rather than I guid , because the structures in I guid can be wrong in some cases like object movements. On the other hand, I guid brings less impact on corrupting the structure since the inconsistency between the multi-modal images J 1 and I guid is reduced. Therefore, the output J 3 provides better estimates and different information compared to the other two streams.

C. FEATURE AGGREGATION
Based on our blur/residual image splitting strategy, the residual images of three pre-deblurred streams are fed into the fusion network and encoded by three independent encoders. We choose to use individual encoders because three pre-deblurred residual images focus on recovering different contents of sharp images. Fig. 3 illustrates the details of the feature aggregation architecture. The encoders contain 6 convolutional blocks and 2 max-pooling layers, which compute features with 64 channels and the spatial size is 1 4 of the input size. The features of the residual images of the three pre-deblurred residual images are then concatenated, followed by two convolutional blocks to reduce the dimension of the channel. The entire feature aggregation process is represented as (8) where N E 1 , N E 2 , and N E 3 are the corresponding encoders of three residual images which include two downsampling layers, ⊕ denotes the concatenate operation, and conv(·) denotes the convolutional operator that reduces the dimension of the channel to 64. We use these encoders to keep information in three streams for the following fusion weights generation. The blur/residual image splitting strategy enhances the feature representation ability by directly extracting features on the residual images.

D. COARSE-TO-FINE RECONSTRUCTION
We adopt a coarse-to-fine reconstruction strategy that can further improve the quality of fused deblurred images [43]. The architecture of the coarse-to-fine reconstruction and its decoders for fusion weights generation is demonstrated in Fig. 4. We use three individual decoders that contain 5 convolutional blocks and upsampling layers for every level of scale, each of them generates 2-scale features where k represents the scale level and N k D i is the i-th (1 ≤ i ≤ 3) decoder at level k, and F k fusion is the 2 k upsampling of F fusion .
The fusion weights ω k i of the residual images at scale level k are computed using a tanh activation function instead of a ReLU because we need to preserve both positive and negative values for fusion. The weights are given by where ω k i is the fusion weights of the i-th residual image at scale level k, N k tanh is the tanh activation function at level k, (·) ↑ represents the upsampling operation, and ⊕ denotes concatenation. The fused deblurred image is computed by adding the fused residuals layers of three pre-deblurred images using ω k i to the original blurred observation, , represents the corresponding residual image at scale level k downsampled using max-pooling, I k fuse , k ∈ {0,1} represents the 2-level fused deblurred images, and denotes the element-wise product. Both 2-level deblurred images are used for back propagation in training, but only the original scale image is needed in inference.

E. LOSS FUNCTION
We employ L 1 , structural similarity index measure (SSIM), learned perceptual image patch similarity (LPIPS) [44], and a frequency-domain loss function on the multi-scale output I k fuse [43] to train GDFNet. The multi-scale LPIPS can be represented as where k and P(·) denote the scale level and LPIPS network, respectively. LPIPS is a pre-trained network for evaluating the perceptual similarity between two images. Recent studies show that reducing the frequency-domain discrepancy is essential for restoring the lost high-frequency components [1], [3]. We adopt the multi-scale frequency reconstruction (MSPR) loss function on our multi-scale output,  where F(·) denotes the Fourier transformation and M k denotes the number of elements at scale k. The total loss function is The balance parameters are empirically set as λ 1 = 0.2, λ 2 = 0.1, and λ 3 = 0.1.

IV. EXPERIMENTS
In this work, the three pre-deblurred streams N 1 , N 2 , and N 3 in GDFNet are flexible and can be replaced by other deblurring streams. To evaluate our algorithm, we conduct experiments using MIMO [1], MPR [2], and DeepRFT [3] as N 1 . As for the choice of the guided deblurring networks N 2 and N 3 , we use SVLRM [5] to compute J 2 and J 3 . We evaluate GDFNet on popular public multi-modal image datasets including RGB/NIR [31] and flash/ambient [33]. Our method is compared with single image deblurring algorithms MIMO [1], MPR [2], DeepRFT [3], Uformer [27], NAFNet [29], guided image deblurring algorithm SVLRM [5], and the combinations such as MIMO+SVLRM, MPR+SVLRM, and DeepRFT+SVLRM. We also compare our fusion network with DenseFuse [34] and SeAFusion [39] using the same input as GDFNet. All compared networks are retrained on the same datasets for fair comparisons.

A. IMPLEMENTATION DETAILS
The experiments are conducted on an Intel Xeon Silver 4210R CPU @ 2.40GHz with 64GB memory and an NVIDIA Quadro RTX 8000. We implemented our network using PyTorch [45]. It is trained using Adam optimizer [46], and the initial learning rate is set to 1 × 10 −4 , with a batch size of 64 and a maximal training epoch of 200. The training image  is of size 128 × 128. We use zero-padding in convolutional layers.

B. DATASETS
We evaluate GDFNet on two multi-modal image pair datasets. The RGB-NIR scene [31] contains registered RGB and NIR image pairs. They are collected using several digital single-lens reflex (DSLR) cameras with and without infrared blocking filters in separated exposures. Although the cameras are equipped with tripods, there are still small misalignments between the RGB and NIR images. Many algorithms are introduced to register multi-modal image such as [47], [48], and [49]. This dataset applies a feature-based alignment algorithm [50] to register these image pairs. We generate the blurry input by computing the convolution of the sharp RGB images and the ground truth blur kernels in [51]. Fig. 5 shows the blur kernels used in our experiments. For each scene, we randomly select one kernel to blur the RGB image, and take the NIR image as the guidance reference image. We discard the image pairs of large misalignment in the RGB/NIR dataset. The blurry/flash dataset is generated using ambient and flash illumination image pairs presented in [33]. The ambient and flash dataset is collected using DLSR cameras controlled by a mobile App that sequentially captures the flash and VOLUME 10, 2022   no-flash photographs with a small delay between two exposures. An improved version of the dual inverse compositional alignment algorithm (DIC) [52] is used to correct the misalignment. We blurred the ambient image using 20 ground truth blur kernels provided in [53] as shown in Fig. 6, which are resized to 13 × 13, 19 × 19, and 25 × 25 to increase the diversity of degradation types. For the guidance image, we multiply the intensities of flash images by 1.5 to remove some structures by over-exposure, which actually makes the image deblurring more challenging.

C. EVALUATION ON RGB/NIR DATASET
We evaluate our GDFNet on the RGB/NIR dataset [31], which includes 300 scenes. Deblurring streams N 1 and N 2 takes 40 randomly choosed image pairs to train, and 10 image pairs to validate. We choose the trained network with the best PSNR on the validation dataset. After we determine N 1 , N 3 takes the output J 1 and the guidance NIR image I guid as input, using another 50 image pairs to train. Then we fix the parameters of these three pre-deblurred streams and train GDFNet using corresponding pre-deblurred images of 50 scenes. The remaining 150 scenes are used for testing.
As shown in Table 1, we compare our method GDFNet with image deblurring algorithms quantitatively on 150 scenes of RGB/NIR dataset. We split the comparisons into three parts based on using MIMO, MPR, and DeepRFT as N 1 . Our method (GDFNet with MIMO) outperforms the image deblurring algorithms in terms of PSNR, SSIM, and LPIPS. Our method (GDFNet with MPR) and our method (GDFNet with DeepRFT) also perform favorably against state-of-the-art image deblurring algorithms. Fig. 7 and Fig. 8 show the image deblurring results by the competing algorithms. The single deblurring algorithms MPR [2], Uformer [27], and NAFNet [29] lose textures and generate ringing patterns around large edges. The SVLRM [5] produces color shifts and ghost shadows. The details of textures can be obtained from NIR input but artifacts may appear since the local linear assumption [32] is weak on inconsistent structures. By exploring the use of the single deblurred image and NIR image, MPR+SVLRM reduces the appearance of artifacts. However, some blurry edges still exist. Instead of using one single or guided deblurring algorithms, our GDFNet fuses three pre-deblurred streams, generating a clearer image with sharper edges.
To validate the superiority of our fusion network, we compare two image fusion approaches with our GDFNet using the same pre-deblurred images as input. We modified SeAFusion so that it can accept three images as input. The results in Table 2 show that the GDFNet outperforms other algorithms. As shown in Fig. 9, DenseFuse [34] and SeAFusion [2] produce chromatic aberrations around the organ pipes and false edges around the chandelier. They are designed for image fusion, but they do not correctly fuse the highfrequency information. Our GDFNet produces fewer artifacts and creates clear edges.

D. EXTENSION TO STRIPE NOISE ON RGB/NIR DATASET
Although our algorithm assumes that the NIR images are noise-free, there is line pattern stripe noise [8] or random noise [54] in case of poor imaging quality. Since we use the NIR images as the guidance image, the stripe noise or random noise may deteriorate the guided image deblurring pre-deblurred images and further corrupt the final fused deblurred image. We compare our GDFNet on RGB/NIR dataset where the NIR images are noisy. We manually corrupt the NIR images using stripe noise and random noise. The stripe noise is simulated in a similar manner to [8], with its intensity varying from [−10, 10]. The random noise level is set to 10% following Tai and Lin [54]. Table 3 shows the quantitative evaluation results where our GDFNet still perform better than deblur/fusion approaches. Since GDFNet fuses three streams to compute the deblurred image, it is robust to noise when the NIR images are severely degraded.
We show the deblurring results of a RGB and noisy NIR image pair in Fig. 10. Single image deblurring algorithms such as NAFNet [29], Uformer [27], and MPR [2] produce deblurred images with small blur at the edges. The result of SVLRM [5] is blurry because the structure of the reference NIR image is destroyed by noise. Noise has little effect on MPR+SVLRM because the network learns that the guidance is less reliable and tends to use the output of MPR to regress the restoration result. DenseFuse [34] and SeAFusion [39] produce sharper images but are still affected by noise. Our GDFNet generates sharp result because it can lower the fusion weights of useless deblurring streams.

E. EVALUATION ON BLURRY/FLASH DATASET
We evaluate our GDFNet on the Object category in the ambient/flash dataset [33] which includes 578 scenes. Deblurring streams N 1 and N 2 takes 160 scenes to train, and 40 scenes to validate. We choose the trained network with the best PSNR on the validation dataset. Similar to the RGB/NIR dataset, we use 80 scenes and 20 scenes to train and validate N 3 . Then we fix three pre-deblurred streams, and train GDFNet using another 40 scenes to train and 10 scenes to validate. The remaining 228 scenes are used for testing.
As shown in Table 4 and Table 5, we compare our GDFNet with image deblurring algorithms and image fusion approaches quantitatively on 228 scenes of the blurry/flash dataset. Our methods perform better   against the state-of-the-art image deblurring algorithms and image fusion approaches. Fig. 11 and Fig. 12 illustrate the image deblurring results by the evaluted image deblurring and image fusion algorithms. In the resultant images produced by MPR [2] and NAFNet [29] the texts are hardly recognizable since the recovered high-frequency information is incorrect. The resultant content of Uformer [27] is barely readable due to ghost shadows. The texts produced by TABLE 6. Quantitative comparisons on loss functions using GDFNet (w/ MPR) on the blurry/flash dataset. The best metrics are in bold, and the second best ones are underlined. SVLRM [5] and MPR+SVLRM are recognizable but the edges are less sharp. Image fusion approaches DenseFuse and SeAFusion produce texts that are generally clear but still contain artifacts. Our GDFNet produces the highest PSNR values and successfully recovers texts with sharp edges.

F. ABLATION STUDIES AND RUNNING TIMES
As shown in (14), we use several loss functions, including L1 loss, SSIM loss, perceptual loss, and frequency loss. To analyze the effect of the loss function on the performance of GDFNet, we train it on the blurry/flash dataset using different versions of loss functions. Quantitative results in Table 6 show that our loss function is suitable when considering all the PSNR, SSIM, and LPIPS metrics.
We conduct ablation study on the framework components, including residual image, multi-scale reconstruction, and multiple guided deblurring streams. As shown in Table 7, our framework works better than the versions with any components removed. The effect of using residual images is significant; it increases the PSNR by 0.98 dB. The multi-scale reconstruction with supervision improves the performance by 0.35 dB in terms of PSNR. The results of without using J 2 and without using J 3 demonstrate that both streams are useful, increasing the PSNR by 0.47 dB and 0.3 dB, respectively.
To quantitatively compare the inference time of the proposed method with image deblurring algorithms and image fusion approaches, we evaluate all algorithms on the RGB/NIR dataset. Table 8 shows that the inference time of our GDFNet is lower than the deblurring algorithms and close to the image fusion approaches.

V. CONCLUSION
This paper proposes a novel guided image deblurring framework based on deep image fusion using multi-modal image pairs, called Guided Deblurring Fusion Network (GDFNet). Previous work on image deblurring has focused on image deblurring, neglecting the alternative of image fusion. We use GDFNet to fuse the pre-deblurred streams of single and guided image deblurring algorithms to aggregate the structures and the sharp details based on fusion weights. In detail, GDFNet employs the blur/residual image splitting strategy and a coarse-to-fine reconstruction module supervised by multi-scale ground truths. The effectiveness of the strategy used in GDFNet is demonstrated by ablation study. Our method can be easily extended by replacing the approaches used as pre-deblurred streams. Quantitative comparisons show that GDFNet outperforms the image deblurring algorithms and image fusion approaches. The average PSNR of GDFNet is at least 0.9 dB higher than existed algorithms evaluated on 228 test scenes from the blurry/flash dataset.