Inverted Residual Fourier Transformation for Lightweight Single Image Deblurring

Single image deblurring aims to restore a sharp image by removing blurred areas in the single image. Such blurred images are not only visually unpleasant, but cause various problems in many applications like image recognition. In recent years, with the development of deep learning, neural networks are used in many single image deblurring. Especially, encoder-decoder structures are widely used for single image deblurring and successfully restore high quality images. However, FLOPs and the number of parameters tend to increase to restore a high-quality image. Thus, this paper proposes a new lightweight network (IRFTNet) based on UNet, which is a widely used basic network of encoder-decoder structures. Our proposed network has three features to improve performance and lightweight. First, a new backbone called Inverted Residual Fourier Transformation block (IRFTblock) based on inverted residual block is introduced to decrease computational complexity. Second, a new module called Lower Feature Synthesis (LFS) is introduced to efficiently transfer encoder information from lower layers to upper layers. Finally, multiple outputs structure proposed in MIMO-UNet is introduced. These improvements resulted in a 32.98dB in PSNR on the GoPro dataset, despite approximately half FLOPs and the number of parameters of DeepRFT. Further ablation studies show the effectiveness of various components in our proposed model.


I. INTRODUCTION
The purpose of single image deblurring is to restore a sharp image by removing blurred areas in the single image [1].Although camera performance has developed remarkably in recent years, there are two reasons why images may be blurred.The first is camera shake caused by the photographer's hand.The second is blur caused by the motion of objects.These happen during the exposure of the image sensor, causing blurring in the image.The blurred image B is expressed by the convolution integral of the ideal image S and the blur kernel k.This is expressed by where * represents the convolution.Such blurred images are not only visually unpleasant, but cause various problems in many applications like image recognition.For this reason, The associate editor coordinating the review of this manuscript and approving it for publication was Byung Cheol Song .
single image deblurring has been an important task for many years.
Traditionally, kernel-based methods that estimate the motion blur kernel were used for single image deblurring.The reason for estimating the blur kernel is that a blurred image can be deconvolved with a blur kernel to restore a sharp image.However, because kernel-based methods are vulnerable to noise and the motion blur kernel is very complex, the quality of the restoration for real images is poor.
In recent years, many methods based on Convolutional Neural Network (CNN) have been proposed with the development of deep learning.Early CNN-based methods [2], [3], [4], [5] take two stage deblurring: the first stage estimates blur kernel by using deep learning, the second stage restores sharp images with estimated blur kernel.However, this method is slow because it is a two step process.On the other hand, recent CNN-based methods [6], [7] take an end-to-end approach and train the network by pairs of blurred images and sharp images.These methods do not require estimation of the motion blur kernel and can restore a sharp image directly from the blurred image, resulting in high performance.Furthermore, the encoder-decoder structure [8], [9], [10], [11], [12] succeeds to restore images with higher performance than other direct methods.However, latest single image deblurring tends to increase FLOPs and the number of parameters, since its purpose is to restore high quality images.Since single image deblurring is used more in edge devices such as mobile devices, it is necessary to create a lightweight network with less FLOPs and less parameters while keeping performance.
In this paper, we propose Inverted Residual Fourier Transformation Network(IRFTNet), which is a lightweight and high-performance deblurring method based on an encoderdecoder structure.IRFTNet is proposed with the following features.First, Inverted Residual Fourier Transformation block (IRFTblock) is introduced.In a regular encoderdecoder structured network, ResBlock is used as a feature extraction block.ResBlock is good for multiple layers to improve accuracy, but lightweighting is not considered.The feature extraction block is used repeatedly through the overall network, so we design the feature extraction block to be lightweight.Specifically, based on the Inverted Residual Block [13], we added the Fast Fourier Transformation branch proposed in DeepRFT [12] to improve both parameter reduction and accuracy.Second, Lower Feature Synthesis (LFS) is introduced.LFS is inspired by the AFF used in MIMO-UNet.AFF is used as a replacement for skip connections in UNet and fuses feature maps of all scales.However, the use of feature maps at all scales increases the computational complexity.We also consider that the feature maps of the upper layers are not sufficiently feature extracted, and using them as input for the lower layers would have a negative impact.Therefore, we introduce LFS, focusing on information transfer from the lower layer to the upper layer, which is sufficiently encoded.To prove the effectiveness of these components, we evaluated IRFTNet on the GoPro [6], HIDE [14], and Real-Blur [15] datasets.These improvements resulted in a PSNR of 32.98 dB on the GoPro dataset, despite FLOPs and the number of parameters are about half that of DeepRFT.The results of comparing PSNR vs. FLOPs(G) and PSNR vs. Parameters(M) with state-of-the-art methods are shown in Fig. 1 and 2.

II. RELATED WORK A. SINGLE IMAGE DEBLURRING
In recent years, single image deblurring has developed rapidly.This is due to the development of deep learning, which has led to the development of CNN-based networks.There are two CNN-based approaches: estimating the motion blur kernel and restoring the image directly.The method of restoring from the motion blur kernel was used in the early days of the emergence of CNN-based networks.This method is divided into two stages: estimation of the motion blur kernel and restoration of the image using the kernel.CNNs are used to estimate motion blur kernels in the first stage.However, the motion blur kernel is very complex and difficult to estimate, so they are not commonly used today.Thus, a direct restoration method such as DeepDeblur [6] is widely used.There are several types of direct restoration methods.Currently, two main types are used: the recurrent neural network structure, which gradually restores the input image using a multiple stage network, and the encoder-decoder method, which restores the image using a single network.The recurrent structure is constructed with multiple stages and needs to pass through multiple networks, so inference takes a longer time.In this paper, the encoder-decoder method is employed to construct lightweight networks.We describe the following two encoder-decoder methods as conventional methods.

1) MIMO-UNet [10]
A network based on UNet [16] with multiple inputs and outputs achieves high performance for single image deblurring.In addition, the performance is improved by a module called Asymmetric Feature Fusion (AFF), which efficiently deals with different scale information, instead of skip connection.Rethinking the cascade network structure that had been commonly used, this network achieves a significant decrease in runtime by using an encoder-decoder structure.
2) DeepRFT [12] The performance is improved by adding a branch that can handle Fast Fourier Transforms to the residual block used in MIMO-UNet.It is common to use residual blocks in the encoder-decoder structure.Although the residual block is good at capturing the high-frequency components of the image, however, it often overlooks the low-frequency components.Fast Fourier transform is used to make both highfrequency and low-frequency components available.

B. NETWORK LAYERS
CNN-based networks use specific layers iteratively for feature extraction.Early networks used simple convolutional iterative structures.AlexNet [17], published in 2012, achieved better results than previous methods by using many convolution layers.Subsequently, ResNet [18], published in 2015, was able to use even more layers and is still used in many networks.ResNet employs Resblock as a network layer, which is composed of a convolution layer and a shortcut connection.Resblock allows models to be deeper, but it is not efficient with respect to weight reduction.Therefore, a lightweight network, MobileNet [19], was proposed in 2017 for use in mobile environments.This network uses depthwise separable convolutions, which are lighter than regular convolutions.This depthwise separable convolutions consists of a combination of depthwise and pointwise convolutions.It has succeeded in achieving the same level of accuracy as regular convolution with very little computational complexity and number of parameters.In 2018, an improved version of mobilenet, MobileNetV2 [13], was proposed.This network uses an inverted residual block.The inverted residual block uses depthwise and pointwise convolution as well as depthwise separable convolution.The difference is that the depthwise convolution is sandwiched by the pointwise convolution.In addition, the shortcut connection used in resblock is used to improve accuracy.In this way, the number of parameters and computational complexity can be reduced without loss of accuracy.
More recently, ConvNext [20] was proposed in 2022.In this paper, various improvements to Resblock were proposed for further improving the accuracy of the ConvNext block.Specifically, kernel size, activation layer, and normalization are changed, or activation layer and normalization are reduced to improve accuracy.

III. PROPOSED METHOD
To construct a network that is lighter than existing deblurring methods, an encoder-decoder structure is used instead of a recurrent structure.Therefore, we construct a network based on UNet [16], which is a general encoder-decoder structure.The proposed method has the following features.In this way, it is possible to reduce computational complexity compared to a residual block while maintaining performance.
2) LFS, a module that efficiently fuses bottleneck features, is introduced.
3) The multiple outputs structure employed in MIMO-UNet is introduced.
The architecture of the proposed network is shown Fig. 3.
There is one input image (H × W ) and three output images (H ×W , H /2×W /2, H /4×W /4).We adopt pixelshuffle [21] and pixelunshuffle as upsampling and downsampling module, respectively.Multiple outputs structures are described in MIMO-UNet.Only the multiple outputs structure is used because of its superior accuracy improvement with increasing number of parameters.

A. INVERTED RESIDUAL FOURIER TRANSFORMATION BLOCK
We propose Inverted Residual Fourier Transformation block (IRFTblock) as shown in Fig. 4.This block is based on the Res FFT-Conv block [12], and the new proposal is to introduce Inverted Residual block (IRblock) as shown in Fig. 5 in the middle branch.The conventionally used residual blocks use ordinary convolution, which increases the number of parameters.Thus, we employ IRblock based on the inverted residual block used in MobileNetV2 [13].This is a block that adopts depthwise convolution layer and 1 × 1 convolution In the left branch of IRFTblock, the Fast Fourier Transformation (FFT) is adopted for the input features.By converting to the frequency domain, the features of the entire image can be captured.Therefore, non-local information that cannot be dealt with by ordinary convolution can be dealt with.FFT is a type of Discrete Fourier Transformation (DFT) that can be computed more efficiently than normal DFT.The 2D DFT is expressed by where j represent the imaginary unit, f (x, y) is the feature map before DFT and F(u, v) is the feature map after DFT.The final IRFTblock output is expressed by Eq.3.Let, n ∈ {0, 1, 2} is the layer number.
where α ∈ R 1×1×C and β ∈ R 1×1×C are learnable parameters that can be weighted in the channel direction, Comparison of PSNR and SSIM on each dataset.Best and second scores are written in red and in blue, respectively.All networks were trained with GoPro [6] training dataset.Despite its lightweight, the performance of IRFTNet is not significantly poor on any data set, and is equal to or better than other state-of-the-art networks.

B. LOWER FEATURE SYNTHESIS
MIMO-UNet [10] employs a component called AFF instead of skip connection in ordinary UNet [16].AFF fuses the encoded features at each layer.However, in the case of recurrent structures such as MPRNet [22], only information is transferred from the coarse image to the fine image.Therefore, the transfer of information from the upper layer to the lower layer is not necessary when considering a small-size image to be a coarse image and a large-size image to be a fine image in the multiple outputs.Thus, we propose an approach to gradually fuse the encoded information in the lower layers to the upper layers.This approach is named Lower Feature Synthesis (LFS) and shown in Fig. 6.Processing in this module is conducted in the following order: concatenate lower features and upper features, 1 × 1 convolution, 3 × 3 depthwise convolution, 1 × 1 convolution.LFS is expressed by where represents the output of n-th layer's encoders.Features from different layers are fused by using up-sampling (↑) to align the sizes.By using LFS, information at different scales can be fused more efficiently, which not only lightens the network, but also improves the accuracy.

C. LOSS FUNCTION
The loss function in this network is expressed by where, λ 1 and λ 2 are hyper parameters to control the weighting of losses.For each L, it is expressed by the following Eq.6, Eq.7 and Eq.8.Let, Ŝn is the n-th restored image, and S n is the n ∈ 0, 1, 2-th sharp image whose size is (H × W , H /2 × W /2, H /4 × W /4), respectively.
(1) Multi Scale Charbonnier loss [22] L msc = where ε is a constant value 1 × 10 −3 .Multi Scale Charbonnier loss is main loss function of this network.This loss function works similarly to the MSE loss.(2) Multi Scale Edge loss [22] where ε is a constant value 1 × 10 −3 and denotes the Laplacian operator.Multi Scale Edge loss helps recover edges of generated images.Blurred images are fuzzy at the edges, so this loss function is effective.
(3) Multi Scale Frequency Reconstruction loss [10] L msfr = where FT means the FFT operation.Multi Scale Frequency Reconstruction loss helps recover frequency components.In this paper, experiments are conducted with λ 1 = 0.05 and λ 2 = 0.01.

IV. EXPERIMENT A. DATASETS AND EVALUATION METRICS
We used only the GoPro [6] training dataset which consist of 2103 pairs of blurred and sharp images for training our network.For testing our network, we used three testing datasets: GoPro [6] which consist of 1111 pairs, HIDE [14] which consist of 2025 pairs and RealBlur [15] which consist of 980 pairs.There are two types of RealBlur: one is RealBlur_R, which includes dark images, and the other is ReaBlur_J, which includes bright images.The blurred images in these data sets include mainly motion blur images.
PSNR and SSIM were used as evaluation metrics for the testing dataset.In addition, FLOPs and the number of parameters per processing a single image were also calculated to evaluate the lightness of the network.FLOPs .Some examples on GoPro [6] testing dataset.From left top to right bottom: input image, MIMO-UNet [10], DeepRFT [12], Restormer [23], IRFTNet(proposed), target image.IRFTNet is successful to restore a clean image overall compared with other networks.TABLE 3. Ablation studies on GoPro [6] testing dataset.This is the result of replacing AFF with LFS.Also, the result of adding multiple outputs structure.

B. IMPLEMENTATION DETAILS
Training images were randomly cropped to a size of 256 × 256.We trained our network for 3,000 epochs, with Adam [24] optimizer.The learning rate was initially 1 × 10 −3 and gradually decreased to 1 × 10 −6 .For data augmentation, horizontal and vertical flips are randomly applied.In our testing, we employed the dataset slicing crop method, the same as used in SDWNet [25] and sliced the dataset into 256 × 256 sizes.The number of IRFTblocks for each scale was set to 8, as in MIMO-UNet [10] and DeepRFT [12].

C. PERFORMANCE COMPARISON
We compared IRFTNet with state-of-the-art deblurring networks [10], [12], [22], [23].The average PSNR of IRFTNet was 32.98dB on GoPro testing dataset.As shown in Table 1, 1RFTNet has a much less FLOPs and the number of parameters compared with other networks.Despite its lightweight, the performance of IRFTNet is not significantly poor on any data set, and is equal to or better than other state-of-the-art networks.In particular, compared with the Restormer [23], the most recent network, FLOPs and the number of parameters are reduced to about 1/4 with the almost same performance.Fig. 7 shows some examples of restored images.From left top to right bottom: input image, MIMO-UNet [2], Deep-RFT [8], Restormer [15], IRFTNet(proposed), target image.IRFTNet is successful to restore a clean image overall compared with other networks.For example, in the upper images in Fig. 7, IRFTNet can restore sharper text than Restormer and sharper fences than DeepRFT.For each result, we used author-released weights trained on the GoPro training dataset.

D. ABLATION STUDY
The following ablation studies are performed to demonstrate the effectiveness of each component.
1) To show the effectiveness of IRFTBlock, we compared it with Res FFT-Conv block, and with and without GELU and LayerNormalization.2) To show the effectiveness of LFS, we compared it with AFF used in MIMO-UNet.3) To show the effectiveness of multiple outputs, we compared the results with and without multiple outputs.First, IRFTblock results are shown in Table2.IRFTblock improved PSNR by 0.14 dB compared with Res FFT-Conv block.Moreover, FLOPs and the number of parameters were successfully reduced approximately 40% and 50%, respectively.In addition, GELU and LayerNormalizaiton also contributed to the accuracy improvement.
The results of LFS and multiple outputs are shown in Table3.LFS improves PSNR by 0.03 dB.The number of parameters remains the same, but FLOPs is reduced approximately 4%.Although the FLOPs were reduced by only about 4%, this reduction is significant when creating lighter models.Multiple outputs improve PSNR by 0.07 dB.The number of parameters and FLOPs are also the same as the singledecoder structure.
Thus, the all proposed components are proven to be effective.In particular, IRFTblock was proven to achieve a significant reduction in FLOPs and the number of parameters by using the inverted residual structure.

V. CONCLUSION
In this paper, we propose a lightweight and well performed network named IRFTNet for single image deblurring.Instead of Resblock, a basic feature extraction layer, we propose IRFTblock based on inverted residual block.IRFTblock not only improves performance but also contributes significantly in terms of reducing FLOPs and the number of parameters.Also, LFS contributes to the improvement of performance and the reduction of FLOPs by efficiently transferring features in the lower layers of the network to the upper layers.Experimental results show that these proposed components are lightweight but effective for deblurring.Future work is to explore more lightweight networks while improving performance.Even more serious degrading deblurring, such as Robust Blind Deblurring [26] is also considered as a future work.

FIGURE 1 .
FIGURE 1.Comparison on PSNR and FLOPs between the proposed and conventional methods on GoPro dataset.IRFTNet resulted in a PSNR of 32.98 dB on the GoPro dataset, despite FLOPs is about half that of DeepRFT.

FIGURE 2 .
FIGURE 2. Comparison on PSNR and the number of parameters between the proposed and conventional methods on GoPro dataset.IRFTNet resulted in a PSNR of 32.98 dB on the GoPro dataset, despite the number of parameters is about 60% that of DeepRFT.

FIGURE 3 .
FIGURE 3. The architecture of IRFTNet.IRFTNet is constructed based on UNet, which is a common encoder-decoder structure.The main features of IRFTNet are to introduce modules such as IRFTblock and LFS to reduce computational complexity.It also has one input image (H × W ) and three output images (H × W , H/2 × W /2, H/4 × W /4).

1 )
IRFTblock, a lightweight feature extraction block compared to Resblock, is introduced.

FIGURE 4 .
FIGURE 4. Illustration of IRFTblock.It consists of three branches.The first is the residual branch, the second is FFT branch, and the third is IR branch.IR branch extracts local features, and FFT branch extracts global features.

FIGURE 5 .
FIGURE 5. Illustration of IRblock.This is a block that adopts depthwise convolution layer and 1 × 1 convolution layer instead of the normal convolution layer.In this way, it is possible to reduce computational complexity compared to a residual block while maintaining performance.

FIGURE 6 .
FIGURE 6. Illustration of LFS.LFS can gradually fuse the encoded information in the lower layers to the upper layers.LFS allows information to transfer efficiently with a small computational complexity.

TABLE 2 .
[6]ation studies on GoPro[6]testing dataset.This is the result of replacing Res FFT-Conv with IRFTblock in IRFTNet.Furthermore, the effect of replacing ReLU with GELU and Batch Normalization with Layer Normalization.