Defocus Blur Detection by Fusing Multiscale Deep Features With Conv-LSTM

Defocus blur detection aiming at distinguishing out-of-focus blur and sharpness has attracted considerable attention in computer vision. The present blur detectors suffer from scale ambiguity, which results in blur boundaries and low accuracy in blur detection. In this paper, we propose a defocus blur detector to address these problems by integrating multiscale deep features with Conv-LSTM. There are two strategies to extract multiscale features. The first one is extracting features from images with different sizes. The second one is extracting features from multiple convolutional layers by a single image. Our method employs both strategies, i.e., extracting multiscale convolutional features from same image with different sizes. The features extracted from different sized images at the corresponding convolutional layers are fused to generate more robust representations. We use Conv-LSTMs to integrate the fused features gradually from top-to-bottom layers, and to generate multiscale blur estimations. The experiments on CUHK and DUT datasets demonstrate that our method is superior to the state-of-the-art blur detectors.


I. INTRODUCTION
Defocus blur or out-of-focus blur is a common problem when shooting photos. Defocus blur is mainly due to the distance of the object from the focal plane. Some blurs are intentional to highlight the salient objects, while others are unwanted and must be removed. Beside blur removal, defocus blur detection (DBD) is also widely applied in blur magnification [1], [2], image deblurring [3]- [5], saliency object detection [6]- [9], and image quality assessment [10], [11].
Traditional DBD methods use gradient [12]- [16], frequency [17]- [21], SVD [22], LBP [23], and sparse representation [24] to conduct blur detection. Although these hand-crafted features are easily extracted, they lack the distinguishing capability for complex scenes. As shown in Fig. 1 (c), hand-crafted feature-based method, DBDF [18], fails to detect the blurred background marked with the yellow rectangle. With the development of deep learning (DL), an increasing number of blur detectors [25]- [31] are designed on the basis of convolutional neural networks (CNNs). DL-based methods learn effective features The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski.  [18], DeFusionNet [27] and BTBCRL [28], the result of the proposed method is better in distinguish sharp and blur in homogeneous and low-contrast regions.
for distinguishing blurred and sharp parts from labeled images without well-designed hand-crafted features. In these methods, semantic and low-level features are integrated to generate robust blur detection. Compared with traditional blur detectors, DL-based blur detectors, DeFusion-Net [27] and BTBCRL [28] achieve effective blur detection results (Figs. 1 (d) and (e)). However, DeFusionNet and BTBCRL fail to distinguish blurred and sharp features in homogeneous and low-contrast regions. Apart from homogeneity and low-contrast, scale ambiguity is an import problem in blur detection. As shown in Fig. 2, the vegetable leaves are sharp in Scale 1 and Scale 2 while remaining blurred in Scale 3. A defocus blur detector that integrates multiscale deep features with Conv-LSTM [32] is proposed in this paper to generate robust blur detection. We adopt VGG16 [33] as the basic feature extractor to extract multiscale deep convolutional features from images with different sizes to solve the scale problem and to generate robust blur detection. Specifically, we first extract multiscale convolutional features from images with different sizes. Then the features extracted from different sized images at the corresponding convolutional layer are fused to generate more robust representation. These fused features are integrated by Conv-LSTMs from top-to-bottom layers to estimate multiscale blur maps. With this manner, the proposed method can fully explore the semantic features of top layers and low-level spatial features of bottom layers. Thus, our method can effectively alleviate misclassification in homogeneous and low-contrast regions, and robust to clutter background. We use multiple losses as supervisors on blur map of each Conv-LSTM layer to improve accuracy. The loss function is the weighted summation of cross-entropy loss, Precision, Recall, F-measure, and MAE. We compare our method with 9 state-of-the-art blur detectors on CUHK [18] and DUT [28] datasets. The experimental results demonstrate that the proposed method is superior to the other methods. The main contributions are three-fold.
• Our method uses two types of scales, that is, image size and feature scale, to solve the scale ambiguity of blur detection and achieves superior blur detection results.
• We propose to use Conv-LSTM to integrate high-level semantic information and low-level spatial information for pursuing accurate blur detection in homogeneous and low-contrast regions.
• We use multiple losses,e.g., cross-entropy loss, Precision, Recall, F-measure and MAE as our loss functions to help our network learn effective features.

II. RELATED WORK A. HAND-CRAFTED FEATURE-BASED METHODS
Traditional DBD methods are based on hand-crafted features, such as, frequency, gradient, or other low-level cues, in which blurred image patches have smoother gradients, smaller singular values and fewer high-frequency components than the sharp ones [17], [25]. Thus, the gradient strength or SVD value can be used as the measurement of the blur level. Golestaneh and Karam [17] proposed a method for blur detection based on a high-frequency multiscale fusion and sort transform, which uses DCT coefficients of the gradient magnitudes from multiple resolutions. Shi et al. [18] deliberately developed local blur features, including local filters, gradient distribution, and spectral measure, to enhance discriminative power for blur detection. Zhang and Hirakawa [19] proposed a double discrete wavelet transform (DDWT) that is designed to sparsify the blurred image and the blur kernel simultaneously. Tang et al. [21] used a log-averaged spectrum as a blur metric for image blur region detection. Xiao et al. [22] proposed a blur metric based on the fusion of multiscale single value features in gradient domains and inferred the blur maps in multiple bands. Yi and Eramian [23] introduced an image sharpness metric based on local binary patterns (LBPs) to separate in-and out-of-focus image regions. Shi et al. [24] developed a blur feature via sparse representation and image decomposition to conduct blur detection, which directly establishes the correspondence between sparse edge representation and blur strength estimation. The traditional methods need well-designed features and fine-tuned classifiers for blur detection, which hinders general research use. Furthermore, traditional blur detectors obtain poor blur detection results when images containing homogeneous or low-contrast regions.

B. DEEP LEARNING-BASED METHODS
DL has been recently successfully applied in saliency detection [6]- [9], image super-resolution [35] and recognition [33], and other computer vision problems. Previous trials using DL for blur detection showed promising performance [25]- [28]. Park et al. [25] combined hand-crafted features with deep blur features at image patch-level and fed the combined features into a fully connected neural network classifier to determine the amount of defocus blur. Zeng et al. [26] trained CNN with super-pixel and extracted effective filters by PCA. The blurred maps are produced by convolving the image with the selected filters. In [28], Zhao et al. proposed a multi-steam bottom-top-bottom fully convolutional network and cascaded residue learning network to solve the DBD problems. Tang et al. [27] recommended a deep convolutional network to fuse the features from shallow and deep layers for blur detection. The feature fusion and refinement were performed step-by-step in recurrent manner.
Although deep blur detectors achieve superior blur detection performance to traditional blur detectors, blur detection results can still be improved by introducing new and effective modules for blur detectors. We introduce Conv-LSTM [32] in our defocus blur detector and achieve promising blur detection results. Fig. 3 shows the network architecture of the proposed defocus blur detector. We partition it into multiscale feature extraction sub-network (MsFEN) and multiscale blur estimation subnetwork (MsBEN). MsFEN has three branches extracting multiscale convolutional features from three resized images. Then the features of the same level convolutional layers from different branches are concatenated together and fused by convolutional layers to generate more robust representations. Thus, the proposed network is robust to blur scale ambiguity by considering information from two types of scale, image scale and feature scale. MsBEN is built by Conv-LSTMs and takes the fused features as input to estimate multiscale blur maps. We adopt top-to-bottom strategy to integrate the features of lower layers by Conv-LSTMs for progressively refine the estimated blur maps. In following section, we will introduce the model architecture, multi-layer losses and implementation details of our method.

III. METHOD
is a popular network used in blur detection [27], [28]. We also use VGG16 as our basic feature extractor. To make it is suitable for our blur detection task, we remove the last pooling layer and last two fully connected layers to save the resolutions of the features. The remaining five convolutional blocks (conv1 − 5,) are used to extract multiscale convolutional features from input images. Each block is denoted as the blue box in Fig. 3. Note that we only use the feature of the last convolutional layer in each convolutional block. Table 1 details the parameters of the convolutional layers, feature sizes of different convolutional blocks of the images with different sizes. We can find that the features extracted from three scaled images have different spatial resolutions in the corresponding convolutional blocks. These multi-scale features are useful for defocus blur detection.
Let I denote an input image. We first resize I into three different scales, denoted as I 1 , I 2 , and I 3 with sizes of 320 × 320, 256 × 256 and 192 × 192, respectively. We use MsFEN to extract multiscale convolutional features F s convl from I s , where s = 1, 2, 3 denotes image scale and l = 1, . . . , 5 denotes layer index. Owing to the different dimensions of F 1 convl , F 2 convl , and F 3 convl , we use bilinear interpolation to resize F 2 convl and F 3 convl to have the same size with F 1 convl for feature fusing. The resized feature can be formulated by where H 1 l and W 1 l are the height and width, respectively, of the feature maps of the l-th layer of I 1 ; up denotes the bilinear interpolation upsampling; Conv(·, 1×1×64) denotes convolutional layers with filters of 1 × 1 × 64. We use this kind of convolutional layer before upsampling to change the channel numbers of the convolutional features to 64. We also apply batch normalization (BN) and ReLU following the convolutional layers. ReLU are used as default after convolution without special description.
The obtained features of the same level convolutional layers from different branches are fused by convolutional layers. Specifically, we first concatenate F 1 convl , F 2 upl , and F 3 upl and then fuse them with a convolutional layer to obtain fused feature F convl , which can be formulated by (2) where Cat(·) denotes concatenate operation. By concatenating the convolutional features, we can obtain more robust features to conquer the blur scale ambiguity.

2) MULTISCALE BLUR ESTIMATION SUB-NETWORK
With the fused features F conv1−5 , we conduct top-to-bottom integration by feeding them to Conv-LSTMs. The estimated blur map B 5 produced by F conv5 is computed by where Conv−LSTM (·) denotes computation of the Conv-LSTM, which will be introduced in the following section.
To gradually refine the blur maps, we concatenate the upsampled blur map at the l + 1 layer with the fused features of F convl , because features of the bottom layer contain more fine structure information. Then the concatenated features are input into Conv-LSTM to estimate blur map B l . The mathematical formulation can be denoted as where σ (·) is a sigmoid activation function. We use sigmoid function to normalize the values of F convl into [0, 1] because VOLUME 8, 2020   Fig. 4 shows the visual comparison of the blur maps estimated by different Conv-LSTM layers. We can find that the blur maps are refined progressively.

3) CONV-LSTM
Conv-LSTM is developed from the traditional fully connected LSTM [34] for capturing spatially-related features. This layer uses convolutional operations instead of dot products to code spatial information. Fig. 5 shows the detailed architecture of Conv-LSTM used in this paper. Similar to LSTM, Conv-LSTM has three gates, namely, input gate i t , output gate o t , and forget gate f t , to control the information flow. Cell status C t and hidden state H t are updated by the values of i t , o t and f t . Let X t denote the input of Conv-LSTM layer at each time step. If i t is activated, then the input will be accumulated to the cell C t . If f t is activated, then the past cell status C t−1 will be neglected. The information propagation of C t to H t is controlled by the output gate o t . The update process at time step t can be expressed as follows: 97282 VOLUME 8, 2020 where σ and tanh are activation functions of sigmoid and hyperbolic tangent, respectively; * is the convolution operation; • is the Hadamard product; W and b are the learned weights and biases, respectively. The max time step is set to 3 in our experiments. We use a convolutional layer with the filter of 1 × 1×1 on the hidden state of last time step to get the estimated blur map.

B. MULTI-LAYER LOSSES
In recent DL-based methods, multiple losses [37] are used as supervisors for fast convergence. We supervise the blur map at each scale. Specifically, we resize the blur maps to 320 × 320 same as input size, and these blur maps are used to compute the losses with groundtruth. Different from image segmentation that only uses cross-entropy loss function, we utilize Precision, Recall, F-measure, and MAE as part of the loss function to obtain results of high Precision, Recall, and F-measure, while low MAE and cross-entropy. Thus, the loss used in this paper is defined as follows: where B is blur map, G is groundtruth. We set α 1 , α 2 , α 3 , α 4 to 0.1 by experience. Cross-entropy loss L C : where B p and G p are the pixel values at position p of blur map and groundtruth; W and H are the width and height of input image, respectively; and is a small constant to avoid zero division error. Precision loss L P : Recall loss L R : F-measure loss L F β : where β 2 = 0.3 as suggested by [36]. Negative values are used to minimize the loss because the high F-measure, Precision, and Recall values indicate the improved performance of the network. MAE loss L MAE : MAE is the mean absolute error, which reflects the similarity between the output and the groundtruth. The total loss of our network is defined as follows: where B l is the blur map produced by l-th Conv-LSTM layer.

C. IMPLEMENTATION DETAILS 1) DATA AUGMENTATION
We first pre-train our network on synthesized blur data and fine-tune it on blur benchmark datasets.

a: SYNTHESIZED BLUR DATA
We randomly pick 2000 images from the Berkeley segmentation dataset (BSDS) [39], uncompressed color image dataset (UCID) [40] and PASCAL 2008 dataset [41]. We then use the method proposed by [28] to synthesize the out-of-focus blurred images. Specifically, we utilize Gaussian blur with kernel size 7×7 and σ = 2 to blur the top-, bottom-, left-, and right-half image regions. We use the same blur kernel to blur the image regions five times to simulate different blur extents. Thus, each image can generate 20 blurred images. We obtain VOLUME 8, 2020 a total of 40,000 synthetic blur images for pre-training. Fig. 6 shows some examples of synthesized blur images and the corresponding groundtruth.

b: REAL BLUR DATA
We employ flipping, rotation, and cropping to augment real blur images. Specifically, we rotate images with 11 random angles with zero padding and flip the image in horizontal, vertical and horizontal-vertical directions. We crop the images with aspect ratios beyond [0.95, 1.05] into two images. Finally, the number of the training images is approximately 16 times to the number of the original images.

2) TRAINING DETAILS
The model is implemented by Keras on Titan Xp GPU. We use the weights of VGG16 trained on ImageNet as the initialization of our feature extraction networks. The newly added layers are initialized via ''Xavier Uniform''. For pretraining, we set the learning rate to 5 × 10 −7 and batch size to 2. We stop pre-training after three epochs because the network is converged after three epochs. For fine-tuning, we set the learning rate to 1 × 10 −5 and decreased it by 0.1 times every five epochs until 1 × 10 −8 . The bath size is set to 2. We stop training when the network is converged. The optimizer is Adam [38].

IV. EXPERIMENT A. DATASET
In the experiments, we use two publicly available defocus blur detection datasets, DUT and CUHK, for evaluating the proposed method. The DUT dataset [28] contains 600 and 500 images for training and testing, respectively. The images of the DUT dataset has large homogeneous and low-contrast regions, which are challenging for blur detection.
The CUHK dataset [18] contains 704 defocused blurred images, wherein 604 and 100 images are for training and testing, respectively. The images on the CUHK dataset have complex scenes and cluster background. For a fair comparison, we utilize two testing and training set partition strategies. The first partition is CUHK 1 , which is the same as Tang et al. [27]. The second partition is CUHK 2 which is proposed by Zhao et al. [28].

B. EVALUATION CRITERIA
We use F-measure and MAE to evaluate different blur detection methods. F-measure is a comprehensive weighted indicator using precision and recall. A high value denotes improved performance. Our F-measure computation is same with [28], [29]. Specifically, we resize the blur map to 320 × 320 and obtain the binary map by OTSU. We compute the precision and recall of the binary map. Finally, the F-measure can be computed as follows: where β 2 = 1 and β 2 = 0.3 are used in our experiments like other bur detection methods [27], [29], [30].
The MAE is computed by the blur map and groundtruth without binary, which can be computed as where W and H are width and height of the image, respectively. B p and G p are values of blur map and groundtruth at position p, respectively.

C. COMPARISON OF METHODS
We utilize nine state-of-the-art blur detection methods for comparison. DHDE [25], BTBCRL [28], and DeFusion-NET [27] are DL-based blur detectors. JNB [24], DBDF [18], HiFST [17], LBP [23], SVD [15], and SS [21] are hand-crafted feature-based blur detectors. We use the results provided by the authors or by Tang et al. [27]. Fig. 7 shows the blur detection results of different blur detectors on images with different scenes. Scene1 contains images with simple scenes, but has large homogeneous region. Scene2 contains images with cluttered background. Scene3 contains images that have similar foreground and background colors. Scene1 shows that SS [21] and DeFusionNET [27] failed in detecting the large homogeneous regions in the first and third images, respectively. DHDE [25] cannot suppress the values of the blurred backgrounds in all images. Although HiFST [17] generates high values on the sharp regions, its boundaries are blurred. Compared with other blur detectors, the blur detection results of our proposed method are accurate in homogeneous regions. Scene2 shows three images with cluster backgrounds. The first and third images have bright spots in the blurred backgrounds. Almost all compared blur detectors treat the bright spots as sharp regions. However, our proposed method accurately detect sharp regions. In the second image, the compared methods fail to detect the doll's body, while our method detected the integral sharp body region.

D. EXPERIMENTAL RESULTS ANALYSIS 1) QUALITATIVE ANALYSIS
Scene3 shows the blur detection results of two images sharing similar foregrounds and backgrounds. BTBCRL [28] fails to detect the real sharp regions from the blurred backgrounds.
HiFST [17] treats the blurred background as sharp. SS [21] and DHDE [25] misclassify the burred background into a sharp region. Compared with other blur detectors, our method achieves correct blur detection results on two representative images.
The above analysis indicates that, the proposed blur detection method is robust in different scenarios and superior to state-of-the-art blur detection methods.

2) QUANTITATIVE ANALYSIS
We report F 0.3 , F 1 , and MAE of different blur detectors on DUT, CUHK 1 , and CUHK 2 in Table 2. Except for  DHDE [25], the DL-based blur detectors are better than hand-crafted feature-based methods. Our method achieves the highest F-measure values and the lowest MAE values on the DUT and CUHK 2 datasets. On the CUHK 1 , our method ranks second place, while DeFusionNET [27] ranks first.
On the DUT dataset, F 0.3 , F 1 , and MAE of the proposed method are 0.873, 0.874 and 0.074, respectively. Compared with the state-of-the-art blur detectors, our method is the only one whose F 0.3 and F 1 are higher than 0.87 and MAE is lower than 0.1. Compared with DefusionNET [27], our method achieves 5.1% and 7.2% relative F 0.3 and F 1 improvements, respectively, and 37.8% relative to MAE reduction.
We conduct our experiment on CUHK 2 for comparison with BTBCRL [28]. The results of DHDE [25] and DeFu-sionNET [27] are not reported on this dataset. On CUHK 2 , compared with BTBCRL, our method achieves 1.4% and 3.0% relative to F 0.3 and F 1 improvements, respectively, and 39.1% relative to MAE reduction.
We also report the PR curves, Precision, Recall and F-measure bars of different blur detection methods on two benchmark datasets in Fig. 8. We can find that our Precision, Recall and F-measure values are the highest among the compared methods on both datasets. Note that the PR curves of our method are flat when Recall smaller than 0.8, because our results are close to the binary blur map whose Precision and Recall are not sensitive to the threshold.
From the above analysis, the proposed method is superior to other state-of-the-art blur detection methods.

3) RUNNING TIME
We report the running time of different blur detection methods for processing an image with size of 320×320 in Table 3. Except BTBCRL [28] and DeFusionNET [27] use the running time reported in their papers, the running time of all other compared methods are conducted on same computer. Our method is a bit slower than DeFusionNET. Note that it is tested on Titan Xp. Our method takes 0.06s for processing an image, which is faster than other compared methods.

E. ABLATION STUDIES 1) USING DIFFERENT BASIC FEATURE EXTRACTORS
We replace VGG16 with VGG13 and VGG19, and report their performance on DUT dataset in Table 4. From the table, we can find that using deeper networks can improve the performance of defocus blur detection. However, deeper network uses more computation resources. We find that VGG16 has similar performance with VGG19 but takes less running time. Thus, we use VGG16 as our basic feature extractor.

2) EFFECTIVENESS OF MULTISCALE FEATURE
The proposed network utilizes three image scales with input image sizes of 320 × 320, 256 × 256 and 192 × 192. We compare three networks with different combinations of the input image sizes to verify the effectiveness of the proposed multiscale feature extraction. The first network uses all three input image sizes and is dubbed Net-S 3 . The second network adopts two input image sizes of 320×320, 256×256, which is called Net-S 2 . The third network, Net-S 1 , only uses input image sizes of 320 × 320.
We use F 0.3 , F 1 , and MAE to evaluate the performance of Net-S 1 , Net-S 2 , and Net-S 3 on the DUT and CUHK datasets, respectively. The results are shown in Table 5. Results show that all criteria of Net-S 3 are better than those of Net-S 1 and Net-S 2 . On the DUT dataset, Net-S 1 and Net-S 2 have similar performances. This similarity demonstrates that using only two scales as inputs cannot promote the blur detection performance on the DUT dataset. Meanwhile, Net-S 3 is better than Net-S 1 and Net-S 2 . On the CUHK dataset, Net-S 2 shows better blur detection performance than that of Net-S 1 with improvements. Although F 0.3 and F 1 of Net-S 3 are slightly better than Net-S 2 , it is also better than Net-S 1 and Net-S 2 both. The above analysis shows that the multiscale feature extraction network effectively extracted multiscale deep features, and successfully achieved feature fusion between different scales.

3) EFFECTIVENESS OF CONV-LSTM
We replace the Conv-LSTM layers with convolutional layers in blur estimation sub-network to verify the effectiveness of Conv-LSTM. The first variant is replacing Conv-LSTM with a convolutional layer, named Single-Conv. The second one is replacing Conv-LSTM with three convolutional layers, which has the same parameter number with the corresponding Conv-LSTM, denoted as Multi-Conv. The details of these layers are shown in Table 6. We retrain these networks with the same training data and report F 0.3 , F 1 , and MAE of the DUT and CUHK datasets in Table 7. We can find that all evaluation criteria using Conv-LSTM layers are better than those using convolutional layers.

V. CONCLUSION
We propose a defocus blur detector in this paper to solve the blur scale ambiguity. Our method employs two strategies to obtain scale robust blur detection results, that is, different image sizes and multiscale convolutional features. We recommend the use of Conv-LSTM layers to integrate the multiscale features. Thus, high-level semantic and low-level spatial features are effectively incorporated together to generate accurate blur detection. We utilize the weighted loss of crossentropy, F-measure, Precision, Recall and MAE to supervise the network. The network is pre-trained with synthesized blur images and fine-tuned on real blur images. The experimental results on two benchmark datasets show that our method is superior to the other state-of-the-art blur detection methods.