De-fencing and Multi-Focus Fusion using Markov Random Field and Image Inpainting

Multi-focus image fusion aims at combining source information from differently focused images. Fusion of multi-focus images has great applications in machine vision. The paper focuses on removal of fence occlusions in multi-focus images. The proposed model extracts fence occlusion map using salient image features, and refined by morphological operators. Binary operators, and inpainting methods are used for fence removal and restoration. The proposed model estimates the fence area using statistical characteristics in focused regions. Similarly, binary filtering is used to perform thinning of enlarged areas for optimised restoration. The proposed model employs guided filtering for consistency verification. Fusion and restoration results are compared using several (reference and no-reference based) image quality metrics. Simulations show that proposed scheme achieves better results (visually and quantitatively) as compared to existing state-of-the-art techniques.


I. INTRODUCTION
Restoration of fence occluded images has emerged as an interesting research area. It is desirable to remove unwanted objects and distractors from images in a variety of ways. The scenes captured at places like zoo, parks, windows, or balconies are normally occluded with fences. Various occlusion removal and restoration schemes can be classified as intensity, transform-domain, and deep learning based methods. Generally these methods require multiple images or frames for fence removal and restoration. Therefore, fusion of fenced images in itself is a challenging task.
In this paper, a two stage framework is proposed for fence removal task from multi-focus images. The salient image features in focused regions are used to extract a fence occlusion map. The raw map is refined using guided and morphological filters. The fence is removed using binary operators, and inpainting is then used for restoration. The fusion weight map is refined by employing guided filtering for consistency verification. Proposed fusion model for multifocus fenced images is shown in figure 1. The proposed scheme is compared with latest schemes for fusion. A step wise illustration of proposed scheme is represented in figure   2. The results of proposed scheme are much better visually and quantitatively as compared to existing state of the art techniques. Several image quality metrics based on entropy, phase congruency, and properties of human visual systems, are used for fusion evaluation. Similarly, reference and noreference based metrics are used for evaluation of restored images.
The rest of the paper proceeds as follows: Section 2 presents a brief survey of multi-focus fusion, and fence removal techniques. The proposed scheme, along with its mathematical formulations, is presented in section 3. Simulation results and comparisons are discussed in section 4. The final section then concludes the paper while suggesting some future work.
In [11], image de-fencing technique uses graph cut and dilation to detect, and better represent the fence region irrespective of shape and color. In case of noisy input, wavelet thresholding and convolutional neural networks (CNN) based image de-noising, is performed. However, the scheme sometimes produces blobs in resultant image. In [12], single stage image de-fencing network based on adversarial, structural and perceptual information is proposed. The scheme provides promising results, however, requires multiple GPUs and training data. In [13], regression based method is pro-  irregular or unique shaped fences. Deep learning based image inpainting method [2], uses a pyramid based loss function (based on adversarial, style and L 1 information) to generate multi-scale structural information simultaneously.
In [5], multiple threshold values are used to extract multicolored fence based on intensity component, morphological operations, and hybrid inpainting scheme. In [10], exemplarbased image inpainting along with dictionary of salient structures (based on k-means labeling) is used to restore the damaged areas. The algorithm shows good performance, however, it needs manual initialization for detection of damaged areas. In [14], disparity map of stereo images is computed and refined by matting technique to obtain the fence mask. The optical flow estimate (while blurring the fence regions) is used in split Bregman optimization method while considering total variation as a regularization constraint. In [15], a modified convolutional neural network (CNN) based deep learning technique uses multiple stereo frames for fence detection. This scheme shows good performance, however, sometimes it fails to restore fences due to inaccurate motion estimation. Liu et al. [16], proposed fence removal as a layer decomposition and reconstruction problem based on deep learning. Coarsest level feature representation and motion vectors are used to reconstruct the background image using residual learning network. In [17], long short-term memory cells based supervised recurrent network with asymmetric loss function is employed to detect fences.
In [21], multi-scale decomposition, and sparse representation are used to jointly perform noise-removal and fusion. The fusion model achieves promising results for noisy images, however, the scheme can be further improved by employing hybrid features. Li et al. [22], proposed dictionarylearning fusion framework based on nuclear norm regularizer and morphological constraints. The scheme preserves fine scale details while avoiding noise, however, fusion and restoration tasks, are performed separately. In [23], all-infocus resultant image is obtained using multi-scale sparse representation to extract salient information. The scheme defines adaptive fusion rules to decide in-focus, out-of-focus, and boundary regions. However, the scheme sometimes suffers from information loss due to channel-wise processing in color images. Yu et al. [24], used fractional differential coefficients and gradient histograms to extract salient image features. The scheme learns overcomplete dictionary using patch based clustering to transfer structural information in all-in-focus images. Similarly, Tan et al. [25], used geometric sparse coefficients (obtained from single dictionary image) to generate focus regions, and preserve important information in resultant images. The scheme also improves time complexity as overcomplete dictionary training is not required.
Khalid et al. [19], proposed supervised learning algorithm based on gradients distribution and edge orientations to classify fence pixels. Du et al. [18], employed fully connected CNN and temporal refinement to remove fence. Optical flow based motion estimation method [14] is then used for information fusion. Farid et al. [7], formulated the Gaussian distribution of the fence based on k-neighbours of an initially chosen sample. To effectively classify each pixel as a fence or non-fence pixel, Bayesian rule along with connected component method is used. Criminisi et. al [8] method is then used at down-sampled level for filling the textural information. However, the scheme requires manual selection of fence pixels carefully. Gupta et al. [20], proposed image de-fencing framework based on cascaded adversarial networks. First network outputs the binary mask, while, the second network generates fence free image. In [26], median and anisotropic diffusion filtering are performed on high resolution images for focus map extraction. The maximum intensity rule and filtering-based difference images are then used to obtain fusion weight map.
Curvature filtering (CF) [27], method defines focus detection criterion based on spatial frequency, and local characteristics. The scheme efficiently extracts salient features in focused regions, and shows promising results. Qiu et al. [28], used guided filtering and focus region detection for fusion of multi-focus images. The scheme shows robust Deep learning based technique [34], uses convolutional layered architecture, binary segmentation, and morphological filtering to obtain focus maps. The weighted-sum fusion strategy shows good quality results, however, sometimes suffers from inaccuracies which can be minimized by using different training models. In [29], bilateral filtering is used to refine the coarse weight maps obtained from saliency maps based on image gradients. The spatial domain method seamlessly fuses images from different modalities. Recently, Li et al. [30], proposed residual removal based multifocus fusion using multiscale focus correction, and non-subsampled contourlet transform (NSCT). The hybrid framework based on structure tensor shows promising results, and is time efficient. In [6], subtraction and Gaussian filtering are used to extract fences in multifocus images. Image inpainting methods are then used to restore the fence occluded regions. However, the scheme sometimes suffers from improper focus map detection.
In [31], focus maps are obtained using sum-modified Laplacian which are further refined by block consistency verification and guided filtering. Neighbour distance filtering and decision maps are then used to reconstruct an all in focus image. Recently, Peng et al. [35], proposed coupled neural-P architecture and NSCT based fusion scheme for multi-focus images. The scheme, however, requires an additional focus measure step in contrast to other CNN based methods. Yang et al. [32], used NSCT decomposition, and consistency verification with guided filter, in their image fusion framework. The fusion map for frequency components is obtained from visual uniqueness and saliency measures. Wang et al. [36],used K-means clustering and NSCT to fuse frequency components of different segmented regions. Zhang et al. [33], proposed super clustering of varying sizes to deal flat and rich in details regions to minimize jagged artifacts. The scheme depicts good results for color images, however, shows lower performance on gray scale images. Zhang et al. [37], used polarization imagery to train deep CNN in an unsupervised manner without ground truth images. The framework combines the loss functions based on SSIM and encoder features with a suitable parameter. In nutshell, deep learning based methods provide better accuracy, however, they have complex architecture and require large training data as compared to classical methods.

III. PROPOSED SCHEME
Let A and B represent two multi-focus input images with dimensions X × Y × γ; where X designates the width, Y represents the height, and γ shows the number of channels in an image. Image A has fence in focus whereas image B, has background in focus.
Since the focused regions contain more high frequency components as compared to de-focused areas, therefore, high frequency components of normalized intensity images are computed. Let A g and B g depict the normalized intensity information contained in source images, defined as, Moreover, let A and B represent smoothed images as, (a) Multi-focus Input 1 (b) Multi-focus Input 2 (c) CF [27] (d) GFDF [28] (e) CNN [34] (f) MISF [29] (g) NSCT-RR [30] (h) Proposed (d) GFDF [28] (e) CNN [34] (f) MISF [29] (g) NSCT-RR [30] (h) Proposed where h r represents the disk operator with radius r ∈ (1, 2), that makes it more suitable to extract high frequency components in the fence region.
The difference images, based on intensity and corresponding blurred images, can roughly detect the focus regions in source images as, To exclude the false edges from estimated salient features through high frequency components, range threshold binarization is employed as, The limits pertaining to fence region components, n 1 and n 2 , are derived from intensity histogram; while the smoothed version of difference image is defined as, (a) Multi-focus Input 1 (b) Multi-focus Input 2 (c) CF [27] (d) GFDF [28] (e) CNN [34] (f) MISF [29] (g) NSCT-RR [30] (h) Proposed  The binary map A b provides an initial estimate for the fence region with some gaps specifically in the crossing areas of the fence regions. The fence region estimation is then refined by performing the guided filtering on the binary map A b , given by, VOLUME 4, 2016 where ξ(.) represents the Guided filter, A g serves the guidance image, φ is the radius, and ϵ is the regularization parameter.
Initial fence map is then extracted by taking pixel wise difference shown as, Here M B depicts the background activity extracted from B s , by performing guided filtering with B g as guidance image. Further refined fence map is represented by, where ζ(.) and δ(.) represent morphological closing and small objects removing operations. The disk size in the closing operation is optimized based on blur metric [48]. The fence region weight map is then used to generate initial fused image, represented as, The final fusion map is obtained by employing guided filtering for consistency verification performed on ω, as follows, W = ξ(I, ω) based on guided image, I. The fused image is generated by using guided filtered based final weight map as, To remove fence, the initial weight map ω is eroded as, where ⊖ indicates erosion and se is structural element. The erosion reduces the fence area for better inpainting. The binary eroded mask ω a is then used to prepare the masked image as, Next, fence removal task is achieved by using such an inpainting method that uses local information to fill the empty

Scheme
Time (sec) CF [27] 7.18 ± 0.17 GFDF [28] 0.1 ± 0.01 CNN [34] 128.69 ± 4.01 MISF [29] 0.30 ± 0.02 NSCT [30] 9.42 ± 0.31 Proposed 4.28 ± 0.29 regions while avoiding artifacts and distortions. Figure 8 shows major steps followed to perform the de-fencing task. Markov random field (MRF) based learnt network [49], using minimum mean squared error (MSE) estimation effectively restores the missing areas [50]. The generative properties of MRF use flexible Gaussian scale mixtures along with noisy and smoothed versions to reconstruct the missing regions. The network has learnt statistics of natural images that make it more suitable for restoration of natural scene images. The generation process can be represented as, Here, Ψ presents the inpainting method, that uses learnt model Γ to fill the masked image C, while R represents the restored image. To overcome the difference in the filled region, and making a more pleasant look, the restored image is sharpened as, S = χ(R, se) where χ(.) represents the sharpening operation with structuring element se. The restored and sharpened image S, is then used to generate final de-fenced image D.  [46], and normalized feature mutual information Q F M I [47]. Figure 3 presents fusion results of proposed and existing schemes on a 'field' image (case-1). The scene contains fence, field, and sky regions. Most existing methods have shown overall good performance. CNN [34], and MISF [29], have shown better visual results in comparison to GFDF (e) FRPI [6] (f) Clean image (g) Proposed FIGURE 9: De-fencing Case-1 [28], NSCT-RR [30], and CF [27]. Still, existing schemes produce different unwanted artifacts. For instance, CF [27], contains partially focused fence. MISF [29], suffers from blurring effect near the lower part of the image. Similarly, a small portion of fence is thinned (in the central part of output image) by CNN [34]. However, the proposed scheme has properly focused the fence and background parts. Figure 4 presents fusion results for 'fuel-station' (case-2). The scene contains a lot of edge and color information. Fused results contain both background and foreground (fence) information. However, output images generated by CNN [34], and NSCT-RR [30], suffer from blur along the sides. On careful observation, it can be seen that fence information is not properly used by MISF [29], GFDF [28], and CF [27]. While most schemes under study have shown limited performance, proposed scheme has produced better results by proper extraction and fusion of source information. Figure 5 shows fusion results for the pool (case-3). First multi-focus input has foreground fence in focus, while second input has background in focus. It can be seen that most of the schemes have properly used the background information. However, the foreground fence is not properly fused, and results suffer from blurring artifacts. For instance, bottom (center) and top (right) parts, in GFDF [28], output are blurred. Similar observations can be made for outputs generated by CNN [34], MISF [29], and NSCT-RR [30]. CNN [34], and MISF [29] also suffer from blurring artifact along the fence in tree regions. CF [27], has properly used background information, however, fence information is not properly fused. NSCT-RR [30], has some unclear areas particular along sides. boundary regions as well as suffer from broken fence. While most of the existing schemes suffer from artifacts, the proposed scheme has successfully recovered the details in background as well as fence region. Figure 6 shows fusion results for 'deer' image. The multifocus pair contains thin occluded foreground. In this case, existing schemes have shown limited fusion performance. For instance, the output generated by CF [27], is nearly similar to background-focused input (figure 6(b)), while ignoring the foreground net. On the other hand, outputs generated by CNN [34], MISF [29], GFDF [28], and NSCT-RR [30] are close in appearance to fence-in-focus input ( figure 6(a)). Further, some part of foreground fence is blurred in CNN [34]. The proposed scheme generates pleasant fusion result where both parts, foreground and background, are entirely focused. Figure 7 shows fusion results for 'canal' image. Fusion results of CF [27], CNN [34], MISF [29], and NSCT-RR [30], suffer from blurring artifacts. Furthermore, in CF [27], entirely blurred fence output indicates that foreground information is least used. Proposed scheme along with GFDF [28], (figure 6(d)), show acceptable outputs. Table 4 presents a time complexity analysis of the existing and proposed fusion methods. Simulations are carried out using machine having CPU (specifications: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz-2.60 GHz), and 8GB RAM. The mean run time (based on images having size 407×271), show that proposed scheme is time efficient in comparison to most of the existing fusion methods. Proposed de-fencing method is compared to hybrid inpainting method (HIM) [7], and fence removal based on pyramid inpainting (FRPI) [6]. The multi-focus image pairs are used in case of [6]. However, in case of [7], fenced image is used as input. Here, five cases are presented to show the superiority of our proposed algorithm. Figure 9 presents the de-fencing case called as 'cow'. Here, the cow, grass and sky region are occluded with fence. Restored image by HIM [7] has successfully restored sky region and parts of grass. However fence in the lower part is not fully removed and parts of cow's legs are merged with grass. Restored parts by FRPI [6] are better as compared to HIM [7], however, it suffers from fence residues. The proposed scheme has shown better restoration results as fence is fully removed as well as the inpainted regions are least affected. Figure 10 presents image de-fencing case called 'home'. Here, fence covers body, clothes, leaves, shower and other parts in background. Restored image by HIM [7] has removed most of the fence however, regions use improper information to fill in colors as shower is hardly restored. Moreover, it can be observed that left part of the image is distorted with information from parts of woman's shirt and hand.
Apart from fence region distortions, FRPI [6], has fully removed the fence. In most regions, suitable background information is used in filling the fence region. However, some parts of shower, leaves, and cap, suffer from color distortions. The proposed scheme yields better results as fence is successfully removed, fence regions are properly filled with local information, and there are no distortions otherwise. Figure 11 presents image de-fencing case called 'veg'. Here, fence mainly occludes vegetation along with wooden crates and stones. Outputs generated by HIM [7], and FRPI [6], have successfully removed most of the fence. Blurring effect is evident in HIM output [7], specifically in the lower left corner of the image. Some noise distortions can be observed in the Chon [6]. The proposed scheme successfully removed the fence as well as fence regions are properly filled without suffering from other distortions. Figure 12 shows the multi-focus images, along with fence super imposed image, and de-fencing results for the 'bird' case. It can be observed that resultant images by HIM [7], and Chon [6], are not properly de-fenced. For example, fence parts (with gray shade) in HIM [7], recovered image are not completely removed. On the other hand, background fence area is not properly filled in Chon [6]. It can be observed that the de-fenced image generated by the proposed scheme is better compared to existing schemes. Figure 13 illustrates de-fenceing results for the 'lawn' case. In this case also, Chon [6], suffers from improper inpainting results in the extracted fence regions. Similarly, on careful observation, distortions are evident in the upper parts of the recovered image by HIM [7]. The proposed scheme shows better results in comparison to existing methods.
In summary, HIM [7] suffers from poor fence detection resulting in distortions in parts even not covered by the fence. FRPI [6], better detects the fence thus avoids such problems, however, fence map must be refined to reduce residual noise. The proposed scheme not only better detects the fence regions, but also improves the fence map to avoid distortions in fence free regions, and then followed by inpainting method showing better results. Image quality features like contrast, luminance, and structure, are inherently related to their statistical characteristics.
Popularly known image quality testers [51], used for proposed scheme evaluation are briefly described here. Mean squared error (Q M SE ), measures average squared-error be-tween original and recovered image pixels. Normalized Absolute Error (Q N AE ) [52] computes mean absolute error between images, while structural similarity index measure (Q SSIM ) [53], is based on locally used luminance and structural information. Blind/referenceless image spatial quality evaluator (Q BRISQU E ) [54], predicts differential score by using feature model trained on natural scene images with distortions like blurring and noise. Natural image quality evaluator (Q N IQE ) [55], computes difference between image features and features extracted by statistical model based on natural scenes. Perceptual image quality evaluator (Q P IQE ) VOLUME 4, 2016 (a) Multi-focus input 1 (b) Multi-focus input 2 (c) Fenced image (d) HIM [7] (e) FRPI [6] (f) Clean image (g) Proposed FIGURE 13: De-fencing Case-5 [56], is inversely proportional to image quality as it evaluates blocking artifacts in distorted blocks using local contrast information. Here Q M SE , Q N AE , and Q SSIM are full reference while Q BRISQU E , Q N IQE , and Q P IQE are noreference quality metrics. Ideal values for Q M SE , Q N AE , and Q SSIM are 0, ∞, and 1 respectively; while Q BRISQU E , and Q P IQE take values in the range [0, 100]. Lower scores for QM SE, QN AE, Q BRISQU E , Q N IQE , and Q P IQE while higher scores for Q SSIM , indicate better quality images. Scores presented in table 5 support the proposed method in comparison to state-of-the-art methods [7], [6].

V. CONCLUSION
This study focuses on restoration of fence occluded multifocus images. The proposed framework uses salient image features in focused regions to extract fence occlusion maps. The raw map is refined using guided and morphological filtering. The fusion map is refined by employing guided filtering for consistency verification. The fence is removed using binary operators, and inpainting is then used for restoration. The results of proposed scheme are much better, visually and quantitatively, as compared to existing state of the art techniques.
It should be mentioned that filtering parameters require fine tuning while working with different types of fences. Hence, optimal restoration needs fence map refinement, and thinning operations with careful parameter selection. This proposed model uses multi-focus scenes without skewing artifacts. Deep learning methods can be incorporated for better extraction of depth segmentation maps from multifocus fence occluded images. Similarly, the work can be extended to videos by exploiting frames similarity and projective geometry.