Focus Measurement in Color Space for Shape From Focus Systems

Shape from Focus (SFF) has been studied extensively in computer vision for 3D shape and depth recovery. The first stage in SFF methods is to compute the focus value of every pixel by converting the colored images into gray scale and then apply the focus measure operator. Converting colored values in the images into gray scale values may lead to imprecise mapping of pixels with different colored values onto the same gray scale value, this affects the overall accuracy of the system. In a colored image, the focused pixels maintain a considerable color difference from their neighboring pixels as compared to the defocused ones, which are blended into their neighborhood. This article presents an alternative method to measure the degree of focus by directly processing colored images. The color differences of the neighbor pixels with respect to the central pixel are obtained and summed together, this is followed by calculating their spread. The sum and the spread are combined to measure the degree of focus of the pixel in consideration. The proposed focus measure is then used for shape recovery of various simulated and real objects and is compared with previous techniques. The comparison results show the proposed method has the highest correlation and smallest RMSE values confirming the effectiveness of using color images for shape recovery.


I. INTRODUCTION
In computer vision, shape-from-X is a passive monocular technique to recover the 3D geometry of a scene or an object from a set of images. Here, X is a 2D characteristic, for example, shading, motion, stereo, focus, defocus, used as a cue to infer the 3D shape. This approach has various industrial applications in diagnostics, autonomous vehicle guidance, microscopy etc., [1]- [3]. Shape-from-Focus (SFF), deals with shape recovery using various images of the same scene that are captured after manipulating the focus settings of the imaging device, [4]. SFF measures the amount of focus in each pixel, along the optical axis, to identify the best-focused pixels, which are used to recover the depth of the scene, [5]- [7].
In SFF all the images are taken by either moving the object along the optical axis in small steps, each step of size step ; or changing the focus of imaging device in small steps.
The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao .
An image (of dimensions l × m) is stored at each step, and n is the total number of images in the image stack, as shown in Fig. 1, and is given by: where, U disp. is the total displacement of the object plane. U disp. is chosen in such a way that the magnification of the object remains constant, [8].
Lens law can be used to explain the focusing of every point in the image sequence. If the object distance from the lens shifts the image plane from the sensor plane by a distance D, it creates a circle of confusion (CoC), [9], which leads to a blurred image as shown in Fig. 2. However, according to geometric optics, if the image plane lies exactly on the sensor plane, no CoC is formed and a sharp image will be obtained. The object distance D O , from the lens is given by: where, f is the focal length of the lens and D I is the image distance from the lens. After image acquisition, a focus measure (FM) operator is applied to find the best-focused points in the image stack. The FM operator acts as a High-Pass-Filter which suppresses the out of focus points and enhances the focused points in the image stack. Ideally, the FM yields its maximum value at the best focused position. This value decreases as the object moves away from this point in both directions (towards and away from the lens). Images are taken in such a way that initially, the object is completely defocused, then images gradually become focused on every point before becoming defocused again.
Finding the best-focused pixels in SFF is a crucial step in estimating depth information. Traditionally, an FM is applied to images after converting them to grayscale. Grayscale images can be obtained using the following: (i, j) = 0.213R(i, j) + 0.715G(i, j) + 0.072B(i, j), (3) where, (i, j) is the grayscale value of the (i, j) pixel in the image and R(i, j), G(i, j), B(i, j) are the RGB values of that pixel. Several other conversions have also been outlined in, [10].
However, converting a color image to grayscale leads to a loss of information. Some distinct color values map to the same gray values incorrectly during this conversion, [11]. Hence, when the FM is applied to these grayscale images, the focus values are not calculated correctly, resulting in incorrect depth extraction.
In this paper, a new focus measure, FM c , is proposed, that uses color information in the images. FM c receives colored images as input from which it computes the focus values, eliminating the first step of converting colored images to grayscale, as proposed by the conventional methods. FM c is applied to different image stacks, of various objects, to recover the 3D shape of the objects. The proposed FM has shown better results when compared with previous methods.
This article is structured as follows. The next section provides a brief overview of the related work. In Section III, the importance of color images is discussed, while the proposed method is described in Section IV. Results are presented and discussed in Section V, which is followed by Section VI that concludes the article.

II. RELATED WORK
An FM operator acts as a High-Pass-Filter which enhances the focused regions and suppresses the defocused regions. It computes the sharpness of a selected center pixel by choosing a small local window. Since each object point has different focus values with various optical settings, these values are compared with neighboring object points to identify the position of the pixel where it is best focused, i.e., with the highest focus value.
A variety of FM operators have been proposed in the previous literature. An overview is provided in, [12]. FMs are divided into six significant groups depending upon their operating principles. Gradient-based operators take the first derivative of an image to compute the focus value, [13]- [16]. Laplacian-based operators measure the focus by calculating the second derivative of an image, [15]- [19]. Wavelet-based operators consider the wavelet transform to describe an image's spatial and frequency contents, [15], [20], [21], whereas, statistical-based operators take into account different statistics of an image to measure the focus, [16], [19], [22], [23]. Discrete cosine transform (DCTE) based operators consider DCT coefficients from the frequency content of the image to calculate the focus level, [24], [25], and finally, many other operators, which do not lie in any of the aforementioned categories, are grouped into a category named as miscellaneous operators, [26], [27].
Laplacian operators can be applied to detect the regions of high contrast. They have been used as a focus measure in both Auto-focus, [28]- [30], and SFF, [15], [16], [18], [19], [31]. They are used to compute the second derivative for each pixel of the given image window. However, the second derivative in the i and j directions can yield equivalent values with opposite signs, which cancel each other. Therefore, they proposed the Sum of Modified Laplacian (SML) which is the sum of absolute Laplacian in the i and j directions, and is given by: where (i, j) is the gray value of the pixel (p (i,j) ) at coordinates (i, j), and N (i o ,j o ) is local neighborhood around p (i,j) in the window of ω × ω. Modified versions of Laplacian operators are further studied in, [16], [31], [32]. Tenenbaum (TENG) operator is discussed in, [13], [16], [33], [34]. This FM is based on the Sobel gradient operator, [31], [35]. It obtains its value by summing the squared responses of the horizontal and vertical Sobel masks. To increase this FM's robustness further, the results are summed up in the local pixel window, which is defined as: Gray Level Variance (GLVA) is one of the most popular FM operators, [16], [31], [36], [37]. It follows the assumption that regions with high gray level values are sharper than regions with low gray level values. GLV is defined as: where µ is the mean gray value within a neighborhood N (i o ,j o ) around p (i,j) . This method is also discussed in, [22], [38]. The Sum of Wavelet Coefficients (WAVS) approach decomposes the image into four sub-images of wavelet transform coefficients of the image. In the first level, the image is transformed into three detail sub-bands (W LH , W HL , W HH ) and one approximation sub-band (W LL ). In the higher levels, the approximation sub-band is further divided into complete and approximation sub-bands, [39]. The information from both sub-bands is then used to compute the degree of focus.
All of these FMs can be applied on a stack of image to measure the focus values, with the aim of constructing a depth map (initial shape) of the object. Although, all these FMs have different working principles, essentially, they all take gray scale images as input and then compute the focus values of the image. The effect of window size on FM performance is discussed in, [40], and suggested that the window sizes of 3 × 3 and 5 × 5 give better results. The mathematical calculation for window size is provided by, [41].
These FMs can be applied to color images by processing the color channels separately, [42]. While this can be done efficiently for single or few images, processing each channel for higher number of images is computationally inefficient and costly. This computational inefficiency hinders the application of SFF for real time applications.

III. MOTIVATION
Aberration is a common optical problem when it comes to using lenses. It occurs when a lens fails to bring colors of different focal lengths to the same focal plane, as shown in Fig. 3. An achromatic doublet is used to reduce the amount of aberration but, it does not produce an exact correction. An accurate image distance (D I ) can only be recovered by processing each channel (R, G, B) individually. However, since each channel has a different wavelength, they affect D I differently. This can be shown by combining the Lens Maker equation, [43], [44], and Cauchy equation, [45],, given by: where φ and µ are the coefficients which are combined to make the refractive index of the material of the lens, [45]. Also, r 1 and r 2 are the radii of the curvatures of the lens, closer to the light source, and farther from the light source, respectively. For a typical lens, (8) can be written as: where ψ and ε are given as follows: From (9), if ψ and ε are taken as constants, then the relationship between wavelength of light (λ) and the image distance (D I ) becomes directly proportional and can be written as: Fig. 4, shows the effect of λ on D I , for two different values of object distance (D O ) when using a fused silica convex lens (LE4412). As λ is increased from 450 nm (Blue spectrum) to 700 nm (Red spectrum), the approximate change in D I is almost 2.5 mm.
Equation (12) shows that the depth varies by changing the wavelength, even though all the other parameters (including D O ) in (9) are kept constant. Therefore, accurate depth can only be recovered if colored images are considered instead of grayscale images.
After image acquisition in SFF, an FM is applied to measure the focus value. All the FM operators, in previous literature described above, use grayscale images to measure the focus value. The accuracy of an FM can be improved if RGB images are employed since they contain more information than grayscale images, [46]. There are some RGB color combinations that give the same grayscale value because isoluminant changes are not preserved when converting to grayscale  and therefore, distinct pixels cannot be distinguished from one another with this approach, [46], as shown in Fig. 5.
In colored images, visual cues are preserved which help detect salient features, and hence are of considerable importance in various applications, [47]. Grayscale conversion of colored images suffers from information loss such as from the lightness channel in CIE L*a*b. Although many decolorization methods have been proposed to preserve this information, all of these methods have some limitations, [48]. In addition, grayscale mappings of color images that are constructed solely by approximating spectral uniformity, are inadequate because isoluminant visual cues signaled only by chromatic differences are lost, [49].

IV. PROPOSED FOCUS MEASURE
In Shape from Focus (SFF), images are acquired by moving the object in finite steps steps , towards or away from the imaging device. Images can also be acquired by changing the focus settings of the imaging device, provided magnification remains constant. In every step the acquired image is stored in the image stack, [8]. Each image, in the image stack, is represented by I k , where, 1 ≤ k ≤ n, and n is the total number of images given in (1). The dimension of each image is l×m. Thus, each pixel in the image stack can be represented by P i,j,k , where, 1 ≤ i ≤ l and 1 ≤ j ≤ m, as shown in Fig. 1. In any color space, the color value of a pixel P i,j,k , in the image stack, is defined by: where i,j,k , i,j,k , i,j,k represent the color values of the pixel P i,j,k ,ĉ = [ ω ξ ψ] T is the unit vector, and C i,j,k is the magnitude vector (in color space ). The first step, for calculating the focus value of each colored pixel is by subtracting the neighboring color vectors from the color vector of the pixel P i,j,k , and collecting them in i,j,k , as: where each δ r is given by: where, 1 ≤ γ ≤ ω 2 − 1, and (ω i , ω j ), represents the indices of (ω 2 − 1) neighboring pixels (around P i,j,k ) present in a window of ω × ω, and are given as: and, ω i = i and ω j = j.
In the second step, the variance (σ 2 ) of i,j,k is calculated. The final step of obtaining focus measurement FM c (i, j, k) of P i,j,k is given as: After calculating the focus value of each pixel, the depth map of each object-point is obtained by finding the position of the best-focused pixel, corresponding to that object-point.   This is acquired by maximizing the focus value along the optical axis using the following: where, 1 ≤ k ≤ n.
When the depth map is obtained for all the object points, a 3D shape is recovered. The focus curve obtained by FM c can be further improved by applying certain fitting functions using different models, [50]. The flow of the proposed method is summarized in Fig. 6.

V. RESULTS AND DISCUSSION
This section provides analysis of experimental results and discusses them in detail, and is divided into four subsections. First, details of the experimental setup are provided, followed by the depth map and shape assessment criteria, and the metric measures used. After the FM analysis is presented, the 3D shape results are analyzed at the end.

A. EXPERIMENTAL SETUP
The experiments were divided into two categories: FM analysis and 3D shape reconstruction analysis. Different data-sets were used for each analysis. The following sections explain the experimental setup.

1) EXPERIMENTAL SETUP FOR FOCUS MEASURE ANALYSIS
FM c is analyzed for two images as shown in Fig. 7 (left & right). In Fig. 7 (left) the foreground is focused and the background is defocused, whereas, in Fig. 7 (right) the foreground is defocused and the background is focused. Three focused pixels are randomly selected from one image and their duplicate defocused pixels are selected from the other. The pixels are then analyzed for their focus-defocus difference. The results are presented in the later section of the manuscript. Fig. 8 provides the focus-defocus images of selected pixels along with their 3 × 3 neighbors.

2) EXPERIMENTAL SETUP FOR 3D SHAPE RECONSTRUCTION ANALYSIS
Experiments for shape reconstruction analysis were performed on publicly available five data-sets of total 19 objects. Table 1, provides a summary of the objects used in the 3D shape analysis, including the name of the object, object type, color scheme dimension of the image in stack and number of images in the stack. The first set contains three simulated objects, Simulated Cone 1, 2 & 3, as given in Fig. 9. Three different colored data-sets of simulated cones were generated with different lens position using camera simulation software (AVS). The details of AVS are provided in, [51]- [53]. The Matlab code used can be downloaded from, [53]. All data-sets consists of 97 images with 360 × 360 pixels.
The AVS software is provided with the depth map, texture image and camera parameters. The texture map consists of concentric circles of two alternating colors. The colors are chosen in such a way that their gray level values are the   The second data-set contains three real objects, Real Cone 1, Real Plane, and the LCD-TFT Filter. These image sequences were originally in gray scale, but for the sake of this experiment they were pseudo-colored, [54]- [56]. The colors were chosen in the same way as for data-set 1. Fig. 11 shows the 50 th frame of each image sequence. These image sequences have been widely used by many researchers, [5], [38], [57]- [59]. Fig. 10 provides the ground truth for Simulated cones and Real Cone 1 data-sets.
The third data-set consists of three real objects, a Coin, Measuring Tape, and Real Cone 2. The Coin and Measuring Tape image sequences were taken from a video source provided by, [60], using the frames from 1:15-2:23 seconds. The Real Cone 2 image sequence was generated by changing the focus of a Nikon D5300 camera, with 35-300 mm lens, the camera was mounted on the stand, using the system described in, [60]. The focal length of the lens was kept constant while the focus of the lens was changed in small steps. For this image sequence, a cone-paper-cup was used and color stripes were drawn onto it to provide texture. The two colors used were red and blue. Fig. 12 provides the 65 th frame of the Real Cone 2 image sequence, and 50 th frame of the Coin and Measuring Tape image sequences in both colored and gray scale.
The fourth data-set consists of six real objects, Balls, Bucket Fruits, Keyboard, Plants and Kitchen. All of these images are taken from the mobile phone. The details VOLUME 9, 2021   of the image sequences are provided in, [61], and are available at, [62].
The fifth data-set consists of four objects, Logi1, Logi2, Logi3 and Logi4. These image sequences are obtained from a Logitech webcam. The details of these are provided in, [12], and are available at, [53]. Fig. 13 provies 10 th image of each image sequence of fourth and fifth data-sets, respectively.

B. DEPTH MAP/SHAPE ASSESSMENT CRITERIA
The shape reconstruction quality is the characteristic that measures the perceived difference between the reconstructed shape and the ideal shape. The quality of shape reconstruction reduces as the difference increases. In this article, the quality of the depth map obtained by using different focus measures is also analyzed. In the ideal case, the obtained depth map is indistinguishable from the original map and the difference is VOLUME 9, 2021  zero, hence, the quality of the map is at its maximum. Several quality metrics have been provided in the previous literature. In this manuscript, RMSE and correlation are used to compare the proposed FM c with other focus measures operators.

1) ROOT MEAN SQUARE ERROR (RMSE )
RMSE is the square root of the variance of the residuals of the data under observation. It indicates how close the perceived shape is from the original shape. A lower value of RMSE  indicates better results, [63]. RMSE is given by: where X and Y are the horizontal and vertical number of pixels. D represents the processed depth map.

2) CORRELATION
Correlation is a the statistical relationship between two variables that are being measured. This metric is best used to demonstrate the linear relationship between two shapes, [64].
The correlation coefficient is computed as: where D   and shown in Fig. 14(a) and Fig. 14(b), respectively. It is evident from the figures that the RGB-color-distances and the spread of the distances from the focused center pixel to its neighbor pixels are larger as compared to the distances of the defocused center pixel to its neighbors. Hence, the RGBcolor-distances and their spread will be higher for focused pixels. This leads to the conclusion that when the focus of the pixel is reduced, the color-distances and their spread are also reduced. This phenomenon allows us to separate the best focused pixel from other pixels.
The proposed FM c was also compared with the algorithms of some other well-known FM methods like Image Contrast(CONT), Image Curvature (CURV), DCT Energy Ratio (DCTE), Histogram Entropy (HISE), Sum of Modified Laplacian (SML), the Modified Laplacian (LAPM), Tenenbaum (TENG) and Grey level variance (GLVA), DCT Reduced Energy Ratio (DCTR), Spatial Frequency (SFRQ), Threshold Absolute Gradient (GRAT), and Sum of Wavelet Coefficients (WAVS). For comparison purposes, all the FMs were tested on equal terms. The colored pixels shown in Fig. 8 were converted to gray scale, since previous methods cannot process colored images. The percentage differences between the focused and defocused value of each of these selected pixels were calculated using FM c , they were then compared with the results obtained from previous algorithms. Table 2 shows the percentage difference between the focus and defocus values of the selected pixels. It is clear that for the proposed method, there is a larger difference between the focused and defocused values, whereas the other methods give smaller difference or in some cases no difference at all. The second to fourth column of the table illustrates the percentage difference for the selected pixels, whereas, fifth  column of the table provides the same analysis averaged for the whole image. Some of the methods give mixed results, such as WAVS, and therefore, they cannot be considered as reliable as FM c . Image Contrast has the best results among all the other FMs, but the proposed FM still significantly outclasses Image Contrast FM.

D. 3D SHAPE RECONSTRUCTION ANALYSIS
The shape reconstruction results of FM c are also compared with some well-known FMs (as mentioned in beginning of Section 5). Some of these operators are discussed in detail in, [12].
For a fair comparison, the colored data-set was converted to gray scale so that a 3D shape could be recovered using these FMs as they cannot process colored images. Some gray scale data-sets showed the same gray value, and thus some frames could not be distinguished from one another. Fig. 9(b, d, f) shows the gray scale converted frames of colored simulated cones whereas, Fig. 11(d, e, f) represent the 50 th frame converted to gray scale from the image stack of Real Cone 1, Real Plane and LCD-TFT Filter. Also, Fig. 12(d, e, f) show the example gray scale image from the stack of Real Cone 2, Coin, and Measuring Tape. Also, for the fair comparison, the FMs were applied on each channel of R, G, and B, separately. The resultant shapes acquired from each channel processing was then averaged to form the final shape. The results of these shapes reconstructed using different FMs are also compared with the shape reconstructed using FM c .   In both the tables the FM c uses color image sequence, while for Table 3 other FMs utilizes the gray scale image sequence, and in Table 4 the other FMs are applied on each channel of R, G and B separately, before averaging the resultant shapes from each channel.  Fig. 9. For Simulated Cone 3 and Real Cone 1, the gray scale conversion have not much effect and the results are better. In Table 4 the FMs are applied on each RGB channel separately, and no gray scale conversion is applied. Therefore, the results of simulated cones 1, 2 & 3, and real cone 1, in terms of RMSE and correlation are very impressive. However, FM c is still outstanding among all. Fig. 15 & 16 provides the graphical representation of the data provided in the tables. These tables and figures clearly demonstrate that the FM c has performed better as compared to other FMs even when they are applied to RGB channels separately. The application of FMs on each RGB channel not only increases the computational complexity and time consumption, but also, does not guarantee the improvement of the shape. For example, TENG for Simulated Cone 3 has performed better when applied to gray scale images sequence (when converted from color images for both RMSE and correlation) as compared to when applied on to R, G, and B channels separately to compute the shape. The similar trend for TNEG is observed in case of Real Cone 1. Other FMs like CURV, SML, SFRQ, & GRAT have shown similar behavior.
These FMs failed to process gray scale images (after converted from color sequence) for certain colors, and have also shown mixed results when applied to RGB channels    the shapes are ambiguous when recovered with previous techniques using gray scale data-sets.
FM c was also compared with CONT, CURV, DCTE, HISE, SML,LAPM, TENG, GLVA, DCTR, SFRQ, GRAT and WAVS by measuring RMSE and correlation. Table 3 shows the RMSE and correlation for the shape reconstruction using the simulated cones and Real Cone 1 data-sets. It is clear from Table 3 that FM c gives better results when compared with traditional FMs. The proposed FM c yields the highest correlation and the lowest RMSE when compared with other FMs. The ground truth for the depth maps of the simulated cones and Real Cone 1 are provided in, [35], [52], [65]. The manuscript utilizes the same depth maps for the reference. Hence it can be concluded that the shapes recovered by other FMs, when using isoluminant colored (gray scale) data-sets are imprecise, therefore, they show very large values for RMSE and very small (even negative) values for correlation.
Experiments were also performed on the third data-set, which includes Real Cone 2, Coin and Measuring Tape image sequences. Fig. 27(a-d) shows the reconstruction of Real Cone 2 using the proposed FM, SML, TENG, and GLVA FMs. Fig. 27(e-h) shows a reconstruction of the Measuring Tape results. The shape of the Measuring Tape is similar to that of a planer object with sides (in black) as background.
The proposed method not only successfully rejected the background, while preserving the planer property of the Measuring Tape, but also preserved the edges. On the other hand, other FMs were unable to do both of these together. Fig. 27(i-l) shows similar results. The Coin represents a cylindrical object. In the figure, only FM c is able to suppress the background (black) while preserving the cylindrical shape of the object, whereas, other FMs only preserved the edges and failed in other areas of shape reconstruction. Fig. 28 represents the shape reconstruction of fourth dataset. Fig. 29 provides the shape construction of fifth data-set. Both figures clearly show that the shape reconstruction using proposed method (first column from left of figures) have better results than others.
Therefore, it can be concluded that the color information used by the proposed method increases the accuracy of shape reconstruction during image focus analysis. While existing focus measures cannot process color information, the proposed FM utilizes this feature and produces better results.

VI. CONCLUSION
Measuring focus quality is an essential step in the Shape from Focus (SFF) process for depth estimation and 3D shape recovery. All previously proposed methods, use gray scale images to measure the degree of focus. Single-channel information is extracted from a three-channel color space while converting to gray scale, inevitably leading to information loss. In this article, an alternative focus measure is proposed to measure the degree of focus in color space. This color focus measure processes RGB images and hence color-critical information is preserved. The color differences of the neighbor pixels with respect to the central pixel are obtained and summed together, this is followed by calculating their spread. The sum and the spread are then combined to measure the degree of focus of the pixel in consideration.
The proposed focus measure FM c was tested with nine colored data-sets (three Simulated Cones, Real Cone 1, Real Plane, LCD-TFT Filter, Real Cone 2, Coin, and Measuring Tape), and their 3D shapes were recovered. Previously proposed focus measures (such as the Sum of Modified Laplacian, Tenenbaum, and Gray Level Variance) were also used for depth estimation and shape recovery. For fair comparison the results are computed in two ways, (i) the colored data-sets were converted to gray scale as these FMs cannot process colored images, and (ii) the FMs were applied on each R, G, and B channels separately, before averaging the resultant shapes. The FM c was applied on color image sequences only.
The results show better depth recovery with the FM c as no information is lost, while for the other methods, inaccurate results were achieved. The proposed method also showed the highest correlation and lowest RMSE value when compared with previous methods when applied in both scenarios.

APPENDIX
When working with RGB colors, color space in (13) can be replaced with RGB. Thus, (13) can be written as: i,j,k = (C RGB i,j,k ) T ·ĉ = R(i, j, k) r + G(i, j, k) g + B(i, j, k) b = R(i, j, k) G(i, j, k) B(i, j, k) where, 0 ≤ R ≤ 255, 0 ≤ G ≤ 255, 0 ≤ B ≤ 255 and c = [ r g b] T is the unit vector in RGB color space. Similar results can be obtained by measuring the difference between the center pixel and the neighboring pixels, in the CIE L*a*b and HSV color spaces.
For CIE L*a*b space (13) is modified as: i,j,k = C L * a * b i,j,k ·ĉ = L(i, j, k) L + a(i, j, k) a + b(i, j, k) where 0 ≤ a ≤ 100, 0 ≤ L ≤ 256, 0 ≤ b ≤ 256, and c = [ L a b] T is the unit vector in L*a*b color space.
For HSV color space (13) can be written as: