Local Spherical Harmonics for Facial Shape and Albedo Estimation

In this paper, we present a novel facial albedo and 3D shape recovery method with a local spherical harmonic illumination model. From a face in an image, the proposed method can produce a high-quality 3D shape and albedo using a novel parameterization of local illuminations. Because a facial shape is partially convex, a single spherical harmonics is generally used for the illumination of a face within a constrained illumination environment. However, when a facial image is captured in an unconstrained scene, the illumination is inappropriately estimated due to the presence of shadow and specular reflections. To address this issue, we propose a novel local spherical harmonic illumination model for representing the illumination of a face. Unlike the existing parameterization of local illumination, our local spherical harmonic illumination model utilizes a smooth weight function for the seamless representation of natural illumination. Therefore, the albedo and shape information in an image can be precisely estimated using the first-order spherical harmonics. For accurate estimation of albedo, we also utilize facial albedo statistics to prevent the estimated albedo from becoming biased toward input image. Furthermore, we developed an accurate and reliable 3D shape reconstruction method from a normal map based on tetrahedron-based deformation. Comparing to the Laplacian deformation based method, our method is applicable to any mesh regardless of its structure. Through rigorous experiments, we demonstrate that the proposed local spherical harmonic illumination model is effective in estimating the complex illumination and can recover a high-quality facial albedo and 3D shape.


I. INTRODUCTION
The analysis of facial geometry and appearance is a classical problem and its applications are related to many computer vision and graphics tasks such as face recognition [1], pose estimation [2]- [4], and facial animation [5]. 3D face reconstruction, which is the process of inferring the 3D geometry of a human face from 2D images, is the most very fundamental core that powers those applications.
Among many studies of 3D face reconstruction, estimating 3D geometry of a human face with a single image is particularly challenging because of the loss of depth information caused by camera projection. Therefore, many of the recent studies in the computer vision field have focused on approximating the 3D geometry in a single image using prior The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei . knowledge regarding the target object [6]. A well-known prior model for face reconstruction is the 3D Morphable Model (3DMM), which is a statistical model of facial shape and albedo [7]. Several studies have attempted to reconstruct the 3D face model from a single image by optimizing the parameters of 3DMM [8]- [10] or utilizing the deeplearning method [11], [12]. However, such methods can only reconstruct the smooth parts of facial geometry and fail to reconstruct the high-frequency details such as beard, pore, and wrinkles. In recent years, several methods have been proposed to recover high-frequency details from the smooth geometry by using the shape from shading (SfS) technique [13], [14]. This technique is widely adopted for computing the 3D geometry of highly curved regions based on the illumination and reflectance model for shading variation [15], [16]. By assuming that a face can be represented by pure diffuse reflectance model, most of studies have focused on Lambertian reflectance in conjunction with second-order spherical harmonics (SH) for illumination model [9], [17], [18]. This approach is effective for images taken under controlled illumination conditions. However, for the face images with complex illumination, the SfS process with single secondorder SH may fail to reconstruct locations containing effects that cannot be modeled accurately.
Although the second-order SH is proven to be robust for Lambertian diffuse surfaces [19], facial images in natural scenes contain various lighting effects such as cast shadow, specular, and indirect illumination. For specular illumination, one way to model such effects is to adopt a non-Lambertian reflectance model such as the dichromatic model, with a higher-order SH basis [20]. Although such a model is useful for estimating surface normals of a reflective object with a strictly constrained setup [21], the estimation for a single image is difficult because it is a highly under-determined problem. Additionally, to model cast shadows and interreflection, a computationally expensive algorithm such as ray-tracing [22], is required. In early works on facial shape recovery, several researchers attempted to reduce the complex lighting effects by using an intensity map called corrective field [17], [23]. This map is introduced to reduce the shading error on a wide shadow region on a face, but not small shadow regions, such as a nasal cavity. Moreover, shading properties are not fully considered, resulting in the filtering shadow regions during the process of recovering albedo and shape. Recently, a global and local SH model for scene depth and reflectance estimation for single image is proposed [24]. Local SH captures partial illumination that compensates for residual discrepancies between image and global illumination. In [24], the effective range of each local SH was defined on a particular rectangular grid. This strategy produces fine approximations of the illumination variation in a scene with a set of small objects. However, for an image with spatially varying appearances, this method can produce aliasing artifacts near the boundary of the grid.
Motivated by these previous works, we address the problem of complex illumination for facial 3D shape and albedo recovery from a single image using a novel local SH illumination model. Because the illumination on a face varies depending on facial structure and lighting environments, local SH illumination with a rectangular grid fails to model regions with complex illumination effects. Rather than using a rectangular grid, we propose a more flexible local SH illumination model by clustering the pixels based on its luminance cues. Based on these clustered pixels, the shape and local contribution of each local SH illumination is computed using a local weight function that accounts for smooth variations around the cluster. The aliasing artifact caused by rectangular grids can be resolved because of the property of signal interpolation between local weight functions. On the leverage of the proposed illumination model, we propose an optimization framework for recovering facial albedo and 3D shape information from a single image. By using a prior face model initialized by the existing fitting method, we obtain illumination, albedo, normal, and 3D shape through an alternating optimization scheme. We observe that the proposed local SH illumination is able to model complex illumination efficiently by using a low order basis. Therefore, by using the first-order basis, we can linearize each problem while retaining the advantage of local SH illumination. To this end, we develop a simple but robust method for reconstructing a 3D facial shape by using tetrahedron-based deformation technique with triangle normals. Compared to the traditional approach using surface Laplacian, we show that this technique can reliably reconstruct a 3D shape regardless of a mesh structure of the prior 3D shape. An example of the proposed method compared to the single SH is presented in Fig. 1. Our main contributions can be summarized as follows: • We propose a robust and efficient optimization framework for estimating facial albedo and 3D shape from single image based on a simple but effective local SH illumination model for natural scenes.
• A local SH illumination model is constructed upon the smooth and flexible weight function based on superpixel clustering. This function smoothly interpolates the local illuminations of its surrounding regions.
• Unlike the traditional SfS method which recovers a 3D shape by optimizing the heightmap or surface Laplacian from computed normal map, we adopt a tetrahedron based deformation method that can be efficiently solved by linear optimization. The remainder of this paper is organized as follows. In Section II, several studies related to the proposed method are briefly reviewed. In Section III, we introduce the proposed local SH illumination model. We then present an efficient framework to recover facial shape and albedo in Section IV. Following the comprehensive experiments in Section V, the final conclusion is summarized in Section VI. VOLUME 8, 2020

II. RELATED WORKS
Because inferring facial albedo, and 3D geometry from a single image is an ill-posed problem, most studies have utilized prior models of a face such as the 3DMM [7]. By using the 3DMM, there are several approaches for reconstructing the 3D face model, which can be categorized into inverse rendering [8], SfS technique combined with 3DMM fitting [25], and end-to-end frameworks based on deep learning.
Inverse rendering aims to recover shape, illumination, and reflectance information from a single image or set of images. In particular, for 3D face reconstruction, the lowdimensional parameters for shape and albedo of the 3DMM are estimated based on a specific reflectance model. Blanz and Vetter [8] reconstructed a 3D face model using the Phong reflectance model with weak specular assumption. To incorporate complex illumination, a shading model represented by second-order SH was proposed [9]. Additionally, some recent works have attempted to merge the from-the-studio texture with in-the-wild texture [1], [10]. While the traditional 3DMM was constructed from 200 face scans, 9,663 distinct faces were used to generate the large-scale face model [26]. In [27], the linear representation of the 3DMM was transformed into a nonlinear representation using a deep neural network. However, this technique cannot reconstruct the fine details of facial geometry, such as deep wrinkles, because such models are constructed based upon the generalization of facial appearance.
SfS, which is a technique for inferring the normal map of a given image, has been widely adopted for enhancing high-frequency details with an initial estimate of a 3D model. After using various methods to compute an initial model, such as multi-view stereo [15], depth [16], or template-based methods [13], the surface normal, illumination, and albedo are inferred to reconstruct a point cloud and corresponding mesh. For human faces, SfS-based methods can successfully recover 3D geometry from a set of unconstrained [14] or studio level [13], [28] images. In [14], a 3D surface corresponding to computed point normals was generated by deforming a template shape using the Laplacian surface deformation technique [29]. Additionally, outlier pixels were filtered by using a structural similarity index (SSiM) [30] to avoid inaccurate albedo estimation. For single-image cases, the most common approach is to compute a depth map iteratively by updating albedo, illumination, and normal maps [25]. Biswas et al. [31] proposed the linear minimum mean squared error estimation for albedo in an image. Their reported results demonstrated that a linear least square process conjoint with error statistics computed by using 3DMM can generate an albedo that is visually similar to the true albedo. Another approach is to compute albedo by estimating coefficients for the principal components of a statistical albedo model constructed from hundreds of scan data [17]. To reduce the effects of complex illumination effects, such as cast shadows and specular highlights, [23] applied a corrective field to their illumination model. Most of the aforementioned works have made efforts to attenuate the effects of facial regions with complex illumination effects.
Deep-learning based methods are the most preferred approach for end-to-end inference of 3D geometry from a single image [32]. [33] proposed a convolutional neural network (CNN) to decompose facial images into albedo, shading, and normal maps. Because of a lack of ground-truth data, their network was trained using both real and synthetic data generated by using the 3DMM. To compensate for the domain discrepancies between real and synthetic data, photometric rendering loss with single second-order SH was adopted. Although this method is robust to complex illumination effects because of the generalization capabilities of CNNs, the reconstructed facial albedo maps were biased toward the average albedo of the 3DMM. Furthermore, the reconstructed facial geometry lacked high-frequency details. To recover high-frequency details, serial architectures consisting of a coarse-to-fine network have been proposed [27]. Recently, a UV map, which is mapping of 3D mesh to the 2D image, was used for inferring 3D displacement from 2D image [34], [35]. The latest prominent results by using a UV map combined with a CNN was proposed in [36].

III. LOCAL SH ILLUMINATION
Most methods for estimating illumination on a face assume that human skin acts as a diffuse reflector so that illumination can be modeled by single second-order SH. In [37], it was demonstrated that single second-order SH can capture the 99.2% of the illumination of a Lambertian object with distant lights. Since then, in many SfS methods for facial albedo and shape recovery, a set of coefficients of second-order SH has been used to represent illumination around a face. In such model, each pixel intensity I(x, y) in an image is expressed as the product of the facial albedo ρ(x, y) and illumination L(x, y). We can further simplify this model by factoring the spectral power of {R, G, B} channels into the albedo, meaning a vector L(x, y) can be converted into a scalar L(x, y). As a result, the pixel intensity of a face in an image can be defined as where N , l nm , and n(x, y) are the order of SH, illumination coefficients of the basis Y nm , and surface normal projected onto the pixel position (x, y), respectively. For a Lambertian object, such as a face, the global illumination is generally approximated by N = 2.

A. LOCAL SH MODEL
Although the aforementioned model is effective for modeling the global illumination of a Lambertian object, it cannot model cast shadows, self-shadows, and specular reflections, which appear in local areas of an object in natural scenes.
To model such illumination effects, a more moderate model is required because of the non-linear nature of shadows and high-frequency characteristics of specular reflections. To address this issue, we propose a simple and efficient representation for complex illumination effects using the local SH illumination model. As demonstrated in [24], the globallocal SH representation for estimating local illumination is effective for the robust estimation of illumination in an image. The authors of the aforementioned paper divided an image into rectangles so the illumination errors in each region could be compensated independently. However, because of the absence of smoothness constraint on border pixels, this approach yields aliasing artifacts. Furthermore, simply minimizing the illumination gradient between the adjacent regions would result in flat local illumination. To resolve this issue, an illumination effect on (x, y) is expressed as the sum of global illumination and the weighted contributions of adjacent local SH as follows: where L G (x, y), L k (x, y), C, and w k (x, y) represent the global illumination, k th local SH illumination, a set of clusters, and their contributions to (x, y), respectively. Note that each illumination follows the SH lighting formula in (2). Unlike in [24], we compute each region for local SH using a pixel clustering method, such as simple linear iterative clustering (SLIC) [38], in the luminance domain. As shown in Fig. 2, regions with shadows, such as the fourth row, and specular effects, such as the first and third rows, can be effectively segmented. Compared to using a rectangular grid, this approach enables more intuitive construction of local SH illumination. For each cluster region, we assign SH illumination that affects not only the area inside the region but also nearby clusters with a certain weight which is detailed in the following subsection.

B. ANISOTROPIC WEIGHTING
For an image clustered based on luminance cues, the contributions of illumination for the pixel location (x, y) from the k th cluster is expressed by a weighted value for the local SH as defined in (3). The weight should be monotonically decreasing function with distance from the centroid of the cluster. Since each cluster is not necessarily formed with ellipsoidal shape, simply using the distance between the centroid and a pixel location is not appropriate. Therefore, we define the weight w k (x, y) in terms of a Gaussian with the distance function of each cluster. The distance function φ k (x, y) computed for the cluster region k is expressed by the Euclidean distance d ((x, y), ∂ k ) between the pixel location (x, y) and boundary ∂ k of the k th cluster as follows: Then, the distance function φ k (x, y) is converted to the zero-centered distance function φ k (x, y) by subtracting minimum value of φ k (x, y). In addition to the distance function, we compute a local smoothness value ω j , which is defined as the Chi-square distance of the luminance distribution between the k th cluster and its adjacent clusters {j|j ∈ adj(k)}. The weight ω j determines the width and smoothness of the weight function in the direction of the j th adjacent cluster. This prevents w k (x, y) from affecting regions with distinctive illumination effects. As shown in Fig. 2, the adjacent cluster with a small Chi-square distance has a wider and higher weight value. Based on this distance function and local smoothness, the anisotropic weight function of the k th cluster for the pixel (x, y) for local SH is defined as where σ k is a control parameter that determines the radius of the weight. To determine σ k automatically, we adopt the concept of the full width at half maximum. Let α be a desired boundary value proportional to the maximum weight value of the cluster. Then, σ k is computed as follows: As shown in Fig. 2, excluding the region with flat illumination in the second row, the computed weights are consistent with the Gaussian and zero-centered distance function φ k (x, y) for adjacent regions with similar illumination effects, but fall off significantly for other regions. For all experiments in this paper, we empirically set α = 1.5.

IV. ALBEDO AND SHAPE RECOVERY
Because directly estimating facial shape and albedo from a single image is infeasible, we utilize the prior face information by fitting a 3DMM. In particular, the pose, expression, and identity parameters of the 3DMM are estimated from a given image with landmark points [39]. A weak perspective projection is assumed for the 3DMM fitting process.
To estimate the prior albedo, we choose initial illumination coefficients as uniform white illumination by setting the first coefficient of global SH as 2 √ π and setting all other coefficients to zero, as described in [20]. Because the mesh resolution of the model is insufficient for recovering facial details, we apply the Loop subdivision [40] to the initial shape model following initialization. Based on the face region identified by projecting an initial face model, the estimation of illumination, albedo, and normal from the image proceeds by minimizing the following data term: Because jointly optimizing the albedo, illumination coefficients, and surface normal in (8) is infeasible, we separate this problem into estimating global-local illumination coefficients L k , L G , albedo ρ, and surface normal n(x, y) iteratively. Each component is estimated by minimizing the (8) combining with proper regularization terms. At the end of each iteration, the vertex displacement with the computed normal is applied by using the tetrahedron-based mesh deformation technique [41]. To improve computational efficiency, we set N = 1 so that each optimization can be conducted using a linear leastsquares solver, such as bi-conjugate gradient stabilized solver [42]. Although higher-orders are desirable for the accurate estimation of shape and albedo, we found that our model can accurately reconstruct facial details even using the first-order SH. An overview of the facial shape and albedo recovery process is presented in Fig. 3.

A. ILLUMINATION ESTIMATION
For a given ρ(x, y) and n(x, y) computed from the initialization or previous iteration, the illumination coefficients, including global and local SH, are estimated jointly. In conjunction with E I from (8), the objective function E L for the illumination estimation step is defined as where λ L c , λ L s , and l k are the balancing weights for the regularization and smoothness terms, as well as a vector containing the illumination coefficients of (2) for the k th cluster, respectively. Because a large value for the coefficients of local SH would affect the stability of the normal estimation step, the coefficient regularization term constrains the coefficients for local SH illumination to be small. Smoothness term is introduced for enforcing the estimated illumination smooth in the spatial domain. The illumination of the input image is estimated by minimizing (9) with respect to the SH coefficients of L k and L G simultaneously.

B. ALBEDO ESTIMATION
After computing the global and local SH illumination coefficients, the albedo ρ(x, y) is computed. We incorporate additional regularization energy to enforce the smoothness of the albedo according to the uniform Laplacian [44] of mesh information of the face model. Additionally, facial albedo priors from the MERL/ETH Skin Reflectance Database [43] are adopted to prevent the computed albedo from overfitted to the input image. Unlike the definition of facial segments in this database, we merge some facial regions. Although predefined segments can be obtained using facial landmarks, this would produce an undesirable discontinuity between contiguous segments. Therefore, we utilize the definition of  Fig. 4. Finally, the objective function E ρ for albedo estimation is defined as follows: A smoothness term E s is introduced to obtain smooth albedo as follows: where N (i) and d i represent the local connectivity around the i th vertex and the normalizer, which represents the number of vertices connected to the i th vertex, respectively. The albedo prior term E p , which is used to make the estimated albedo to have higher likelihood in terms of GMM, is defined as where p(ρ i ) is GMM computed by applying the expectationmaximization method to the database. Because (12) is non-convex, we use an iterative method to relax the problem, as described in [45]. In each iteration, we compute the weighted log-likelihood of the computed albedo for the i th vertex with its adjacent vertices and derive the mixture k * with the highest likelihood. By using the chosen mixture k * with its mean µ k * and variance k * , the equation can be converted into a linear form as follows: In our experiments, five iterations were sufficient to achieve convergence. It should be noted that (11) is similar to the albedo regularization term proposed in [25], except that regularization is applied in the vertex domain. Because the albedo is estimated on the entire vertex, our formula can generate smooth albedo across an entire face model. This is particularly beneficial when visible triangles are updated during the iteration by z-buffering.

C. NORMAL ESTIMATION
After both the global and local SH illumination coefficients and albedo are estimated, we can compute the triangle normal n(x, y). Because solely minimizing E I is an under-constrained problem, the normal stabilization constraint defined as the difference between the estimated normal n o (x, y) in the previous iteration and current normal n(x, y) is applied to the objective function. Additionally, the smoothness of the estimated normals is penalized by adding a gradient term. The normal at each location (x, y) is then computed using weights λ n o and λ n s by minimizing the following energy formula: Note that, a mapping between the triangle of shape model to the location (x, y) at each iteration is computed by using zbuffering. Since we use only first-order SH, minimizing (14) is a linear least-squares problem that can be solved efficiently.

D. 3D SHAPE RECOVERY
In most SfS methods for single-view facial shape recovery, the surface is reconstructed by computing a heightmap for the computed normal map. Such methods compute a depth value for each pixel location with assistance from a prior face model [17], [25]. However, these methods take a lot of time to converge, because these are formulated in a non-linear least-square problem. Roth et al. [14] proposed a linear SfS method based on mesh deformation using a surface Laplacian. Although this method was developed for unconstrained images, it is applicable to single-image face reconstruction with an appropriate regularizer. However, it requires suitable re-meshing prior to optimization because the surface Laplacian can be computed incorrectly around flat surfaces due to the precision error of the processor. Since the surface Laplacian is computed as the sum of the differences between a vertex and its adjacent vertices, if the vertex lies on the flat surface, the computed direction of Laplacian should be a zero-valued vector. In practice, due to limited precision, the Laplacian around a flat surface is a vector that has a magnitude larger than zero in a random direction. Therefore, the computed shape by using the surface Laplacian would be severely distorted. VOLUME 8, 2020 FIGURE 5. Comparison of our deformation method with [14]. Without adapting the mesh structure, tetrahedron-based deformation produces accurate results, whereas the output of the Laplacian-based method is severely distorted.
To address this issue, we propose a tetrahedron-based deformation technique with triangle normals for the SfS process. Once the normal at (x, y) is computed as in the previous subsection, the 3D facial shape X is recovered by computing the rotation of triangles of current facial shape X c . For each triangle, the target rotation matrix R j ∈ SO(3) between the current normal n j c and target normal n j of the j th triangle is computed. The target normal can be computed easily by backprojecting the 3D facial shape onto an image. The rotation between two vectors can be expressed by a tetrahedron representation [41] where x j 4 = x 1 + n j is an additional vertex introduced to define the tetrahedron. By using (15), the rotation matrixR j between the current V j c and target (V j ) −1 for the j th triangle can be computed asR By using this representation, surface deformation can be performed by minimizing the following objective function E d : where B is the set of boundary vertices of the 3D facial shape. Because the first term in (17) can be converted into a system of linear equations, we can solve this minimization problem using the bi-conjugate gradient stabilized solver [42]. Fig. 5 presents an example of recovering 3D shape for input shape and target normal map by using the method proposed in [14] without re-meshing and tetrahedron-based method. Compared to the method proposed in [14], which stretches or shrinks flat regions, such as the cheek and forehead, our method reliably and accurately recovers the target shape from the given normal map.

V. EXPERIMENTS
In this section, we present comprehensive experimental results. As an initial face model, we adopted the Basel face model [46], which is a widely used statistical parametric model for the shape and albedo of a face. The initial model with its projection parameters, including scale, translation, and rotation, was estimated using the 3DMM fitting method described in [4] with detected landmarks from [39]. Our method was tested using several public datasets and high-resolution photos from the internet. In all experiments, the hyperparameters λ L c , λ L s , λ p , λ s , λ n o , and λ n s are set to 1, 2, 1, 5, 0.1, and 0.1 respectively.

A. ALBEDO RECOVERY
To demonstrate the effectiveness of the proposed lighting model, we tested our albedo recovery method, which was described in Section IV-B, on the YaleB face database [47]. The YaleB face database contains images of faces captured under different lighting conditions from a fixed viewing angle. For each subject, we computed a ground truth albedo using a generic photometric stereo technique based on the given ambient intensity image and direction of a light source for each image.
For the quantitative measure of estimated albedo and shading, we adopted three different error metrics. Reflectance/shading mean squared error (RS-MSE) is an error metric introduced in [48] that measures locally scaleinvariant error for both reflectance and shading with values between zero and one. For the estimated gray albedo ρ, illumination (shading) L, and corresponding ground truth ρ * , L * , the error is computed by summing the scale-invariant We also computed the SSiM [30], which is a measurement of the perceptual quality error between two images. This metric is sensitive to the local distortion of reconstructed albedo and shading. A high value of the SSiM indicates that a reconstructed signal is perceptually similar to the ground truth. Note that, the resulting SSiM values were calculated by averaging the SSiM values for all pixels in an image. The SSiM values for the estimated albedo and shading are denoted as A-SSiM and S-SSiM, respectively. To assess the effects of local SH with varying orders, we compared estimation results for albedo. We set the initial number of clusters to K = 100 for SLIC segmentation in the luminance domain. For orders of N = 1, 2, 4, 8, the results for RS-MSE, A-SSiM, and S-SSiM are listed in Table 1. Compared to the generic representation of illumination with N = 2 and K = 1, the proposed representation produces better results in terms of the error metrics. However, for RS-MSE and A-SSiM, error increases with an order. Because the proposed method estimates illumination in local regions, rather than an entire image, the results with higher-order SH could be biased to the appearance of local clusters, resulting in distortion of the recovered albedo. For visual compari-son, estimation results for albedo and shading are presented in Fig. 6. Note that K = 1 corresponds to the model without local SH (i.e., global illumination only). One can see that there are no visible differences between the estimated albedo with varying orders, except for the result of global illumination with K = 1. Although there are no significant differences with increasing order, the specular region around the nose tip is diminished with the proposed method. These results serve as experimental proof for the validity of using first-order SH for local illumination, as discussed in Section IV. Based on the observation that local SH of the first order provides sufficient quality for the estimated albedo, VOLUME 8, 2020 FIGURE 8. Comparison of results for 3D facial shape recovery from the Bosphorus database. We compare our proposed method to the methods proposed by Jiang et al. [17], Deng et al. [49], and Chen et al. [36]. Note that ground truth was constructed by meshing the noisy depth map provided in the database.
we present some example results with N = 1 in Fig. 7. For albedo estimation with single SH with K = 1, the results reveal albedo highlights stronger than the true albedo around the specular regions, such as the nose tip and forehead. In contrast, our method can capture the specular regions more accurately, so specular highlights are suppressed compared to the single SH case. Moreover, regions with selfshadow can be effectively handled by our method. Except for the regions which are not under-saturated, the proposed method can recover the albedo accurately in softly shadowed regions. Even the hard shadow regions can be approximated using the albedo prior constraint introduced in (13). The averages and standard deviations of RS-MSE, A-SSiM, and S-SSiM are listed in Table 2. As the number of divisions increases, better estimation results for albedo and shading are obtained. It seems that the increment of each SSiM value or decrement of RS-MSE with respect to increasing K saturated after K = 100. We found that the optimal number of clusters depends on the resolution of an input image. For our experiments on HD images, local SH with K = 300 are sufficient for producing acceptable results. Results of the proposed albedo and shape recovery method for faces under natural illumination. Because the subject in the last row was captured in an indoor environment, the estimated shading is darker than that in the other images.

B. 3D SHAPE RECOVERY
We tested the entire proposed process from illumination estimation to 3D face shape recovery on images captured under constrained illumination and natural images with N = 1 and the number of clusters K = 300. For images with constrained illumination, we sampled several images from the Bosphorus database [50], which provides structured-light scanned 3D point clouds of the faces of 105 subjects, as well VOLUME 8, 2020 as corresponding RGB images. Note that, for the SfS-based method without any metric information, such as the true depth information of several points, error estimation using MSE is unconvincing. Furthermore, the scanned point clouds in the database are relatively noisy, making the quantitative measurement of real data impractical. Therefore, we present qualitative results for shape recovery with high-frequency details and exclude quantitative measures. Fig. 8 presents comparative results for several images from the Bosphorus database for various methods. All of the compared methods are constructed using a single SH for modeling the illumination. The method from [17] is an SfS method that optimizes the height (depth) map using second-order SH. Because this method uses albedo statistics from 3DMM texture directly, the results exhibit over-smoothen geometry. It seems that the smoothness constraint is over-weighted because of the inaccurate estimation of albedo. The other two results were generated using pre-trained deep-learning models. While the method from [49] estimates the shape of a face by inferring the shape parameters of a 3DMM, a detailed facial shape with a UV displacement map is generated in [36]. Although our method is based on iterative linear optimization, the quality of the recovered 3D shapes is comparable to that of the latest methods based on deep-learning.
We conducted further experiments on the photos taken under natural illumination. The resulting shading, albedo, and normal, as well as the recovered 3D shape of faces, are presented in Fig. 9. One can see that the proposed local SH model can effectively handle natural illumination. Therefore, plausible results for albedo and normal, as well as detailed facial geometry, such as wrinkles around the eye and forehead or beards, can be obtained.

VI. CONCLUSION
In this paper, we presented a novel 3D facial shape recovery framework for a single image with a novel parameterization of local SH illumination. The proposed local SH illumination model with an anisotropic weighting strategy allows facial albedo and 3D shape recovery to enhance the details of a face while preventing noisy deformation. Additionally, shape recovery using tetrahedron-based deformation facilitates reliable and accurate reconstruction for SfS with a prior face model. Experimental results demonstrated that our method is capable of recovering plausible facial albedo and 3D shape with enhanced details compared to recent state-of-the-art methods.
Nevertheless, the proposed local SH method is proven to be robust to complex illumination in natural scenes, identifying local regions by performing clustering based on pixel luminance may yield poor results. Because each local region is computed based on its luminance discrepancies, the resulting segmentation can produce regions containing rough geometry. Therefore, the geometry of the face should be incorporated in determining local clusters. This can be accomplished by incorporating the geometrical information of the initial shape using a graph-based optimization technique. Also, our method can be combined with a statistical framework for physically based facial skin reflectance, such as the framework described in [51]. In future work, the proposed local SH model could be extended by accounting for these considerations.