A Selective Expression Manipulation With Parametric 3D Facial Model

This paper proposes a novel method to represent expressive 3D facial shapes called the Selective Expression Manipulation (SEM) by fitting the expression coefficients of delta-blendshapes, which is the standard parametric facial model widely used in industries. SEM focuses on preserving blendshape semantics to characterize the facial shapes since the facial shape obtained by minimizing the distance to sparse facial landmarks might fail to signify a facial expression from a human being’s perspective. Assuming each delta-blendshape corresponds to a facial movement with semantic meaning, SEM finds a series of facial motions required to compose the target facial expression. In addition, SEM sequentially determines a sufficient number of expressions and coefficients closely resembling the target facial movements by introducing similarities to quantify the directional correlation of facial motions between a target and a blendshape, excluding redundant expressions in terms of motions from the neutral shape. As a result, far fewer inter-correlated expressions that significantly increase the target correlation can be obtained. Furthermore, SEM exhibits substantial improvement in accuracy, correlation, semantics, and stability in experiments over previous facial fitting schemes and state-of-the-art methods. It is demonstrated that SEM enables accurate and realistic 3D facial shape generation by semantically manipulating expression delta-blendshapes.

global-blanedshape and delta-blendshape. In the globalblendshape representation, as shown in Figure 1(a), a facial shape is generated in the span of the ''whole''-facial blendshapes [16]. This approach ensures that the generated shape lies within a valid range of expressions, resulting in stable shapes and preventing unexpected expressions. Nevertheless, the concurrent representation of several expression models dilutes the expression of each model in the global-blendshape. To overcome this drawback, many researchers [17], [18], [19] used the ''delta''-blendshape representation to overcome this drawback along with expression blendshapes that are constructed according to the Facial Action Coding System (FACS) [20]. In the delta-blendshape representation, as shown in Figure 1(b), a facial shape is characterized in the span of the delta-shapes, which are the differences in the facial shape of each expression blendshape from a neutral expression blendshape. The delta-blendshape representation enables the facial shapes to be described with a wide variety of facial expressions. However, the combinations of these expression blendshapes do not always preserve the semantic meaning of the original expression, as depicted in Figure 1(b), because it uses many redundant expressions with high inter-correlations. When expression delta-blendshapes are being fitted to the point clouds or facial landmarks of a target, the expressions of the blendshapes may be locally overshot to make the facial shape closer to the fitting points. For these reasons, a combination of highly correlated blendshapes can cause an exaggerated and unstable facial shape. Thus, the performance degradation is inevitable.
In this paper, we propose a robust selective fitting approach for expression delta-blendshapes, called Selective Expression Manipulation (SEM). We assume that the expression blendshapes are not semantically duplicated with the others. In other words, each expression blendshape is semantically unique and it has a distinct expression. The key idea is that each expression delta-blendshape is considered as a facial motion which is the difference from the neutral expression blendshape. From this idea, we suppose that an expressive target face is formed by taking a series of facial motions from the expressionless or non-posed face (i.e., neutral expression face), as described in Figure 2. We can accumulate expression delta-blendshapes without losing facial expressiveness by composing facial motions that rarely correlate with each other. Thus, in the proposed method, a set of expression delta-blendshapes and their coefficients are obtained that can preserve the semantic meanings of their facial expressions by seeking the set that are least correlated to each other.
The expression selection and exclusion methods are proposed for SEM to highly correlate a combination of the expression delta-blendshapes with the target motion while decreasing the inter-correlations of the expression deltablendshapes. The two measurements, expressional and relative similarities, are introduced to quantify the directional similarity of facial motions between the target and the expression delta-blendshape. Based on similarity measures, SEM obtains expressions that accurately represent the target's facial motions in the order of expressional similarity. The relative similarity allows SEM to compute the appropriate coefficients of the expression delta-blendshapes. SEM selects an expression with the highest similarity among the expression delta-blendshape candidates to update the facial shape in a greedy manner [22], [23]. Meanwhile, SEM excludes redundant expressions from the remaining candidates to represent the target face with less inter-correlated expressions by ensuring the selected expressions as a series of motions from the initial shape in a backtracking manner [24], [25]. We prove that the most motionally correlated to the target can be obtained by our greedy formulation that iteratively synthesizes a facial shape with the selected expression delta-blendshapes. Through exhaustive experiments to evaluate the accuracy, redundancy among expression deltablendshapes, uniqueness, and semantics, including comparisons with state-of-the-art methods and baseline methods that cover the previous facial fitting schemes, it is demonstrated that SEM finds appropriate expressions to uniquely compose the expressive target shape by manipulating expression delta-blendshapes semantically, enabling accurate and realistic facial shape generation. In summary, we propose a selective expression delta-blendshape manipulation scheme, in which • Expressional and relative similarities are introduced on the target shape and the expression delta-blendshapes to measure motional correlations under the assumption that the facial shape is generated by the motions from the neutral shape, • Expression selection of most similar expressions is carried out in a greedy manner to address the expression VOLUME 11, 2023 FIGURE 2. The proposed method selectively picks delta-blendshapes as a series of facial motions to semantically reveal a target expression under the assumption that each delta-blendshape is a facial movement needed to compose the target expression. Fitting the blendshapes for the given target points (marked by ''x'' on the face) by stacking the less inter-correlated expressions facilitates the representation of the facial shape semantically and expressively. The image was obtained from the CK+ database (©Jeffrey Cohn) [21].
fitting as finding the highest correlated motion to the target, and • Expression exclusion of redundant expressions is performed in a single-step backtracking manner to represent the expressive target face with a combination of less inter-correlated expressions, enabling SEM to generate realistic facial shapes with significantly increased semantics and uniqueness.

II. PREVIOUS WORKS
Many other efforts have been made to overcome the drawbacks of the delta-blendshape approaches. In studies by several authors [26], [27], [28], [29], the L 1 norm of blendshape weights have been used to achieve sparse weight activations, resulting in a smaller number of expressions. Also, the L 2 regularization has been used in several works [30], [31], [32], [33], [34] to prevent the over-fitting of the expression models to the target face. Although these norm-based regularizations are useful techniques to prevent the expression of the facial shape from being exaggerated, they can cause the facial shape less expressive than the target face. This trade-off between shape stability and expressiveness is highly data-dependent, so the performance might be very sensitive to the individual and its expression, even when the same regularization definition is used. Simply increasing the sparsity of the expression coefficients (e.g., by penalizing the number of the expressions used to fit without considering the correlations between the expressions) does not help the facial shape become expressive and interpretable. Moreover, norm-based regularizations do not flexibly handle the semantic meaning of expression models. Some methods used different types of prior constraints to regularize the blendshape expressions [35], [36], [37], [38], [39]. While these approaches can perform well, their quality depends heavily on the priors, which are selected manually, or requires a sufficient number of training examples. Nevertheless, these types of regularization have not been guaranteed as a fundamental solution to overcome the drawbacks of the delta-blendshape, as they still use whole-expression models, which are highly correlated with each other. Indeed, we found that using a selected set of expressions that are rarely correlated with each other can describe a target's face more expressively, while preserving uniqueness and interpretability by maintaining the semantics of each expression blendshape.
Parametric facial models, such as the 3D Morphable Model (3DMM) [30], [40], represent a facial shape on a low-dimensional face subspace by using the Principal Component Analysis (PCA). In recent works [33], [41], [42], [43], [44], [45], [46], [47], [48], facial geometry and reflectance have been estimated using the parametric facial models, but the baseline is on a neutral target face and ignores the target expression. In other works [49], [50], [51], [52], [53], [54], [55], an additional basis was used to model expressions parametrically, and in [30], both facial appearance and expression variations of in-the-wild facial shapes were modeled concurrently on a basis model. Although decomposition methods such as PCA, Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), help obtain orthogonal basis vectors, a dimension transfer makes the basis vectors of the facial shape difficult to interpret and apply [16]. The decomposition alters the basis vectors, which have been commonly modeled on the basis of the facial unit actions [20], from local to global deformations. Although decomposition enables to model the deformation globally by capturing correlations across faces in the database used for training the model, global deformations depend on the statistics of the training set. They tend to require a sufficient number of coefficients to represent high-frequency details of the facial shape. Some works have shown that the data variance of facial movement is not jointly Gaussian and thus PCA is not sufficient to model expression variation [16], [56], [57]. Statistical priors should be required to fit the target more stably [58], [59]. Therefore, expression blendshapes without decomposition have been used to manipulate the blendshape expression coefficients in many related works [8], [9], [27], [28], [36], [60], [61], [62], [63].
Recently, Kang et al. used a facial fitting method using a subset of expressions in the whole expression set [64]. This method defines a metric to measure expressional redundancy between facial delta-blendshapes and uses the metric to find a subset of delta-blendshapes with fewer redundancies. Subsequently, the subset is used to generate a 3D facial shape by minimizing the distance between the 3D face and facial landmark points via optimization using the gradient descent algorithm. Although this method is similar to the proposed method here in that it uses the subset of less-correlated facial expressions, this does not sufficiently consider cancellations between expressional motions. Thus, this method can efficiently address the ''redundancy'' between similar-shaped delta-blendshapes, making it avoid faces with exaggerated expressions. But, it cannot describe fine facial details made from composites of completely different expressions. Also, the expression selection scheme was heuristically defined in [64]. In contrast, the proposed method addresses the expressional redundancy and the facial shape details simultaneously by introducing two facial similarity metrics. The most similar expression to the target can be found with the expressional similarity. Subsequently, the appropriate expression coefficient is obtained with the relative similarity.
In addition, it is demonstrated mathematically that the greedy selections in each iteration can lead to maximizing the expression similarity to the target from the global perspective.

III. EXPRESSION BLENDSHAPE FITTING
We briefly review the delta-blendshape fitting to define our notation and explain the method clearly. The facial mesh or blendshape is a set of vertices denoted by capital letters (e.g., P or F) and its component vectors, characterized by bold lowercase letters (e.g., p i or f 1,i [16], [39]. In the delta-blendshape approach, the facial shape is based on the following constraints to B: • B 1 is a neutral facial shape. • All the non-neutral facial weights are bounded between 0 and 1; that is, 0 ≤ e i ≤ 1 for 2 ≤ i ≤ n exp .
• The sum of all the weights is 1 so that e 1 = 1 − n exp i=2 e i . By the above constraints, the blendshape mesh B can be represented in terms of the displacements from the neutral face, the delta-blendshapes B i = B i − B 1 , as follows: Thus, (n exp − 1) coefficients need to be determined to describe a target facial expression. For simplicity, we use e to refer to all the expression coefficients except for the neutral coefficient in the following equations and discussions. A facial mesh F can be represented for rotation and translation to the facial model in (1) as follows: where R ∈ R 3×3 is a rotation matrix and t ∈ R 3 is a translation vector. Fitting an expression model is a procedure for finding an optimal e, whereby a facial shape is generated to reveal the target's facial expression most accurately. The blendshape's expression weights are approximated by minimizing the distances between the target's facial features and the corresponding points on the 3D facial blendshape. In previous works [27], [39], 3D facial point clouds captured by 3D laser scanners or depth sensors such as Microsoft Kinect have been used to fit expression blendshapes to a target. Some other works [8], [18], [28], [60] have utilized sparse facial landmarks on an image jointly with 3D point clouds to match the target's expression more precisely or used landmarks alone to fit the blendshapes for more general applicability.
The energy term E pnt to be minimized for point cloud matching can be defined as follows: where f j ∈ R 3 is the j th vertex of the facial mesh F, p j ∈ R 3 is the closest point to f j among the target point clouds, and n ver is the number of vertices of the facial mesh. The energy term E fea for the landmark matching is defined as follows: where v k is the corresponding vertex index of the facial mesh F to the k th two-dimensional (2D) facial landmark l k ∈ R 2 , (·) : R 3 → R 2 is a perspective projection operator, and n fea is the number of facial landmarks utilized for the fitting.
In order to avoid over-fitting noisy points and getting stuck in the local minima, an additional energy term is essential to regularize the expression coefficients or the facial shape. Several regularization terms have been used on the basis of the norms of the coefficients and the types of shape priors, which are defined on the basis of the probabilistic distributions of facial shapes in the training set [37], [39]. The L 1 -norm and the L 2 -norm of the delta-blendshape coefficients [30], [31], [32] are the regularization methods widely used for facial fitting to form stable and natural facial shapes in the delta-blendshape representation by controlling the concurrent activations of the delta expressions. The L 1 -and L 2coefficient regularization terms can be defined as follows: In the delta-blendshape fitting methods, the optimal expression coefficients e * can be obtained by minimizing the total fitting energy E fit as follows: where VOLUME 11,2023 and ω pnt , ω fea , ω L 1 , and ω L 2 are the constants that balance the energy terms. In practice, the geometric parameters R and t in (2) need to be determined in advance to solve the problem in (6). It is achieved by optimizing jointly or alternately on R, t, and e until the expression coefficients e converge to minimize the fitting energy E fit in (7).

IV. SELECTIVE EXPRESSION MANIPULATION
In the delta-blendshape approach, the facial shape is given by a combination of the delta-blendshapes as defined in (1).
Artists have widely used it for character modeling and user interactions since it enables the facial shape to be locally controlled and semantically interpretable. However, when this approach is used to estimate a set of expressions from a human face, the interpretability is often decreased because the weights of the expression blendshapes obtained by the fitting method are not uniquely determined. The concurrent activations of the expression models that are represented in similar facial regions can interfere with each other, decreasing the semantic meaning of each other's expression. Thus, fitting with the whole-expression blendshapes can cause the fitting procedure to become over-fitted or stuck in the local minima.
To address the problem, we attempted to find a set of expressions relevant to the target from among the entire set of expression blendshape models instead of using the wholeexpression models. Inspired by greedy algorithms [22], [23], we propose a method whereby a subset of expressions is obtained by iteratively selecting a delta expression candidate that shows the highest expression similarity to the target. Figure 3 depicts an overview of the SEM, in which the six expression delta-blendshapes related to the mouth shape B i (1 ≤ i ≤ 6) are represented for simplicity. By explaining vertex displacements as a facial motion, the expressional similarity is defined as a measure in terms of the facial movements of the blendshape model. Thus, SEM selectively obtains a set of expression blendshapes; it selects a series of facial motions required to compose the target facial shape. In order to interpret the expression delta-blendshapes as facial motions more semantically and precisely, the selected expressions should not be canceled out by the other blendshapes. As a result, SEM obtains a set of less inter-correlated expressions that are semantically close to the target face by removing expression candidates that lead to dissimilar motions from those of the target.
First, we introduce the SEM under the assumption that a target facial mesh has the same vertex correspondence to an expression blendshape. The two kinds of expressional similarities are defined by measuring the motion difference between the target mesh and the blendshape model. Based on the similarities, a robust method for predicting and selecting a subset of less inter-correlated expressions for best representing the target is described. Then, we extend the proposed method to expression selection for given landmarks and face point clouds.

A. THE SIMILARITY BETWEEN TWO MESHES
From an initial facial mesh F 0 = [x 0 , y 0 , z 0 , . . . , x n ver , y n ver , z n ver ] ∈ R (3×n ver ) , assume that two facial meshes F 1 and F 2 are generated by adding expressional motions to F 0 . The motion The motion correlation C M between F 1 and F 2 w.r.t. F 0 can be measured by as follows: The expressional similarity (ESim) is defined by normalizing the motion correlation in (8) for the magnitude of both motion vectors: Assuming that F 2 is an expressional facial shape of the target, ESim measures the similarity between the facial motion F 1 and the facial motion of the target F 2 . As ESim measures the cosine similarity of two motions, it quantifies the normalized value of producing similar facial shapes, ranging from -1 to 1. Using ESim, a blendshape motion is selected in each iteration that produces a facial shape most similar to the target facial shape. Figure 4 depicts two sets of facial shapes F 1 and F 2 generated from F 0 . In Figure 4(a), they have similar poses, of which ESim is 0.82. Conversely, in Figure 4(b), the ESim is -0.81 when two faces have different poses corresponding to widely opening and closing eyes.
The motion correlation in (8) can be considered an (n ver × 3)-dimensional projection of F 2 onto F 1 or vice versa. Based on the geometric property, we introduce the relative similarity that measures the relative amount of one's expressional motion to the other. The relative similarity (ERSim) of F 1 to F 2 is defined by normalizing the motion correlation for the magnitude of F 1 : This metric measures the magnitude of F 1 relative to F 2 in terms of the expressional motion. When the magnitudes of F 1 and the projection of F 2 onto F 1 are the same, ERSim produces a unit value. On the other hand, when the magnitude of the two motions varies, ERSim provides a scale multiplier that makes F 1 equal to the projection of F 2 , i.e., an optimal magnitude of F 1 to be made similarly to F 2 . In short, ESim in (9) measures the directional similarity between the facial motions, whereas ERSim in (10) quantifies the similarity between the facial shapes that the facial motions change from the initial face. Thus, ERSim is used in SEM for two purposes: (a) measuring the difference of a facial shape from the target face from a motional perspective and (b) determining an optimal magnitude of the expression delta-blendshapes to be made similarly to the target face.   Figure 5 shows an example where ERSims of F 1 relative to F 2 are measured by varying the magnitude of F 1 . As shown in Figure 5(d), ERSim is 1.0 when two facial motions are the most similar. Moreover, when F 1 (= F 0 + F 1 ) has either less or more expressive shape than F 2 , ERSim can provide a multiplier α for the facial motion F 1 to make the facial shape similar to F 2 , such that F 0 + α F 1 ≈ F 2 (α = 5.0 and 0.5 for 5(c) and 5(e), respectively). Thus, for a unit motion F 1 , ERSim provides an optimal magnitude of the motion similar to F 2 . In SEM, the selected blendshape motion at each iteration is magnified to be close to the target facial shape from an expressional perspective by measuring ERSim.

B. SEM ON CORRESPONDING MESHES 1) OBJECTIVE
Let T ∈ R (3×n ver ) be a target facial mesh and B be the facial shape representation using the blendshapes defined in (1). In SEM, the target face is derived from an initial face by a series of facial motions, that is, the delta-blendshapes. Then, the expression fitting energy of the blendshape model B to be minimized with respect to the expression coefficient vector e is defined: where T is the motion derived from the initial facial shape, i.e., T = T − B 1 . The motional fitting energy in (11) can be represented in terms of the relative similarity ERSim in (10) as VOLUME 11, 2023 As ERSim measures a relative similarity, 0.5 · F 1 is the delta expression most similar to F 2 as depicted in (d).
The full description of the derivation of (12) appears in Appendix. SEM determines pairs of delta-blendshapes and coefficients toward greedily minimizing (12) to satisfy the following conditions: (a) the unique set of expression blendshapes and coefficients can be obtained from a given facial mesh, (b) as few as expression coefficients can be obtained to minimize the absolute motional fitting energy E ′ mot , and (c) the semantic meaning of each expression delta-blendshape used to represent the facial shape can be preserved.

2) PROBLEM FORMULATION
Starting from a neutral facial shape, SEM selects an expressional motion, that is, a delta-blendshape iteratively in a greedy manner. The facial shape after the t th selection is defined as follows: where B t and e t are the selected delta-blendshape and its coefficient in the t th iteration (t ≥ 1), respectively. Note that the initial facial shape is a neutral facial blendshape, i.e., Since the greedy selection scheme sequentially determines the expressions one by one as it iterates, the motional energy in (12) to be minimized in each iteration t is reformulated as where

3) EXPRESSION SELECTION
As SEM iterates by (13), it determines an expression blendshape B t from among the blendshape candidates and the magnitude of the selected blendshape e t represented for the facial shape that minimizes the fitting energy in (14). It can be accomplished by finding an expression that minimizes ESim in (9) and measuring its coefficient via ERSim in (10). The detailed description is represented in Appendix.
Assume that there are (n exp − 1) delta-blendshape candidates B i ∈ R (3×n ver ) 2 ≤ i ≤ n exp . To determine B t , SEM selects one of the blendshape candidates most expressively similar to the target facial shape. ESim between the motion toward the target facial shape T − F t and each candidate s i 2 ≤ i ≤ n exp can be measured as follows: However, e i , which is the magnitude of each blendshape, is unknown. As the ESim is defined by normalizing the magnitudes of facial motions, it is scale-invariant with a facial motion by definition in (9). Therefore, the coefficient does not affect the result of the expressional similarity as described in Figure 5. Thus, s i in (15) can be substituted as follows: Once the similarities for all the candidates are measured, SEM selects the blendshape that has the largest similarity as the delta-blendshape in the t th iteration B t : Then, the corresponding coefficient e t is obtained using ERSim in (10) as follows: The blendshape selected in the current iteration t is removed from the blendshape candidate set not to be chosen in the following iteration.

4) EXPRESSION EXCLUSION
SEM selects (n exp − 1) expression blendshapes for (n exp − 1) iterations in the order of expressional similarity. Expression blendshapes dissimilar to the target facial shape from a motional perspective are excluded from the set of candidates before the selection to obtain fewer expressions that are less inter-correlated. In SEM, the greedy selection at each iteration picks the most similar motion to the remaining motion required to reach the target face. Therefore, it is necessary to confirm from a global perspective that the selections of SEM iterations decrease the motional fitting energy from an initial face in (12).
We introduce an expression exclusion step to efficiently accomplish such a process using a 'one-step' backtracking strategy [24], [25], [65]. Based on the formulation in (13), we define the motional energy index from the initial face, which is measured for all the remaining candidates B i at the beginning of the t th iteration: whereB i = t−1 k=1 e k B k + e i B i and e i is calculated by (18). The term t−1 k=1 e k B k inB i is the accumulated motion using the selected expressions for (t − 1) iterations and the term e i B i is the motion of the i th expression among the remaining candidates. Thus, the motional energy index of the selected expression in the previous iteration, i.e., (t − 1) th iteration, can be measured as r t − 1, B t . As a sequence of motions, an expression to be selected should not increase the absolute motional energy in (12) from the initial face. Thus, All the expression candidates that satisfy the following condition in the t th iteration are excluded in the subsequent selections: This exclusion weeds out candidates that largely cancel out the previously selected expressions while allowing the selected motions that move toward the target. It helps SEM to non-parametrically determine the number of expressions required to represent the target expression semantically. The SEM procedure is terminated when no candidate remains by the selection and exclusion procedures.

C. SEM ON LANDMARKS
It is assumed that the target facial mesh corresponding to the blendshape model is given in Sec. IV-B. However, in real applications for blendshape fitting, the point clouds of the target face or facial landmarks on the image are given generally, as used in (3) and (4). As the vertex-point correspondence between the blendshape model and the facial landmarks is known, the expression selection on the facial landmarks is similar to the selection on the facial mesh, except for the following two conditions: 1) The facial landmarks are far fewer than the number of blendshape vertices, and 2) the facial landmarks are projected 2D points on the image, rather than 3D vertices.
When 2D facial landmarks are given, facial motion is measured on the projected plane rather than on the 3D coordinates. Let L = [l 1 , . . . , l n fea ] ∈ R (2×n fea ) and F = [f 1 , . . . , f n fea ] ∈ R (3×n fea ) be the flatten vectors of 2D facial landmarks and the corresponding vertices of the blendshape models, respectively. For the ease of comparison with the previous blendshape fitting, the same notation as in (4) is used. The (16), (18), and (19) can be modified respectively as follows: SEM greedily selects expressions in the order of the expressional similarity s i and directly computes the coefficient of the selected blendshape e t in (21). Thus, unlike the previous blendshape-fitting methods, SEM does not require the regularization term in (7) to obtain the sparse expressions and iterative optimization methods to find the optimal coefficient for the facial landmarks. In addition, for the given geometric parameters and the target facial landmarks, SEM selects the unique set of expressions for each trial.

D. SEM ON POINT CLOUDS
For point clouds, the vertex-point correspondences to the blendshape model are not given. In the previous blendshape fitting, the closest point to each vertex of the blendshape is chosen during the optimization expressed in (6). Similarly, SEM finds the corresponding pairs using Iterative Closest Point (ICP) [66]. However, in SEM, the corresponding pairs are obtained for each blendshape. Thus, SEM finds the target point set P i = [p i,1 , . . . , p i,n ver ] closest to the delta-blendshape candidate B i 2 ≤ i ≤ n exp using ICP in each iteration of the expression selection, which can be denoted as where P is the point set closest to the facial mesh with a single expressional motion B i of magnitude e i , i.e., F t + e i B i . The objective function in (23) is to find the closest point set P i with respect to e i , and e * i is not further used. Once all the correspondences are obtained, the parameters in (16), (18), and (19) can be measured for each candidate as follows: We assume that the expression is selected by (17) from among the candidates in (24) at the t th iteration, and denote P t as the point set of the selected expressions. As P t is updated to include more accurate matches at each iteration, the coefficients for the previously selected expressions can be refined by using the updated correspondence.
The point cloud matching error using the closest point in (3) is largely non-linear and it has several local minima. Therefore, without strong regularizations or priors, jointly optimizing the full expressions using ICP causes the fitting procedure to get stuck in the local minima in many cases. SEM efficiently avoids these local minima and finds a better solution for the target point cloud by providing such strong priors of target expressions in the order of the motional similarity.

V. EXPERIMENTAL RESULTS
We conducted two main experiments to evaluate the performance of our proposed method. In the first part, the performance for facial expression reconstruction and alignment was VOLUME 11, 2023 evaluated against existing schemes. We categorized the fitting optimization functions used in the previous facial reconstruction methods into four baselines. The performance comparisons to the baselines were performed to verify the SEM scheme over L 1 , L 2 , and PCA regularizations. In addition, the comparisons with the state-of-the-art reconstruction methods were performed to validate the performance improvements of SEM. In the second part, we tried to evaluate the uniqueness and semantics of the proposed method by correlating the expression coefficients obtained from the expression fitting with two closely related attributes to facial expressions.

A. 3D EXPRESSION RECONSTRUCTION AND ALIGNMENT 1) QUANTITATIVE METRIC FOR FACIAL RECONSTRUCTION
To evaluate facial alignment performance, we measured the normalized error (NE) and the relative similarity (ERSim) of the obtained shape to the target shape. NE is the distance to the target points normalized by the inter-ocular distance. ERSim measures closer to one when the generated shape is more similar to the target in terms of facial motions, as defined in (13). In addition, the total redundancy among the blendshapes, denoted as Corr, was measured. Please refer to Appendix-VI for details of the measurement metrics.

2) COMPARISONS WITH THE BASELINE METHODS
The early methods for the blendshape fitting used point clouds from depth sensors or 3D scanners. By contrast, recent methods have widely used facial images or facial landmarks obtained from the images rather than using point clouds to synthesize a 3D human face. Regardless of the types of the target features used for the blendshape fitting, many previous works using the delta-blendshape models [8], [27], [28], [36], [60], [61] have been obtained the coefficients of the expression delta-blendshapes by minimizing the fitting error E fit in (7) under the assumption that the user-specific blendshapes were given. Unlike 3DMM based methods, these works did not use a low-dimensional subspace decomposition for the expression blendshapes such as PCA. Instead, the original shape of each expression blendshape was intended to be preserved as possible. Thus, either L 1 or L 2 regularization term for the expression coefficients in (7) was adopted in these methods to prevent the facial shape from being over-fitted and from having exaggerated expressions. To cover the delta-blendshape fitting schemes used in these methods, the fitting energy of the two baseline methods with the L 1 and L 2 regularizations are represented in (7).
The L 1 and L 2 norms were mutual-exclusively used in our experiments for the reliable measurement, i.e., either ω L 1 or ω L 2 was zero. For simplicity, we denoted the results obtained by utilizing the L 1 and L 2 regularizations as L 1 Reg and L 2 Reg, respectively. The L 0 norm, which is a counting loss for the non-zero expression coefficients, had been considered for our experiments because it has been used to make sparse representations in other applications [67]. However, the L 0 norm produced unreliable results in our tests and it was challenging to find common constant values or an interval to balance the normalization term and the others. Therefore, we do not include the L 0 norm in the baseline experiments. In addition to L 1 Reg and L 2 Reg, the performance comparisons were conducted with the fitting schemes using PCA-decomposed basis vectors, which have been used in many parametric model-based fitting methods [30], [49], [50], [51], [53], [54], [55], [68]. To cover the parametric model fitting schemes used in these methods, two types of expression bases were used for the baseline comparison. Firstly, we obtained the PCA basis by decomposing the entire expression set, including a neutral model. The second one was obtained by decomposing the delta shapes of the expression blendshapes [69], which has been widely used in PCA-based facial fitting approaches. We denoted the results of these methods as PCAExp and PCADel, respectively. Following the standard regularization technique for PCA coefficients [51], [53], [54], the L 2 regularization to coefficients weighted with the inverse of PCA eigenvalues was used, which is the so-called ''Mahalnobis distance'' regularization [70].

a: RESULTS ON FACIAL LANDMARKS
The fitting energy for the facial landmarks in (7) can be represented with ω pnt = 0 to disable the point cloud loss. To obtain an accurate neutral face model for each facial expression image for composing the user-specific blendshapes [36], we used the Extended Cohn-Kanade (CK+) database [21], where each sequence begins at the neutral expression and ends at the peak expression. For 593 sequences from 12 subjects who portrayed seven basic emotions in the CK+ database, the images in the first frame were used to obtain the user identities. Then, SEM, L 1 Reg, and L 2 Reg used the obtained user-specific blendshape set directly. In contrast, we decomposed the obtained blendshape set into the PCA spaces for PCAExp and PCADel. Among the 68 landmark points annotated in the CK+ database images, it is difficult to clearly distinguish between the inner and outer lip corners in practice. In addition, the facial boundary landmark points can be defined ambiguously depending on the facial pose. Therefore, we used 49 landmarks for reliable evaluation by excluding 2 inner lip corners and 17 facial boundary points. For the performance evaluation, we split the 49 inner landmarks of the CK+ database into two sets as represented in the first column of Figure 6: used and unused points. One is for the facial fitting and the other is for measuring facial alignment accuracy at unobserved points. For a fair comparison, the non-linear optimization for all the methods in the baseline experiments was performed using an off-the-shelf BFGS optimizer [71], [72] to constrain the boundaries of the expression coefficients.
The distance errors of the expression fitting methods on the CK+ database are summarized in Table 1. NE is measured for all (49pts), unused (26pts), and used (23pts) landmarks. The errors between the baseline methods do not show    meaningful differences. However, the results establish that SEM outperforms the baseline methods in facial alignment. In particular, SEM shows a significantly higher performance gain when it is evaluated for the unused landmarks. In contrast, the baseline methods do not accurately predict the unused landmarks compared to used ones, leading to an increase in the NE measured for all landmarks. Figure 6 visualizes the facial shapes obtained from these methods. SEM produces expressive faces with the expressions visually unique to the baseline methods. In contrast, the baseline methods generate exaggeratedly expressive faces in some facial images or less-expressive faces in others, even with the same regularization balancing constants.
The measurements of ERSim and Corr are summarized in Table 2. The significantly lower Corr and higher ERSim than the baseline methods demonstrate that SEM can represent the target more expressively and semantically with significantly less correlation between the delta-blendshapes. In contrast, the baseline methods combine all the delta expressions to make the facial points of the blendshapes as close as possible to the given facial landmarks without knowing the appropriate expressions to be matched to the target expression. As a result, it may cause the obtained expressions to become redundant and higher Corr measurements (i.e., more than two times higher than SEM), implying that the delta shapes composed of the facial shape are motionally redundant. For examining the redundancy between the obtained blendshapes in more detail, Corr is measured for positive and negative parts, which are e F,i e F,j B i · B j > 0 and < 0 of (33) in Appendix. A higher Corr(+) means that the expression blendshapes have redundant motions and a higher Corr(−) means that the expressions tend to cancel out the others. In other words, the lower Corr(+) and Corr(−) can represent better target expressions. Table 3 summarizes the correlation measurements. SEM obtains expressions with lower Corr(+) and Corr(−) than the baseline methods. Cancellations between the expressions can cause each blendshape not to represent its own facial shape, considerably decreasing the semantics of the blendshapes. Compared to the baseline methods, the notably lower Corr(−) value demonstrates that SEM can produce a set of expressions semantically by avoiding the cancellations between the blendshapes.

b: RESULTS ON 3D POINT CLOUDS
For the 3D point clouds, the Bosphorus database [73] was used. This database captured the 3D scans of 4,666 scans from 105 subjects. We used 2,603 of the 4,666 scans for the performance evaluation after excluding facial scans without expressions. The user-specific expression blendshapes were obtained from the scans of the neutral face in the Bosphorus database. The Bosphorus database provided approximately 30,000 valid facial points per scan. The point clouds were randomly sampled 10 times to produce sets for the performance evaluation. For evaluations on point clouds, the facial landmark term was not used. Thus, the fitting energy of the baseline methods can be represented as E fit in (7) with ω fea = 0.
The average performances for the methods are summarized in Table 4. Figure 7(a) shows the visualizations of the facial shapes obtained from the proposed and the baseline methods. As strong constraints such as the facial landmarks are not used in this experiment, the facial shapes of the baseline methods tend to be trapped in the local minima as shown in Figure 7(a). PCA helps prevent the facial shapes from being overshot into local minima by decomposing the basis to move globally. However, it is shown that the globally moving basis often results in missing local details. In contrast, SEM reliably finds the target expression by capturing the distinct expressions in the order of the motional similarity, showing the lowest NE and the highest ERSim, and considerably lower Corr measurements than the baseline methods. Figure 7(b) shows the facial shapes obtained from the baseline methods by using different regularization weights. It is shown that regularization does not help to form expressions with different shapes. Instead, it decreases the overall expressions of the facial shape and increases the target error as the magnitude of the weight increases. The baseline approaches do not often produce much-exaggerated facial shapes even when no regularization is used; because a sufficient number of fitting points are provided. Instead, the method often fails to find or mis-capture the target expressions. The cancellation between expressions (higher Corr) allows the baseline methods to generate various shapes, increasing the possibility of overshooting or getting stuck into the local minima. Also, the cancellation significantly decreases the semantics of the blendshapes, being unable to sufficiently utilize the strong priors of the blendshape faces or motions. Finally, it is demonstrated that SEM reliably constructs a realistic face highly correlated with the target facial motion (higher ERSim) with a combination of less inter-correlated expressions (lower Corr).

3) COMPARISONS WITH THE STATE-OF-THE-ART METHODS
In this section, we compared the results quantitatively and qualitatively with the state-of-the-art facial reconstruction methods using the 3D facial model to validate the performance of the proposed method. For thorough validations, multiple facial reconstruction methods recently proposed were used for the comparisons: ones proposed in the works of Cao16 [9], Zhu16 [50], Chang18 [54], Sanyal19 [68], Guo20 [55], and Kang20 [64], respectively. All of these methods align the 3D facial models [17], [49], [75] to a facial image and produce matching landmark points. SEM, Cao16, and Kang20 required the facial landmarks for aligning the model to the face to obtain the facial shape from an image, whereas the methods of Chang18, Sanyal19, and Guo20 needed the target image and a face bounding box on the image. For providing approximations of the facial landmarks and the bounding box in a target image, we used the Supervised Descent Method (SDM) [76], [77], which was trained with the 300W database [78]. The results of SDM, which has been still used widely today for facial alignment tasks, were also included in the quantitative comparison. We used two public databases of facial emotional expressions to validate the facial fitting performance on various types of expressions: CK+ [21] and AffectNet [74] databases. In the CK+ database, facial images were captured indoors under constrained conditions, where subjects were requested to pose specific emotional expressions. Therefore, it enables balanced evaluations of basic expressions. In contrast, in the AffectNet database, in-the-wild facial images were collected online, allowing the methods to be evaluated extensively on natural scenes. We used 10 uniformly sampled images per sequence in the CK+ database for evaluation, amounting to 5,930 facial images. The AffectNet database comprises 1,000K facial images with 68 facial landmark  annotations, which is the same as the annotations used in the CK+ database. In the AffectNet database, the landmarks of 420,300 images were annotated manually and those of the other images were obtained automatically using a facial alignment algorithm [79]. For reliable evaluation, 40,000 images in the manually annotated image set were used. We used 49 inner landmarks of 68 landmark annotations for evaluation.
The quantitative measurements of NE on the CK+ and AffectNet databases are summarized in Table 5. All methods show better performance on the CK+ database than the AffectNet database since in-the-wild expressions of the AffectNet database have significantly more variations. Furthermore, the images of the CK+ database have clean backgrounds and almost-frontal faces, allowing SDM to obtain better performance than facial model-based methods except for SEM. In contrast, facial model-based methods, including SEM, obtain the target landmarks with lower errors on the AffectNet database than SDM thanks to the human facial shape priors of the facial model. SEM obtains the target expression by fitting on the landmarks estimated by the landmark detector. Nevertheless, the results show that SEM decreases the mean error remarkably compared to that of SDM, enabling SEM to outperform the state-of-the-art methods in terms of the normalized distance error. This demonstrates that the increased semantic meaning of the expression blendshapes helps better characterize the facial shape and expression by providing strong priors of human expressional shapes. Figures 8(a) and 8(b) visualize facial models reconstructed by the methods on the CK+ and AffectNet databases. It can be seen that the methods of Cao16, Zhu16, Chang18, Sanyal19, Guo20, and Kang20 find coarse-scale details of the target expressions sufficiently. However, they are prone to miss out on expressive details, especially around the eyes, cheeks, and mouth. In contrast, the results show that SEM reconstructs expressive faces with fine details with less deviation, demonstrating facial expression blendshapes can be semantically manipulated with SEM to find appropriate and unique target expressions.

B. SEMANTICS AND UNIQUENESS OF EXPRESSION COEFFICIENTS
In this section, we tried to evaluate the semantics and uniqueness of the proposed method. However, it is not easy to measure the performance of the expression fitting in terms of semantics and uniqueness. Therefore, we verified the proposed approach by examining the correlation of the coefficients obtained from the expression fitting with two closely related attributes to facial expressions: the Facial Action Unit (FAU) and the facial emotion. It is expected that the expression coefficients and the attributes can be highly correlated if the unique coefficients with semantics are obtained according to the FAU and emotion. Therefore, we used the Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel to non-linearly correlating the coefficients with the attributes. Although it is insufficient to achieve extremely high correlations using the coefficients only, relative comparisons between the facial fitting methods allow us to verify improvements in semantics and uniqueness. The comparisons were conducted with four facial fitting methods [9], [54], [55], [68] employed in Sec. V-A3.  (6, 25, and 26). SEM produces the expression coefficients uniquely over FAUs 25+26, whereas the other facial fitting methods produce quite different expression coefficients between the two faces.

1) FACIAL ACTION UNIT DETECTION
We used the Denver Intensity of Spontaneous Facial Actions (DISFA) database for the FAU detection experiment [80]. In the DISFA database, 4-minute videos of 27 subjects watching an emotional video stimulus were recorded, producing 4,845 image frames per subject. The intensities of 12 FAUs were manually coded in each frame. We used 80% of the images for training and 20% for testing the SVM classifier. Moreover, the coefficients obtained from the comparison methods were fed as inputs to the SVM and the FAUs activations were used as labels. The three standard metrics were used for the evaluation with 10-fold validation: the Mean Squared Error (MSE), the Squared Correlation Coefficient (SCC), and the classification accuracy. Table 6 summarizes quantitative performance measurements of the comparison methods for FAU detection. The results show that SEM outperforms the previous facial fitting methods. In particular, the SCC and accuracy of SEM are measured significantly higher than the others. One of the major factors leading to the remarkably higher correlation with FAU is the unique and semantic coefficients of expressions. Figure 9 depicts 3D facial shapes and expression coefficients for two faces in the DISFA database. The first face is nearly neutral and the other is smiling. Although these faces have different expressions, both faces share two facial actions, which are ''lips part'' and ''jaw drop'' (FAU 25+26). SEM shows similar coefficient distributions between the two faces. Specifically, the coefficient distribution of the second face almost covers that of the first one, implying SEM can uniquely produce the expression coefficients for the FAUs 25+26. In contrast, the other facial fitting methods produce considerably different coefficients between the faces. The results demonstrate that SEM significantly increases the semantics and uniqueness of the expression fitting.

2) FACIAL EMOTION ESTIMATION
We used the AffectNet database [74] to estimate facial emotion values. In the AffectNet database, the facial intensity is defined in the Arousal-Valence space [81]. Since the intensity i for Arousal and Valence is composed of real numbers between -2 and 1 in this database, we divided the intensity values into 3 parts for classification: high for 0 ≤ i ≤ 1, middle for −1 ≤ i < 0, and low for −2 ≤ i < −1. Thus, three FAU labels (0, 1, and 2) were used on each Arousal and Valence part to train the SVM classifiers. Table 7 summarizes quantitative measurements of emotion estimation performance in terms of MSE and accuracy. The expression coefficients of SEM show significantly higher   correlations with facial emotion than those of the other reconstruction methods, which are similar results to those of FAU detection in Sec. V-B1. Figure 10 depicts the 3D facial shapes belonging to high and low labels in the Arousal coordinate and the corresponding expression coefficients. Each set of high and low Arousal faces represents similar emotional expressions. It is shown that SEM produces clearly distinguishable distributions of expression coefficients between intra-label and inter-label faces. In contrast, it is not easy to find common characteristics from the coefficient distributions of the other methods to classify emotions. In the experiments, the unique coefficients according to emotion enable SEM to accomplish significant improvements over the state-of-the-art fitting methods, demonstrating the benefit of SEM in terms of uniqueness and semantics.

C. STABILITY OF BLENDSHAPE SELECTION
Here, we show temporal fitting results to verify the blendshape selection stability of the proposed method. After fitting the blendshape model on facial expression sequences separately in time, we evaluated the temporal fitting stability with intensity changes for ''key'' expression delta-blendshapes in time. A sufficient number of accurately registered 3D facial landmark points are essential to validate the stability of blendshape selection. Thus, following the work in [38], we captured temporal facial sequences with several expressions using 108 infrared (IR) markers with multiple IR cameras to obtain highly accurate and reliable facial landmark positions in time. For the evaluation, we compared the temporal fitting stability of the proposed method with those of the L 1 regularization method, which showed the least alignment error among the baseline methods in Section V-A2. In short, the delta-blendshape models were fitted on the temporal sequences using SEM and L 1 Reg methods, respectively. For the stability experiments, we used the same balancing weight values to those used in the baseline experiments for the fitting methods. Then, we analyzed the fitted delta-blendshape's intensities in time. We chose the two pairs of expressions in the delta-blendshapes associated with the mouth and eyes for the evaluation. Figure 12(a) shows the intensities' changes of deltablendshapes in time for the two key expressions, FAU 22 (mouth open) and FAU 23 (jaw twist). Although both delta-blendshapes change mouth shapes, the two expressions look clearly different from a human point of view. Thus, these expressions have quite different semantic meanings. At points A, C, and F in Figure 12(a), the subject opened his mouth largely. The results show that the previous scheme detects FAU 22 (mouth open) unevenly. Also, it recognizes FAU 23 (jaw twist) in frames that have different expressions semantically. In contrast, the proposed scheme finds mouth openings in appropriate frames without activating irrelevant expression blendshapes. Figure 12(b) shows another example for the two key expressions, FAU 1 (right eye closed) and FAU 2 (left eye closed). The subject moves eyebrows slightly (points A-C) and closes his eyes at the end of the sequence (points D-E). Especially in the first three intervals (A-C), the subject moves their eyebrows but the eyes do not get smaller than the first frame of the sequence. Nevertheless, L 1 Reg used the expression delta-blendshapes with closed eyes (FAU 1 or FAU 2) to construct the target face for the interval. By contrast, SEM robustly distinguishes between expressions with moving eyebrows and eyes closed. These results demonstrate that the proposed scheme accurately selects the VOLUME 11, 2023 target expression blendshapes with stability and semantics in time.

D. LIMITATION AND FURTHER WORK
The previous results show that SEM can detect the detailed target expressional shapes and poses using the inner landmarks. As a result, SEM produces the 3D facial shapes of the targets more expressively in a non-parametric manner for the given facial points. Nevertheless, as SEM is inherently affected by the landmark detector, the facial expression different from the target can be obtained because of the wrong landmark positions, as described in Figure 11. This limitation could be resolved by extending the fitting procedure of SEM to the selection of optimal landmarks. In addition, it is shown that utilizing a few less inter-correlated expressions can help represent a facial expression more accurately and semantically, leading to discontinuous expression sets for a continuous image sequence. The discontinuity can be applied to find the peak expressions or the anchor frames in the sequence. Therefore, SEM can be extended to the temporal domain by propagating the other frames from the reference frames [82], [83]. The expression similarity also can be used to measure the correspondences between partially overlapped point clouds [84]. We intend to continue our research in these directions in the future.

VI. CONCLUSION
We introduced the selective manipulation approach for expression delta-blendshapes, where a target expression was approximated by a series of facial motions under the assumption that each delta-blendshape is a facial movement with semantic meaning. A series of facial motions required to compose the target facial shape were selected sequentially based on the expression similarities that quantify the motional correlation between the delta-blendshape and the remaining motion to reach the target points. The proposed method significantly decreased the cancellations between the selected expressions by excluding expression candidates that caused the motion to be uncorrelated to the target motion from an integrated perspective. Consequently, less inter-correlated expressions with considerably larger target correlations could be obtained by using the SEM. The experimental results on the public databases showed an increase in the quality, uniqueness, and semantics of the facial shape representation over the baseline and the state-of-the-art methods.

APPENDIX MOTIONAL SHAPE FITTING ENERGY AND ERSim
The motional shape fitting energy to be minimized in SEM can be defined as follows: Since | T | 2 is a constant value for the given target mesh P, we can minimize the energy E ′ mot = E mot | T | 2 instead of E mot such that

APPENDIX EXPRESSION SELECTION USING ESim AND ERSim
ESim and ERSim in Equations (9) and (10) defined in the original manuscript can be represented as follows: where α is the angle between the motions F 1 and F 2 in (n ver × 3)-dimensional coordinates. Then, the measurement for s i in Equation (17) defined in the original manuscript can be expressed as follows: where α i is the (n ver × 3)-dimensional angle between the i th delta-blendshape B i and T t . Thus, finding a candidate that has the highest ESim in Equation (18) defined in the original manuscript is a procedure that selects an expression with the highest directional correspondence to the target motion as: where α t is the multi-dimensional angle of the selected expression in the t th iteration. The motional energy in Equation (15) defined in the original manuscript can be formulated using the selected delta-blendshape and its coefficients in Equations (17) and (18) in the manuscript, B t and e t as: where e t = ERSim B t , T t = | T t | cos α t | B t | in Equation (18) defined in the original manuscript. As α t in Equation (27) minimizes E in Equation (28) for cos α t ≥ 0, it is verified that the expression selection using ESim in each iteration can find an expressional motion closest to T t among the candidates and (1 − cos 2 α t ) is the remaining error between the motions.

APPENDIX QUANTITATIVE METRIC FOR FACIAL ALIGNMENT
The alignment error for the predicted facial shapes can be defined by the distance to the target points. Let L T = [l 1 , . . . , l n fea ] be the target landmarks and S T denote the target point clouds. Similar to the formulations in Equations (3) and (4) defined in the original manuscript, the average distances to the facial mesh F from L T and S T , respectively, can be calculated as follows: where v j is the vertex index of F corresponding to j th landmark l j , p k is the closest point to f k ∈ F among S T , and n lnd /n pnt are the numbers of the target landmarks and the point clouds for the evaluation, respectively. As the magnitudes of d lnd and d pnt are scale-variant to the landmark L T and the point set S T of the target, an error metric used for the evaluation is defined by normalizing the distance error by the inter-ocular distance. The normalized error (NE) is measured as follows: where I i (1 ≤ i ≤ 4) are the indices of the inner points of eyes for calculating the inter-ocular distance. Two more metrics were utilized for measuring the motional similarity and the redundancies between the obtained expression blendshapes. ERSim in the latter part of Equation (13) defined in the original manuscript was utilized to evaluate the facial shape obtained for the landmarks and the point clouds as ERSim (L T − (B 1 ) , (F) − (B 1 )) and ERSim (S T − B 1 , F − B 1 ), respectively. Let e F = [e F,1 , . . . , e F,n exp ] be the blendshape coefficients of F. The correlation between the blendshapes (Corr) can be measured as follows: Corr(e F ) = (nexp−1) The sum of the absolute correlation in Equation (33) is measured for quantifying the total redundancy among the blendshapes used to represent the facial shape.