Low-Rank Active Learning for Generating Speech-Drive Human Face Animation

Emotion&speech-based human facial animation technique can be considered as a useful application in many artificial intelligent systems. Given a speech signal, the recognizer output a sequence of the phoneme and emotion pairs. Thereby, we calculate the sequence of viseme and expression pairs accordingly, which are subsequently transformed to a consistent and synchronous video describing facial animation. This article introduces a novel facial animation technique that can intelligently generates real human face animation videos by leveraging an emotional speech. More specifically, we first extract acoustic features sufficiently discriminative to the emotion and phoneme pairs. And the corresponding sequence of phoneme and emotion pairs are computed. Next, we propose a low-rank active learning paradigm for discovering multiple key facial frames that can best represent the above phoneme and emotion pairs in the feature subspace. We associate each phoneme and emotion pair with a key facial frame, based on which the well-known morphing technique fits the associated key facial frames to a smooth animated facial video. We focus on generating multiple transitional facial frames between pairwise temporally adjacent key ones. Experiments demonstrated that the synthesized facial videos look real, smooth, and synchronous with different male/female speeches.


I. INTRODUCTION
Synthesizing facial animation video using human speech [1] is an important technique that is pervasively applied in modern AI systems.As an example, this technique is helpful for fully/partially hearing impaired people recognizing speech in noisy environments.Besides, it is significant for synthesizing human lip movements, which is a widely-used technique in virtual reality.Further, as a computer-assisted multi-person communication tool, speech-guided human facial animation (e.g., Apple Memoji) is becoming a useful interface for online chatting in remote collaborative circumstances.
In the literature, a rich variety of facial animation frameworks have been proposed.We can boardly categorize facial modeling and animation techniques into two classes: geometric manipulations-guided techniques and The associate editor coordinating the review of this manuscript and approving it for publication was Christian Pilato .image manipulations-guided techniques.Geometric manipulations include the following techniques: key-framing and geometric interpolations [2], [3], parameterizations [4], finite element methods [5], modeling using facial muscless [6], pseudo-muscle-based facial animation [7], spline-guided approaches [8], and free-way deformations [9].Comparatively, image manipulations denote techniques like image morphing between pairwise photographic images [10], texture manipulations [11], image blending [12], and vascular expressions [13], [14].These video animation techniques are practically guided by tracking/localizing visual features or animation driven by performance [15].In spite of the various aforementioned methods, there are still challenges to implement them into an emotional speech animation system satisfying real-world requirements: • Many approaches need complicated human intervention in the model training stage.For example, the system designers have to determine the key frames corresponding to each phoneme/emotion tag and how many key frames need to be employed.Such human intervention makes the accuracy of synthesized facial video intolerantly dependent on the domain knowledge of system designers.In practice, we expect a fully automatic training stage of the facial animation system, wherein no strong domain knowledge is needed.
• Owing to the popularity of portable devices like Apple Watch and Google Pixel, more and more communication Apps are developed on mobile platforms, e.g., Skype and Facetime.This stimulates the demand of developing facial animation systems on mobile devices.However, due to the limited computational capability, it is difficult to develop a real-time mobile facial animation App.Besides, no optimization have been proposed to transform an off-the-shelf desktop animation system onto mobile platforms.
• Most of the previous facial animations are based on 2D/3D cartoon figures.Toward a more natural humancomputer interface, animation based on real human faces is preferred.This is a challenging task because illumination, expression, and head position are difficult to control when synthesizing a real human face.This factors may lead to an unnatural face as shown in practical speech animation systems.To tackle these difficulties, we design an emotional speech driven facial animation system that is trained in a fully automatic way.Moreover, the animation system is based on a real human face and can execute in real-time on mobile devices.The flowchart of our designed animation system in Fig. 1, which can be divided into three main components.Part 1: For each recorded human speech with emotion, the six well-known acoustic features [16] (e.g., MFCC and LSF) are extracted in the first place.Thereafter, our phoneme and emotion pairs from an emotional speech can be rapidly and accurately calculated by a multi-label classifier.Part 2: To match one phoneme emotion pair with a selected representative faces, a low-rank active learning technique is leveraged to discover multiple key facial images from the recorded videos during training.In our implementation, these videos are recorded by a volunteer from our Department.She is a native Mandarin/English speaker.Herein, we divide each video into multiple sentences, each associated with an specific emotion (i.e., ''happy'', ''surprise'', ''sad'', ''angry'', and ''neutral'') are used to speak the sentence.Our proposed low-rank active learning algorithm is effective since it exploits the underlying distribution of facial frames from a video.Part 3: After matching the key facial frame to each phoneme and emotion pair, toward a smooth synthesized video, the morphing [35] technique is adopted to produce a set of intermediate frames between key facial frames that are temporally adjacent.To make our synthesized facial expressions seemingly natural, illumination compensation is applied to each facial frame.
Totally, our work has the following advantages: 1) an intelligent platform for real human facial animation, which is trained with little human intervention; 2) leveraging an active learning paradigm for calculating key facial frames from multiple recorded training videos; and 3) our system is a general that can be trained from an arbitrary human face.

II. RELATED WORK
The proposed system is basically relevant to two topics in modern artificial intelligence systems: 1) recognizing emotion and phoneme using human speech, and 2) speechdriven facial animation technique.

A. EMOTION AND PHONEME RECOGNITION BY SPEECH
Identifying emotion and phoneme pairs based on human speech [16], [17] aims to understand human affective attributes of each utterance by analyzing the acoustic features engineered from human speeches.Practically, we can formulate this task as a speech clip categorization problem.To accurately and fast categorize different speeches into emotion and phoneme pairs, researchers proposed a couple of acoustic features.In the literature, machine learning researchers proposed probabilistic generative models, e.g., Latent Dirichlet Allocation (LDA) and Long Short-Term Memory (LSTM), to exploit the underlying distribution of the aforementioned acoustic features.Afterward, they deployed the softmax layer or the maximum posterior probability estimation to recognize different emotion and phoneme pairs [18], [19].Another line of research focused on deriving the so-called background models from the acoustic representations, based on which the supervectors are calculated for categorization [20], [21].Such categorization pipeline has been pervasively utilized in domains like speakers localization.Some researchers designed statistical algorithms to learn the distribution of the acoustic representations.Herein, the globally calculated statistical distributions are leveraged to classify each emotion and phoneme pair.In practice, support vector machine is treated as the most popular tool for classify such global acoustic representations [22], [23].Meanwhile, different classifiers, e.g., random forest [24] and softmax [25], are also pervasively applied in speechbased emotion and phoneme understanding.Noticeably, however, the above methods largely rely on the possibly high-dimension and manually-designed acoustic features that are selected by some prior knowledge.

B. FACIAL ANIMATION VIDEO DRIVEN BY SPEECH
In the literature, the synthesization of an aesthetically pleasing facial video based on a the input human speech was investigated comprehensively.Herein, an extensive review of the previously published speech-guided facial video synthesization is provided in [26].The authors [27] attempted to transform the two dimensional facial frames into a natural facial video by rebuilding the 3D facial frames using a morphing technique.Thereafter, they calculated a so-called expression+viseme feature space using the above synthesized 3D faces.The authors [28] proposed a speech-driven-lips framework that simultaneously constructs human speech co-articulation as well as the expressionguided eigenspaces.A rich set of other methods [29] were designed so as to produce expression-guided speech videos.In [30], the authors proposed to intelligently predict lip-based movement trajectory using human speech.The designed system accurately calculates human lip movements from the original human speech.Simultaneously, it can optimally produce video animation trajectory by leveraging the wellknown HMM.The authors established a real-time framework for automatically generating speech-guided facial gestures in virtual contexts.More specifically, the method can produce gestures such as multiple nods/ head movements and eye blinks.The system is practically realized by incorporating HMM, multiple pre-defined crteria, as well as some statistic distributions.In conclusion, the above discussed facial animation pipelines are not particularly designed toward mobile platforms.Besides, to our best knowledge, only a few animation pipelines can synthesize real-world human faces.Even worse, they cannot rapidly reduce the sub-optimal illumination.

III. OUR APPROACH A. ACOUSTIC FEATURES EXTRACTION
In our implementation, for a male/female speech set, the entire feature combination is constructed by multiple well-known acoustic feature dimensions, that is, pitch, log energy, 3 format frequencies, 11 MFCCs, 16 PLCCs, and 9 LSFs.We choose these acoustic features by cross validation.The above 41-dimensional features are utilized to train a multi-label classifier to classify each speech sentence into the corresponding phoneme and emotion pairs.Such pairs are utilized for synthesizing the speech-drive facial animation video subsequently.

B. ACTIVE LEARNING FOR KEY FACES SELECTION
In order to build an optimal facial animation framework, we typically record facial videos of a male/female speaking English or Chinese during the system training stage.It is observable that each video practically has large number of facial frames.Practically, it is non-trivial to detect facial frames which can best associate the phoneme and emotion pairs.Previous AI systems typically employ pre-specified key facial frames, which might be sub-optimal.Herein, we select the key faces by leveraging a novel active learning paradigm that are conducted in a completely automatic way.In our implementation, the speech videos are captured by a Mandrin speaker in a well-established recording studio.Totally, we obtain 105 recorded speech videos, each lasts about 420 seconds.
Theoretically, we treat active learning as a sample selection paradigm, wherein multiple criteria were developed to select highly representative sample.For our system, we discover multiple key facial frames based on the aforementioned recorded speech videos.Herein, the objective is that the discovered key facial frames are best representative to frames from the recorded speech videos. Denote as a collection of facial video frames distributed on the underlying subspace.Herein, N counts the training video frames.The objective is to conduct subspace learning and active frames selection jointly.We denote B ∈ R 58×K as the selected K representative frames.
In theory, we still adopt the strategy of minimizing the overall reconstruction loss in the original space to select the most representative samples.To this end, we take advantage of the following objective function: Herein, λ ≥ 0 measures the significance of our designed regularizer.For the above objective function, the left term attempts to maximally rebuild the input facial frames, wherein R is a matrix containing the rebuilding parameters.Meanwhile, Meanwhile, the right term represents a predefined regularizer with a particular matrix norm.Herein, the objective is to acquire the top K key facial frames, and thus the rebuilding terms toward the top K key frames should be heavily weighted.In contrast, the remaining NK unselected facial frames should be lightly weighted.Taking a very particular case as an example, when all elements of one row in R become zeros, that means these facial frames are not recognized as the key facial frames.This is because they are considered to have no contribution to rebuild the rest facial frames.In this way, R is a matrix that is sparse in row, as each row measures the importance of each facial frame in rebuilding the remaining ones.Toward a row-wise sparse matrix R, it is straightforward to upgrade term ||R|| l into term ||R|| + 2, 1 or term ||R|| ∞ .In our implementation, term ||R|| 2,1 is deployed.In practice, we notice that term ||R|| ∞ is also an appropriate choice.In theory, R has two key contribution in the above objective function.i) a matrix containing the rebuilding parameters and each column functions as the linear combination of the key facial frames to rebuild a new one; and ii) a matrix for representing itself.That is, each column r i ∈ R N is considered as a feature for representing ⃗ α i .Herein, we treat A as an unknown dictionary.
As we mentioned, the facial frames are practically distributed on the underlying subspace hidden in a high-order feature space.In this way, R is constrained to be a lowrank matrix, based on which the above objection function is updated as follows: where η ≥ 0 denotes a weight to the corresponding term.rank(•) calculate the matrix rank.We minimize term rank(R) to achieve a low-rank matrix R. Therefore, we can recover the low-rank geometry from the input matrix.Practically, we notice that the above objective function is NP-hard.Instead, we update rank(R) to the well-known nuclear norm of matrix R [33].This makes the problem a convex one, that is, Herein, ||R|| * denotes the nuclear norm implemented for the aforementioned rank function.Details of the solution is presented in [34].By leveraging the calculated R, we acquire K representative frames to represent each facial video.

C. ANIMATION VIDEO GENERATION BY MORPHING
For one second, we practically generate three phoneme and emotion pairs.The three pairs have three corresponding key frames accordingly.In practice, three frames for each second cannot ensure a smooth and natural synthesized facial video, i.e., 24 frames for each second.Herein, the wellknown morphing [35] algorithm is leveraged for calculating the intermediate faces for pairwise temporally adjacent key frames.
Given two key facial frames as shown in Fig. 3, morphing combines them by cross-dissolving their corresponding image pixels (e.g., pixels from the lips in the two key facial frames in Fig. 3).Before this, we have to locate the corresponding pixels between pairwise key facial frames.Given a pair of corresponding lines PQ and P ′ Q ′ from the destination and the source frames respectively, a mapping can be derived from the coordinate of the destination frame pixel X to that of the source frame pixel X ′ : Herein, Pen(•) returns the vector that is perpendicular to, as well as the same length to the input vector.⃗ u is the direction along the line PQ or P ′ Q ′ .⃗ v calculates the distance between X (a pixel) and PQ (a line) (or the distance from X ′ to P ′ Q ′ ).
Denoting O and O ′ as the origins of the destination and the source frames respectively, we can obtain X = O + dX .By putting (4) and ( 5) into (6), we obtain: Based on the above derivation, given a destination frame, we start from its origin O and map each of its pixel coordinates to that of the source frame.Two directions of increments are used: dX 1 = (1, 0) and dX 2 = (0, 1).By locating the pixels in the destination key facial frame to those in the source one, we use cross-dissolve to obtain each intermediate facial frame.Denote g 1 (x 1 , y 1 ) and g 2 (x 2 , y 2 ) as the RGB values of the corresponding pixels (x 1 , y 1 ) and (x 2 , y 2 ) in the source and the destination frames respectively, the RGB value of a pixel in the intermediate frame is: where k ∈ [0, 1] is the interpolation coefficient.We set k = 0.4 according to our implementation.
For our built animation system, we notice that our synthesized facial skins might be visually inconsistent.The inherent reason is the illumination discrepancy from the original and target human faces.Practically, to tackle such shortcoming, we adopt a lighting compensation scheme during our pixel cross-dissolve stage, i.e., g(x, y) = where η = σ (g 2 )/σ (g 1 ), σ (•) is the variance of the RGB color in a frame, and ḡ is the average RGB color of a frame.As shown in Fig. 4, the illumination compensation scheme makes the facial skin in the animation video more consistent.

IV. EXPERIMENTS
In this section, we test our designed animation system using three empirical validations.The first set of experiments stepby-step evaluates the important modules in our animation system.The second set of experiments evaluates performance our system under different parameter settings.The third set of experiments visualizes the synthesized animation video and some intermediate results.Our facial animation system for testing is briefed as follows.During the training stage, we collected 5,600 English speech sentences.These sentences are recorded by five males and three females, whom are from our Computer Sciences Department.Each sentence lasts approximately 42 ∼ 550 seconds.To accurately describe each speech sentence, 41 well-known acoustic features are calculated.To label the emotion of each speech sentence, we employ five different emotion labels (''anger'', ''happiness'', ''neutrality'', ''sadness'', and ''surprise'') and the pre-defined 44 phonemes (as shown in Fig. 5).To refine the speech sentences, a pre-emphasizing stage is deployed, including blocking and Hamming windowing.

A. IMPORTANT MODULES EVALUATION 1) LOW-RANK ACTIVE KEY FACIAL FRAMES DISCOVERY
Here, our adopted key facial frames selection algorithm is compared with multiple well-known frame selection algorithms, that is, online clustering key frames extraction  (OCFE) [36] using the same ASM [37]-based facial feature as ours, dictionary selection based key frame selection (DSVS) [38], and motion-based key frame extraction (MKFE) [39].OCFE first leverages a clustering algorithm to categorize the frames to different centers.Thereby, the remaining frames are progressively integrated to cluster.DSVS formulates video frame selection as a dictionary selection by seeking sparsity.A key-frame-based dictionary is calculated, wherein the training facial videos can best rebuild the calculated dictionary.For MKFE, we predict camera as well as object motion features for extracting the descriptors.Each video is subsequently decomposed into multiple clips based on different motion types.Accordingly, multiple rules are leveraged for calculating the key frames.
The key frames of the training facial videos are calculated in the first place.As shown in Fig. 6, the accuracy of key frames generated by different algorithms are reported.The accuracies indicate how the key frames can rebuild the entire facial frames during training.A high reconstruction accuracy means that key frames produced the method can optimally capture the training facial videos.Meanwhile, for each counterpart, we notice that some key frames capture each face with highly similar viseme and expression pairs.This observation is different from the principle that key facial frames must be evenly distributed and can effectively capture the facial videos.

2) MORPHING-BASED FACIAL ANIMATION VIDEO
In this subsection, our designed animation system is compared with the facial systems proposed by Deng et al. [28], Kshirsagar et al. [29], Hofer et al. [30], and Zoric et al. [40] respectively.Noticeably, either accuracy or ranking is an optimal choice for this task.The reason is that these methods are practically highly complicated for each observer to provide.In our implementation, we leverage the wellknown paired comparison for user study.We use it to test the effectiveness of the proposed facial animation system.Paired comparison means, we present pairwise videos produced by two different facial animation systems to each subject, with the same input speech sentence.We preserve the above testing results in the so-called preference matrix.As displayed in Fig. 6, the element in row ''Hofer'' and the column ''Zoric'' is 12.This indicates that 12 subjects prefer the video produced from Hofer et al. than that produced by Zoric et al..

B. SYSTEM PERFORMANCE UNDER DIFFERENT PARAMETERS
This subsection evaluates the performance of our system under different parameter settings, that is, the parameter µ in the active key facial selection.We evaluate the reconstruction error under different values of η in (2).We set the number of selected key facial frames K to 10, 20, 30, and 40 respectively.Then, we tune η from 0.01 to 0.1 with a step of 0.01.As shown in Fig. 7, the reconstruction error is minimal when η = 0.05.This is because η reflects the importance of preserving the distribution of the facial frames in the training videos.Emphasizing too much on this property will increase the reconstruction error.

C. VISUALIZATION OF THE FACIAL ANIMATION RESULTS
In this subsection, we visualize the intermediate results of our facial animation system.First, we show the facial features extracted by the ASM [37] model in Fig. 8.We deliberately use left oriented faces and each face is not in the middle of the video.As can be seen, the ASM model can accurately locate the faces.Then, we present the intermediate faces generated   by the morphing technique in our proposed system.As shown in Fig. 9, the leftmost and the rightmost facial frames are the original frames while the rest frames are generated by morphing.It is observed that these generated faces look very natural and quite real to human faces.

V. CONCLUSION
In this work, we design a novel AI system to synthesize aesthetically pleasing facial videos by leveraging human speech sentences.More specifically, high quality acoustic features for recognizing phoneme and emotion pairs are identified using a multi-label SVM classifier.Afterward, we leverage a novel low-rank active learning algorithm to recognize the key faces from the large-scale training facial videos.By associating each emotion and phoneme pair with a key face, the well-known morphing algorithm fits the key frames into a smooth and natural synthesized facial video.Empirical results have shown that our method is efficient and effective.And it is learned in a completely automatic way.

FIGURE 1 .
FIGURE 1. Pipeline of the proposed speech-driven real human facial animation system.

FIGURE 2 .
FIGURE 2. Left: the active shape model (ASM) model of a human face; Right: projecting ASM facial features from all the facial frames (red dots) onto manifold.

FIGURE 3 .
FIGURE 3. Coordinates mapping from the source image to the destination one.

FIGURE 4 .
FIGURE 4.An example of illumination compensation for the intermediate faces.

FIGURE 5 .
FIGURE 5.The viseme-phoneme for Chinese pronunciation (the red text denotes the recognition accuracy of the phonemes).

FIGURE 6 .
FIGURE 6. Reconstructing accuracy by leveraging different techniques as aforementioned (PM means the proposed method).

FIGURE 7 .
FIGURE 7. Key facial frames reconstruction error by leveraging different values of η.

FIGURE 8 .
FIGURE 8. Human faces detected by the ASM model in the training facial videos.

FIGURE 9 .
FIGURE 9.The intermediate faces (blue rectangles) generated by the morphing technique.

FIGURE 10 .
FIGURE 10.An exmaple of a male-face-based video animation framework.
, Yuan et al. presented crucial insights into active learning applied in a visual context, particularly in tracking applications.Based on active learning, the proposed CNN-guided visual tracker can be conveniently trained by leveraging a highly diverse set of training video frames.In [32], Ren et al. systematically summarized the existing deep active learning algorithms, associated with a comprehensive overview.They also presented the development of deep active learning in different vision applications.