Learning Flow-Based Disentanglement

Face reenactment aims to generate the talking face images of a target person given by a face image of source person. It is crucial to learn latent disentanglement to tackle such a challenging task through domain mapping between source and target images. The attributes or talking features due to domains or conditions become adjustable to generate target images from source images. This article presents an information-theoretic attribute factorization (AF) where the mixed features are disentangled for flow-based face reenactment. The latent variables with flow model are factorized into the attribute-relevant and attribute-irrelevant components without the need of the paired face images. In particular, the domain knowledge is learned to provide the condition to identify the talking attributes from real face images. The AF is guided in accordance with multiple losses for source structure, target structure, random-pair reconstruction, and sequential classification. The random-pair reconstruction loss is calculated by means of exchanging the attribute-relevant components within a sequence of face images. In addition, a new mutual information flow is constructed for disentanglement toward domain mapping, condition irrelevance, and condition relevance. The disentangled features are learned and controlled to generate image sequence with meaningful interpretation. Experiments on mouth reenactment illustrate the merit of individual and hybrid models for conditional generation and mapping based on the informative AF.


I. INTRODUCTION
D OMAIN mapping aims to characterize the complicated relation between source and target domains where the conditional generation of target data can be learned with the specific embeddings of features, styles, or attributes from source data.It is essential to learn such a generative model that acts as the observation probability for generation of new samples.Basically, the deep generative models, combining generative models with deep neural networks, have been recognized as a building block in implementation of various multimedia information systems.Deep generative models in domain mapping have been developed for different pairs of mapping in the presence of various data types.The mapping pairs can also be under the same data type.For example, image-to-image translation is recognized as a popular domain mapping task for style transfer where the style of source images is transferred The authors are with the Department of Electrical and Computer Engineering, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan (e-mail: jtchien@nycu.edu.tw).
Digital Object Identifier 10.1109/TNNLS.2022.3190068and incorporated in target images [1]- [3].In general, a key success to generative model for domain mapping relies on the preservation of overall data structure based on the disentanglement in latent representation.Latent disentanglement typically aims to strengthen the learning representation by disentangling the basic structure of observations into disjoint components or salient features in the latent variable model.Currently, there is no clear definition and solution to latent disentanglement because the ground truth of disentangled features from the structural and mixed observations is missing.Nevertheless, the disentanglement needs to preserve the properties of independence as well as interpretation in latent representation [4].Independence is to identify the statistically independent factors that are not interfered with each other, while the interpretation is to capture the semantic meanings of the separated components.The more the generative model understands the observations, the better the rich samples are precisely generated.

A. Related Work
Generative models are developed as the probabilistic distributions to reproduce or even create new data like a human does.Traditionally, latent disentanglement was developed for generative models based on the variational autoencoder (VAE) [5], [6], which depended on the expressiveness of prior distribution, and the generative adversarial network (GAN) [7], [8], which required the stability in minimax optimization for adversarial training.VAE also suffered from the posterior collapse [9] in variational inference, which resulted in the blurred data.As a result, the goodness of disentanglement affected the controllable generation for domain mapping.In [10] and [11], the richness of generated images was attained by the controllable generation where the styles and attributes were identified via transfer learning.Generation of face images is viewed as a popular task for image-to-image translation.The attributes of genders, races, hairstyles, and facial emotions were disentangled to pursue the variety of the generated faces.In [12], the latent semantics of facial expressions were adjusted to manipulate the attribute mapping in local regions where the face landmark was constructed for image synthesis [13].In [14] and [15], talking information of a target video was combined with a source image to implement the talking face generation where GAN was applied for conditional generation with an audio-visual disentangled representation.
A recent paradigm, called flow-based model [16], [17], has achieved the state-of-the-art performance for generation of various types of data, including face images [18], medical images [19], natural sentences [20], [21], and speech waves [22].The attractiveness of flow model is the exact estimation of target distribution for highly nonlinear domain mapping through multiple simple and invertible transformations.However, the conditional generation using a flow-based model was hard to implement because there was no reconstruction imposed in the generation phase and only the inference phase adopted the model.In [18], the post-preprocessing mechanism was proposed as an alternative to indirectly impose the labels for semantic adjustment.In [23], the effective architecture for generative flow using the masked convolution was proposed.In [19], the dual invertible networks were exploited and learned for flow-based modality transfer.In [24] and [25], the guided images were generated by the conditional flow-based models for image colorization and edge detection.

B. Main Idea of This Work
This article presents the flow-based disentanglement for conditional generation in face reenactment.Such a flow model is suitable for precise reconstruction due to the invertibility in mapping between the observed domain and the latent domain.There are two approaches.The first approach is based on the attribute factorization (AF) flow where the disentanglement is preserved by exploring the structural features in consecutive changes given by an image sequence.The attribute-relevant and attribute-irrelevant encoders are introduced to identify facial features in the flow-based model, which represents a specific talking attribute and an overall latent structure of talking mouth, respectively.These two encoders are mutually collaborated and estimated according to the objectives for disentangled domain mapping in mouth reenactment.The second approach is called the mutual information (MI) flow, which conducts information-theoretic learning for flow transformation that consolidates the disentanglement to connect the relations between image sequences and latent variables.MI is optimized to build the flow-based model so as to disentangle the informative features.The attributes in facial features are retrieved and treated as the conditions for face generation.MI flow is implemented by using an invertible 1 × 1 convolution (known as the Glow) [18] where high-quality image synthesis is assured.A conditional prior distribution is additionally learned to express implicit talking attribute from face data.This article further handles the dimensional waste in latent vectors and preserves the capability of data compression in construction of face generative model.The physical meaning of latent disentanglement becomes intuitive and interpretable.The proposed AF flow and MI flow can be merged to reinforce the performance.In the experiments on face reenactment, the synthesis of talking mouth from different domains of images and the image reconstruction from the disentangled features are illustrated.The remaining of this article is organized as follows.Section II addresses the flow-based representation and disentanglement.Sections III and IV present the conditional generation based on the disentangled flow models using AF flow and MI flow, respectively.Section V reports a series of experiments to evaluate these methods.The final conclusions drawn in this study are given in Section VI.

A. Flow-Based Representation
Flow-based representation [16] was proposed as a new type of likelihood-based generative model where the distribution due to invertible transformation z = f θ (x) with parameter θ from observed sample x to latent variable z is calculated by or g θ .This normalizing process is reverse to form the generative process using p 1 shows the generative (upper) process and the normalizing (lower) process in flow-based model where a series of invertible func- is estimated to smoothly generate from z 0 to x = z k and normalize from x = z k to z 0 , respectively, where the dimension of different variables {z i } k i=0 is fixed.The observed variable x = z k and latent variable z 0 are represented by the complex distribution and the simple distribution (i.e., standard Gaussian), respectively.The flow-based generative model turns out to estimate the flow parameter θ by minimizing the expected loss function given by an exact likelihood-based model After the first stage of flow-based pretraining, the second stage is devoted to find a feature encoder for AF.In [16], nonlinear independent component estimation (NICE) provided the transformation f θ , which was easy to compute the inverse f −1 θ (z) and the Jacobian determinant det((d f θ (x))/dx).The volume-preserving flow was built by imposing the additive coupling layer where the unit Jacobian determinant was obtained to assure volume preserving.However, such a flow is difficult to handle high-dimensional continuous space.Therefore, the real-valued nonvolume-preserving (RealNVP) flow [26] was proposed by implementing the affine coupling layer with masked convolution and multiscale architecture.In [18], Glow was inherited from an RealNVP multiscale structure that was driven by invertible 1 × 1 convolution.Flow models have been successfully developed for computer vision [19], [27] and natural language processing [20], [22].

B. Conditional Generation and Mapping
This article presents the flow-based domain mapping via conditional likelihood p(y|x) with a latent representation z = f θ (x), which generates data y in the target domain conditioned on the source samples x.Talking face reenactment conducts a kind of domain mapping, which generates the face images with lip motion via video frames [28].This task aims to generate the talking face images of a target person given by a reference face of a source person.The challenges of this task are caused by the richness of lip movements and the sparseness of paired samples.Traditional method to handle this work was based on the 3-D face structural model [29], the dense photometric consistency measure [30], or the facial embedding representation [31].More recently, the GAN was employed in domain separation and adaptation where adversarial learning was adopted to improve the generation by disentangling various information sources [32]- [34].In [14], a conditional recurrent neural network (RNN) was considered as the discriminator for GAN where the spatial-temporal information was merged.In [15], the audio and visual features were both used to disentangle the information related to subject and speech from domain knowledge for face reenactment.Such a model was too large and hard to converge due to the adversarial training.This study proposes the flow-based model for mouth reenactment where the conditional likelihood function is maximized to sample the unseen talking mouths in latent space.This flow model continuously transforms the observed data into latent representation via a number of invertible functions where the inverse mapping can be conducted to recover the original observations.The quality of generated data, conditioned on some controllable factor, is accordingly assured [35].There have been a variety of medical image tasks where the flow model was deployed for vessel segmentation [25] and image transfer from magnetic resonance imaging to positron emission tomography imaging [19].In [24], a guided invertible domain mapping was proposed for color transfer in conditional image generation.In [36], a Glow-based makeup transfer was developed to estimate a target face image based on the decomposed latent vectors for makeup and face.The disentangled latent representation is crucial.

C. Latent Disentanglement
A key success to conditional generation and mapping is the disentanglement in latent representation which is essential in construction of an unsupervised learning machine.The disentangled representation basically relies on the distinct, separate, modular, and compact factors, learned from observation data, which are independent with minimum information redundancy and interpretable for semantic meaning [37].In [38], the group theorem was developed as a new perspective to build the disentangled representation.In [39] and [40], a precise criterion with general property was presented to implement the disentangled representation, which connected different symmetry groups in latent space.A disentangling procedure was performed by decomposing each symmetry group into subgroups which preserved the independence.The representation redundancy was minimized to assure model compactness by using the independent generative factors.In [41] and [42], a similar perspective was presented in a way of multidimensional disentangled representation.In [43] and [44], the disentangled representation was performed in an unsupervised manner where the semantic factors were automatically learned from observed data.The model enforced a factorized aggregated posterior, which promoted disentanglement.In [45], a weakly supervised disentanglement was learned with supervision in presence of inductive bias.Recent works have been proposed for flow-based latent disentanglement.In [46], a nonlinear independent component analysis was exploited for disentanglement over a flow model where a Gaussian mixture model was calculated to build a latent space conditioned by classes.In [47], a dedicated neural structure was constructed to separate the mixed images into condition-dependent and independent components.A compact module was learned as a disentangled model driven by a reference condition.

D. Motivation of the Proposed Flows
This article presents the flow-based disentanglement for face reenactment where two learning perspectives are developed.The first perspective is to factorize the latent representation of face images x into those variables for talking attributes z r and general faces z i .A conditional mouth generation is implemented through the AF flow in accordance with structural and geometric objectives.The second perspective is to carry out the MI flow based on an information-theoretic disentanglement for latent variables z r and z i where the condition relevant and irrelevant informative objectives are optimized for conditional face generation, respectively.Although two flow models are separately developed, the disentanglement in these two models is consistently performed to find the same attribute or condition relevant and irrelevant variables {z r , z i }.Therefore, a cascaded way to combine two flow models under the shared variables can be implemented for mouth reenactment as detailed in the following.

III. AF FLOW
First, the AF flow is presented for conditional generation where the observed image x is transformed to a latent variable z using a flow model z = f θ (x) with parameter θ .This invertible transformation is to assure the information preserving for data reconstruction.Fig. 2 shows the architecture of AF flow where the target image x T is obtained from a source image x S driven by a query image x q .Face reenactment aims to generate x T of a target video with the facial features of x q whose mouth movement replicates the movement from x S of a source video.AF is performed over latent variables z q and z S .This study adopts Glow [18], [48] as the backbone, shown by green bars, to carry out the flow-based domain mapping for face images.In the implementation, the squeezed and unsqueezed operations were performed in the begin and end of the input f θ and output flows f −1 θ , respectively.The multiscale architecture [26] was configured to alleviate the computation cost.A number of objectives are introduced to disentangle z = {z q , z S } into an attribute-relevant vector z r = {z r q , z r S } for local talking movement of a mouth and an attribute-irrelevant vector z i = {z i q , z i S } for global facial structure of a general face.The variable z T is used for face reenactment where the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.decomposed variables of query vector of facial structure z i q and source vector of lip movement z r S are combined.

A. Factorization by Structure Preserving
In particular, the AF aims to capture the global structure of face samples x = {x q , x S } from latent vector z, which is common for faces under various talking attributes.Latent vectors z r = E φ r (z) and z i = E φ i (z) are extracted by using the attribute-relevant and attribute-irrelevant encoders with parameters φ r and φ i , respectively.This factorization is performed to obtain z = z r + z i with the fixed dimension D in z, z r , and z i given by D = w × h × c based on the width w, height h, and channel size c of a face image.A video clip of talking face is viewed as a number of face images consisting of the global structure of a real face and the local movement of a facial expression or a lip language.The facial features of a query person in the target domain and the attribute of a talking mouth of a source person are represented by using z i q and z r S , respectively, and then merged as z T for conditional generation.The attribute-irrelevant encoder E φ i is learned to capture the facial structure by inferring the irrelevant variable z i toward the centroid of latent vectors z of a source person in the source domain S. The structural loss is formed by a square error as the regression loss where the centroid is calculated as an ensemble mean z = (1/N) N n=1 z n of the variables {z n } N n=1 corresponding to the talking frames of a source person.AF flow minimizes this loss to maintain the facial structure of a person in a sequence of talking images.The encoder E φ i is then estimated to preserve the global features of a general face where the attribute-relevant features of talking details are neglected.
It is essential to learn latent disentanglement across various face identities.The disentanglement is strengthened by minimizing the structural loss in the generated target images x T that are transferred from source images x S conditioned by a query image x q where source and query come from different identities.The target sample is generated via z T by merging the attribute-relevant feature z r S of source image and the attribute-irrelevant feature z i q of query image.The structural loss in ( 2) is further measured by using the synthesized feature z r S + z i q in the target domain via where zT denotes the ensemble mean of target images.This loss is minimized to preserve the structure of target samples due to domain mapping S → T .Fig. 3 (right) shows the structural loss measured by the synthesized variable z T in the target domain by adding the latent variables of the query face z i q and the talking information of source person z r S (encoded by E φ r and depicted by red).The structure encoder E φ i and movement encoder E φ r are optimized to preserve the structures before and after domain mapping.

B. Factorization by Self-Supervised Learning
In practice, the stereo paired data between two domains are missing in the task of face reenactment.Supervised factorization is not possible.However, the relation between a pair of any two talking frames within the same video clip of face images can be characterized in a self-supervised way.This study presents the random-pair reconstruction loss L p that is minimized to estimate E φ i and E φ r .As shown in Fig. 3 (left), the image frames of a single sequence are self-collected to form a set of pseudo-paired data {z n , z m } where two different frames within a single sequence from a source person are randomly selected.The reconstruction error due to a latent variable z m is then calculated by which is minimized to preserve the structure information and attribute evidence.This method does not only augment the positive pairs but also encourage the training diversity.
The previous three losses {L s , L t , L p } closely affect the structure encoder E φ i .To enhance the flow-based disentanglement, AF flow is further consolidated by minimizing the classification loss due to a word label k of talking mouth along a training sequence that is predicted by using the talking variable (z r ) k [49].This variable is considerably affected by the attribute encoder E φ r .The cross-entropy error between one-hot class output y n = {y nk } and posterior output C ψ of a neural sequential classifier with parameter ψ is measured by where N y denotes the vocabulary size.This classification loss due to word label is minimized to jointly train two encoders {E φ i , E φ r } and one classifier C ψ for AF.

C. Implementation and Optimization
In this study, neural sequential classifier C ψ with word outputs {y n } N n=1 is implemented by an RNN using a sequence of attribute-relevant vectors {z r n } N n=1 as the inputs, which is shown in Fig. 4(a).In addition, a single encoder E φ i or E φ is configured as in Fig. 4(b) instead of using two encoders {E φ r , E φ i } in Fig. 2.This scheme simplifies the training convergence and makes sure the inverse procedure through the additive relation z = z r + z i .There are two training stages for AF flow.The first stage is to train an unsupervised Glow model z = f θ (x) with parameter θ by maximizing the likelihood of observed data x or minimizing the flow loss in (1).However, the observed data x consist of discrete pixel values of images.In the implementation, the dequantization method [18], [27] is applied by using a noise random variable ε in a flow model.The flow transformation and its invertible function are obtained by z = f θ ( x) and x = f −1 θ (z; ε), respectively, where x = x + ε is obtained by adding the positive uniform sample of a noise signal drawn by ε ∼ U(0, b) with a small value of bounding parameter b.This Glow model using θ is then adopted to collect the mini-batches of training sequences {z, y}.The second stage is to use them to fulfill the AF by continuously updating two encoders and one classifier with parameters {φ, ψ} where the combined AF loss is obtained by which is minimized by calculating the gradients (∂L AF /∂φ) and (∂L AF /∂ψ) in the stochastic gradient descent (SGD) algorithm.AF flow carries out the disentanglement for conditional generation via the structure encoder E φ and classifier C ψ by minimizing the structural losses in two domains: the random pair loss and the classification loss.In what follows, an alternative disentanglement with information evidence is presented.Algorithm 1 shows the learning stages of parameters {θ, φ, ψ} in AF flow.

IV. MI FLOW
This study further presents the MI flow for latent disentanglement where the attribute relevance and irrelevance are factorized.

A. Information-Theoretic Disentanglement
A key property of disentanglement is to factorize latent variables with distinct features.Consider the disentanglement of latent variable z into condition-irrelevant variable z i and condition-relevant variable z r , where z = {z i , z r }, the learning Algorithm 1 Learning Procedure for AF Flow Model Input queries, source samples and labels {x q , x S , y} Initialize parameters θ , φ r , φ i , ψ and select parameter b while θ , φ r , φ i , ψ not converged do update θ using x j from {x q , x S } via gradient of reconstruction error using x j ← x j + ε for each mini-batch x j , y j from {x q , x S , y} do calculate z, z r , z i via f θ , E φ r , E φ i using x j calculate z, source structural loss L s in (  objective is formed as an MI loss L = I(z i ; z r |x, c) which is minimized by using the observed data x (corresponding to a source sample x S ) and a relevance condition c (corresponding to his/her subimage of mouth region).Mapping between observed domain and latent domain is characterized.Fig. 5(a) shows how the separation between z i (for facial structure) and z r (for lip movement) is increased by minimizing I(z i ; z r |x, c).However, direct calculation of true MI is difficult.This MI is therefore arranged by manipulating the entropy terms H(•) that are factorized in a form of condition c is missing.The last two terms reflect the conditionrelevant MI.The MI I(z r ; x, c) is maximized to pursue the condition-relevant variable z r , which is substantially correlated with data and condition {x, c}.The MI I(z i ; z r ) is minimized to disentangle the condition-irrelevant and relevant variables {z i , z r }.In [50], the information bottleneck in the invertible neural network was proposed to represent the information constraint for a generative classifier to optimally balance between classification accuracy and model complexity.This study presents the MI-based flow model where four MI terms are jointly optimized for disentanglement of structure and attribute variables {z i , z r }.

B. MI Objectives
The first MI objective I(z; x, c) is used to disentangle domain mapping between {x, c} and z, which is driven by the posterior distribution p θ (z|x, c) using parameter θ .Flow model is applied to transform from observed data x to latent variable z by using invertible function z = f θ (x) with parameter θ , while the condition c provides the attribute as the prior information for conditional generation.In particular, minimizing this domain mapping MI is equivalent to minimizing its variational upper bound expressed as [51] Since H( x|c) is independent of flow parameter θ , the MI loss for domain mapping is accordingly obtained by which is minimized to learn the flow-based representation.Equivalently, the conditional likelihood of noisy samples x with condition c is maximized to learn the flow model.Next, the condition-irrelevant MI I(x; z i ) is maximized to infer z i , which sufficiently reflects x but irrelevantly relates to attribute c.This MI is arranged to find a lower bound via where x is sampled from inverse function g θ or f −1 θ using z i and the auxiliary distribution p θ (x|z i ) is merged to approximate true posterior p(x|z i ).The lower bound in (11) is obtained since the KL term is always nonnegative.Notably, rather than using an additional decoder to approximate true distribution, it is meaningful to reuse flow model by reversing its transformation direction or equivalently applying its inverse function to implement this generator or decoder.Flow parameter θ is not only affected by domain mapping objective I(z; x, c) but also by condition-irrelevant objective I(x; z i ).Such a scheme is helpful to train a flow-based generator p θ (x|z i ).Maximizing I(x; z i ) is comparable with maximizing its lower bound.As a result, the loss function for condition-irrelevant MI is constructed by removing the independent term H(x) to form the objective In addition, the informative latent disentanglement is further strengthened by inferring the condition-relevant latent variable z r where the correlation between the given condition c and the disentangled embedding z r is increased by optimizing the condition-relevant MI −I(z r ; x, c) + I(z i ; z r ).There are two terms in learning objective.The first term I(z r ; x, c) is maximized to consolidate the condition-relevant variable z r , which reflects the image x as well as the condition c.Similarly, this term can be factorized and manipulated as Again, this lower bound is obtained due to the nonnegative KL term.Notably, in (13), a learnable conditional distribution p ϕ (z r |x, c) or p ϕ (z r |c) with Gaussian parameter ϕ with mean μ and standard deviation σ is incorporated to provide prior information for condition-relevant variable z r , which is calculated by flow model f θ under a distribution p θ (z r |x, c).Alternatively, the second term is minimized to disentangle z r from z i .This term can be factorized as I(z i ; z r ) = H(z r ) − H(z r |z i ) and combined with the first term in (13) to derive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the variational upper bound of condition-relevant MI Minimizing this MI objective is equivalent to minimizing its corresponding upper bound.The loss function due to condition-relevant MI is then obtained by minimizing to update flow parameter θ as well as prior parameter ϕ.

C. Learning Algorithm
In the implementation, latent disentanglement using domain mapping MI can be further strengthened by expanding the loss function L d .The generative likelihood given by condition c in Eq. ( 10) is extended by jointly considering condition-irrelevant prior p(z i ) and condition-relevant prior p ϕ (z r |c) based on z = {z i , z r }.The parameters of flow model and prior model {θ, ϕ} are merged in the derivation as The loss function for condition-relevant MI L r in ( 15) is seen as a part of loss term for domain mapping MI L d .Optimizing the conditional prior model p ϕ (z r |c) or equivalently p ϕ (z r |x, c) is performed under the training of the conditional flow model f θ , which is used in p θ (z|x, c).In the optimization, this conditional prior is provided to infer the condition-relevant variable z r , while the condition-irrelevant variable z i simply relies on a standard Gaussian prior.Besides, the loss functions L d and L r depend on input data {x, c} as well as model parameters {θ, ϕ}.The loss function for condition-irrelevant MI L i (x; θ) in ( 12) is only related to observed data x and flow parameter θ .Assuming that the likelihood function p θ (x|z i ) given by condition-irrelevant variable z i , calculated by flow model f θ , is Gaussian with zero mean for reconstruction error and unit variance in each dimension.Loss function L i is then simply seen as the aggregation of reconstruction errors from individual observations x as This loss is minimized to build a flow model where its inverse g θ works toward the smallest reconstruction error.Such an inverse flow model is reused to act as the conditional generator.Note that only the condition-irrelevant variable z i is Algorithm 2 Learning Procedure for MI Flow Model Input training mini-batches x = {x j } and condition c Initialize parameters θ , ϕ and select parameters b, α while θ, ϕ not converged do for each mini-batch x j do select c for mini-batch x j find a noise sample by ε ∼ U(0, b) de-quantize x j by x j ← x j + ε calculate z r , z i , z via f θ using used as the input to generator.This property makes sure of an informative latent variable z r , which substantially reflects its relation with the observed input x.Therefore, the loss function of MI flow is combined with a parameter α as Algorithm 2 illustrates the learning procedure of the proposed MI flow where the flow model f θ or g θ with parameter θ and the Gaussian prior p ϕ (z r |c) with parameters ϕ = {μ, σ } are tightly merged and jointly optimized.This is different from the separate two-stage training in AF flow.

D. Architecture and Implementation
Extended from the concept in Fig. 5, the architecture of the proposed MI flow is configured in Fig. 6(a) as the training stage and Fig. 6(b) as the generation stage where Fig. 6(c) describes the definition of different symbols.Different from AF flow using two encoders E φ r and E φ i for factorization of flow-based latent vector z = f θ (x) into z r and z i , respectively, the flow model in MI flow is factorized as which is applied to calculate the condition relevant and irrelevant vectors z = {z r , z i } from the observed vector x by using Function split(•) is used for variable splitting.Such a factorization is fulfilled to implement latent disentanglement z = {z r , z i } = f θ (x) with the inverse x = f −1 θ (z r , z i ).In the training stage, the flow components f r θ and f i θ consisted of K flow steps or coupling layers.Considering the image data with three channels 64 × 64 × 3, the flow component f r θ first adopted a squeezed operation to increase the channel number and reshape three-way tensor input as 32 × 32 × 12.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The remaining layers had the same size.After these flow steps, the output h was split into two variables with the same shape 32 × 32 × 6 where one variable was used as the condition-relevant variable z r and the other variable h i was used as the input to repeat this flow step.Such a computation block was repeated L − 1 times to construct a multiscale layer architecture [26] so as to produce z i .The condition-irrelevant vector z i was obtained after the flow component f i θ with K flow steps was computed.This architecture was useful to reduce the computation cost and improve the model regularization.Notably, this Glow model is not only used in MI flow but also employed in AF flow when finding z q and z S in Fig. 2. Different from AF flow, the conditional prior is incorporated in MI flow to draw the sample of condition-relevant variable z r based on the Gaussian mean and variance parameters {μ, σ } where L − 1 layers of convolution and max pooling were calculated to find p ϕ (z r |c) by given the condition image c.This process was to match the size of multiscale architecture.Parameter ϕ of conditional prior network was estimated to tightly impose the condition of lip movement to infer the relevance vector z r .Here, K = 32 and L = 4 were used.
After the training stage, the generation stage was implemented for face reenactment by inverting all of computations by using the inverse functions {g i θ , g r θ } and changing the splitting as the concatenating where the condition-irrelevant and condition-relevant clues from {z i , z r } were used in each computation block.There were L−1 blocks.Given a condition of lip movement c S of a query image x q in the source domain, the proposed MI flow is able to generate a sequence of images of a target face x T .This model implements the conditional generation for domain mapping between c S and x T .In the implementation, the trained conditional prior network p ϕ (z r |c) is applied to draw the sample of condition-relevant variable z r S by using query condition c S , which is then concatenated with h i .This h i is calculated by the inverse model g i θ , due to the condition-irrelevant variable z i .The concatenated vector h is then transformed by using the inverse function g r θ and repeated with this concatenation step L − 1 times to finally generate the target face image x T .An unsqueezed operation is performed to restore the shape of target mouth to match with that of original image sample.Notably, the same flow models in reverse direction g i θ and g r θ and the same model structure as the training stage are employed in this conditional generation.

E. Disentanglement by the Combined Flow
This article has presented two approaches to flow-based latent disentanglement for conditional generation and mapping from a source image x S (or input image x) to a target or output image x T driven by a query image x q (or a condition subimage of lip movement c) where the paired data between source and target domains in face reenactment are missing.The AF flow and the MI flow are proposed by minimizing the structural loss and the information-preserving loss, respectively, toward inferring the attribute-relevant and irrelevant vectors {z r , z i } in flow-based latent representation.Basically, AF flow carries out a two-stage separate training of a flow model f θ and a disentanglement model with two encoders {E φ r , E φ i } [or a single encoder E φ via Fig. 4(b)] where the likelihood-based loss L f , structural losses in source and target domains {L s , L t }, random-pair loss L p , and word-level classification loss L y are minimized.A flow model is built as a pretrained model, which is then fine-tuned to estimate the encoder.Alternatively, MI flow implements a single-stage disentanglement in the presence of Gaussian prior p ϕ of relevant vector z r where the informative disentanglement is performed.The variational upper bound of MI of {z r , z i } conditioned on the source mouth x and his/her lip movement c is minimized.This bound is factorized as the bounds for domain mapping L d , condition irrelevance L i , and conditional relevance L r .
Basically, two approaches are originated from different perspectives and eligible to be combined to strengthen the flow-based disentanglement with both geometry and information meanings.The combined AF-MI flow is here proposed for latent disentanglement.This hybrid model is implemented by training MI flow as an initial model, which is then fine-tuned in accordance with the objectives of AF flow.A single-stage factorization of z r and z i is directly handled by a single flow model f θ (with f r θ and f i θ ) instead of two-stage disentanglement using both flow model f θ and encoder E φ .The flow parameter θ is updated by jointly minimizing the structural losses in the source domain L s and target domain L t and the reconstruction loss due to random pairs L p in self-supervised manner.Importantly, the word-level classification loss L y is minimized.After that, the Gaussian prior p ϕ is finally updated by the objective L MI .In the generation stage, a target image x T is generated from a source mouth x S and his/her lip condition c S by using the flow model and the conditional prior model, which minimizes both L MI and L AF .In this study, both objectives are incorporated in L = L MI + βL AF with a hyperparameter β.

V. EXPERIMENTS
In the experiments, the conditional generation and domain mapping were implemented for mouth reenactment where the flow-based disentanglement is evaluated by using the Oxford-BBC Lip Reading in the Wild (LRW) dataset [52] with some face images shown in Fig. 7.

A. Experimental Setup
The LRW dataset consisted of short human talking videos where each video contained a pronunciation of a single vocabulary word with a length of 29 frames.There were 500 different words that were spoken by hundreds of different speakers.Each word had 1000 utterances.The image size 64 × 64 with RGB channels was fixed.The audio signals were ignored in this study.The settings for training, validation, and test were referred to [52].In addition to the proposed AF flow, MI flow, and the combined AF-MI flow, this study also carried out the related works based on the disentangled audio-visual system (DAVS) without and with adversarial learning [15], and the Glow [18] for comparison.AF and MI flow models were implemented with the generative flow based on Glow [18].Glow was geared with the invertible computation, which estimated the exact likelihood where each flow step implemented three calculations.First, the activation normalization was calculated to act as the scaling layer given by datadependent initialization.This computation was different from the affine transformation using scale and bias parameters per channel.Second, the invertible 1 × 1 convolution was calculated, while the dimensions were swapped.This process was different from [16], which simplifies inverted each flow step, and [26], which randomly scrambled the channels.A learnable invertible 1 × 1 convolution was performed with a generalized permutation by using a rotation matrix with randomly initialized weights [18].Third, an affine coupling layer [16], [26] was introduced as an invertible transformation where the determinant was computationally efficient.In addition to evaluate the synthesized images, the quantitative evaluation for image reconstruction of the test images in LRW was analyzed in terms of the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) [53], which were averaged over all test images.PSNR measures the ratio of the maximum possible power of color images with RGB channels to the corrupting noise power in decibels.This measure reflects the quality of an image.The higher the better.SSIM measures the similarity between an undistorted image and a distorted image where the factors of luminance, contrast, and structure are jointly considered with equal weighting.This metric is in line with human judgment and is seen as a measure of image quality.The higher the better.The computation time of running Python codes in PyTorch was based on the hardware using a GPU with GeForce RTX 3090 Ti 24 GB and a CPU with Intel Core i9-10900K where a memory of DDR4 128G RAM was used.
Using AF flow, the first stage was to build four computation blocks for multiscale architecture.Each block had 32 flow steps.The mini-batch size was 16.The second stage was to implement two encoders E φ i and E φ r (or a single encoder E φ ) and one classifier C ψ .Parameter ψ was used to transform from the talking attribute vector z r with a dimension 6144 (32 × 32 × 6) to the posterior vector for 500 word labels.Each encoder had one hidden layer.The sequence of frames from a video was used as a batch.The total number of neurons in each layer was the same as the size of image D. Ablation study on the effect of removing individual losses L t , L p , and L y was evaluated.Using MI flow, there were four computation blocks consisting of three blocks and one block for calculating relevant vector z r and irrelevant vector z i , respectively.MI flow had two components.One was the generative flow f θ and the other was the conditional prior p ϕ .Different from [18], [26], MI flow was implemented by optimizing different MI objectives where a learnable conditional prior distribution p ϕ (z r |c) was merged to establish a multiscale architecture.This conditional prior model served as an encoder to infer the condition-relevant latent variable z r S from a source of subimage of lip movement c S , which promoted the latent disentanglement for generation of target face x T by combining with the condition-irrelevant variable z i q of a query image x q based on the trained flow models f i θ and f r θ .The generation was based on the inverse flow model g θ .The mini-batch size was set as 8.
In implementation of Glow [18], AF, and MI flows, the dimensions of input and output in the flow model should be the same so as to preserve the invertibility for precise reconstruction.To mitigate the dimensional waste, the multiscale architecture [18], [26] was employed in the flow model by applying the dimensional splitting.In the case of three flow blocks (or L = 4), the compressed ratio (denoted by γ ) of the dimensions of condition-irrelevant variable z i relative to condition-relevant variable z r turned out as z i : z r = 1 : 2 3 , which resulted in γ = 0.125.For ablation study, the flow models were implemented with different compressed ratios γ = 0.25 (L = 3) and γ = 0.0625 (L = 5).Nevertheless, MI flow optimized the condition-irrelevant MI L i , which was able to strengthen the latent variable z i with image structural information.Given the condition of mouth movement c, latent variable z r was enhanced by optimizing L d and L r for providing attribute information.In addition, the simplified variant of MI flow was implemented to investigate the effect of different MI terms to infer z i and z r .The ablation study on individually removing L d , L i , and L r is evaluated.For AF, MI, and AF-MI flows, the bounding parameter b = 1 was set.Adam optimizer [54] was used with initial learning rate 0.0001 when updating the parameters θ , φ, ψ, and ϕ.Gradient clipping was applied.The 200k iterations were run.

B. Evaluation for Qualitative Results
First, the qualitative evaluation is illustrated for the generated images in the target domain based on the learned flow-based disentanglement for domain mapping from the source images.A source video {x S n } N n=1 and a query image x q are used to generate the target video {x T n } N n=1 frame by frame.A mask is used to crop the region or the subimage around the mouth as the source condition c S , which is mapped to the corresponding mouth region for query image x q to synthesize a target face x T .It looks like the query image providing the condition of source mouth with his/her lip movement.The scheme of Poisson blending [55] is applied to improve the image quality by tackling the blurred issue.Fig. 8 shows the generated target images x T consisting of seven sequence frames where the source images x S and a query image x q are provided.Latent disentanglement using AF flow is applied.To investigate the generalization capability, the trained model from the LRW dataset is applied for conditional generation of out-domain data where the source videos are collected from YouTube and the query images are sampled from Getty Images (https://www.gettyimages.comhttps://www.gettyimages.com/).Both source and query data are outside the LRW dataset.The evaluation over male/female, western/eastern, and photograph/painting is shown.In general, AF flow obtains desirable imitation for lip movement in target images conditioned on various query faces.The generalization over different genders, races, styles, and angles works well.Fig. 9 further evaluates the mouth reenactment for the other source video where the ablation study on individual loss terms is conducted.Among different losses, it is found that there is no clear change in the synthesized images caused by different source frames when the random-pair reconstruction loss L p is removed.In particular, the synthesized mouth in the last two frames does not really reflect the closing mouth as seen in the source video.This implies the importance of L p in shaping up the details of lip movement.This loss substantially affects conditional generation and mapping for talking faces.
Next, MI flow is examined for domain mapping.Again, the query images are all excluded from the LRW dataset.The input data consist of a source video {x S n } N n=1 or {x n } N n=1 Fig. 9.
Face reenactment with ablation study on various loss terms is illustrated.AF flow is applied.and a query or condition image c. Fig. 10(a) shows the comparison of the synthesized images for a male and a female.The hyperparameter for tuning MI terms α = 0.1 and α = 0, where the condition-relevant MI L r is functioned and ignored, respectively, is investigated.MI flow with L r does provide richer information for better generation.Without L r , the imitation of lip motion is not obvious.With a smaller value of α, the synthesized lip images are closer to those of source mouth.The condition-relevant variable z r does work.Hereafter, α = 0.1 is used.In addition, Fig. 10(

C. Evaluation for Quantitative Results
The performance of conditional generation is further evaluated by image reconstruction in terms of PSNR and SSIM that are averaged over the test images of LRW dataset.The baseline results of the DAVS without and with adversarial learning [15] and the Glow model [18] are included for comparison.The DAVS with GAN is examined.The flow model using Glow implements the conditional generation where latent disentanglement is missing.Note that the flow model performs complete reconstruction with the invertible property.The results of latent disentanglement using individual AF and MI flows, and combined AF-MI flow are compared.The ablation studies on various compressed ratios, loss terms, and training styles are investigated.Following the scheme [15] for improving PSNR and SSIM in image reconstruction, the human face is generated by using a fixed z i randomly selected from a video and the z r inferred at different frames of the video.Table I compares  In addition, the AF for disentanglement using single encoder E φ performs better than that using individual encoders {E φ r , E φ i } for attribute relevant and irrelevant vectors where PSNR 27.5 and SSIM 0.930 are measured.These variants of AF flow-2 consistently perform better than the baseline systems of DAVS and Glow.Furthermore, the ablation study on learning objectives shows that the performance is dropped  We are accordingly motivated by combining the objective L y with the MI objectives in the implementation of MI flow.The effect of adding L y is evaluated.The performance of MI flow under different compressed ratios γ is investigated.By additionally merging L y in MI flow, PSNR and SSIM are increased from 28.4 and 0.936 to 29.2 and 0.949, respectively, where γ = 0.125 is fixed.Notably, PSNR and SSIM of MI flow with L y are higher than those of AF flow.However, the computation cost of using MI flow is increased as well.The ablation study on individual objectives shows that the largest drops in PSNR and SSIM are caused by the objective L i , which is known as the most influencing factor in learning objective.The condition-irrelevant MI L i is required to capture the mouth structure to improve image reconstruction.Compared with the MIs for domain mapping L d and conditional relevance L r , the condition-irrelevant MI L i focuses more on controlling the overall structure and characteristic, which are closer to the performance measures based on PSNR and SSIM.In addition, among different compressed ratios, the value γ = 0.125 achieves the highest PSNR and SSIM.In this comparison, even though the compressed ratio is increased to γ = 0.25, MI flow is not improved in terms of PSNR and SSIM.However, the computation cost is increased significantly by increasing the compressed ratio in the flow model.Basically, PSNR and SSIM using MI flow consistently perform better than those using DAVS without and with adversarial training, Glow with different γ , and AF flow with different training styles.Because of the complementary property in using AF and MI flows, this study presents the flow combination for disentanglement, which is investigated by measuring the results of the combined AF-MI flow with different hyperparameters β.In this comparison, the highest PSNR and SSIM are achieved as 30.2 and 0.958 by using AF-MI flow with β = 0.5 and β = 0.8, respectively.Nevertheless, the computation cost of implementing AF-MI flow is increased substantially.The training hours of the best setting using Glow, AF, MI, and AF-MI flows are measured as 20.6, 28.9, 30.5, and 45.1, respectively.Finally, a demo video is provided to illustrate different results of video clips by using AF and MI flows for face reenactment shown in Figs. 8 and 10, respectively. 1Source codes are commented and posted online in this article. 2I.CONCLUSION This article has presented the flow-based latent disentanglement to identify the attribute-relevant and attribute-irrelevant latent variables that were employed for domain mapping and conditional generation.The geometric and informative solutions to disentanglement based on the AF flow and the MI flow were proposed, respectively.AF flow trained the flow model and the feature extractor (or encoder) for attribute relevance and attribute irrelevance based on a two-stage method where the Glow model was estimated by maximizing the generative likelihood and the disentanglement model was inferred by minimizing the structural losses within and between domains.The feature encoder was trained to disentangle the latent vectors according to the structural information of the images.The random-pair reconstruction loss via self-supervised learning and the cross-entropy loss for word classification were additionally minimized without the need of paired data.The proposed loss functions made use of the properties of sequence data and identify the related domain information in different sequences.In addition, this study presented the information-theoretic latent disentanglement for flow-based generative model.A kind of end-to-end training was proposed to carry out the conditional generation for domain mapping in mouth reenactment.The condition-irrelevant and conditionrelevant latent variables were learned in accordance with the informative objectives for domain mapping and disentanglement.By introducing the conditional prior, these two latent variables were disentangled and embedded with the specific attribute.AF and MI flows were constructed with the multiscale architecture where the dimensional waste was handled.The hybrid AF-MI flow combining two flow models was further developed by a cascaded implementation.A series of experiments on qualitative and quantitative evaluation of face reenactment showed the merit of the AF and MI for face generation and reconstruction.The objectives of random-paired reconstruction and condition irrelevance considerably affected the learning procedure.The proposed methods will be further investigated by extending to the other types of flow model and the other kinds of technical data under different domain mapping tasks.

Abstract-
Face reenactment aims to generate the talking face images of a target person given by a face image of source person.It is crucial to learn latent disentanglement to tackle such a challenging task through domain mapping between source and target images.The attributes or talking features due to domains or conditions become adjustable to generate target images from source images.This article presents an information-theoretic attribute factorization (AF) where the mixed features are disentangled for flow-based face reenactment.The latent variables with flow model are factorized into the attribute-relevant and attribute-irrelevant components without the need of the paired face images.In particular, the domain knowledge is learned to provide the condition to identify the talking attributes from real face images.The AF is guided in accordance with multiple losses for source structure, target structure, random-pair reconstruction, and sequential classification.The random-pair reconstruction loss is calculated by means of exchanging the attribute-relevant components within a sequence of face images.In addition, a new mutual information flow is constructed for disentanglement toward domain mapping, condition irrelevance, and condition relevance.The disentangled features are learned and controlled to generate image sequence with meaningful interpretation.Experiments on mouth reenactment illustrate the merit of individual and hybrid models for conditional generation and mapping based on the informative AF.Index Terms-Disentangled features, domain mapping, face reenactment, flow model, information-theoretic generation.

Manuscript received 18
December 2021; revised 16 April 2022 and 18 June 2022; accepted 7 July 2022.Date of publication 15 July 2022; date of current version 6 February 2024.This work was supported in part by the Ministry of Science and Technology, Taiwan, under Contract MOST 110-2634-F-A49-003.(Corresponding author: Jen-Tzung Chien.)

Fig. 1 .
Fig. 1.Generating and normalizing processes in flow model.II.BACKGROUND SURVEY First, the flow-based generative models for image-to-image translation with latent disentanglement are introduced.

Fig. 3 .
Fig. 3. Right: structural loss between z iT = E φ i (z r S +z i q ) and zT in the target domain.Left: Random-pair reconstruction loss for any paired data {z n , z m } in the source domain.

Fig. 4 .
Fig. 4. Architectures for (a) sequential classifier with vanilla RNN and (b) single encoder for factorization of attributes.

Fig. 5 .
Fig. 5. Illustration for (a) MI in latent domain and (b) informative mapping and disentanglement based on the condition irrelevant and relevant variables {z i , z r }.
z r ; x, c) + I(z; x, c) + I(z i ; z r ) = I(z; x, c) domain mapping −I(z i ; x) condition irrelevance −I(z r ; x, c) + I(z i ; z r ) consists of four individual MI terms as shown in Fig. 5(b).The first term is a domain mapping MI I(z; x, c), which is minimized to disentangle the relation between observed domain {x, c} and latent domain z.The second term I(z i ; x) denotes the condition-irrelevant MI, which is maximized to infer the condition-irrelevant variable z i so as to sufficiently reflect the observed data x where the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 6 .
Fig. 6.Architectures for MI flow in training and generation stages.Different symbols are defined (a) Training stage.(b) Generation stage.(c) Symbol definition.

Fig. 8 .
Fig. 8.Face reenactment from a source video and three query images covering different genders, races, and styles.Out-domain data are evaluated.AF flow is applied.
b) compares the results of the generated videos based on MI flow where different query images with different races or even different portrait paintings are investigated: evaluation for different query faces, including western faces, which are different but close to training targets, and eastern faces and portrait painting faces, which are far from the training targets.As we can see, the sequences of the generated images of lip movement in different races and styles consistently look well.The qualitative results on various out-domain examples assure the generalization performance of AF and MI flows for domain mapping and disentanglement.

Fig. 10 .
Fig. 10.Face reenactment for (a) evaluation of query images over different genders and hyperparameters α and (b) evaluation of query images covering different races and styles.Out-domain data are evaluated.MI flow is applied.
the PSNR and SSIM scores over different conditional generation models.The training costs of different disentanglement methods relative to Glow under different conditions are reported.In addition to the two-stage implementation of AF flow (denoted as the AF flow-2), AF flow is also implemented by a single-stage AF model (AF flow-1) where the flow model using Glow f θ and the disentanglement model with encoder E φ are jointly trained instead of treating flow model as a pretrained model for fine-tuning the encoder model.The variants of AF flow in the presence of two encoders {E φ r , E φ i } and a single encoder E φ are also compared.The compressed ratio in AF flow is set as γ = 0.125.It is found that two-stage AF flow obtains PSNR 28.6 and SSIM 0.939 that are considerably higher than PSNR 25.6 and SSIM 0.918 by using single-stage AF flow.The training procedure via single-stage AF flow does not converge well.
STUDIES ON COMPRESSED RATIO, LOSS FUNCTION, AND TRAINING STYLES.THE TRAINING TIME RELATIVE TO GLOW IS EVALUATED.THE ERROR BAR WITH ONE STANDARD DEVIATION IS SHOWN by individual terms in the loss function L AF .The biggest drop in model learning was due to the removal of self-supervised learning via random-pair reconstruction loss L p .The random pairs provide crucial information for reconstruction across two domains.Such a loss conveys sufficient evidence for the encoder to learn AF.Also, the cross-entropy loss L y from word label k of a talking mouth is influencing attribute disentanglement in the ablation study.The classification loss L y is seen as an additional objective, which is feasible to enrich the inference of attribute-relevant vector z r .The computation costs due to different training styles of stages and encoders are comparable.

TABLE I COMPARISON
OF PSNR AND SSIM SCORES FOR IMAGE RECONSTRUCTION BY USING DIFFERENT MODELS WITH ABLATION