Generative Adversarial Networks in Human Emotion Synthesis:A Review

Synthesizing realistic data samples is of great value for both academic and industrial communities. Deep generative models have become an emerging topic in various research areas like computer vision and signal processing. Affective computing, a topic of a broad interest in computer vision society, has been no exception and has benefited from generative models. In fact, affective computing observed a rapid derivation of generative models during the last two decades. Applications of such models include but are not limited to emotion recognition and classification, unimodal emotion synthesis, and cross-modal emotion synthesis. As a result, we conducted a review of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video. In this context, facial expression synthesis, speech emotion synthesis, and the audio-visual (cross-modal) emotion synthesis is reviewed extensively under different application scenarios. Gradually, we discuss open research problems to push the boundaries of this research area for future works.


Introduction
Deep learning techniques are known best for their promising success in uncovering the underlying probability distributions over various data types in the field of artificial intelligence. Some of these data types are videos, images, audio samples, biological signals, and natural language corpora. The success of the deep discriminative models owes primarily to the back-propagation algorithm and piece-wise linear units (LeCun et al., 1998;Krizhevsky et al., 2012). In contrast, deep generative models (Goodfellow et al., 2014) have been less successful in addressing deep learning due to difficulties that arise by intractable approximation in the probabilistic computation of methods like maximum likelihood estimation.
Many reviews studied the rapidly expanding topic of generative models and specifically Generative Adversarial Networks (GAN) by investigating various points of view. From algorithms, theory, and applications (Gui et al., 2020;Wang et al., 2017) and recent advances and developments Zamorski et al., 2019) to comparative studies (Hitawala, 2018), GAN taxonomies (Wang et al., 2019b), and its variants (Hong et al., 2019;Creswell et al., 2018;Huang et al., 2018;Kurach et al., 2018) are investigated by the researchers. Also, few review papers discussed the subject based on a specific application like medical imaging (Yi et al., 2019b), audio enhancement and synthesis (Torres-Reyes and Latifi, 2019), image synthesis , and text synthesis (Agnese et al., 2019). Howsoever, none of the existing surveys considered GAN in view of human emotion synthesis.
It is important to note that searching the phrase "Generative Adversarial Network" on Web Of Science (WOS) and Scopus repositories report that 2538 and 4405 documents are published, respectively starting from 2014 up to present. Figure 1(a) and 1(b) show the statistical results obtained from these repositories by searching the aforementioned phrase. The large number of researches published on this topic within only 6 years inspired us to conduct a comprehensive review considering one of the significant applications of GAN models called human emotion synthesis.
Synthesizing realistic data samples is of great value for both academic and industrial communities. Affective computing, a topic of a broad interest in computer vision society benefits from human emotion synthesis and data augmentation. Throughout this paper, we concentrate on the recent advances in the field of GAN and their possible acquisition in the field of human emotion recognition which is known to be useful in other research areas like computer-aided diagnosis systems, security and identity verification, multimedia tagging systems, and human-computer and human-robot interactions. Humans communicate through various verbal and nonverbal channels to show their emotional state. All of the communication modalities are of high importance once interpreting the current emotional state of the user. In this paper, we focus on the GAN-related works of speech emotion synthesis, face emotion synthesis, and audio-visual (cross-modal) emotion synthesis because face and speech are known as pioneer communication channels among humans (Schirmer and Adolphs, 2017;Ekman et al., 1988;Zuckerman et al., 1981;Mehrabian and Ferris, 1967). Researchers developed many GAN-based models to address problems such as data augmentation, improvement of emotion recognition rate, and enhancement of synthesized samples through unimodal (Ding et al., 2018), (Choi et al., 2018;Tulyakov et al., 2018), (Kervadec et al.,   2018), , (Pascual et al., 2017), (Latif et al., 2017), (Gideon et al., 2019), (Zhou and Wang, 2017), (Wang and Wan, 2018) and cross-modal analysis (Duarte et al., 2019), (Karras et al., 2017a), (Jamaludin et al., 2019), (Chen et al., 2017). A specific type of neural network called GAN models was introduced in 2014 by Goodfellow et al. (Goodfellow et al., 2014). This model is composed of a generative model pitting against an adversary model as a two-player minimax framework. The generative model captures data distribution. Then, given a sample, the adversary or the discriminator decides if the sample is drawn from the true data distribution (real) or the model distribution (fake). The competition continues until the generated samples are indistinguishable from the genuine ones.
This review deals with the GAN-based algorithms, theory, and applications in human emotion synthesis and recognition. The remainder of the paper is organized as follows: Section 2 provides a brief introduction to GANs and their variations. This is followed by a comprehensive review of related works on human emotion synthesis tasks using GANs in section 3. This section covers unimodal and cross-modal GAN-based methods developed using audio/visual modalities. Finally, section 4 summarizes the review, identifies potential applications, and discusses challenges.

Background
In general, generative models can be categorized into explicit density models and implicit density models. While the former utilizes the true data distribution or its parameters to train the generative model, the latter generates sample instances without an explicit parameter assumption or direct estimation on real data distribution. Examples of explicit density modeling are maximum likelihood estima-tion and Markov Chain Method (Kingma and Welling, 2013;Rezende et al., 2014). GANs can be considered as implicit density modeling example (Goodfellow et al., 2014).
2.1 Generative Adversarial Networks (GAN) Goodfellow et al. proposed Generative Adversarial Networks or vanilla GAN in 2014(Goodfellow et al., 2014. The model works based on a two-player minimax game where one player seeks to maximize a value function and the other seeks to minimize it. The game ends at a saddle point when the first agent and the second agent reach a minimum and a maximum, respectively, concerning their strategies. This model draws samples directly from the desired distribution without explicitly modeling the underlying probability density function. The general framework of this model consists of two neural network components: a generative model G capturing the data distribution and a discriminative model D estimating the probability that a sample comes from the training samples or G. Let us designate the input sample for G as z where z is a random noise vector sampled from a priori distribution p z (z). Let us denote a real sample as x r that is taken from the data distribution P r . Also, we show an output sample generated by G as x g . Then, the idea is to get maximum visual similarity between the two samples. In fact, the generator G learns a nonlinear mapping function parametrized by θ g and formulated as: G(z; θ g ). The discriminator D, gets both x r and x g to output a single scalar value O 1 = D(x; θ d ) stating the probability that whether an input sample is a real or a generated sample (Goodfellow et al., 2014). It is important to highlight that D(x; θ d ) is the mapping function learned by D and parametrized by θ d . The final distribution formed by generated samples is P g and it is expected to approximate P r after learning. Figure 2(a) illustrates the general block diagram of the vanilla GAN model.
Having two distributions P r and P g on the same probability space X , the KL divergence is as follows: where both P r and P g are assumed to admit densities with respect to a common measure defined on X . This happens when P r and P g are absolutely continuous, that is P g ≪ P r . The KL divergence is asymmetric, i.e KL(P r P g ) ≠ KL(P g P r ) and possibly infinite when there are points such that p g (x) = 0 and p r (x) > 0 for KL(P r P g ). A more convenient approach for GAN is the Jensen-Shannon (JS) divergence which may interpreted as a symmetrical version of KL divergence and it is defined as follows: where P m = (P r + P g ) 2.
In other words, a minimax game between G and D continues to obtain a normalized and symmetrical score in terms of the value function V (G, D) as follows: Here, the parameters of G are adjusted by minimizing log(1 − D(G(x g ))). In a similar way, adjusting the parameters for D is performed by maximizing log D(x r ). Minimizing log(1 − D(G(x g ))) is known (Goodfellow et al., 2014) to be equivalent to minimizing the JS divergence between P r and P g as expressed in Eq.
(2). The value function V (θ g , θ d ) determines the payoff of the discriminator. Also, the generator takes the value −V (θ g , θ d ) as its own payoff. The generator and the discriminator, each attempts to maximize its own payoff  during the learning process.The general framework of this model is shown in Figure 2

Challenges of GAN Models
The training objective of GAN models is often referred to as saddle point optimization problem (Yadav et al., 2017) which is resolved by gradient-based methods. One challenge here is that D and G should be trained at a time so that they advance and converge together. Minimizing the generators' objective is proven to be equivalent to minimizing JS divergence if the discriminatorD is trained to its optimal point before the next update of G. This means minimizing the JS divergence does not guarantee finding the equilibrium point between G and D through the training process. This normally leads to a better performance of D as opposed to G. Consequently, at some point classifying real and fake samples becomes such an easy task that gradients of D approach zero and it becomes ineffectual in the learning procedure of G. Mode collapse is another well-known problem in training GAN models where G produces a limited set of repetitive samples due to focusing on a few limited modes of the true data distribution, namely P r during learning and approximating distribution P g . We discuss these problems in more detail in section 4.1.

Variants by Architectures
The GAN model can be extended to a conditional GAN (CGAN) model if both the generator and discriminator are conditioned on some extra information y (Mirza and Osindero, 2014). Figure 2(b) shows the block diagram of the CGAN model. The condition vector y = c is fed into both the discriminator and the generator through an additional input layer. Here, the latent variable z with prior density p z (z) and condition vector y with some value c ∈ R d are passed through one perceptron layer to learn the joint hidden representation. Conditioning on c changes the training criterion of Eq.
(3) and leads to the following criterion: where c could be target class labels or auxiliary information from other modalities.
Another type of GAN models is Laplacian Generative Adversarial Network (LAPGAN) (Denton et al., 2015) that formed by combining CGAN models progressively within a Laplacian pyramid representation. LAPGAN includes a set of generative convolutional models, say G 1 , . . . , G K . The synthesis procedure consists of two parts a sampling phase and a training phase. The sampling phase starts with generator G 1 that takes a noise sample z 1 and generates sample x g 1 . The generated sample is upscaled before passing to the generator of next level as a conditioning variable. G 2 takes both upscaled version of x g 1 and a noise sample z 2 to synthesize a difference sample called h 2 which is added to the upscaled version of x g 1 . This process of upsampling and addition repeats across successive levels to yield a final full resolution sample. The Figure 3 illustrates the general block diagram of the LAPGAN model. Fig. 3 Block diagram of LAPGAN model; G k : k -th Generator, D k : k -th Discriminator, X r1 : real sample, X rk : k -th real residual, X gk : generated sample, O 1 : Output of binary classification to real/fake. SGAN is a second example formed by top-down stacked GAN models (Huang et al., 2017b) to solve the low performance of GAN models in discriminative tasks with large variation in data distribution. Huang et al. (2017b) employ the hierarchical representation in a model trained discriminatively by stitching GAN models in a top-down framework and forcing the top-most level to take class labels and the bottom-most one to generate images. Alternatively, instead of stacking GANs on top of each other, Karras et al. (2017b) increased the depth of both the generator and the discriminator by adding new layers. All models are developed under conditional GAN (Denton et al., 2015;Huang et al., 2017b;Karras et al., 2017b).
Other models modify the input to the generator slightly. For instance, in SPADE (Park et al., 2019) a segmentation mask is fed indirectly to the generator through an adaptive normalization layer instead of utilizing the standard input noise vector z. Also, StyleGAN  injects z, first to an intermediate latent space that helps to avoid entanglement of the input latent space to the probability density of the training data.
In 2015, Radford et al. (Radford et al., 2015) proposed Deep Convolutional Generative Adversarial Network (DCGAN) in which both the generator and the discriminator were formed by a class of architecturally constrained convolutional networks. In this model, fully convolutional downsampling/upsampling layers replaced the Fully connected layers of vanilla GAN along with other architectural restrictions like using batch-normalization layers and LeakyReLU activation function in all layers of the discriminator.
Another advancement in GAN models includes using the spectral normalization layer to adjust feature response criterion by normalizing the weights in the discriminator network . Residual connections are another novel approach fetched into the GAN models by Gulrajani et al. (2017) and . While models like CGANs incorporate the conditional information vector simply by concatenation, others remodeled the usage of a conditional vector by a projection approach leading to significant improvement in the quality of the generated samples .
The aforementioned GAN models expanded based on Convolutional Neural Networks (CNN). Further, along this line, a whole new research line of GAN models developed based on recent deep learning models called CapsuleNets (CapsNets) (Sabour et al., 2017). Let v k be the output vector of the final layer of a CapsNet that represents the presence of a visual entity by classifying to one of the K classes. Sabour et al. (2017) provide an updated objective function that benefits from CapsNet margin loss (L M ) and it could be expressed as follows: where m + , m − , and λ are down-weighting factors set to 0.9, 0.1, and 0.5, respectively to stop initial learning from shrinking the lengths of the capsule outputs in the final layer. The length of each capsule in the final layer ( v k ) can then be viewed as the probability of the image belonging to a particular class (k). Also, T k denotes the target label. CapsuleGAN (Jaiswal et al., 2018) is a GAN model proposed by Jaiswal et al. (2018) based on CapsNet.The authors use CapsNet in the discriminator as opposed to conventional CNNs. The final layer of this discriminator consists of a single capsule representing the probability of being a real or fake sample. They used the margin loss introduced in Eq. (5) instead of the binary cross-entropy loss for training. The training criterion of the CapsuleGAN is then formulated as follows: Practically, the generator must be trained to minimize L M (D(G(z)), T = 1) rather than minimizing −L M (D(G(z)), T = 0). This helps eliminating the downweighting factor λ in L M when training the generator, which does not contain any capsules.

Variants by Discriminators
Stabilizing the training and avoiding mode collapse problem could be achieved by employing different loss functions for D. An entropy-based loss is proposed by Springenberg (2015) called Categorical GAN (CatGAN) in which the objective of discriminator changed from real-fake classification to entropy-based class predictions. WGAN  and an improved version of it called WGAN-GP (Gulrajani et al., 2017) are two GAN models with a loss function based on Wasserstein distance used in the discriminator. The Earth-Mover (EM) distance or Wasserstein-1 is expressed as follows: where ∏(Pr, P g ) is the set of all joint distributions γ(x, y) whose marginals are respectively, P r and P g . Here, γ(x, y) describes how much "mass" needs to be transported from x to y in order to transform the distribution P r into P g . The EM distance is then the "cost" of the optimal transport plan. Other alternative models that benefit from a different loss metric are GAN based on Category Information (CIGAN) , hinge loss , least-square GAN (Mao et al., 2017), and f-divergence GAN (Nowozin et al., 2016). Research developments include replacing the encoder structure of the discriminator with an autoencoder structure. In fact, a new loss objective is defined for the discriminator which corresponds to the autoencoder loss distribution instead of data distribution. Examples of such GAN frameworks are Energy-based GAN (EBGAN) (Zhao et al., 2016) and Boundary Equilibrium GAN (BEGAN) Berthelot et al. (2017). Figure 4 illustrates the block diagram of GAN models developed by modification in the discriminator. Another interesting GAN model proposed by Chen et al. (2016) is Information Maximizing Generative Adversarial Net (InfoGAN), which simply modifies the discriminator to output both the fake/real classification result and the semantic features of x g illustrated as f in Figure 4(c). The discriminator performs real/fake prediction by maximizing the mutual information between the x g and conditional vector c. Other models like CIGAN  and ACGAN (Odena et al., 2017) focused on improving the quality of the generated samples by employing the class labels during synthesis and then impelling D to provide entropy loss information as well as class probabilities. The Figure 4(d) shows the structure of ACGAN.

Variants by Generators
The objective of generators is to transform noise input vector z to a sample x g = G(z). In the standard vanilla GAN, this objective is achieved by successively improving the state of the generated sample. The procedure stops when the desired quality is captured. Variational AutoEncoder GAN network (VAEGAN) (Larsen et al., 2015) is arguably the most popular GAN model proposed by varying on the generator architecture. The VAEGAN computes the reconstruction loss in a pixel-wise approach. The decoder network of VAE outputs patterns resembling the true samples (see Figure 5(b)).
One challenge in designing GAN models is controlling the attributes of the generated data known as a mode of data. Using supplemental information leads to sample generation with control over the modification of the selected properties. The generator output then becomes x g = G(z, c). GANs lack the capability of interpreting the underlying latent space that encodes the input sample. ALI (Dumoulin et al., 2016) and BiGAN (Donahue et al., 2016) are proposed to resolve this problem by embedding an encoder network in the generator as shown in Figure 5(a). Here, the discriminator performs real/fake prediction by distinguishing between the tuples (z g , x g ) and (z r , x r ). This can categorize the model as a discriminator variant as well.
Other researchers developed the generators to solve specific tasks. Isola et al. (2017) designed pix2pix as an image-to-image translation network to study relations between two visual domains and Milletari et al. (2016) proposed VNet with Dice loss for image segmentation. The disadvantage of such networks was the aligned training with paired samples. In 2017, Zhu et al. and Kim et al. found a solution to perform unpaired image-to-image translation using cycle consistency loss and cross-domain relations, respectively. Here, the idea was to join two generators together to perform translation between sets of unpaired samples. Below,  CycleGAN  and UNIT are successful examples derived from VAEGAN model. Figure 6(c) illustrates the layout for UNIT framework. It is important to highlight that considering the generators, the conditional input may vary from class labels (Mirza and Osindero, 2014) and text descriptions (Reed et al., 2016),  to object location and encoded audio features or crossmodal correlations.

Applications in view of Human Emotion
In this section, we discuss applications of GAN models in human emotion synthesis. We categorize related works into unimodal and cross-modal researches based on audio and video modalities to help the reader discover applications of interest without difficulty. Also, we explain each method in terms of the proposed algorithm and its advantages and disadvantages. Generally, applications of GAN for human emotion synthesis focus on two issues. The first one is data augmentation that helps obviating the need for the tedious job of collecting and labeling large scale databases and the second is improving the performance on emotion recognition.

Facial Expression Synthesis
Facial expression synthesis using conventional methods confronts several important problems. First, most methods require paired training data, and second, the generated faces are of low resolution. Moreover, the diversity of the generated faces is limited. The works reviewed in this section are taken from the computer-visionrelated researches that focus on facial expression synthesis.
One of the foremost works on facial expression synthesis was the study by Susskind et al. (2008) that could embed constraints like "raised eyebrows" on generated samples. The authors build their framework upon a Deep Belief Network (DBN) that starts with two hidden layers of 500 units. The output of the second hidden layer is concatenated with identity and a vector of the Facial Action Coding System (FACS) (Ekman and Friesen, 1978) to learn a joint model of them through a Restricted Boltzmann Machine (RBM) with 1000 logistic hidden units. The trained DBN model is then used to generate faces with different identities and facial Action Units (AU).
Later, with the advent of GAN models, DyadGAN is designed specifically for face generation and it can generate facial images of an interviewer conditioned on the facial expressions of their dyadic conversation partner. ExprGAN (Ding et al., 2018) is another model designed to solve the problems mentioned above. ExprGAN has the ability to control both the target class and the intensity of the generated expression from weak to strong without a need for training data with intensity values. This is achieved by using an expression controller module that encodes complex information like expression intensity to a real-valued vector and by introducing an identity preserving loss function.
Other proposed methods before ExprGAN had the ability to synthesize facial expressions either by manipulating facial components in the input image (Yang et al., 2011;Mohammed et al., 2009;Yeh et al., 2016) or by using the target expression as a piece of auxiliary information (Susskind et al., 2008;Reed et al., 2014;Cheung et al., 2014) Table 1: Comparison of facial expression image synthesis models, description of loss functions (L), metrics (M), databases (D) and purposes (P) used in the reviewed publications are given in Tables 2, 3, 4, and 6 -M: Metric, RS: Results, RM:Remarks -*: shows the proposed method by authors, other mentioned methods are implemented by the authors for the sake of comparison -the result reported for expression classification accuracy (Mv 2 ) belongs to the synthesized image datasets - †: Dv 9 + Dv 10 is used as the database - ‡: Dv 9 + Dv 11 is used as the database -All papers provide visual representation of the synthesized images (Mv 7 ) Table 1 compares the reviewed publication based on various metrics, databases, loss functions and purposes used by researchers. Following those models and through the many variations of facial expression synthesis proposed by researchers, the GAN-based model proposed by Song et al. (2018) was one of the interesting and premier ones, called G2GAN. G2GAN generates photo-realistic and identitypreserving images. Furthermore, it provides fine-grained control over the target expression and facial attributes of the generated images like widening the smile of the subject or narrowing the eyes. The idea here is to feed the face geometry into the generator as a condition vector which guides the expression synthesis procedure. The model benefits from a pair of GANs that while one removes the expression, the other synthesizes it. This leverages on the ability of unpaired training.
StarGAN (Choi et al., 2018) is the first approach with a scalable solution for multi-domain image-to-image translation using a unified GAN model (i.e only a single generator and discriminator). In this model, a domain is defined as a set  Table 2 List of databases used for facial emotion synthesis in the reviewed publications of images sharing the same attribute and attributes are the facial features like hair color, gender, and age which can be modified based on the desired value. For example, one can set hair color to be blond or brown and set the gender to be male or female. Likewise, Attribute editing GAN (AttGAN) (He et al., 2019) provides a GAN framework that can edit any attribute among a set of attributes for face images by employing adversarial loss, reconstruction loss, and attribute classification constraints. Also, DIAT (Li et al., 2016), CycleGAN  and IcGAN (Perarnau et al., 2016) could be compared as baseline models.
In 2018 Another VAEGAN-based model is the work of (Lai and Lai, 2018) where a novel optimization loss called symmetric loss is introduced. Symmetric loss helps preserving the symmetrical property of the face while translating from various head poses to frontal-view of the face. Similar to Lai and Lai is the FaceID-GAN (Shen et al., 2018a) where, in addition to the two-players of vanilla GANs and symmetry information, a classifier of face identity is employed as the third player that competes with the generator by distinguishing the identities of the real and synthesized faces. Lai and Lai (2018) used GAN to perform emotion-preserving representations. In the proposed approach, the generator can transform the non-frontal facial images into frontal ones while the identity and the emotion expression are preserved. Moreover, a recent publication (Vielzeuf et al., 2019) relies on a two-step GAN framework. The first component maps images to a 3D vector space. This vector is issued from a neural network and it represents the corresponding emotion of the image. Then, a second component that is a standard image-to-image translator uses the 3D points obtained in the first step to generate different expressions. The proposed model provides fine-grained control over the synthesized discrete expressions through the continuous vector space representing the arousal, valence, and dominance space.
It should be noted that a series of GAN models focus on 3D object/face generation. Examples of these models are Convolutional Mesh Autoencoder (CoMA) (Ranjan et al., 2018), MeshGAN (Cheng et al., 2019), UVGAN (Deng et al., 2018), and MeshVAE (Litany et al., 2018). Despite the successful performance of GANs in image synthesis, they still fall short when dealing with 3D objects and particularly human face synthesis. Here, we compare synthesized images of the aforementioned methods qualitatively in Figures 7 and 8. Images are taken from the corresponding papers. As the images show, most of the generated samples suffer from blurring problem.
In addition to GAN-based models that synthesize single images, there are models with the ability to generate an image sequence or a video/animation. Video GAN (VGAN) (Vondrick et al., 2016) and Temporal GAN (TGAN) (Saito et al.,   identity classification (error) down stream task Table 4 List of evaluative metrics used for facial emotion synthesis in the reviewed publications 2017) were the first two models in this research line. Although these models could learn a semantic representation of unlabeled videos, they produced a fixed-length video clip. As a result, MoCoGAN is proposed by Tulyakov et al. to solve the problem. MoCoGAN is composed of 4 sub-networks. These sub-networks are a recurrent neural network, an image generator, an image discriminator, and a video    -M: Metric, RS: Results, RM:Remarks -*: shows the proposed method by authors, other mentioned methods are implemented by the authors for the sake of comparison -the result reported for expression classification accuracy (Mv 2 ) belongs to the synthesized image datasets -All papers provide visual representation of the synthesized images (Mv 7 ) -* † :TwoStreamVAN Table 5 Comparison of facial expression video generation models, description of loss functions (L), metrics (M), databases (D) and purposes (P) used in the reviewed publications are given in Tables 2, 3, 4, and 6 (video) given the pose sequence generated by the PSGAN. The effect of noisy or abnormal poses between the generated and ground-truth poses is reduced by the semantic consistency. We show this method as PS/SCGAN in Table 5. It is worth to mention that two of the recent and successful methods in video generation are MetaPix (Lee et al., 2019b) and MoCycleGAN  that used motion and temporal information for realistic video synthesis. However, these methods are not tested for facial expression generation. Table 5 lists the models developed for video or animation generation. One of the main goals in synthesizing is augmenting the number of available samples. Zhu et al. (2018) used GAN models to improve the imbalanced class distribution by data augmentation through GAN models. The discriminator of the model is a CNN and the generator is based on CycleGAN. They report up to 10% increase in the classification accuracy (Mv 2 ) based on GAN-based data augmentation techniques.
The objective function or the optimization loss problem categorizes into two groups: synthesis loss and classification loss. Although the definitions provided by the authors are not always clear, we tried to list all different losses used by authors and we propose a symbolic name for each to provide harmony in the literature. The losses are used in a general point of view. That is, marking different papers by classification loss (L7) in Table 1, does not mean necessarily that the exact same loss function is used. In other words, it shows that the classification loss is contributed in some way. A comprehensive list of these functions is given in Table  3. Additionally, we compared some of the video synthesis models in Figure 9.
Evaluation metrics of the generative models are different from one research to another due to several reasons (Hitawala, 2018). First, the quality of the synthesized sample is a perceptual concept and, as a result, it cannot be accurately  Fig. 9 Visual comparison of the GAN models, images are in courtesy of the reviewed papers expressed. Usually, researchers provide the best-synthesized samples for visual comparison and thus problems like mode drop are not covered qualitatively. Second, employing human annotators to judge the visual quality can cover only a limited number of data samples. Specifically, in topics such as human emotion, experts are required for accurate annotation and having the least possible labeling error. Hence, approaches like Amazon Mechanical Turk are less reliable considering classification based on those labels. Third, general metrics like photo-metric error, geometric error, and inception score are not reported in all publications (Salimans et al., 2016). These problems cause the comparison among papers either unfair or impossible.
The Inception Score (IS) can be computed as follows: where x g denotes the generated sample, y is the label predicted by an arbitrary classifier, and KL(.) is the KL divergence to measure the distance between probability distributions as defined in Eq.
(1). Based on this score, an ideal model produces samples that have close congruence to real data samples as much as possible. In fact, KL divergence is the de-facto standard for training and evaluating generative models.
Other widely used evaluative metrics are Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR). SSIM is expressed as follows: where I, C, and S are luminance, contrast, and structure and they can be formulated as: Here µ x , µ y , σ x , and σ y denote mean and standard deviations of pixel intensity in a local image patch where the patch is superimposed so that its center coincides with the center of the image. Typically, a patch is considered as a square neighborhood of n × x pixels. Also, σ xy is the sample correlation coefficient between corresponding pixels in that patch. C1, C2, and C3 are small constants values added for numerical stability.
PSNR or the peak signal-to-noise ratio assesses the quality between two monochrome images x g and x r . Let x g and x r be the generated image and the real image, respectively. Then, PSNR is: where MAX x r is the maximum possible pixel value of the image and MSE stands for Mean Square Error. PSNR is measured in dB, generated images with a better quality result in higher PSNR.
In addition to the metrics that evaluate the generated image, Generative Adversarial Metric (GAM) proposed by Im et al. (2016) compares two GAN models by engaging them in a rivalry. In this metric, first GAN models M 1 and M 2 are trained. Then, model M 1 competes with model M 2 in a test phase by having M 1 trying to fool discriminator of M 2 and vice versa. In the end, two ratios are calculated using the discriminative scores of these models as follows: where G 1 , D 1 , G 2 , and D 2 , are the generators and the discriminators of M 1 and M 2 , respectively. In Eq. (12), ǫ(.) outputs the classification error rate. The test ratio or r test shows which model generalizes better because it discriminates based on X test . The sample ratio or r sample shows which model fools the other more easily because discriminators classify based on the synthesized samples of the opponent. The sample ratio and the test ratio can be used to decide the winning model: To measure the texture similarity, Peng and Yin (2019) simply calculated correlation coefficients between T g and T r that are the texture of the synthesized image and the texture of its corresponding ground truth, respectively. Let ρ be the texture similarity score. Then, the mathematical representation is as follows: where (i, j) specifies pixel coordinates in the texture images, and µ g and µ r are the mean value of T g and T r , respectively. Other important metrics include Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), the Wasserstein Critic, Tournament Win Rate and Skill Rating, and Geometry Score. FID works based on embedding the set of synthesized samples into a new feature space using a certain layer of a CNN architecture. Then, mean and covariance are estimated for both the synthesized and the real data distributions based on the assumption that the embedding layer is a continuous multivariate Gaussian distribution. Finally, FID or Wasserstein-2 distance between these Gaussians is then used to quantify the quality of generated samples : Here, (µ g , ∑ g ) and (µ r , ∑ r ) represent the mean and covariance of generated and real data distributions, respectively. Lower FID score indicates a smaller distance between the two distributions. MMD focuses on the dissimilarity between the two probability distributions by taking samples from each distribution independently. The kernel MMD is expressed as follows: where k is some fixed characteristic kernel function like Gaussian kernel: k(x r , x g ) = exp(∥ x r − x g ∥ 2 ) that measures MMD dissimilarity between the generated and real data distributions. Also, x r and x ′ r are randomly drawn samples from real data distribution, i.e P r . Similarly, x g and x ′ g are randomly drawn from model distribution, i.e P g .
The Wasserstein Critic provides an approximation of the Wasserstein distance between the model distribution and the real data distribution. Let P r and P g be the real data and the model distributions, then: where f ∶ R D → R is a Lipschitz continuous function. In practice, the critic f is a neural network with clipped weights and bounded derivatives (Borji, 2019). In practice, this is approximated by training to achieve high values for real samples and low values for generated ones: where X test is a batch of testing samples, X g is a batch of generated samples, and f is the independent critic. An alternative version of this score is known as Sliced Wasserstein Distance (SWD) that estimates the Wasserstein-1 distance (see Eq. (7)) between real and generated images. SWD computes the statistical similarity between local image patches extracted from Laplacian pyramid representations of the images (Karras et al., 2017b).
In the case of the metrics of video generation, evaluating content consistency based on Average Content Distance (ACD) is defined as calculating the average pairwise L 2 distance of the per-frame average feature vectors. In addition, the motion control score (MCS) is suggested for assessing the motion generation ability of the model. Here, a spatio-temporal CNN is first trained on a training dataset. Then, this model classifies the generated videos to verify whether the generated video contained the required motion (e.g action/expression).
Other metrics include but are not limited to identification classification, true/false acceptance rate (Song et al., 2018), expression classification accuracy/error (Ding et al., 2018), real/fake classification accuracy/error (Ding et al., 2018), attribute editing accuracy/error (He et al., 2019), and Fully Convolutional Networks. List of evaluative metrics used in the reviewed publications is given in Table 4. For a comprehensive list on evaluative metrics of GAN models, we invite the reader to study "Pros and Cons of GAN Evaluation Measures" by Borji (2019).
Synthesizing models are proposed with different aims and purposes. Texture synthesis, super-resolution images, and image in-painting are some applications. Considering face synthesis, the most important goal is the data augmentation for improved recognition performance. A complete list of such purposes and the model properties are given in Table 6.
Despite the numerous publications on image and video synthesis, yet some problems are not solved thoroughly. For example, generating high-resolution samples is an open research problem. The output is usually blurry or impaired by checkered artifacts. Results obtained for video generation or synthesis of 3D samples are far from realistic examples. Also, it is important to highlight that the supports unsupervised learning Table 6 List of purposes and characteristics of GAN models used for facial emotion synthesis by the reviewed publications number of publications focused on expression classification is greater than that of those employing identity recognition.

Speech Emotion Synthesis
Research efforts focusing on synthesizing speech with emotion effect has continued for more than a decade now. One application of GAN models in speech synthesis is speech enhancement. A pioneer GAN-based model developed for raw speech generation and enhancement is called the Speech Enhancement GAN (SEGAN) proposed by Pascual et al. (2017). SEGAN provides a quick non-recursive framework that works End-to-End (E2E) with raw audio. Learning from different speakers and noise types and incorporating that information to a shared parameterizing system is another contribution of the proposed model. Similar to SEAGAN, Macartney and Weyde (2018) proposes a model for speech enhancement based on a CNN architecture called Wave-UNet. The Wave-UNet is used successfully for audio source separation in music and speech de-reverberation. Similar to section 3.1, we compare the results of the reviewed papers in Table 7. Additionally, Tables  8 to 11 represent databases, loss functions, assessment metrics and characteristics used in speech synthesis. Sahu et al. (2018) followed a two-fold contribution. First, they train a simple GAN model to learn a high-dimensional feature vector through the distribution of a lower-dimensional representation. Second, cGAN is used to learn the distribution of the high-dimensional feature vectors by conditioning on the emotional label of the target class. Eventually, the generated feature vectors are used to assess the improvement of emotion recognition. They report that using synthesized samples generated by cGAN in the training set is helpful. Also it is concluded that using synthesized samples in the test set suggests the estimation of a lower-dimensional distribution is easier than a high-dimensional complex distribution. Employing the synthesized feature vectors from IEMOCAP database in a cross-corpus experiment on emotion classification of MSP-IMPROV database is reported to be successful.
Mic2Mic (Mathur et al., 2019) is another example of a GAN-based model for speech enhancement. This model addresses a challenging problem called microphone variability. The Mic2Mic model disentangles the variability problem from the downstream speech recognition task and it minimizes the need for training data. Another advantage is that it works with unlabeled and unpaired samples from various microphones. This model defines microphone variability as a data translation from one microphone to another for reducing domain shift between the train and the test data. This model is developed based on CycleGAN to assure that the audio sample (Mathur et al., 2019)   3.41 Table 7: Comparison of speech Emotion synthesis models, description of loss functions (L), metrics (M), databases (D) and purposes (P) used in the reviewed publications are given in Tables 8, 9, 10, and 11 -M: Metric, RS: Results, RM:Remarks Gao et al. (2018) decomposed each speech signal into two codes: a content code that represents emotion-invariant information and a style code that represents emotion-dependent information. The content code is shared across emotion domains and should be preserved while the style code carries domain-specific information and it should change. The extracted content code of the source speech and the style code of the target domain are combined at the conversion step. Finally, they use the GAN model to enhance the quality of the generated speech.
Another widely extended research direction in speech synthesis is Voice Conversion (VC). Hsu et al. (2017) proposed a non-parallel VC framework called Variational Autoencoding Wasserstein Generative Adversarial Network (VAWGAN). This method directly incorporates a non-parallel VC criterion into the objective function to build a speech model from unaligned data. VAWGAN improves the synthesized samples with more realistic spectral shapes. Even if VAE-based approaches can work free of parallel data and unaligned corpora, yet they have three drawbacks. First, it is difficult to learn time dependencies in the acoustic feature sequences of source and target speech. Second, the decoder of the VAEs tends to output over-smoothed results. To overcome these limitations, Kameoka et al. (2018a) adopted fully convolutional neural networks to learn conversion rules that capture short-term and long-term dependencies. Also, by transplanting the spectral details of input speech into its converted version at the test phase, the proposed model avoids producing buzzy speech. Furthermore, in order to prevent losing class information during the conversion process, an information-theoretic regularizer is used.
In 2018, Kaneko and Kameoka made two modifications in CycleGAN architecture to make it suitable for voice conversion task and so the name CycleGAN-VC is selected for the modified architecture. Representing speech by using Recurrent Neural Networks (RNN) is more effective due to the sequential and hierarchical structure of the speech. Howsoever, RNN is computationally demanding considering parallel implementations. As a result, they used gated CNNs that are proven to be successful both in parallelization over sequential data and achieving high performance. The second modification is made by using identity loss to assure preserving linguistic information. Here, a 1D CNN is used as a generator and a 2D CNN as a discriminator to focus on 2D spectral texture.  Later, they released CycleGAN-VC2 (Kaneko et al., 2019) which is an improved version of CycleGAN-VC to fill the large gap between the real target and converted speech. Architecture is altered by using 2-1-2D CNN for the generator and PatchGAN for the discriminator. In addition, the objective function is improved by employing a two-step adversarial loss. It is known that downsampling and upsampling have a severe degradation effect on the original structure of the data. To alleviate this, a 2-1-2D CNN architecture is used in the generator where 2D convolution is used for downsampling and upsampling, and only 1D convolution is used for the main conversion process. Another difference is that while CycleGAN-VC uses a fully connected CNN as its discriminator, CycleGAN-VC2 uses PatchGAN. The last layer of PatchGAN employs convolution to make a patchbased decision for the realness of samples. The difference in objective functions between these two models is reported in Table 7. They report that CycleGAN-VC2 outperforms its predecessor on the same database.
To overcome the shortcomings of CVAEVC (Kameoka et al., 2018a) and CycleGAN-VC , the StarGAN-VC (Kameoka et al., 2018b) method combines these two methods to address nonparallel many-to-many voice conversion. While CVAEVC and CycleGAN-VC require to know the attribute of the input speech at the test time, StarGAN does not need any such information. Other GAN-based methods for VC like WaveCycleGAN-VC  and WaveCycleGAN-VC2 (Tanaka et al., 2019b) rely on learning based on filters that prevents quality degradation by overcoming the over-smoothing effect. The over-smoothing effect causes degradation in resolution of acoustic features of the generated speech signal. WaveCycleGAN-VC uses cycle-consistent adversarial networks to convert synthesized speech to natural waveform. The drawback of WaveCycleGAN-VC is aliasing distortion that is avoided in WaveCycleGAN-VC2 by adding identity loss. J s2s sequence-to-sequence loss La 10 J KL KL divergence between the output distribution and the related labels La 11 J li linguistic-information loss Conventional methods like VAEs, cycle-consistent GANs, and StarGAN have a common limitation. Instead of focusing on converting prosodic features like Fundamental frequency contour, they focus on the conversion of spectral features frame by frame. A fully convolutional sequence-to-sequence (seq2seq) learning approach is proposed by Kameoka et al. (2018c) to solve this problem. Generally, all inputs of a seq2seq model must be encoded into a fixed-length vector. In order to avoid this general limitation of seq2seq models, the authors used an attentionbased mechanism that learns where to pay attention in the input sequence for each output sequence. The advantage of seq2seq models is that one can transform a sequence into another variable-length sequence. The proposed model is called ConvS2S (Kameoka et al., 2018c) and its architecture comprises a pair of source and target encoders, a pair of source and target re-constructors, one target decoder, and a PostNet. The PostNet aims to restore the linear frequency spectrogram from its Mel-scaled version.
Similar to ConvS2S is ATTS2S-VC (Tanaka et al., 2019a) that employs attention and context preservation mechanisms in a Seq2Seq-based VC system. Although this method addresses the aforementioned problems, yet it has a lower performance in comparison to CVAE-VC, CycleGAN-VC, CycleGAN-VC2, and StarGAN. An ablation study is required to evaluate each component of seq2seq methods considering performance degradation.
Despite the promising performance of deep neural networks, they are highly susceptible to malicious attacks that use adversarial examples. One can develop an adversarial example through the addition of unperceived perturbation with the intention of eliciting wrong responses from the machine learning models. Latif et al. (2017) conducted a study on how adversarial examples can be used to attack speech emotion recognition (SER) systems. They propose the first black-box adversarial attack on SER systems that directly perturbs speech utterances with small and imperceptible noises. Later, the authors perform emotion classification to clean audio utterances by removing that adversarial noise using a GAN model to show that GAN-based defense stands better against adversarial examples. Other examples of malicious attacks are simulating spoofing attacks  and cloning Obama's voice using GAN-based models and low-quality data (Lorenzo-Trueba et al., 2018a).  The next target application of speech synthesis is data augmentation. Data augmentation is the task of increasing the amount and diversity of data to compensate for the lack of data in certain cases. Data augmentation can improve the generalization behavior of the classifiers. Despite its importance, only a few papers contributed fully toward this concept.
One of the researches on data augmentation for the purpose of SER improvement is the work of Sheng et al. (2018). Sheng et al. used a variant of cGANs model that works at frame level and uses two different conditions. The first condition is the acoustic state of each input frame that is combined as a one-hot vector with the noise input and fed into the generator. The same vector is combined with real noisy speech and fed into the discriminator. The second condition is the pairing of speech samples during the training process. In fact, parallel paired data is used for training. For example, original and clean speech is paired with manually added noisy speech or close-talk speech sample is paired with far-field recorded speech. The discriminator learns the naturalness of the sample based on the paired data.
purpose or characteristic Pa 1 is tested for data augmentation Pa 2 is designed for speech enhancement Pa 3 is designed for non-parallel identity preserving VC Pa 4 is designed for for defense against Malicious Adversary Pa 5 is designed for non-parallel many-to-many identity VC Pa 6 performs emotion conversion Pa 7 performance improvement Pa 8 generates vocoder-less sounding speech Pa 9 alleviates the aliasing effect Pa 10 voice conversion (VC) Pa 11 fully convolutional sequence-to-sequence Pa 12 modifies prosodic features of voice Pa 13 uses an attention-based mechanism Pa 14 generates spectrograms with high quality Table 11 List of purposes and characteristics used for speech synthesis by reviewed publications Another study with more focus on the improvement of SER is done by Chatziagapi et al. (2019). They adopt a cGAN called Balancing GAN (BAGAN) (Mariani et al., 2018) and improve it to generate synthetic spectrograms for the minority or underrepresented emotion classes. The authors modified the architecture of BAGAN by adding two dense layers to the original generator. These layers project the input to higher dimensionality. Also, the discriminator is changed by using double strides to increase the height and width of the intermediate tensors which affect the quality of the generated spectrogram remarkably.
Other interesting applications like cross-language emotion transfer and singing voice synthesis are also investigated by various researches. However, these applications are not thoroughly studied and they have plenty of potential for further research. One such example is ET-GAN (Jia et al., 2019). This model uses a cycleconsistent GAN to learn language-independent emotion transfer from one emotion to another while it does not require parallel training samples.
Also, some works are dedicated to speech synthesis in the frequency domain. Long-range dependencies are difficult to model in the time domain. For instance, MelNet model (Vasquez and Lewis, 2019) proves that such dependencies can be more tractably modeled in two dimensional (2D) time-frequency representations such as spectrograms. By coupling the 2D spectrogram representation and an autoregressive probabilistic model with a multi-scale generative model, they synthesized high fidelity audio samples. This model captures local and global structures at time scales that time-domain models have yet to achieve. In a MOS comparison between MelNet and WaveNet, MelNet won with a 100% vote for a preference of the quality of the sample.
In the case of speech synthesis in feature-domain, several pieces of research are represented under VC application. For instance, Juvela et al. (2018) proposed generating speech from filterbank Mel-frequency cepstral coefficients (MFCC). The method starts by predicting the fundamental frequency (F 0 ) and the intonation information from MFCC using an auto-regressive model. Then, a pitch synchronous excitation model is trained on the all-pole filters obtained in turn from spectral envelope information in MFCCs. In the end, a residual GAN-based noise model is used to add a realistic high-frequency stochastic component to the modeled exci-tation signal. Degradation Mean Opinion Score (DMOS) is used to evaluate the quality of synthesized speech samples.
In order to evaluate the local and global structures of the generated samples, various metrics are employed by the researchers. In general, metrics like Mean opinion score (MOS) and Perceptual Evaluation of Speech Quality (PESQ) are used widely, while other efficient metrics like Mel-cepstral distortion (MCD) and modulation spectra distance (MSD) are less employed in the literature. Following, we provide a brief explanation of each metric with the hope of a more cohesive future comparison in the literature.
MOS is a quality rating method that works based on the subjective quality evaluation test. The quality is assessed by human subjects for a given stimulus. The user rates its Quality of Experience (QoE) as a number within a categorical range with "bad" being the lowest perceived quality 1 and 5 being "Excellent" or the highest perceived quality. MOS is expressed as the arithmetic mean over QoEs.

MOS
where N is the total number of subjects contributed to the evaluation and r n is the QoE of the subject considering the stimuli. MOS is subject to certain biases. Number of subjects in the test and content of the samples under assessment are some of the problems. The ordinal categories code a wide range of perceptions. That is why MOS is considered to be an absolute measure of total quality, regardless of any specific quality dimension. This is useful in may applications related to communications. However, for other applications, a measure that could be more sensitive to specific quality dimensions is more suitable. Other biases include, changing the user expectation about quality over time, and the value of the smallest MOS difference that is perceptible to users and can actually claim if one method is better over another. For example, Pascual et al. and Macartney and Weyde achieved an MOS of 3.18 and 2.41 on the same database which provides a naive comparison of .77 MOS difference in favor of the former method. Howsoever, one question here is that whether the sample tests and the number of subjects were the same. This becomes more interesting by comparison of the methods proposed in Tanaka et al. (2018) and Tanaka et al. (2019b) where the authors achieved only a .11 MOS difference using the same database. MOS is time-consuming though, it is applicable to different quantities. For instance, likewise MOS of voice quality, the MOS of signal distortion and MOS for intrusiveness of background noise are used as a metric. PESQ is an objective speech quality assessment based on subjective quality ratings. In fact, it automatically calculates what is MOS of subjective perception. PESQ integrates disturbance over several time frequency scales. This is applied by using a method that take soptimal account of the distribution of error in time and amplitude (Rix et al., 2001;Recommendation, 2001). The disturbance values are aggregated using a L p norm as follows: Summing disturbance across frequency using an L p norm gives a frame-byframe measure of perceived distortion. The subjective listening tests were designed to reduce the effect of uncertainty arising from the listener's decision by highlighting which of the three components of a noisy speech signal should form the basis of their ratings of overall quality. These components are the speech signal, the back-  ground noise, or both. In this method the listener successively attends and rates the synthesized speech sample on: a) the speech signal alone using a five-point scale of signal distortion (SIG), b) the background noise alone using a five-point scale of background intrusiveness (BAK), c) the overall quality using the scale of the mean opinion score (OVRL). The SIG, BAK, and OVRL scales are described in Table 12. The Signal to Noise Ratio (SNR) and Segmental Signal to Noise Ratio can be expressed as follows: where x(i) and y(i) are the ith real and synthesized samples and N is the total number of samples. Segmental signal to Noise Ratio (SSNR/SegSNR) can be expressed as the average of the SNR values of short segments (15 to 20 ms). It can be expressed as follows: where N and M are the segment length and the number of segments, respectively. SSNR tends to provide better results than SNR for waveform encoders and generally SSNR results are poor on vocoders.
Other objective measurements include MCD that evaluates the distance between the target and converted Mel-cepstral coefficients (MCEP) sequences. Also, MSD assesses the local structural differences by calculating the root mean square error between the target and converted logarithmic modulation spectra of MCEPs averaged over all MCEP dimensions and modulation frequencies. For both metrics, smaller values indicate higher distortion between the real and converted speech. It is important to highlight that some of the successful methods like GANSynth (Engel et al., 2019) are not mentioned in this paper as it focuses on the musical note synthesis.

Audio-Visual Emotion Synthesis
Although GAN models have an impressive performance on single-domain and cross-domain generation, yet they did not achieve much success in cross-modal generation due to the lack of a common distribution between heterogeneous data. In a cross-domain generation, one generates data samples of various styles from the same modality. As a result, the generated sample and its original counterpart have a common shape structure. However, in a cross-modal generation, the pair of samples have heterogeneous features with quite different distributions.
In this section, we investigate the cross-modal research line where audio and video provide applications like talking heads, audio-video synchronization, facial animations, and visualizing the face of an unseen subject from their voice. Note that other modalities like text (Reed et al., 2016;Gu et al., 2018;Stanton et al., 2018) and biological signals (Palazzo et al., 2017) are used in combination with audio and video. However, those modalities are beyond the scope of this review paper.  proposed the Vid2Speech model that uses neighboring video frames to generate sound features for each frame. Then, speech waveforms are synthesized from the learned speech features. In 2017, the authors designed a two-tower CNN  framework that reconstructs a naturalsounding speech signal from silent video frames of the speaking person. Their model shows that using one modality to generate samples of another modality is indeed useful because it provides the possibility of natural supervision which means segmentation of the recorded video frames and the recorded sound is not required. The two-tower CNN relies on improving the performance of a Residual neural network (ResNet) that is used as an encoder and redesigning a CNN-based decoder. Continued on next page One of the foremost cross-modal GAN-based models is proposed by Chen et al. (2017). They explored the performance of cGANs by using various audio-visual encodings on generating sound/player of a musical instrument from the pose of the player or the sound of the instrument. This model is not tested on any emotional database and hence, it is not listed in Table 13. Another leading and interesting research work is conducted by Suwajanakorn et al. (2017). An RNN is trained on weekly audio footage of President Barack Obama to map raw audio features to mouth shapes. In the end, a high-quality video with accurate lip synchronization is synthesized. The model can control fine-details like lip texture and mouth-pose.
Speech-driven video synthesis is the next application. The X2Face model proposed by Wiles et al. 2018 uses a facial photo or another modality sample (e.g audio) to modify the pose and expression of a given face for video/image editing. They train the model in a self-supervised fashion by receiving two samples: a source sample (video) and a driving sample (video, audio or, a combination). The generated sample inherits the same identity and style (e.g hairstyle) from the source sample and gets the pose, expression of the driving sample. The authors employed an embedding network that factorizes the face representation of the source sample and applies face frontalization. Unfortunately, the authors reported only the visual generated samples and, no further metric is used as of comparison.  Another noteworthy work in speech-driven video synthesis is the one presented by Vougioukas et al. (2018). They suggested an E2E temporal GAN that captures the facial dynamics and generates synchronized mouth movements and fine-detailed facial expressions, such as eyebrow raises, frowns, and blinks. The authors paired still images of a person with an audio speech to generate subject independent realistic videos. They use raw speech signal as the audio input. The model includes one generator comprising an RNN-based audio encoder, an identity image encoder, a frame decoder, and a noise generator. Also, there exist two discriminators: frame discriminator that simply classifies the frames into real and fake, and sequence discriminator distinguishing real videos from the fake ones. Evaluating the generated samples in frame-level and video-level helps to generate high-quality frames while the video remains synchronized with audio. In 2019, Vougioukas et al. modified their previous work to generate speech-driven facial animations. This E2E model has the capability of generating synchronized lip movements with the speech audio and it has fine control over facial expressions like blinking and eyebrow movement.   Duarte et al. (2019) proposed Wav2Pix model generating the facial image of a speaker without prior knowledge about the face identity. This is done by conditioning on the raw speech signal of that person. The model uses Least-Square GAN and SEAGAN to preserve the identity of the speaker half of the time. The generated images by this model are of low quality (See Figure 10). Also, the model is sensitive to several factors like the dimensionality and quality of the training images and the duration of the speech chunk.
Likewise, (Jamaludin et al., 2019) in "You said that? : Synthesising Talking Faces from Audio" designed the Speech2Vid model that gets the still images of the target face and an audio speech segment as input. The model synthesizes a video of the target face that has a synchronized lip with the speech signal. This model consists of a VGG-M as an audio encoder, a VGG-Face as an identity image encoder, and a VGG-M in reverse order as a talking face image decoder. Here instead of raw speech data, the audio encoder uses MFCC heatmap images. The network is trained with a usual adversarial loss between the generated image and the ground truth and a content representation loss.  One of the most important models developed in the cross-modal community is the SyncGAN model  capable of successfully generating synchronous data. A common problem of the aforementioned cross-modal GAN models is that they are one-directional because they learn the transfer between different modalities. This means they cannot generate a pair of synchronous data from both purpose or characteristic Pav 1 read speech Pav 2 controlling the pose and expression of face based on audio modality Pav 3 generating high quality mouth texture Pav 4 generating unseen face of subject using raw speech Pav 5 generating lip synchronized video of a talking face Pav 6 generating an intelligible speech signal from silent video of a speaking person Table 17 List of purposes and characteristics used for cross-modal synthesis by reviewed publications modalities simultaneously. SyncGAN addresses this problem by learning in a bidirectional mode and from synchronous latent space representing the cross-modal data. In addition to the general generator and discriminator of the vanilla GAN, the model uses a synchronizer network for estimating the probability that two input data are from the same concept. This network is trained using synchronous and asynchronous data samples to maximize the following loss function: Similarly, CMCGAN (Hao et al., 2018) is a cross-modal CycleGAN that handles generating mutual generation of cross-modal audio-visual videos. Given an image/sound sample from a musical instrument outputs a sound LMS or an image of the player. Unfortunately, neither SyncGAN nor CMCGAN are tested on any multi-modal emotional database.
In Figure 10 we compared the generated samples of the reviewed publications qualitatively.

Discussion
In this section, we discuss the concepts that are yet not explored thoroughly about GAN-based emotion synthesis within the literature. Also, despite the active development of GAN models, there exist open research problems like mode collapse, convergence failure, and vanishing gradients. Following we discuss these problems. Also, the evolution of GAN models is shown in Figure 11.

Disadvantages
The most important drawback of GAN models is mode drop or mode collapse. Mode collapse occurs when a generator learns to generate a limited variety of samples out of the many modes available in the training data. Roth et al. (2017) attempted to solve the mode collapse problem by stabilizing the training procedure using regularization. Numerical analysis of general algorithms for training GAN showed that not all training methods actually converge (Mescheder et al., 2018) which leads to mode collapse problem. Several objective functions (Berthelot et al.,   (Ghosh et al., 2018) are developed to tackle this problem, however, none have solved the problem thoroughly. GANs also suffer from convergence failure. Convergence failure happens when the model parameters oscillate and they cannot stabilize during training. In the minimax game, convergence occurs when the discriminator and the generator reach the optimal point under Nash equilibrium theorem. Nash equilibrium is defined as the situation where one player will not change action irrespective of opponent action.
It is known that if the discriminator performs too accurately, the generator fails due to the vanishing gradient. In fact, the discriminator does not leak/provide enough information for the generator to continue the learning process.

Open Research Problems
In addition to the theoretical problems mentioned in section 4.1, GANs have taskbased limitations. For instance, GANs cannot synthesize discrete data like onehot coded vectors. Although this problem is addressed partially in some research works (Kusner and Hernández-Lobato, 2016;Jang et al., 2016;Maddison et al., 2016), yet it needs more attention to unlock the full potential of GAN models. A series of novel divergence algorithms like FisherGAN  and the model proposed by  try to improve the convergence for training GANs. This area deserves more exploration by studying families of integral probability metrics.
The objective of a GAN model is to generate new samples that come from the same distribution as the training data. However, they do not generate the distribution that generated the training examples. As a result, they don't have a prior likelihood or a well-defined posterior. The question here is how can one estimate the uncertainty of a well-trained generator.
Considering the emotion synthesis domain, some problems are studied partially. First of all, data augmentation is not fully explored. At the time of writing this paper, there is no large scale image database generated artificially by using GAN models and released for public usage. Such a database could be compared in terms of classification accuracy and quality with the existing databases. Although methods like GANimation and StarGAN successfully generate all sorts of facial expressions, yet generating a fully labeled database requires further processing. For example, the synthesized samples should be annotated and tested against a ground truth like facial Action Units (AU) to confirm that the generated samples carry the predefined standards of a specific emotional class. This issue becomes very complicated when one deals with compound emotions and not only the basic discrete emotions. Also, generated samples are not evaluated within continuous space considering the arousal, valence, and dominance properties of the emotional state. Finally, despite the fact that some successful GAN models are proposed for video generation, the results are not realistic.
In the case of speech emotion synthesizing, majority of papers focused on raw speech and spectrgorams. As a result, feature-based synthesis is less explored. Human-likeliness of the generated speech samples is another open discussion in this research direction. Furthermore, evaluation metrics in this field is less developed and merely the ones from the traditional speech processing are used on the generated results. Research works that are focusing on cross-modal emotion generation do not exceed from few publications. This research direction requires both developing new ideas and improving the result of previous models.

Applications
One important application of GAN models in the computer vision society includes synthesizing Super Resolution (SR) or photo-realistic images. For example, SR-GAN (Ledig et al., 2017) and Enhanced SRGAN (Wang et al., 2018b) are generating photo-realistic natural images for an upscaling factor. Considering the facial synthesis, these applications include manipulation of facial pose using DRGAN (Tran et al., 2017) and TPGAN (Huang et al., 2017a), generating a facial portrait (Yi et al., 2019a), generating face of an artificial subject or manipulating the facial attributes of a specific subject Radford et al. (2015), (Choi et al., 2018), and synthesizing/manipulating fine detail facial features like skin, lip or teeth texture (Suwajanakorn et al., 2017). Generally speaking, the application of GAN considering the visual modality could be categorized to texture synthesis, image super-resolution, image inpainting, face aging, face frontalization, human image synthesis, image-to-image translation, text-to-image, sketch-to-image, image editing, and video generation. Some specific applications with respect to the emotional video generation include the synthesizing of talking heads (Tulyakov et al., 2018), (Pumarola et al., 2018).
In the case of speech emotion synthesis, as mentioned before these applications can be categorize to speech enhancement, data augmentation, and voice conversion. Other research directions like feature learning, imitation learning, and reinforcement learning are important research directions for the near future.

Conclusion
In this paper, we survey the state-of-the-art proposed in human emotion synthesis using GAN models. GAN models proposed first in 2014 by Goodfellow et al.. The core idea of GANs is based on a zero-sum game in game theory. Generally, a GAN model consists of a generator and a discriminator, which are trained iteratively in an adversarial learning manner, approaching Nash equilibrium. Instead of estimating the distribution of real data samples, GANs learn to synthesize samples that adapt to the distribution of real data samples. Fields like computer vision, speech processing, and natural language processing benefit from the ability of GAN in generating infinite new samples from potential distributions.