TLEFuzzyNet: Fuzzy Rank based ensemble of Transfer Learning models for Emotion Recognition from Human Speeches

Human speech is not only a verbose medium of communication but it also conveys emotions. The past decade has seen a lot of research going on with speech data which becomes especially important for human-computer interaction and also healthcare, security and entertainment. This paper proposes the TLEFuzzyNet model, a three-stage pipeline for emotion recognition from speech. The first stage includes feature extraction by data augmentation of speech signals and extraction of Mel spectrograms, followed by the use three pre-trained transfer learning CNN models namely, ResNet18, Inception_v3 and GoogleNet whose prediction scores are fed to the third stage. In the final stage, we assign Fuzzy Ranks using a modified Gompertz function which gives the final prediction scores after considering the individual scores from the three CNN models. We have used the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Berlin Database of Emotional Speech (EmoDB) datasets to evaluate the TLEFuzzyNet model which has achieved state-of-the-art performance and is hence a dependable framework for Speech emotion recognition(SER). All the codes are available using GitHub link: https://github.com/KaramSahoo/SpeechEmotionRecognitionFuzzy.


I. INTRODUCTION
Speech emotion recognition (SER) has been gaining popularity within the research fraternity for the past few decades and has the potential in the domain of Human-Computer Interaction, along with Multimedia and Biomedical applications to name a few. Speech is one of the main forms of relaying information to our surroundings hence a detailed analysis of speech signals is necessary. Emotion is a piece of vital information that speech signals carry apart from the verbal corpus. Any human-computer interface must be able to capture the underlying emotion of human speech since the same sentence could carry different meanings depending on the emotion. SER has a lot of potential in speech-enabled interfaces such as Artificial Intelligence enabled voice assistants which could keep track of emotions and changes in the pattern to predict any psychological changes or signs of mental stress and depression. This can be extended in the medical field for the detection of Autism and Parkinson's Disease as long as adjuvant treatment and diagnosis. Educational software and smart classrooms could make use of SER for student mental health detection. Automated vehicles could prevent accidents and ensure safer driving environments by analyzing the speech of the driver and judging whether the driver is sound to drive or not.
The entire process of SER has two indispensable phases VOLUME 4, 2016 namely feature extraction and classification. For the former, we can divide the acoustic features into two broad categories. First is the temporal features such as the energy of the signal, zero-crossing rate, maximum amplitude, and minimum energy. Conversion of these temporal features to the frequency domain using Fourier Transforms gives us spectral features. Some of the spectral features include spectral centroid, Mel spectrograms, Mel Frequency Cepstral coefficients(MFCC), spectral flux, Shifted Delta Cepstral Coefficients (SDCC), spectral density, and chroma-stft. Since both Deep learning and image classification has attained great heights in the past few decades, our paper proposes to bring image classification using 2D convolutional neural networks (CNN) models to the field of SER. Hence, the choice of features is the Mel Spectrogram of the audio data.
In recent years, using ensemble models to fuse the prediction scores from different constituent models has been in practice. In our paper, we propose to build an ensemble model that assigns fuzzy ranks with the help of a modified Gompertz Function. The audio data used are from the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Berlin Database of Emotional Speech (EmoDB) datasets. Ensemble Learning assigns ranks to the individual model predictions hence providing superior results than individual models. Imbalance and Correlation problems are taken care of simultaneously by ensemble learning models. The Fuzzy rank approach is a fusion technique that predicts the final classification results by assigning adaptive weights to the multi-class confidence scores of the constituent models. The Gompertz function named after Benjamin Gompertz is a sigmoid function in the time domain which saturates to the slowest in the beginning and end of a given period. It was originally modeled for human mortality since mortality decreases exponentially with age after which it saturates asymptotically. In the TLEFuzzyNet model, we choose three pre-trained transfer learning CNN models: ResNet18, GoogleNet, and Inception_v3. The fuzzy ranks were given to the prediction scores of the three models on the test set using the Gompertz Function which provided better accuracy than the constituent models hence making this a novel approach.

A. MOTIVATION AND CONTRIBUTIONS
Since modern technology is all about automation with gestures and voice, SER holds the key to designing state of the art Human-Computer interaction interfaces. From voice enabled security devices and authentication systems to Automated Vehicle Environments, emotion can be analysed to prevent identity mismatch or accidents. The medical sciences field too can benefit by classification of emotion from patients speech for treating Parkinson's disease and Autism to name a few. Key points regarding our proposed model are as follows: 1. A huge amount of data is required when it comes to building an end-to-end model. Due to the lack of enough labeled audio data we preferred to use three pretrained transfer learning models as our constituent models for the ensemble pipeline which are: Inception_V3, GoogleNet and ResNet18. 2. Training the audio samples on a single model for classification may lead to the problem of imbalance. Ensemble approach gives an aggregate opinion to all the individual models thereby decreasing noise and giving better and unbiased prediction scores. This makes it a novel approach for SER. 3. A modified Gompertz function is employed to assign fuzzy ranks to the decision scores of the individual models. The prediction scores rarely go as low as zero and the Gompertz function saturates exponentially to an asymptote. This is different and more efficient than traditional ensemble pipelines since fuzzy ranks based fusion assigns adaptive priority weights to the prediction Dellaert et al. 9 used smoothing spline approximation on the contour of the pitch for feature extraction. Pattern Recognition techniques were employed for classification purposes. Machine Learning models were applied by Noroozi et al. 36 such as Multi-class Support Vector Machine, Random Forest and Single-layered Adaptive boosting on SAVEE and Polish database. Noroozi et al. 36 were able to obtain 75.71% and 87.91% testing accuracies on the respective databases using random forest models. By using a reduced feature size of 14 they were able to reduce computational time as well.
Nicholson et al. 35 used a special type of neural network called the one-class-in-one neural network which predicted the best emotion by making use of 12 LPC and Delta LPC parameters which attained an accuracy of 50%. Deshmukh et al. 11 worked with the Mel-frequency cepstral coefficients(MFCC) features and trained their dataset on a Support Vector Machine(SVM) achieving 80% accuracy. Slimi et al. 45 generated Mel spectrograms of the audio data and resized them to a size of 150 by 66. These resized spectrograms were flattened and fed to a single hidden layer neural network. The classifications were made by a softmax layer. Their proposed model achieved an accuracy of 81.82% after 1970 epochs of training on the EmoDB database. Badshah et al. 1 used deep convolutional neural networks to feed their model with Mel spectrograms of the speech data. They conducted two experiments. In the first method they trained the CNN model from scratch and in the second approach they fine tuned a pre trained Alex-Net model which gave them a test accuracy of 84.3%. Etienne et al. 13 tried to capture the long term dependencies of speech by using a CNN+LSTM architecture. The high level features were captured by the CNN model from spectrograms whereas the recurrent LSTM layers extracted relation among the temporal features.
Zehra et al. 54 have used an ensemble approach by combining results from decision tree (J48), sequential minimal optimization (SMO), and random forest (RF). Their main objective was to design robotic systems that could analyze and detect emotions in not only in-corpus but also crosscorpus data. Khan   both gender dependent and independent, by training a Naive Bayes Classifier on both pitch and MFCC features using the EmoDB database and have achieved an accuracy of 95.20%.

III. DATASET
Since SER is a classification problem, the availability of sufficient data which is correctly labeled is of prime concern. Different individuals could perceive the same speech as different emotions hence it is very important to chose a dataset that is labelled by people from the same group of database creators. We have chosen to work with simulated databases where particular actors or speakers simulate the required emotion by reading out from a set of sentences. In our work we use the SAVEE, EmoDB 3 and RAVDESS 30 dataset which contain accurately labelled and free of noise audio samples. All the three databases are accessible publicly.

A. SAVEE
The SAVEE dataset contains standard TIMIT sentences recorded by 4 male actors. All the audio samples have been recorded, processed and labelled using superior quality equipment in a visual and media laboratory. The dataset has 480 audio samples in the .wav format and divided into seven emotion classes: anger, fear, disgust, surprise, sad, happy and neutral.

B. EMODB
EmoDB 3 is a German Emotional Speech dataset of 535 utterances available in .wav format. It was created by the Institute of Communication Science, Technical University, Berlin, Germany. There are 5 male speakers and 5 female speakers. There are seven labelled emotion classes: anger, boredom, neutral, happiness, anxiety, sadness and disgust. VOLUME 4, 2016 The sentences that the speakers are made to utter are everyday phrases and can be used with a variety of emotions.

C. RAVDESS
The RAVDESS 30 is a facial and vocal expression database in North American English. 24 professional actors vocalize statements in a North American accent. There are 8 classes of emotions which are calm, happy, sad, angry, fearful, surprise, disgust and neutral. All the classes except neutral are uttered in two levels of emotional intensity. 247 participants reviewed the database with 72 participants retesting for better accuracy of the dataset.

IV. METHODOLOGY
The proposed framework has been categorized into the following subsections that include data augmentation, feature extraction, transfer learning and finally assigning the fuzzy ranks using the Gompertz function for the final prediction. A diagrammatic workflow of the entire TLEFuzzyNet model from feature extraction to final prediction has been given in Figure 1.

A. AUGMENTATION
The size of any dataset is a crucial deciding factor for any deep learning model. Small amount of data hinder the deep learning models to map the inputs to the ground label accurately. High variance in the predictions of test data is a major problem that inadequate data can give rise to. To overcome this problem we use augmentation to create multiple training data samples from the limited audio files in the SAVEE and EmoDB databases. The technique we have employed in this time shifting whose equation is as follows: where, X[n] is the audio signal and s refers to the number of samples shifted. In the present work, the original audio is sampled at 44100 Hz and the shifting is done by s samples both to the right and left.

B. FEATURE EXTRACTION
The choice of effective features from the speech data is very important to attain state-of-the art performance for TLE-FuzzyNet model. There is a cascade of important features pertaining to speech such as pitch, energy, Mel spectrograms, Mel-frequency cepstrum coefficients (MFCC), Linear Prediction Cepstrum Coefficients (LPCC), modulation spectral features (MSFs) to name a few. Since we are using Convolutional Neural Network(CNN) models to train our datasets, the Mel spectrogram is an ideal choice of features.
Spectrograms are generated with the help of Fourier Transforms on sound signals. The sound signal is divided into small time segments to which the Fourier Transform is applied individually. As a result we get a frequency versus time graph with the amplitude of the frequency denoted by the color of the spectrogram. However, humans do not perceive frequency linearly but rather logarithmically. The Mel scale  solves this problem by mapping the perceived frequency to the measured frequency of a tone. Figure 2 plots the mel pitch against the frequency of the sound waves. We can clearly see from fig. 2 that with increasing frequencies the slope of the curve decreases thereby establishing that differentiating between higher frequencies is difficult compared to lower frequencies. s

C. TRANSFER LEARNING
The non-availability of sufficient data makes it difficult to create Deep neural network models. Transfer learning solves this problem because here we can reuse models that have been pre trained on huge datasets. The weights of the neural network from one model are used on another model which takes as input a totally new set of data. The last few layers of the pre-trained model are fine-tuned by re-training them on the new datasets which achieves better performance and lesser training time. In image classification problems, transfer learning has been used extensively because obtaining millions of images for a database is not always feasible. CNNs demand hardware memory and are compute-intensive hence making it difficult to train in scenarios with limited power supply. Therefore not having to re-train the entire transfer learning model saves training time and system resources.
The three databases are split into training, validation and testing sets in the ratio 8:1:1. The training and validation datasets are used to fine tune the transfer learning models for our mel spectrogram data. The testing dataset remains unseen by the model during training. The GoogLeNet, ResNet18 and Inception_v3 models are loaded from the Pytorch Model Zoo. They are pretrained on the ImageNet weights and fine tuned using the Adam 24 optimizer. The confidence scores for the testing split dataset are generated for all the three models VOLUME 4, 2016 Figure 6: A training batch of 16 spectrograms from the RAVDESS dataset with the labels denoting the respective classes present in the database. The labels are mapped as following: 'f' for fear, 'd' for disgust, 'h' for happiness, 's' for sad, 'n' for neutral 'su' for surprised, 'a' for angry and 'c' for calm. and stored as a csv file for using in the fuzzy rank based ensemble step.
The three transfer learning models are described below as follows: Inception_v3 is a popular 2D CNN model with great classification performance that uses transfer learning. It is an extended network of the GoogLeNet model that extensively uses Batchnorm 17 in the activation layers. Input to this model are of sizes 299×299×3 which go through layers of different convolution layers for feature maps extraction. The inception blocks of Inception_v3 make it possible to compute on different filters of feature extraction by concatenating them into a single feature map(see figure 5 for further details). This architecture decreases the computational complexity by reducing the number of parameters in the model.

2) ResNet18
ResNet or Residual Network, introduced in 2015, is an architecture that uses residual mapping and is very effective against the "degradation problem" in deep networks. The residual learning approach enhances the optimization phase of the CNN model. Like most popular image recognition CNN models the ResNet-18 is pretrained on the ImageNet dataset. It takes as input images of size 3x224x224 which is lesser than the input size of inception_v3 model. The deeper the residual network the better performance it can achieve. There are different depth wise implementation of the Resnet model such as ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-110. The ResNet18 model (see Figure 3 for further details) used in our proposed TLEFuzzyNet framework is a perfect balance between computation complexity and accuracy.

3) GoogleNet
GoogLeNet is a 22 layer CNN model and one of the popular networks of the Inception architecture, developed by the researchers at Google. It takes as input images of shape 224x224. The network was specially designed to achieve   computational efficiency and practicality so that inference can be run on any individual system, including those that do not have very high computational power. One method by which GoogLeNet achieves efficiency is through reduction of the input image whilst simultaneously retaining important spatial information. To prevent overfitting of the network a regularisation technique is used during training called the dropout layer(40%), just before the linear layer. It works by randomly reducing the amount of interconnecting neurons within a neural network. A pictorial view of the resulting network is depicted in Figure 4.
Each dataset is trained separately on the three models for a total 50 epochs on SAVEE, 10 epochs on RAVDESS and 25 epochs on the EmoDB dataset. The epochs are experimentally determined to prevent overfitting on sample data. A batch size of 16 is used with a learning rate of 0.001 using the Adam optimizer. Figure 6 shows the training batch which is used to classify the emotions of the corresponding audio data.

D. FUZZY RANK ENSEMBLE
In literature, the traditional ensemble technique gives equal priority to the classification scores of all constituent models, and pre-computed weights are used for the classifiers. The main issue with such an ensemble is the generation of static weights which are difficult to modify in the phase where we classify the test samples. However, in the proposed fuzzy-rank-based ensemble approach each base classifier's predictions scores are taken into account for every individual test case separately. This way, enhanced and more accurate scores for classification can be obtained using this ensemble method. This is a dynamic process and there is no need to change any weights for different test datasets.
The Gompertz function describes time series with the slowest growth in the beginning and end of a time period. Originally used to describe the mortality rate concerning growing age, it is now used extensively in the field of biology. The Gompertz function can explain the growth of a population, cancerous tumor, colony of bacteria as well as number of affected people during an epidemic. The equation to understand the function is: where, a is an asymptote, b sets the displacement along the x-axis, c is used for y-scaling and e is Euler's Number. Figures 7-9 shows the graphs of the Gompertz function with varying values for a, b and c respectively.
In our proposed method, we use a redesigned version of the Gompertz Function. Considering N to be the number of constituent models, we have N number of prediction scores for each image in the test split of the database. As discussed above we have made use of three transfer learning CNN models therefore N=3. If L is the number of labels in the dataset then: The prediction scores denoted by S in equation 3 of each class for each given sample data of the are taken into account while generating the fuzzy ranks. For the l th class the fuzzy ranks given on account of the n th constituent model are given by the following formula: ∀l, n; n = 1, 2, ..., N ; l = 1, 2, ..., L Corresponding to each class in the dataset, there can be k top classes which in our proposed method we have chosen as 2. The eqs. (5) to (6) are used to calculate the Fuzzy Ranks (F RS l ) and complement of confidence factor sum (CCF S l ) for the class l. If the label l does not fall under the top K classes a penalty value of P R l and P CF l is imposed on the corresponding class. The final predicted class for the data instance X is calculated by multiplying the (F RS l ) and (CCF S l ) and finding the minimum value among all the classes as shown in eq. (7).

V. RESULTS AND DISCUSSION
In this segment, we have provided tabular data for the results we have acquired after working on the three aforementioned datasets. A detailed explanation of the evaluation metrics, performance of the CNN Transfer learning models, and the final ensemble model is provided. We have compared our work with previous researches and proved that our method has attained state-of-the-art performance for the SER problem using an ensemble approach with deep learning 2D CNN models.

A. EVALUATION METRICS
To assess the performance of TLEFuzzyNet model, we have considered F1-Score, Precision, Recall and Accuracy as our evaluation metrics. A majority of past researches have used Accuracy as the standard metric for evaluation of performance. As a result we will be providing a comparative study between the TLEFuzzyNet model and previous models for SER problem.
The mentioned evaluation metrics can be calculated using basic parameters such as True Positives, True Negatives, False Positives and False Negatives. The corresponding formulas are as follows: Accuracy: Precision: Recall: F1 Score:  The reason we have used Precision, Recall and F1 scores is because of the unsymmetrical distribution of samples in our database. In the upcoming sections we give a comparative study between previous SER model which includes deep learning as well as machine learning classifiers.

B. PERFORMANCE OF CONSTITUENT MODELS
Each of the three constituent models has been loaded from the Pytorch Model Zoo and is pretrained on the ImageNet 10 dataset. The entire model weights have been freezed except the classification layers. The classification layers initially had an output layer with softmax activation of size (1, 1000) which are fine-tuned to (1, num_of_classes). Each model has been trained for exactly 50 epochs on SAVEE, 25 epochs for EmoDB, and 10 epochs for RAVDESS dataset after which the best validation accuracy has been taken into consideration. Adam optimizer has been used for gradient descent with the learning rate as 0.001 and β values (0.9, 0.99). The training process has been experimented with different learning rates, batch sizes, and number of epochs and the final values have been experimentally chosen for TLEFuzzyNet model.
The Inception_V3 model achieved 98.76% accuracy, ResNet18 model achieved 98.77% accuracy and GoogLeNet model achieved 99.08% accuracy for the EmoDB dataset. The ensemble model after applying ranks to the classification scores predicts the labels from the corresponding testing datasets with 99.38% accuracy which is greater than that of the constituent models.
For the RAVDESS dataset, the Inception_V3, ResNet18, and GoogLeNet models achieved an individual classification accuracy of 92.07%, 95.29%, and 97.24% respectively. However, the ensemble model was able to assign Fuzzy ranks in a manner to correctly classify each instance from the corresponding testing dataset with a final accuracy of 99.66%. These results connote the ability of the ensemble approach to minimize the errors of each CNN model and generate more accurate classification scores.     Table 5: Comparison of TLEFuzzynet performance with state-of-the-art works for RAVDESS database. Figure 10 shows the learning curves for the SAVEE dataset using the Inception_v3, GoogLeNet and Resnet18 models respectively. We can infer from the graphs that the model does not learn anything new and reaches a maximum accuracy around the 10th epoch. Similarly, for EmoDB dataset (shown in Figure 11) and RAVDESS dataset (illustrated in Figure 12), the training of the model reaches a point where the accuracy halts to improve which is again around the 10th epoch. All the learning curves have been plotted using the TensorBoard library in Python.

C. PERFORMANCE OF ENSEMBLE MODEL
The ensemble model assigns fuzzy ranks to the classification scores given by the three CNN Transfer learning models mentioned in IV-C. Classification scores from the previous transfer learning phase are stored for each sample in the testing dataset. In this phase we assign fuzzy ranks for the top k classes and give penalties to the other class predictions as discussed in IV-D. The ensemble model generates the final prediction scores for the respective number of classes for each database. Table 2 gives us the accuracy, recall, precision, and F1 score along with the overall accuracy of the ensemble. The class column refers to the different emotion classes for each dataset. For the SAVEE dataset classes 1-7 are anger, disgust, fear, happiness, neutral, sad and surprised in order. For the EMODB dataset classes 1-7 are fear, disgust, happiness, boredom, neutral, sad and anger in order. For the RAVDESS dataset classes 1-8 are neutral, calm, happiness, VOLUME 4, 2016   In order to visualize the classification performance of our ensemble model, we make use of the receiver operating characteristic curve or ROC curve. The ROC curve can be used for multi-class classification though conventionally it is employed for binary classification. A method known as One vs All where the ground truth class is treated as one label and other classes treated as a collective label. The ROC curve is the measure of the models to distinguish between classes. The area under the ROC curve is proportional to the accuracy with which a class is classified correctly. The True Positive Rate is drafted against the False Positive Rate in the ROC curve.
The model achieves perfect accuracy is the area under ROC curve is 1. It can separate between the multiple classes with 100% accuracy. A ROC curve with almost 0 area under the curve will provide the wrong prediction for each sample data, hence being a poor model. The ROC curves for our ensemble model on the RAVDESS, EmoDB and SAVEE dataset are given in figs. 13 to 15. We can see in fig. 13 that the area under the ROC curve has nearly approached 1 which is implied by the accuracy of 99.66% for the RAVDESS dataset.
T P R = T P T P + F P (12)

D. COMPARISON WITH OTHER FRAMEWORKS
In this section, we present an explicate analysis between the performance of past researches and frameworks with our proposed TLEFuzzyNet model. The datasets we have used are open source and widely used for research in the field of speech processing and speech emotion recognition. Since research in the domain of SER has been evolving since the last few decades there is a cascade of deep learning and machine learning models to compare TLEFuzzyNet model with. Tables 3 -5 give a tabular comparison of TLEFuzzyNet model with other state of the art researches.
The datasets we have worked with are popular SERdatasets. As mentioned in Table 1, it can seen that both EmoDB and SAVEE are relatively small datasets compared to RAVDESS dataset. Additionally, EmoDB is an example of imbalanced dataset with relatively more data corresponding to anger and least for disgust. As described in section IV-A we have implemented an additional augmentation phase by which we increase the data by time shifting of the original speeches. This aids in countering the problem of small dataset sizes. Since RAVDESS is a comparatively larger dataset, the models are able to capture more relevant features from the Mel spectrograms and map the input to the output (emotion) with lesser bias towards wrong emotion outputs. Despite the difference in dataset sizes, the prediction accuracy for the three datasets is numerically comparable. This is due to the dynamic nature of assigning ranks to the classification score of each test sample. As highlighted in Tables 3-4 the previous state of the art models were able to achieve accuracy >95% for EmoDB and SAVEE owing to the small dataset sizes. The models were able to fit the small data with greater accuracy. Nevertheless our proposed TLE-FuzzyNet model beats the previous state-of-the-art models, though the margin of improvement seems to be less due to the already high performance of previous models.
-In case of RAVDESS dataset, the previous frameworks mentioned in Table 5 are able to classify different emotions with decent accuracy in the range 70%-85%. This might be due to the comparatively large size of the dataset. Our framework does not entirely depend on the transfer learning phase for classification, but the classification scores are fed to the fuzzy-rank ensemble model which ultimately provides the final classification results after assigning ranks to the topk (which is 2 in our proposed paper) predicted classes for each test sample for each constituent model. As a result of which we gain superior performance compared to the previous benchmark models that were trained on the RAVDESS dataset.

VI. CONCLUSION
This paper proposed an ensemble learning based framework for SER using transfer learning 2D CNN models. It was found that models pre-trained on huge image datasets can extract essential features from Mel spectrograms of audio data hence converting the task of speech processing and recognition into a computer vision task. TLEFuzzyNet model combined transfer learning, CNNs and fuzzy rank based ensemble approach by making use of the Gompertz function. Since the datasets used were not of very large scale hence transfer learning was a good choice for training the deep convolutional neural networks. The dynamic assignment of ranks to the classifiers makes it possible to make predictions without having to initialize a new set of weights for the entire ensemble phase of the framework for newer datasets. Errors of each individual CNN classifier is compensated by the fuzzy ranking algorithm. The experimental results depict that TLEFuzzyNet model has achieved state-of-the-art accuracy of 98.57%, 99.38% and 99.66% on all the three benchmark datasets namely, SAVEE, EmoDB, and RAVDESS respectively. There is a promising application of transfer learning and ensemble approaches for SER.
There are few areas where TLEFuzzyNet model can be improved which are as follows: 1) In our CNN models, we have used traditional Mel spectrograms. This greatly increases computation due to the input image sizes of (224, 224) for the transfer learning models. In future we can use smaller feature sets such as MFCC or feature vectors from neural network architectures of smaller size. 2) The generalization of the framework can be improved by using better data augmentation techniques such as voice conversion using generative model 51