Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation

Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision.


I. INTRODUCTION
I N speech production, speakers convey messages to listeners in acoustic form by moving multiple articulators in specific patterns. These movement patterns are learned and are utilized regularly in everyday communications. By observing speech and visible articulations, a child gradually learns to speak with minimal specific instructions on articulatory movements [1]. Later, a child can mimic other speech by observing only a few samples and then practicing producing them. This learning phenomenon suggests that one of the key components in an early stage of language learning is the ability to recognize the potential articulatory movements and test them by producing similar instances and improving their correctness. Understanding this learning process will provide a critical answer to a question on how speech production learning should be represented. This will provide a framework for creating a better learning algorithm for speech synthesis systems that can automatically learn from observations and interactions, and other speech related applications [2]- [9].
One way to address this issue is to use a corpus-based analysis-by-synthesis method that learns underlying articulatory targets through an iterative exploration of candidate targets by comparing synthesized and original signals and using them to adjust the targets [10]. This modeling method simulates iterative learning of articulatory movements by the speaker where the speaker iteratively synthesizes the speech and then generalizes to estimate the movement pattern. While this approach allows the computational learning process to generate the targets, a sizable corpus with enough contextual variations is required to cover all the possible utterances. Also, it is computationally complex because of the large time VOLUME 4, 2016 complexity of optimization. Moreover, it can only estimate a single utterance at a time. While this strategy is analogous to mimicking, it still does not address how a trained speaker can recognize or estimate the movement patterns of newly introduced speech utterances immediately after perceiving a few samples.
To address the latter issue, the learning process should be able to quickly estimate the articulatory targets once a few speech examples are received. This can be done by using an acoustic-to-articulatory inversion model [11] where the model learns the mapping from the acoustic to articulatory trajectories. After learning, this model can be used to quickly recognize the plausible articulatory trajectories and assimilate either acoustic or articulatory differences as an interactive learning strategy. However, the problem is complicated, as the solutions are non-linear, non-unique, and ambiguous [12], due to characteristics of speaker vocal tract shape [13], environmental noise [14], coarticulation [15], and speaking rate [16]. Different methods have been proposed to address the problem of learning the association between acoustic and articulation [17]- [19]. The most recent advances were methods that use deep learning models, which have achieved low error rates [20]- [26].
Deep learning [27] is a learning method by using an artificial neural network with the gradient-based optimization to approximate a complex function [28]. To accurately approximate the complex function, a large amount of learning samples are required. However, the process of gathering data is expensive and not always possible. The common strategy to increase the quality of the learning process without the need to acquire additional data is data augmentation, i.e., generating additional samples based on existing data [29]- [31]. Common strategies for a speech data augmentation are vocal tract length perturbation [32], speech perturbation [33], pitch-shifting [34], speech rate modification [35], speech's feature masking [36], and data synthesis [37] using a generative model. To utilize this strategy in target learning for articulatory synthesis, the augmentation should reflect the variabilities in speech production.
In recent developments of articulatory targets estimation, an analysis-by-synthesis approach using a distal learning strategy has been used [10], [38]. Conceptualizing learning by imitation, gradient descent optimization, and swarm optimization were used as a learning strategy to acquire the target articulation. A three-dimensional articulatory synthesis model [39] was used to generate the speech signal from different parameter sets. The results suggest that the optimizer can imitate single vowel utterances. Further improvement of the optimization process using genetic algorithms and long short-term memory (LSTM) neural networks has also been developed which showed promising results [40], [41].
Two kinds of articulatory spaces have been studied for acoustic-to-articulatory inversion methods: 1) the actual articulatory spaces using electromagnetic articulography [25], [26], and magnetic resonance imaging [24], and 2) a theoretical human articulatory space of a two-dimensional [42], [43] and a three-dimensional vocal tract model [44], [45]. Of these two, the theoretical space represented by a vocal tract model is more accessible for understanding speech production and has been much studied in recent years. One example is speech imitation via acoustic-to-articulatory inversion on a two-dimensional vocal tract synthesizer using distal learning [42] and chain metrics [43]. The results of these studies show that the differences between synthesized speech and human speech pose a major constraint on the modelling process. To improve the synthesis quality, a data generating method called babbling generator was proposed that uses an HMM to estimate realistic articulatory trajectories [46]. Further studies have used VocalTractLab, a three-dimensional vocal tract model [39] to improve the naturalness of the synthesized speech, and reinforcement learning using a reward function as a learning strategy [44] on a preset vowel samples, although there were no quantitative assessments of the synthesis quality. In addition, the learning process is not end-toend, and the human was involved to select optimized tokens during the vowel's refinements process. There was also a proposal for implementing an imitation algorithm based on Echo State Network to refine synthetic syllables generated by VocalTractLab with preset gestural scores [45]. However, the results showed that the improvement due to the refinements was limited, trading off the 12% deteriorated intelligibility with only 40% improvement while others are not improved. Therefore, much more work is needed to improve the intelligibility of the synthesized speech from the model, and to close the generalization gap, i.e., the difference between the ability to re-synthesize speech with a high intelligibility of an utterance from learning samples and an utterance from unseen speakers.
This study proposes a speech acquisition strategy for learning the underlying articulatory targets that can generate synthetic Thai vowels using a three-dimensional vocal tract model. The strategy uses deep learning to directly map an acoustic vowel representation to an underlying articulatory target, and then uses the VocalTractLab, an articulatory synthesizer with a three-dimensional vocal tract model, as a forward function to reproduce speech utterances from the retrieved representations. The training samples were monosyllabic and disyllabic vowel-only utterances that were synthesized by interpolating and augmenting from a few observed samples. The quality of the re-synthesized speech by the model was evaluated using a perceptual recognition test, where the model re-synthesizes a vowel speech from the unseen human vowel utterance.

A. OVERVIEW
Our proposed underlying articulatory target acquisition strategy applies deep learning to model the acoustic-toarticulatory targets mapping and a three-dimensional vocal tract model to both generate learning samples and resynthesize speech from the estimated articulatory targets from the model, as illustrated in Figure 1. The deep neural network maps the observed target speech to the underlying articulatory target. Next, the three-dimensional vocal tract model, designed by Birkholz et al. [47], maps estimated articulatory targets into motor commands to reproduce the speech. In the acoustic domain, the target is a surface acoustical pattern that a speaker aims to produce, while in the articulatory domain, the underlying target is a set of parameters used to control the models of the vocal tract and the vocal folds. This corresponds to the motor commands in speech production. The deep learning model analogous an auditoryto-motor mapping, which is a neural connection in the cortex of the human brain. To extract multiple representations from an observed sample, the data generator module was designed, generating a large speech and articulatory target corpus by using interpolation and augmentation.
The model was evaluated by comparing the reproduced speech with the observed target speech vowels (as part of disyllabic utterances). Two observed target speech vowel samples were used, 1) an observed pair of a speech and an articulatory target acquired from the VocalTractLab application, and 2) an observed disyllabic speech signal recorded from 12 native Thai speakers. To determine the effectiveness of the proposed learning strategy, a listening test, which included 25 native Thai participants, was used to recognize a disyllabic Thai vowel utterance which re-synthesized from the recorded speech of the native Thai speaker.

B. VOCALTRACTLAB
The VocalTractLab 2.2 (VTL), an articulatory speech synthesizer, is the core speech production model [48]. The application provided a 3D vocal tract model and a predefined articulatory parameter of German vowels and some of German consonants. The 3D vocal tract model used in the VocalTractLab was developed based on the volumetric MRI of a German native male speaker [49]. The VocalTractLab can synthesize a full range of speech sounds based on the set of articulatory parameters. For the aero-acoustic simulation, the 3D vocal tract shape is mapped to an enhanced area function and its equivalent transmission-line circuit representation, which is numerically simulated in the time domain [50]. The target approximation model implemented in Vocal-TractLab simulates continuous articulatory trajectories [51]. The model simulates dynamic articulation movements in the same way that tones in Thai have been successfully simulated [52]. The VocalTractLab generates speech waveforms with a sampling rate of 22.05 kHz with 16 bits resolution.
The shape of the vocal tract is controlled by 23 articulatory parameters, as shown in Table 1. These parameters control jaw angle, velum shape, velopharyngeal port, lips, tongue, and additional constraint parameters. The minimum and maximum parameter ranges are restricted as a soft constraint to prevent abnormal anatomic shapes.
Dynamic articulatory movements generated by Vocal-TractLab are controlled by gestural scores, which specify the underlying targets of each articulator in terms of their geometrical shapes and positions in a specified temporal speech interval. The motor commands for the individual articulators are then calculated based on the target approximation model [51], [53]. Beside articulatory parameters, pitch targets of the produced speech are also defined in terms of underlying targets [51].

C. DATA GENERATOR
Data generator module was designed to generate a high variation of the speech from a few observed samples. The acoustics were generated from the VTL, using 1) articulatory parameters, 2) speaker's vocal tract model, and 3) gestural score. The high degree of freedom of the three-dimensional vocal tract model resulted in many-to-many mapping between articulatory parameters and acoustics. The boundary of the interpolation function was defined from a predefined articulatory target of German vowels from the VTL. The linear interpolation of the articulation is defined as follows: P and Q are a vector consisting of 23 articulatory parameters from a randomly selected predefined articulatory target of German vowels. R is a generated articulatory target vector. u is an interpolation parameter indicating the interpolating range from P to Q. The interpolating range was constrained around P , which prevents oversampling of the central part of the vowel space produced from average articulations between P and Q.
Similarly, the speakers' vocal tract model was constructed using linear interpolation between the existing adult and a child vocal tract model from VTL, where the child vocal tract was transformed from an adult vocal tract model [54]. The function is defined as follows: The terms V intpl , V adult , and V child are vectors of anatomical parameters of the interpolated, adult, and child speakers, respectively. The j is an interpolation factor, where the range of j was perceptually selected to ensure the naturalness of the synthesized speech.
Generated articulatory targets were then scaled using minmax articulation range of the new interpolated speaker vocal tract model, defined as follows: The term y k is an articulatory k, where k ∈ (1, 23).Ŷ jk is a generated target articulatory parameter k of the interpolated speaker vocal tract model j.
To simulate an articulatory movement, the gesture score was generated where gestures that related to the production of the vowel utterance were randomly selected from a distribution of a possible dynamic movement, while gestures related to the consonants (lip, tongue, and velic gesture) were left blank. The glottal shape gesture was fixed to modal phonation. The syllable duration of both monosyllabic and disyllabic vowel-only utterances was randomly selected from the uniform distribution between 0.5 and 1.5 seconds. For the disyllabic utterance, the transition between the first and second syllable was randomly selected, the uniform distribution between -20% and 20% from the half of the total duration. The time constant was uniformly sampled from a range of C ∈ [0.015, 0.020] seconds. The glottal pressure was uniformly sampled from a range of G ∈ [9000, 12000] dPa. This range of parameter values was chosen based on a perceptual evaluation of the intelligibility of the synthetic speech without distortion.
The speech was resampled to a 16 kHz with 16-bit resolution. The disyllabic vowel data along with its correspond-ing generated underlying articulatory target data were split into the first and second syllables. The split point was a midpoint of the disyllabic vowel speech sequence regardless of the transition time. These split syllables were treated as individual data. The amplitude of the speech was normalized into a range between -1 and 1. The learning samples were augmented into multiple representations per sample.
The speech augmentation methods included: 1) random noise injection, 2) volume perturbation, 3) pitch shifting, and 4) feature masking. The vocal tract length perturbation was excluded because it produces the same effect as in the speaker simulation method. All parameters of the augmentation function were perceptually selected to prevent a loss of intelligibility and a speech distortion from overaugmentation. In random noise injection, a sequence of noise A(t) was generated at random from a continuous amplitude range of A ∈ (0.001, 0.01). Given X(t) is an normalized speech signal with an amplitude between -1 and 1 at time t, the noise injection is defined as follows: For volume perturbation, a perturbation factor α was randomly selected from a continuous range α ∈ (1.5, 3), defined as follows: The pitch shifting augmentation was based on the PSOLA algorithm [34] implemented in the Librosa Python package [55]. The shifting factor β was randomly selected from a continuous range of β ∈ (−1.0, 4). The range of α and β were perceptually selected to ensure the naturalness of the synthesized speech. The feature masking method was based on SpecAug [e1] where the masking was applied in 1) one of a random mel-frequency cepstral coefficient, and 2) a random segment of a speech feature time frame where the masking length was set to 10.

D. PRE-PROCESSING METHOD
The speech signal was represented as Mel-frequency cepstral coefficient (MFCCs) [56] with 13 cepstral coefficient and an additional velocity and acceleration resulting in a total of 39 features per time frame. The spectrum was computed using a Hanning window with a window length of 32 ms and a frame step of 10 ms, then applying a filter bank frequency followed by the discrete cosine transformation to decorrelate a filter bank frequency. Normalization was applied using cepstral mean and variant normalization (CMVN) [57], which was a feature compensation method using the z-scoring and scaling per coefficient. The mean and variance were inferred from their distribution. The CMVN is defined as follows: The mean X[c] and standard deviation SD(X[c]) of c th coefficients are defined as follows: is a i th th input feature of c th coefficient at a frame j th . N is the total number of samples. J is the total number of timeframes.
The target articulatory parameters were min-max scaled into a range between 0 and 1, using equation 3, where the min and max value was inferred from the training distribution. The JX, VO, TRX, TRY, MS1, MS2, and MS3 were excluded from the model estimation and set to a constant that is appropriate for all vowels. The TRX and TRY parameters were automatically calculated in VTL based on the tongue body, TCX, and TCY. The parameter VO controlling the velic opening was fixed for a closed velo-pharyngeal port. The JX, MS1, MS2, and MS3 were defined as a zero constant because its boundary is very close to zero and have little effect on a vowel articulation.

E. DEEP LEARNING
A bidirectional LSTM recurrent neural network (BiLSTM) [58], [59] was used as the deep learning model architecture. The BiLSTM was composed of five LSTM layers with 128 hidden units each with a backward recurrent direction. The output layer was a fully connected layer that mapped the extracted feature representation from BiLSTM to the articulatory representation. The dropout [60] with a 50% drop rate was applied. The simple multiple linear regression without any feature extraction layer was used as a baseline. Both BiLSTM and baseline take MFCC features as an input, and estimate 23 articulatory target parameters of the input as an output.
The model was trained by supervised learning using the gradient-based optimization, Adam optimizer [61], minimizing a mean square error (MSE) between estimated articulatory targets and generated articulatory targets, defined as follows: The y kj andŷ kj are a target underlying articulatory targets (labels) and estimated underlying articulatory target values of a parameter j at data-point k from the model. M is the total number of parameters. N is the total number of data in a mini-batch. The learning rate used in the optimization was 0.0001 , the batch size was 64 . Hyperparameters of the Adam optimizer were 0.9 and 0.999 for β 1 and β 2 respectively. The weight was initialized using the Kaiman initialization method [62]. Models were trained with 150 epochs with the early stopping mechanism monitoring the loss computed from the development set to prevent the overfitting problem [63].

F. POST-PROCESSING METHOD
The articulatory targets estimated by the deep neural network were inverted with min-max rescaling, using the same min and max parameter from the training distribution. Then, the parameters JX, WC, TRX, TRY, MS1, MS2, and MS3 were added, where JX and WC were 0.0., and MS1 to MS3 were −0.05.. These settings were based on a distribution of the predefined vowels in VTL. TRX and TRY were imputed using the equation from the VocalTractLab synthesizer, defined as follows: T RY = 0.831T CX − 3.0300 disyllabic utterances, which resulted in a total of 81 disyllabic vowel utterances per recorded set (one set per speaker). Thus, a total of 12 x 81 = 972 disyllabic vowel utterances were recorded. Seven additional sets were recorded by another native Thai speaker, which then were used as additional data to train the speech recognition model, as will be described later. Some transformations were performed by hand prior to data processing, which are 1) resampling, 2) amplitude rescaling, and 3) syllabic transition marking. Resampling was performed to reduce the speech sample rate from 44.1 kHz to 16 kHz. The amplitude of the recorded audio was scaled to match the distribution of a training synthetic speech by multiplication with a constant factor. The transition time between syllables in the disyllabic vowel utterances was manually marked based on visual inspection of the waveform.

I. DESIGN OF EXPERIMENT
The model was trained on two synthetic training sets from the generator, which were a monosyllabic vowel utterance and a disyllabic vowel utterance. To prevent overfitting from training the model too long, the dataset was split into a training, validating, and testing dataset, where the validating dataset was used for performance monitoring during training, and the testing dataset was used for a final performance test.
After the training, the model performance was evaluated both in the articulatory domain and acoustic domains using the predefined vowel dataset. Lastly, the model was evaluated on the acoustic domain using the recorded disyllabic Thai vowel samples. Since the acoustic from different speakers cannot be directly compared, the model performance was evaluated in a phonemic domain using a listening test, described in the following section. The effect of the proposed generator, e.g., speaker simulation and data augmentation during the data generating process, was studied in four experiments each using a different dataset: 1) the originally proposed dataset; 2) the dataset without speaker simulation; 3) the dataset without data augmentation; 4) the dataset without both speaker simulation and speech augmentation. The training on all four models used the same experimental setting. The performance of the training was evaluated by resynthesizing the recorded disyllabic Thai vowel samples, and then measuring the performance in terms of phoneme recognition accuracy using a speech recognition model.

J. EVALUATION METRICS
This study evaluated models in the articulatory domain and the acoustic domain using both visual and numerical assessment. In the articulatory domain, the root means square error (RMSE) and R-squared (R 2 ) were used to measure the error between the observed underlying articulatory target and estimated underlying articulatory target by the model. For the acoustic domain, F1, F2, F3 formant errors between the target speech and the reproduced speech by the model were measured using a mean absolute percentage formant error. Formants were extracted using the Praat script [64].
The mean absolute percentage formant error is defined as follows: where F i is a formant of an imitated speech. F a a is a formant of a target speech. N is a total amount of a formant sample in the speech data. For the phonemic domain, the precision metric was used as a numerical score. The precision is defined as follows: The TruePositive is a number of correct predictions. The FalsePositive is a number of incorrect predictions, where the actual target is negative.

K. LISTENING TEST
The listening test was conducted with 25 native Thai listeners who participated in this experiment. 14 listeners are female, and others are male. The listener's age is distributed around 23 to 27 years old. The listener was asked to identify the phoneme of the given set of disyllabic vowel utterances reproduced by the proposed model. These utterances were composed of vowels as described in Subsection II-H consisting of 81 utterances. The utterances were presented to the participants in a random order.

L. RECOGNITION TEST USING SPEECH RECOGNITION
To measure the effect of the proposed generator, the recognition test using a speech recognition model was used to evaluate the intelligibility of a reproduced speech in a phonemic domain. The speech recognition model was trained on the recorded disyllabic Thai vowel speech. This test assumes that if the reproduced speech was intelligible enough, the speech recognition trained from Thai vowels should be able to identify phonemes correctly. The model architecture was defined as a shallow LSTM recurrent network consisting of two LSTM layers and 64 hidden units per layer. The output layer was a fully-connected layer with nine units representing the phonemic target class. The model was trained by supervised classification learning, where the MFCCs representation was used as a speech feature and its phonemic representation was used as a learning target. The cross-entropy loss was used as an objective function to train this model, defined as follows: Softmax(a) = e ac C d=1 e a d whereẑ is a predicted probability produced by the Sof tmax(a) function. C is the total number of classes. a is an activation from the previous layer. z is a ground truth of a predicted class. The permutation test and bootstrapping subsampling method were used to ensure the model's fitness.

M. DATA VISUALIZATION
A uniform manifold approximation and projection (UMAP) [65] was used to visualize the cluster of phonemes from the speech and articulatory targets in a two-dimensional space. UMAP reduces dimensions of the input by constructing highdimensional topological space from the input using simplicial complexes, and then using an optimization to find a topology of a lower dimension that is similar to the initial high-dimensional topology. UMAP preserved both local structure and global structure, meaning that similar data are clustered together, and similar categories of data are shown close to each other.   Table 2 shows the result of the model performance evaluated on the synthetic samples. The error in the articulatory domain shows that the BiLSTM performed better than the baseline on both monosyllabic and disyllabic vowel utterances. Next, the models were evaluated on the predefined vowel dataset. The results are shown in Table 3. The BiLSTM achieved small RMSEs for both monosyllabic and disyllabic vowel utterances, indicating that the model can estimate the articulation of the unseen speech sample from a known speaker.

B. MODEL PERFORMANCE ON SYNTHETIC DATASET
To further analyze the model performance, the mean absolute percentage formant error with a 95% confidence interval was used to measure the resynthesizing error between the target speech from predefined vowels and the re-synthesized speech from the model in the acoustic domain. As shown in Table 4, the BiLSTM achieved a low percent error rate, indicating that the model can accurately reproduce the unseen target speech from a known speaker.  The model performance of each estimated articulatory parameter on a predefined monosyllabic vowel is shown in Table 5. The result shows that the BiLSTM was weak in estimating TTX and TS1 parameters. While TS1 has little effect on the produced speech, TTX, the tongue tip, may cause a slight error when reproducing a speech. VOLUME 4, 2016  The model performance of each resynthesize vowel in an acoustic domain measured by absolute percentage formant error is shown in Table 6. The re-synthesized speech of the phoneme /i:/, /u:/, /o:/, and /ø:/ from the BiLSTM had a higher F1 error compared to other vowels. Using a tstatistical test, F1, F2, and F3 have p-values larger than 0.1, as shown in Table 7. Thus, the null hypothesis of having similar speech does hold. Therefore, these errors did not cause the target speech and re-synthesized speech to be significantly different.  Table 8 shows the model performance of each estimated articulatory parameter on predefined disyllabic vowels. The model estimated articulation of a disyllabic vowel better than that of a monosyllabic vowel, where most of the estimated articulatory parameters have lower RMSE and higher R2. Table 9 shows the average F1, F2, and F3 formant error between target and the re-synthesized disyllabic vowel utterance, where the first and the second syllables of disyllabic vowel utterance were measured separately. From the result, the model noticeably has a high F2 error on the vowel /u:/ for both halves of the disyllabic vowel. Using the t-statistical test to test the difference between target and re-synthesized disyllabic vowel speech, both speeches were not significantly different where the p-value for F1, F2, and F3 of both parts are more than 0.1, as shown in Table ??.   Figure 4 shows the estimated articulation from the model when reproducing monosyllabic vowels (top) and disyllabic vowels (bottom) visualized from the VocalTractLab. These estimated articulations from the model were behaved according to the international phonetic alphabet chart (IPA) [66], where /a:/ had tongue towards front and month open, /i:/ had a tongue towards the front and mouth slightly close, and /u:/ had tongue towards back and mouth slightly close.  Table 11 shows the mean absolute percentage formant error between the target disyllabic Thai vowel speech and the resynthesized speech from estimated underlying articulatory targets by the model. Figure 5 shows the similar arrangement of formants between target and re-synthesized speech. The result in Table 12 from the t-statistical test also shows that average F1 and F2 frequencies were not statistically significantly different, where the p-value for F1, F2, and F3 of both first and second syllable of a disyllabic vowel are larger than 0.1, as shown in . While the t-statistical test shows that average F3 was statistically significantly different, only F1 and F2 are enough to identify vowels [67]. As shown in both numerically and visually, while it is not statistically different, the error seems large. However, the formant from the different speakers cannot be directly compared because it is affected by the shape of the vocal tract. Therefore, formants of the re-synthesized speech were compared with the empirical formant [68] range. Figure 6 illustrates that most of the average formants are within the empirical formant range, indicating that the model accurately re-synthesized target speeches which were recorded by the actual human. Figure 7 shows the comparison result between the spectrograms of the target speech recorded by the Thai speaker and re-synthesized speech by the model. The red contour shows the speech formants. Visually, the F1 and F2 formants of both target and re-synthesized speech signals are comparable, while F3 in some speeches were different, e.g., /E:O:/ and   Figure 8 shows the group of estimated underlying articulatory targets from the model where the color represents the phoneme of the target speech. The articulatory parameters were projected into a two-dimensional space using the UMAP. Articulations of the same phoneme were clustered together. However, while there was a clear distinction between some groups, some mixture between phonemes were presented. Figure 9 shows articulations of an estimated underlying articulatory target from the recorded disyllabic Thai vowel. The visualization is shown as an estimation of the first and   Figure 10 shows a lip's shape of the /a:/, /i:/, and /u:/ vowel to visualize the roundedness. The lip's shape of /u:/ was rounded, while /a:/ and /i:/ are unrounded. Figure 12 shows the phoneme recognition rate, where the participants were asked to identify the Thai phoneme of the re-synthesized speech from the target recorded Thai speech by the model. The high recognition rate meant that the resynthesized speech by the model was intelligible enough to be classified correctly. The result shows that most of the vowels were correctly identified by participants. Overall, the model achieved more than 80% classification accuracy for re-synthesizing disyllabic vowel utterances, except for /a:/ with the recognition rate slightly below 80% and /O:/ with a recognition rate below 40%. The only difference in the articulation of /a:/ and /O:/ is the rounding of the lip where the articulation for /O:/ is rounder than /a:/. Figure 11 shows the table where each row is a vowel that the model tried to reproduce, and each column is the vowel that was recognized by the participants. The diagonal represents the correct recognition. Most of the incorrect identifications were from a pair of vowels that having similar phoneme, which are /a:/ and /O:/.

D. EFFECTS OF THE DATA GENERATOR
The effect of the proposed data generator module was evaluated by measuring the model target speech re-synthesizing performance on recorded disyllabic Thai vowel dataset. The result shown in Table 13 indicates that applying both speaker simulation and augmentation methods improved the model performance. The result clearly shows that the performance significantly dropped without both speaker simulation and data augmentation.  model was trained on the synthetic samples, showed that the model could generalize the acoustic to the underlying articulatory targets relationship to unseen utterances speaked by unseen native Thai speakers as well. The variations from the speaker simulation and data augmentation increased the model's knowledge about the acoustic-to-underlying articulatory targets relationship, leading to the better estimation of the underlying articulatory targets. The data augmentation method affected the model performance more than speaker simulation. This is because the speaker simulation only increased the variation in terms of speaker characteristic, where the vocal tract area was interpolated from the same vocal tract model thus the overall shape of the vocal tract did not change much, while the data augmentation augmented pitch, volumn, and added random noise which reduce the model overfitting. The result from the listening test showed that the proposed strategy provides a nearly perceptually accurate mapping between the Thai vowel speech with the German vocal tract configurations provided in VocalTractLab, indicating the ability to decouple the articulatory mechanisms from the linguistic information. Thus, the generalization of the strategy and the learned targets should provide evidence that this learning strategy can also provide a potential means of target learning besides a general supervised learning approach. The advantage of the proposed strategy over the previous study [10] is that the model performed well although its training was based on only a few observed samples. While only trained with synthetic samples, the model can estimate the underlying articulatory targets from the recorded Thai speech from various native Thai speakers, which are unknown to the model, with actual minimum background noise. Therefore, it shows a promising result to apply these underlying articulatory targets acquisition strategies on the real-world application.
Further improvement of this model is highly needed. First, the model assumed that the perceptual segmentation of the speech was learned before the learning of speech production. Second, the effect consonants were not studied. Future work should explore how humans segment speech, e.g., how many syllables are in the target speech, and then learn to imitate those signals. The attention model for machine learning has proved very useful for speech recognition [70], [71], where a speech utterance was translated into a sequence of characters. Therefore, the attention model could also be VOLUME 4, 2016 applied to estimate a set of underlying articulatory targets from the given speech utterance. Next, the method could be modified to be able to estimate articulatory targets related to consonants. In addition, to improve estimation performance and generalization, self-supervised methods [72] are worth exploring on how the speech is featured without any explicit target.

V. CONCLUSION
This study explored the estimation of underlying articulatory targets by learning the mapping between acoustic and the underlying articulatory targets of Thai vowels using a bidirectional long short-term memory recurrent neural network. The VocalTractLab was used as a generative model to generate acoustic data from articulatory parameters, and the deep learning approach was used to model the acousticto-articulatory relationship. Using a few data points as representative of Thai vowels, the speech data augmentation and a speaker simulation method allowed us to extract more information from the data and improve the estimation of the underlying articulatory targets. The results demonstrated that the proposed strategy was able to accurately reproduce speech from a given target utterance from unseen Thai speakers. Thus, the model represents an effective strategy for rapid mapping of acoustic data to articulatory target parameters.
The caveats of this study are: 1) the proposed method required a predefined syllable segmentation of the input speech, and 2) this study excluded consonants. Therefore, the recommended improvements from this study are to include the estimation of underlying articulatory targets of consonantvowel utterances, and to explore methods which can directly estimate the sequence of speech syllables without a need for the predefined speech segmentation.
YI XU received the Ph.D. degree in Linguistics from the University of Connecticut, United States in 1993. He is currently a Professor of Speech Sciences at the Department of Speech, Hearing and Phonetic Sciences, Division of Psychology and Language Sciences, University College London. VOLUME 4, 2016