Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale

Technologies such as Text-To-Speech (TTS) synthesis and Automatic Speech Recognition (ASR) have become important in providing speech-based Artificial Intelligence (AI) solutions in today’s AI-centric technology sector. Most current research work and solutions focus largely on adult speech compared to child speech. The main reason for this disparity can be linked to the limited availability of children’s speech datasets that can be used in training modern speech AI systems. In this paper, we propose and validate a speech augmentation pipeline to transform existing adult speech datasets into synthetic child-like speech. We use a publicly available phase vocoder-based toolbox for manipulating sound files to tune the pitch and duration of the adult speech utterances making them sound child-like. Both objective and subjective evaluations are performed on the resulting synthetic child utterances. For the objective evaluation, the similarities of the selected top adults’ speaker embeddings are compared before and after the augmentation to a mean child speaker embedding. The average adult voice is shown to have a cosine similarity of approximately 0.87 (87%) relative to the mean child voice after augmentation, compared to a similarity of approximately 0.74 (74%) before augmentations. Mean Opinion Score (MOS) tests were also conducted for the subjective evaluation, with average MOS scores of 3.7 for how convincing the samples are as child-speech and 4.6 for how intelligible the speech is. Finally, ASR models fine-tuned with the augmented speech are tested against a baseline set of ASR experiments showing some modest improvements over the baseline model finetuned with only adult speech.


I. INTRODUCTION
In recent years, rapid advances in Machine Learning (ML) and Deep Neural Network (DNN) techniques, together with tremendous increases in computational power, have led to a significant boost in the development of speech related The associate editor coordinating the review of this manuscript and approving it for publication was Ganesh Naik .
Child speech differs in multiple ways from adult speech owing essentially to the anatomical and morphological differences in their vocal-tract structure.Children have shorter vocal cords, giving their voices higher fundamental and formant frequencies compared to adults.In addition, children may have less control over articulation and nonlinguistic aspects of speech such as prosody and therefore child speech exhibits higher spectral and temporal variation than adult speech [39].
On average, children also have slower speaking rates due to having longer phoneme durations [40].They also exhibit higher pitch values: typically above 250Hz compared to average pitch values of 130Hz for adult males and 220Hz for adult females [41], [42].For these reasons, it is important to gather and prepare good quality children's speech data to successfully train child-friendly speech-related AI models.However, there are additional challenges in the process of collecting child speech data [43], explaining the limited number of child-speech datasets available for research purposes.

A. EXISTING CHILD SPEECH DATASETS -DEFICIENCIES
There are some English child-speech datasets publicly available to researchers.Some of these [28], [29] were built using the approach of recruiting child speakers for recording sessions in professional recording studios, while others, for example, the MyST dataset [30] were built using a tablet or smartphone based app to record children's conversational speech remotely.For the latter, audio quality is highly dependent on the consumer device that the app runs on.All of these datasets feature several drawbacks, which affect data quality and introduces challenges to the use of said data in training speech-related AI models such as ASR and TTS.Invariably, major cleaning, filtering, annotation and other preprocessing of the data becomes necessary.A summary of the statistics and pros and cons of these child-speech datasets are presented in Table 1, along with some adult speech datasets for comparison.
A common problem with many of the child speech datasets is that they are relatively small/short in duration, as can be seen in Table 1, and are simply not enough in terms of duration (hours) to train a speech model on their own.Another problem is the poor quality of recorded speech samples.Some datasets are generally of poor quality due to the recording devices and/or environments used to while capturing the data; for examples audio samples may have too much background noise, noise from recording gear, or very low gain.Lastly, some datasets have several bad speech samples.
For instance, in Table 1, MyST is the largest child dataset and has a lot of data (approx.393 hours) from multiple speakers; but many of the utterances are too short or too long, non-meaningful or indiscernible and noisy.In addition, much of the dataset is not annotated, or annotations are of poor quality and cannot be used for training speech models [24].

B. CHALLENGES IN BUILDING CHILD-SPEECH DATASETS
Building a clean speech dataset even for adults is not an easy task.It requires a specially prepared environment (recording studio), the right recording and storage devices, as well as recruitment of speakers.Child speech data can also be collected using this traditional method of recruiting speaking actors for recording sessions in media studios; however, in the case of children, additional difficulties are introduced.
• Recruitment and data protection: The processes of recruiting child speakers (actors) and complying with data protection laws can be both expensive and timeconsuming and must involve the parents or legal guardians of the children, as children cannot give their own legal consent.
• Low concentration and short attention spans: children have relatively lower levels of focus and shorter spans of attention, which could cut recording sessions short.
• Poor acoustic and linguistic capabilities of the youngest group of children.
• Poor quality of recording devices and environment.Another approach that can be used to gather children's speech involves collecting audio recordings from the Internet, for example from YouTube or through a dedicated recording application.With this approach, a different set of challenges are faced: • Limited number of videos with children as main actors.
• Background noise and music • Lack of transcriptions and annotations.

C. RATIONALE FOR THIS RESEARCH
Taking all the above challenges into consideration, there is a need for alternative ways to build larger child speech datasets to facilitate the development of child-friendly speech technologies.To this end, the goal of this study is to explore the potential of augmenting adult speech to provide additional child-like speech samples to complement existing childspeech datasets.The resulting synthetic child voices can be used to generate more synthetic child speech with the appropriate (child-like) linguistic content using a fine-tuned TTS model.

D. RELATED WORKS
To improve the performance of ASR models for children's speech, some researchers have adopted similar data augmentation techniques.For example, Shahnawazuddin et al. [44] proposed a prosody modification (i.e., pitch and speaking rate scaling) using a Zero-Frequency Filtering based Glottal Closure Instants (ZFF-GCI ) anchoring approach.The authors used these modifications to introduce more variability in order to achieve speaker independent ASR and reported improvements in accuracy over their baseline for both adult and child test sets (ASR).Bhardwaj et al. [45] also used pitch and speaking rate modification to improve performance of Punjabi ASR system on children's speech.It uses the ZFF-GCI method for Linear Prediction based Pitch Synchronous Overlap and Add (LP-PSOLA) together with speaker adaptive training and achieves an improvement in recognition rate for Punjabi child speech.Chen et al. [46] applied multiple modifications including pitch, tempo, speed, and volume perturbations to both adult and child training datasets to diversify and increase the amount of available training data to improve child ASR.
The idea of generating synthetic child-like speech from adult speech was explored by Singh et al. [47].In their work, they applied spectral modifications, namely Linear Predicting Coding (LPC)-based segmental warping perturbations (LPC-SWP) and formant energy perturbations (FEP), to adult data to generate child-like speech for data augmentation, and demonstrated an improvement in WER on both children and adult test sets when these modifications were combined with vocal tract length perturbation (VTLP).
Most of these works used different algorithmic approaches to apply prosody-based modifications (pitch and speaking rate scaling) to the speech, and the modifications were applied in a somewhat randomized manner.That is, both increasing and decreasing adjustments were applied to the audio features (e.g., pitch and speaking rate).In addition, the quality of the modified speech generated was not assessed in detail.
In this work, the goal is to generate/create synthetic childlike speech data, and we consider augmenting the pitch and speaking rate of adult speech to achieve this using a publicly available phase-vocoder based sound manipulation tool.To determine the timestamps of words and spaces where the speaking rate should be reduced, a forced alignment system based on an ASR model is used.In addition, we employ a speaker encoder model to visualize and compare the adults' and children's speaker embeddings in a common latent space before and after modifications.The contributions of this paper are as follows: a) exploring an alternative algorithm approach for the modification of adult speech (to make them more child-like through pitch and speaking rate adjustments), b) conducting Mean Opinion Score (MOS) studies to provide a qualitative evaluation of the augmented/modified speech, c) scaling the augmentation to generate large amounts of synthetic child-like speech, d) conducting a proof-ofconcept ASR experiment (example application) to provide a quantitative evaluation of the augmented adult speech.
The rest of this paper is organized as follows: Section II presents foundation technologies used in this research.Section III describes the methodology and Section IV presents the experiments conducted.Results and discussions are presented in Section V. Section VI shows an example application.andfinally, Section VII presents our conclusions and future work.

II. FOUNDATION TOOLS AND TECHNOLOGIES
To develop our augmentation pipeline, we need to use a number of specialized tools to modify the pitch and control the duration of speech samples.In this section we introduce these tools, outline their features and discuss their role in the pipeline.
Different tools were considered for the tasks defined.The Combinatorial Expressive Speech Engine (CLEESE) [48] was selected to implement these augmentations because it offers a combination of ease-of-use and flexibility by allowing transformations to be applied to specific segments of the input speech sample where desired.

A. THE COMBINATORIAL EXPRESSIVE SPEECH ENGINE (CLEESE)
CLEESE is a python toolkit that can be used to perform deterministic or random transformations on input sound.Several features of the input sound can be modified, including the pitch, duration, and gain (amplitude).Originally designed to generate many random variations of a single input sound, CLEESE can also be used to perform individual and userdetermined transformations, and the transformations can be either static or time-varying [48].
Using the phase-vocoder digital audio technique, CLEESE first takes the Short-Time Fourier Transform (STFT) of audio files, which decomposes each frame (segment) of the audio file into its frequency coefficients.Then CLEESE modifies the frames' STFT coefficients as required.For example, it shifts a frame's frequency coefficients to higher frequency positions to achieve a higher pitch [48].After applying the modifications, CLEESE then generates a modified timedomain signal from the manipulated frames by applying a variety of techniques to ensure continuity or phase-coherence of the resulting sinusoidal components [48].
CLEESE operates by passing user-defined or random breakpoint functions (BPFs) to a spectral processing engine together with other parameters for processing of the sound.The BPFs are functions that determine how transformations vary over the duration of the sound, in other words, they define one or more segments (time-windows) of the input sound where specified modifications should be applied.For each BPF, a transformed version of the input sound is generated.
For the pitch, time and gain transformations, the BPFs are temporal and are specified as two-column matrices.Each row (breakpoint) in a BPF matrix has two elements: time and value.The time indicates where the next modification should begin from, and the value indicates the amount of modification to be applied.The desired transformation is specified separately in a configuration file.With the specified transformation, CLEESE modifies the input sound along the corresponding dimension (pitch, time, or amplitude) while maintaining the other dimensions constant.CLEESE can also perform chained transformations; for example, apply pitch shifting followed by time shifting.

B. CLEESE TRANSFORMATIONS 1) PITCH-SHIFT TRANSFORMATION
Pitch-shifting involves shifting or displacing the fundamental frequency in a given audio frame to a different (higher/lower) frequency.In this study, the fundamental frequencies are shifted to higher frequency points specified in the BPFs along with the corresponding times where the modifications should start.To determine the new frequency point, CLEESE takes a pitch-shift factor, a value expressed in units of cents (a cent is one hundredth of a semitone), provided in the BPF and uses it to compute the new frequency with respect to the original frequency.As an example, to shift the pitch of the input audio by 2 semitones, a pitch-shift factor of 200 cents is provided in the BPF.Pitch-shift factors less than 0 cents correspond to lowering the pitch, factors greater than 0 cents correspond to raising the pitch, and a factor of 0 cents implies no change or shift in pitch [48].

2) TIME-STRETCH TRANSFORMATION
The time-stretching transformation involves shifting the audio frames from their original positions to earlier or later points.Similarly, for the time-stretching, CLEESE takes a time-shift factor from the given BPF and uses that to determine the new position of a frame.A time-shift factor less than 1 corresponds to compressing the sound, a factor greater than 1 corresponds to stretching the sound, and a factor of 1 implies no change in the original sound duration [48].For example, using a time-shift factor of 2 doubles the duration of the audio, i.e., a 3-second-long audio will become 6-secondslong after modification, if the modification is applied to the full length of the input audio.

C. WAV2VEC2 FORCED ALIGNMENT SYSTEM
The wav2vec2.0forced alignment system1,2 uses the wav2vec2.0[4] ASR model for extracting acoustic features from the audio and estimating the frame-wise label probabilities.It then constructs a Trellis matrix using the ground-truth TABLE 2. The wav2vec2.0alignment system outputs all words in an utterance, their respective start and stop times, as well as the confidence score for each alignment.transcript of the utterance, which shows the probability of the transcript's labels at each timestep.The system then finds the most likely path through the Trellis matrix, producing the alignments between the ground-truth transcript's words and the spoken audio.The output of the forced alignment process is the start and end timestamps for all words in an utterance as shown in table 2.

D. SPEAKER EMBEDDINGS
A speaker embedding is simply a representation of a speaker's identity in the form of a fixed size vector given an utterance, and regardless of the utterance duration.Speaker embeddings can be plotted in an embedding vector space to visualize how multiple speakers relate to each other.Speaker embeddings are commonly used for speaker recognition tasks [16], [49] and more recently, to improve multi-speaker TTS models [8].In addition to the speaker identities, speaker embeddings may carry information about other paralinguistic information such as prosody or emotion and gender of a speaker.
Different approaches have been proposed to encode speaker embeddings, and these include identity vectors (i-vectors) [50], which are low-dimensional projections of the differences between a speaker's pronunciations and the respective overall average pronunciations; (d-vectors) [51], which are deep neural network (DNN) based and extracted from a hidden layer of a model trained to predict speaker identities; and x-vectors [52], which are also DNN based but capture segment/utterance level information as well as framelevel information by using either statistical or max-pooling method to gather the frame level information as segment level representation [53].

III. METHODOLOGY
In this section, we describe the implementation of the proposed adult-to-child speech augmentation process.The python toolkit, CLEESE, is used to perform two key transformations to the adult speech data with the aim of transforming them to child-like speech.Fig. 1 shows a flow diagram of the overall augmentation process.
First, we triage the adult speakers by comparing the cosine similarities of their speaker embeddings to child speaker embedding prior to the augmentation process, see Fig. 2.This is done by computing the mean child speaker embedding as well as the mean speaker embedding per adult speaker.Each adult speaker's cosine similarity to the mean child embedding is computed, and the value is compared to a threshold value for a selection decision to be made.More details on this in section VI, B.
Next, we apply the pitch-shifting transformation to the utterances of the selected speakers.For each utterance, the pitch transformation is applied to the full utterance length.To achieve this, a BPF is created with one break point (time: start of utterance and value: the desired pitch-shift factor e.g. 100 cents i.e., 1 semitone).CLEESE applies the transformation from the specified time stamp to the end of the utterance unless another breakpoint is encountered.Therefore, a single breakpoint (row) in the BPF modifies the full utterance.
Next, the time-stretching transformation is applied to the pitch-shifted utterances.To successfully stretch the desired segments of the sound, the exact start and stop times for the segments are needed to create the appropriate BPFs for stretching.For this, the wav2vec2.0based forced alignment system is employed to align the adult speech with their corresponding transcripts.Based on the word timestamps, the start and end times of all ''white spaces'' in the utterance are derived and used in creating BPFs for the time-stretching transformation.The start time for each word and white space is used as a breakpoint in the BPFs, and different stretch factors are used for words vs whitespaces.

IV. EXPERIMENTS
The proposed techniques for augmentation were implemented on an NVIDIA GeForce RTX 2080 Super GPU, and to scale our experiments we used an NVIDIA RTX A6000 GPU.

A. PRELIMINARY TESTS
Pitch-shift and time-stretch transformations were applied to randomly selected subsets of two adult speech datasets: LJ 109070 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.speech [31] and voxceleb1 [34].Pitch-shift factors in the range of 100 cents (1 semitone) to 800 cents (8 semitones) were tested on both male and female speakers, and random time-shift factors in the range of 2 to 4 were also tested.The goal was to determine approximately the range of pitch-shift and time-shift factors that will make sense to use in future experiments.
The time-stretched utterances were qualitatively evaluated by listening to them, and it was observed that a time-shift factor of 4, which quadruples the audio length, resulted in extremely sluggish augmented utterances, and even a factor of 2, which doubles audio length, resulted in utterances that were still a bit too slow.Another observation made was that stretching the individual words in the utterance made them sound unrealistic.
For the pitch-shift transformation, we observed that, with the range of values (pitch-shift factors) that achieved desired results on some of the speaker identities, other speaker identities did not sound realistic, even after extending the range of pitch-shift factors.From these initial tests we determined that not all adult voices can be successfully tuned to sound child-like.
To resolve this and allow a larger study to be conducted, it was necessary to first triage and determine the adult speakers whose voices are more suitable for transforming into natural child voices.This could be achieved by projecting both adults' and children's speaker embeddings into a latent speaker embedding space for comparison.

B. INITIAL EXPERIMENTS 1) COMPARISON OF ADULTS' AND CHILDREN's SPEAKER EMBEDDINGS
To compare the adults' and children's speaker identities, a Generalized End-to-End (GE2E) Loss [49] based speaker embedding (encoder) tool known as Resemblyzer [54] was used.It uses a d-vectors based speaker encoder model [49] which uses the GE2E loss for optimization.It also has multiple functionalities for visualizing and comparing the extracted embeddings using the Unified Manifold Approximation and Projection (UMAP) for dimension reduction.
Initially, the speaker embeddings of multiple speakers (both adult and children) were plotted via UMAP with the aim of finding adult speaker embeddings closest to the children's speaker embeddings.In Fig. 3, we show some speaker embeddings in a UMAP plot for visualization.
All male speaker embeddings are marked with black crosses, female speaker embeddings are marked with blue triangles and all child speaker embeddings are marked with red circles.The children's embeddings cluster in a small section of the embedding space.However, it was challenging to accurately identify the adult speakers that are closest or most similar to children by visual inspection.Therefore, it was decided to perform a cosine similarity-based comparison and select speakers with the highest similarity values for the main augmentation experiments.

2) COMPARISON OF EMBEDDINGS BASED ON COSINE SIMILARITY
The cosine similarity score is a number between 0 and 1.A similarity of 1 means the two embeddings compared are identical, and a similarity of 0 means they are completely different.Firstly, we extracted the speaker embeddings for multiple child speakers taken from the CMU kids corpus [29].From previous research [24] as well as initial experiments (see Fig. 3), it is known that children's speaker embeddings form a small cluster in the speaker embedding latent space; hence, we computed the mean child speaker embedding to represent the children embedding cluster.
Next, we took a random subset from the train-clean-100 subset of the Librispeech dataset [32] and computed an average speaker embedding for each adult speaker, by averaging the embeddings of their individual utterances.Then we compared them to the mean child speaker embedding using the cosine similarity metric.A flow diagram showing the flow of this process is seen in Fig. 2.
All adult speakers whose embeddings exceeded a predefined threshold of 0.65 were selected for augmentation as in equation 1.This threshold was chosen by listening to some of the utterances and observing their corresponding similarities.Fig. 4 shows some examples of the cosine similarities computed.More statistics regarding the cosine similarities are shown in the next section.
where Dec is adult speaker selection decision, sim_score is the computed cosine similarity score between an adult's speaker embedding and the average child speaker embedding.

3) AUGMENTATION PROCESS
Further tests were done on the selected speakers (i.e., adult speakers whose cosine similarities exceeded the threshold) thereafter.Two separate ranges of pitch-shift factors were empirically chosen for the two genders.This was done by listening to the pitch-shifted utterances and rating them in terms of how convincingly child-like they sounded.For male and female speakers, the ranges of 500 to 700 cents and 100 to 300 cents were chosen respectively.Based on the observations made about the time-stretched utterances in the preliminary tests, it was decided to stretch only the pauses (whitespaces) between all words in the utterances, as well as the unusually long words, without stretching every single word.For stretching all the whitespaces, we first used a time-stretch factor of 2 (i.e., we doubled the length/duration of pauses) and then reduced it to a time-stretch factor of 1.8 after qualitatively evaluating a few of the augmented utterances.In addition, we identified the unusually longer words in the utterances -that might be difficult for children to pronounce -and stretched them using a factor of 2. This was done by computing the duration of each word and comparing it to an empirically chosen word length threshold.

C. MAIN EXPERIMENTS
For the main experiments, we used all the data in the trainclean-100 subset of Librispeech [32] as the adult speech dataset.A subset of the CMU kids dataset [29] was used as the child speaker set, specifically the Fort Pitt (FP) subset.Firstly, the adult speakers most proximate/similar to children in terms of speech/voice were determined by performing the cosine similarity comparison explained in Section III, B using the same decision threshold value of 0.65 as in the initial experiment.Table 3 shows the number of adult speakers above and below the cosine similarity threshold.
Once the most similar speakers were selected, the two augmentation techniques explained in Section II, B; namely, pitch-shifting and time-stretching transformations were applied to all individual utterances of the selected speakers.The same pitch-shift and time-stretch factors chosen in the initial experiment were applied here.First, we applied the pitch-shifting transformations and then applied the timestretching transformation on the output of the pitch-shifting transformation.Table 4 shows the final shift factors used for the pitch-shift and time-stretch transformations.
This resulted in multiple sets/folders of data per speaker, each containing utterances augmented with different augmentation parameters.Specifically, the sets of utterances differed in terms of pitch-shift factors only, as the time-stretching parameters were kept constant for all sets and all genders.

D. OBJECTIVE EVALUATION
In the initial experiment section, the cosine similarity value served as a good metric to determine the proximity of adult speaker embeddings to the average child speaker embedding.Therefore, to objectively evaluate the augmented speech, it made sense to recompute the cosine similarities between adult speakers' average embedding (after augmentation) and the average child embedding.After recomputing the cosine similarities, we observed that there was a general increase in the similarity values for all the speakers.Fig. 5 below shows the cosine similarities of selected speakers before and after transformations were applied.A similarity score of 1 would indicate that a speaker is exactly the same as the average child speaker.Table 5 also shows statistical analysis of the adult speech data before and after augmentation.Note that the cosine similarities of all individual child speakers' embeddings to the mean child embedding were in the range of 0.9 to 0.973, except one child (0.837).

E. SUBJECTIVE EVALUATIONS
While the increase in the cosine similarity of an augmented adult speaker gives a strong indication that the augmentation pipeline is achieving its primary goal, it is not possible to judge how realistic or intelligible the augmented voice is.In the case of some subjects, it was noted that while the cosine similarity was high, the corresponding speech was occasionally distorted and unrealistic.
For this reason, it was decided to conduct a human listener evaluation study to validate how realistic the augmented speech from a speaker is and confirm that it remains intelligible.Such a study can also help confirm the best speakers and the optimal augmentation parameters to use for individual speakers to build a larger augmented speech dataset -a core goal of this research.
To subjectively evaluate the augmented speech samples, the MOS [55] subjective evaluation method was applied.MOS evaluation is widely used to evaluate speech models, such as TTS and Voice Conversion (VC) models, by asking human evaluators to rate various aspects of speech quality such as naturalness, intelligibility, similarity, etc.

1) DESIGN OF MOS STUDY
There were three specific goals for the study: i) Determine the optimal pitch-shift factor per speaker, ii) Determine how realistic (convincingly child-like) the augmented utterances sound and iii) Determine whether the augmented utterances are distorted beyond understanding or if they remain intelligible.To achieve the goals of the study, three questions that capture the information required were chosen and presented to the evaluators.
For the first goal, evaluators were provided with multiple variations per utterance and asked to select most childsounding one.The difference between the variations are the pitch-shift factors used in the pitch-shift transformation.The sample selected for the first question is used in the remaining questions.Secondly, evaluators were asked to rate the selected sample in terms of how convincingly child-like or how realistic it sounds on a scale of 1 to 5. Note that the linguistic contents of the utterances are adult-like and very different from the typical linguistic content of child speech.Evaluators were given prior notice and were asked to disregard the adult-like linguistic content while rating the convincingness.
Thirdly, evaluators were asked to rate the same selected sample in terms of intelligibility on a scale of 1 to 5. Evaluators were restricted to only 5 grading points (i.e., 1, 2, 3, 4 and 5); they were not allowed to give intermediate scores, such as 2.5.Evaluators were also asked to identify the gender of the speaker by choosing one of three options: Boy, Girl and Can't say.Finally, evaluators were also given the option to leave comments if they had any.Table 6 shows explanations of the scales for convincingness (question 2) and intelligibility (question 3) following approach in [24].
With this design, a first MOS study (Study A) was conducted on utterances augmented with the following pitchshift factors: 100 cents, 200 cents and 300 cents, meaning for each utterance, there were three variations for evaluators to choose from.After this first study was completed and the results were processed, it was decided to conduct a second MOS study (Study B) to refine the outcome of the first.In particular, we wanted to see the effect of including utterances augmented with higher and more finely granulated pitch-shift factors as compared to the first study: 250, 300, 350 and 400 cents.This is because in the first study, the same variation of utterance (augmented with the highest pitch-shift factor of 300 cents) was selected by almost every evaluator as the most child-like for all female speakers, so the range of pitch-shift factors to investigate clearly needed expansion.
Study A: In the first evaluation, there was a total of 30 evaluators, mainly drawn from an undergraduate engineering class.These were divided into two groups of 15 evaluators.Augmented speech samples were taken from Librispeech speakers.20 speakers were chosen for the MOS study, after listening to samples from all their recording sessions to check for noise and rate the quality.This was done after triaging the adult speakers as described in Section III B. They included 16 female speakers and 4 male speakers with the highest cosine similarities and high-quality audio samples.Each group of 15 evaluators was given a unique set of 10 different speakers to review (8 females and 2 males).The purpose was to reduce the total number utterances per evaluator.To diversify the phrases, each evaluator group was further divided into 3 subgroups and each subgroup was given a unique (randomly selected) set of 2 phrases per speaker, resulting in a total of 20 phrases to evaluate per evaluator.They were given three augmented samples (variations) per phrase: A, B and C corresponding to pitch shifting factors of 100, 200 and 300 cents, respectively.
Study B: In the second evaluation study, augmented samples from only the 16 female speakers out of the top 20 speakers 109074 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(the same speakers as in the first study) were evaluated.There were 60 evaluators, again mostly engineering students, divided into 2 main groups of 30 evaluators.Each group was further divided into 3 subgroups of 10 students, similar to the approach used in Study A. This time, each evaluator received 16 phrases from 8 speakers: two phrases per speaker as in the first evaluation.Specifically, there were four variations per utterance/phrase: A, B, C and D corresponding to pitch shifting factors of 250, 300, 350 and 400 cents, respectively.Information about evaluators for the two MOS studies is presented in Table 7.

V. RESULTS AND DISCUSSION
In Section III, we described our adult-to-child speech augmentation experiments using the two augmentation techniques described in Section II, with a goal of making the adult voices sound child-like.We also conducted two MOS studies to evaluate the quality of the synthetic child-like speech.In this section, we present and discuss the results of our experiments.
Tables 8 and 9 show the results obtained from the first and second MOS studies, respectively.More detailed presentations of the MOS evaluation results are shown in Tables 12 and 13 in the Appendix.The results of the subjective evaluation showed that the utterances of adult female speakers consistently ranked with higher scores for intelligibility and significantly higher scores for convincingness.We had anticipated this result as only 4 males ranked in the top 20 speakers from Librispeech train-clean-100.It is clear that female speakers offer a better starting point to build synthetic child voices than male speakers.
As shown in both Table 12 and Table 13, the optimal pitchshifting factors for the female speakers lie in the range of 300 to 400 cents.Augmenting the pitch above this range causes the augmented speech to sound more chipmunk-like rather than child-like.For the male speakers, the pitch-shift factor of 600 cents was selected for 3 out of 4 speakers but the augmented speech were unconvincing as child voices, with a very low average MOS score of 1.76.The overall C-MOS score of the most child-like samples was approximately 3.0 when adult male speakers were considered, and 3.7 when only adult female speakers were evaluated (study B).Both convincingness MOS values are above average and implies that the augmented samples are reasonably convincing in terms of human perception and very convincing when only female speakers are used in the study.
A relatively higher I-MOS was obtained for the augmented samples of both genders, showing that generating synthetic child voices using our proposed method does not significantly degrade the intelligibility of the original speech samples.
Note that there are limitations in going from adult speech to child speech; for example, the linguistic content of adult speech data is completely different from the typical linguistic content of children's speech.For this reason, tuning the pitch and speaking rate of adult speech would not make the speech sound completely natural as child speech in terms of the linguistic content.However, these tunings can make the voices alone sound reasonably child-like, which is the target for the current study.
The mean cosine similarity of adult speakers after augmentation was 0.83 for all speakers exceeding the similarity threshold and 0.87 for the top 16 female speakers (see Table 4), whereas the mean cosine similarity of the individual child speakers was 0.94, indicating that there is still potential to further augment the adult speakers to sound closer to child speakers.This suggests that additional prosodic features and paralinguistic elements could be investigated and added into our augmentation strategy to improve the cosine similarity score of the adult speakers.
Finally, to validate the augmented child speech data in a practical application, we next run some ASR fine-tuning experiments, as presented in the next section.

VI. VALIDATION OF THE AUGMENTED SPEECH: EXAMPLE APPLICATION -ASR FINETUNING
In this section, as an example application, we conduct semisupervised ASR finetuning experiments with our augmented adult speech dataset, to show that the augmented speech could achieve improvement over simply using additional adult speech to finetune the ASR for child speech.
Note that the main goal of our study was to explore data augmentations to make adult speech data sound more childlike (i.e.closer to child speech data) in order to provide more child-like data for training, testing and validation of ASR and TTS models to improve their performance on real child speech.Here, we show that finetuning a semi-supervised ASR model with augmented adult speech data can improve the ASR model's performance on child speech.We show that even when finetuned with adult-only speech data, the performance of the model improves to an extent; however, there is some additional improvement when the augmented adult speech is used.
We used the state-of-the-art (SOTA) wav2vec2.0ASR model [3], which uses a self-supervised learning approach and has a two-step training process.First, the model is pretrained on a large amount on unlabeled speech, then it is finetuned on labelled speech data for a downstream task, such as ASR.We used a publicly available pretrained wav2vec2.0model, which was trained on approximately 1000 hours of unlabeled Librispeech data [32].This model was then finetuned with different combinations of our augmented datasets in the various finetuning experiments as presented in the next sub-section.The aim was to compare the performance of an ASR model finetuned with real child and/or adult speech versus the same model finetuned with our augmented data (synthetic child-like speech).The Word Error Rate (WER) metric was used to measure the performance of the finetuned ASR models.

A. ASR FINETUNING DATASETS
We created two sets of synthetic child speech: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
CMU-kids corpus by cosine similarity.The female speakers were selected by ranking all female speakers by their similarity score.This data totals approximately 17 hours in duration.
• Augmented_311h: Contains augmented utterances of all female speakers in Librispeech train-clean-360, train-clean-100, dev and test sets combined, whose similarity score to the average real child embedding from the CMU-kids corpus is above 0.6.This data totals approximately 311 hours in duration.We also used original (non-augmented) adult speech from Librispeech [32] and real child speech data from the MyST child speech corpus [30] for our finetuning experiments: • Original_12h: Contains 12 hours of original adult speech.
• Original_220h: Contains 220 hours of original adult speech.
• MyST_55h: Contains 55 hours of cleaned MyST child speech, which was prepared according to [56] The Original_12h and Original_220h sets are the original Librispeech (adult speech) counterparts of the Augmented_17h and Augmented_311h sets, respectively.Note that there is an increase in the number of hours of speech data when augmenting from Original_12h to Aug-mented_17h and from Original_220h to Augmented_311h.More information about the finetuning datasets can be found in Table 10.

B. ASR FINETUNING EXPERIMENTS
To test our hypothesis of a lower WER on child test data after finetuning on our synthetic child-like speech data, we prepared multiple finetuning experiments.The details of these experiments are presented in Table 11.The experiments were divided into three groups-A, B and C. Group-A experiments contained only the Original and Augmented datasets.MyST_55h was added for the finetuning experiments in Group-B in addition to the Original and Augmented datasets.Group-C experiments used the combined Librispeech datasets across all speakers, both original and augmented versions.All the groups used a pretrained wav2vec2.0model which was pretrained on 960 hours of Librispeech data.
We used four test datasets to test our finetuned models at the inference stage.These datasets were prepared in accordance with our previous research on child speech ASR [56].Since MyST [30] is the largest child audio corpus available publicly for research use, it was used for both finetuning and inference.This was done to see the performance when finetuning and testing on similar data distributions.We used 10 hours of MyST child speech data, 10 hours of PFSTAR British English data [27], 9 hours of CMU-Kids American English child speech data [29], and 9 hours of Librispeech dev-clean data as our test datasets.Different child speech test datasets were selected specifically to check the performance of our finetuned models on datasets that have different acoustic attributes, in conjunction with adult speech also.WER values obtained on these test datasets during inference are shown in Table 11.

C. ASR FINETUNING RESULTS
Group-A: Finetuning with Augmented_17h resulted in a decrease in WER on the PFS_10h data (British English child speech), and a slight increase in WER on the other child test sets, when compared to inference with a model finetuned on its original speech counterpart (Original_12h).Furthermore, combining just 17 hours of the augmented child speech (Augmented_17h) with original adult speech (Original_12h) leads to a slight improvement in WER on all child test sets, as well as the adult speech test data.
Group-B: This group uses the cleaned MyST_55h dataset in addition to the datasets used in Group-A experiments.Using Augmented data along with MyST child speech dataset led to a decrease in WER on all the test datasets (see model 6 in Table 10).
Group-C: This group used datasets created from largescale augmentation.There was an 18.3x increase in dataset size from Original_12h to Original_220h and from Aug-mented_17h to Augmented_311h, respectively.Augmentation led to a decrease in WER on PFS_10h test data, but an increase in WER for all other datasets, which is very similar to the results of Group-A experiments.

D. DISCUSSION OF RESULTS
For Group-A, the WER decreases for all the test datasets when both original and augmented adult speech datasets were used for finetuning.
With MyST data inclusion in Group-B, we see a major decrease in WER compared to Group-A results.
Furthermore, in Group-B, it can be seen that adding augmented speech along with MyST_55h (model 6) led to decrease in WER on all the test datasets compared to using only MyST child speech for finetuning (model 4) or using both MyST and original adult speech (model 5).Also, by adding both the and augmented speech for finetuning (model 7), an increase in WER can be observed on PFS_10h and adult data, while the WER on CMU_9h is reduced.
Using Original_220h and Augmented_311h in the Group-C experiments did not lead to improvements in ASR performance when compared with Group-B results.Comparing models 2 and 9, with an 18x increase in the amount of augmented data, respectively, the WER decreased by only 3.5 points on average on child speech.While improvements in the child ASR performance were expected, the results from the example application do not show significant improvements using just the large amount of synthetic child speech for finetuning.This could partly be attributed to a lack of natural prosody in the augmented adult data (synthetic) when compared to real child audio.Although the synthetic speech sounds reasonably child-like in terms of pitch and speaking rate, they are still lacking natural prosody characteristics such as stammering, long pauses (due to uncertainty) and other features seen in real child audio recordings.Features of natural child speech prosody could be modeled in addition to the proposed augmentation approach, which is expected to improve WER further.
109078 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 1 .
FIGURE 1. Flow diagram for the adult-child speech augmentation process.

FIGURE 2 .
FIGURE 2. A flow diagram for the adult speaker selection process.

FIGURE 4 .
FIGURE 4. Original cosine similarities showing how similar Librispeech adult speakers are to mean CMU kids child speaker embedding.The speaker gender is suffixed to the Speaker IDs.

FIGURE 5 .
FIGURE 5.Increases in cosine similarity between adult and child speaker embeddings after pitch shifting and time stretching.

•
Augmented_17h: Contains augmented utterances from the 16 female speakers of the train-clean-100 Librispeech dataset, whose speaker embeddings are most similar to an average child embedding from the 109076 VOLUME 11, 2023

TABLE 1 .
Summary of child speech research datasets with statistics & pros and cons.

TABLE 3 .
Number of librispeech train-clean-100 speakers above and below the cosine similarity threshold.

TABLE 4 .
Final shift factors used for pitch and time transformations.

TABLE 5 .
Statistics of cosine similarities for librispeech train-clean-100 before and after augmentations.

TABLE
Summary the data distribution for mos studies a and b.

TABLE 8 .
Mean and standard deviation (std) of convincingness and intelligibility MOS scores (C-MOS and I-MOS) from Study A.

TABLE 9 .
Mean and standard deviation (std) of convincingness and intelligibility MOS Scores (C-MOS and I-MOS) from Study B.

TABLE 10 .
Details of the synthetic and original data used in finetuning.

TABLE 11 .
WER of ASR models finetuned with synthetic and original speech data.

TABLE 12 .
Per speaker MOS Scores and best shift factors from 1 ST evaluation.

TABLE 13 .
Per speaker mos scores and best shift factors from 2 ND evaluation.