Modeling Speech Sound Radiation With Different Degrees of Realism for Articulatory Synthesis

Articulatory synthesis is based on modeling various physical phenomena of speech production, including sound radiation from the mouth. With regard to sound radiation, the most common approach is to approximate it in terms of a simple spherical source of strength equal to the mouth volume velocity. However, because this approximation is only valid at very low frequencies and does not account for the diffraction by the head and the torso, we simulated two alternative radiation characteristics that are potentially more realistic: the radiation from a vibrating piston in a spherical baffle, and the radiation from the mouth of a detailed model of the human head and torso. Using the articulatory speech synthesizer VocalTractLab, a corpus of 10 sentences was synthesized with the different radiation characteristics combined with three different phonation types. The synthesized sentences were acoustically compared with natural recordings of the same sentences in terms of their long-term average spectra (LTAS), and evaluated in terms of their naturalness and intelligibility. The intelligibility was not affected by the type of radiation characteristic. However, it was found that the more similar their LTAS was to real speech, the more natural the synthetic sentences were perceived to be. Hence, the naturalness was not directly determined by the realism of the radiation characteristic, but by the combined spectral effect of the radiation characteristic and the voice source. While the more realistic radiation models do not per se improve synthesis quality, they provide new insights in the study of speech production and articulatory synthesis.

a number of different applications, including text-to-speech 23 synthesis [7], [8], the simulation of speech acquisition [9], 24 [10], [11], [12], and the study of paralinguistic effects [13], 25 The associate editor coordinating the review of this manuscript and approving it for publication was Lin Wang . [14]. However, currently the quality of articulatory synthesis 26 is not yet competitive with the best commercial unit-selection 27 or neural speech synthesizers [15]. The key to improving 28 the speech quality is the identification and improvement of 29 potential shortcomings of the currently employed simulation 30 models. 31 This study explored the effect of the radiation characteris-32 tic on the quality of the synthesized speech. Usually, acoustic 33 models of the vocal tract generate the volume velocity at 34 the lips and the nostrils (e.g. [16], [17]). Here, the radiating 35 areas of the mouth and the nostrils are considered as vibrating 36 source'') has the closed-form expression [17], [21], [22] 48 H simple (ω) = P rad (ω)

49
where P rad (ω) is the sound pressure at a distance r from the 50 source, U 0 (ω) is the volume velocity generated by the source, The simple model of a spherical source is an approximation 67 that applies only to a small radiation area (mouth opening) 68 and to low frequencies, where the wavelength is long com-69 pared to the head diameter. Although this is acknowledged 70 in textbooks on speech acoustics [17], [22], the potential 71 effects of this constraint on the naturalness of synthesized 72 speech have not been studied. For a more precise radiation 73 characteristic, Stevens [17] proposes to use that of a vibrating slope of the radiation characteristic is about 7 dB/oct for a 90 microphone 30 cm in front of the mouth (instead of 6 dB/oct 91 for the simple source), which is similar to the radiation from 92 a piston in a sphere. Unfortunately, these measurements were 93 performed for only nine discrete frequency points spaced in 94 6-semitone steps between 250 Hz and 4000 Hz. There are a 95 number of other studies on the radiation of speech (e.g. [24], 96 [25]), but they have focused on the relative changes of sound 97 pressure levels with varying microphone positions, i.e. on 98 speech directivity patterns. To our knowledge, there are no 99 published transfer functions from volume velocity at the lips 100 to sound pressure in front of the mouth for realistic head and 101 torso geometries with high frequency resolution and up to 102 high frequencies.

103
The present study had two goals. The first goal was to 104 determine detailed radiation characteristics for the piston-in-105 sphere model and for a realistic head-torso geometry, and 106 to derive finite-impulse-response filters for their convenient 107 implementation in articulatory synthesizers. These two radi-108 ation characteristics represent two alternatives to the simple 109 source model with different degrees of realism.

110
The second goal was to explore the effect of the differ-111 ent radiation characteristics on the perceived naturalness of 112 synthesized speech. Since the radiation characteristic affects 113 the long-term spectral pattern of the speech signal in a sim-114 ilar way as the glottal source signal, the types of radiation 115 characteristics were assessed in combination with three dif-116 ferent phonation types. Therefore, a set of ten sentences 117 was synthesized for each combination of three radiation 118 characteristics (simple source model and the two alterna-119 tives) and three phonation types (breathy, modal and pressed 120 phonation) using the articulatory synthesizer VocalTractLab 121 (www.vocaltractlab.de), and assessed in a perception 122 experiment. Finally, the different variants of the synthesized 123 sentences were acoustically compared with their naturally 124 spoken counterparts in terms of long-term average spectra, 125 and evaluated with respect to their recognition by automatic 126 speech recognizers (as a proxy for their intelligibility).

128
This section starts with modeling sound radiation from a 129 vibrating piston in a sphere, and from the mouth of a realistic 130 head-torso model. The two obtained transfer functions are 131 VOLUME 10, 2022 whereẆ is the velocity of the piston, a is the radius of the 151 head, α is the angle between the piston axis and the circular 152 Note that we put a minus sign in front of the right hand side of the equation for P rad , because Williams [26] used the expression e −jωt instead of e jωt (as we do here) for describing the time variation of the acoustic quantities. edge of the piston on the surface of the sphere, k = ω/c 153 is the wave number, P n (x) is the Legendre function of the 154 first kind of order n, h n (x) is the spherical Hankel function 155 of the 2nd kind of order n, and h n (x) = dh n (x)/dx is its 156 first derivative. The Legendre polynom of order n can be 157 effectively calculated with the recursion formula For the special case of n = 0 in Eq. (3), P n−1 (·) = 162 P −1 (·) = 1. The spherical Hankel function is given by [26,163 Eq. 6.57] In the present study we assume that the microphone or 170 the ear of the listener is on the piston axis, so that θ = 0. 171 In this case, the factor P n (cos θ) in Eq. (3) becomes 1 (because 172 P n (1) = 1 for all n). Furthermore, when the mouth is small 173 compared to the head, i.e. when α is small, then the mouth 174 area is A mouth ≈ πa 2 α 2 , where a is the radius of the sphere. 175 This can be used to write the piston velocityẆ in terms 176 of the volume velocity U 0 generated by the piston, i.e. as 177 W = U 0 /A mouth , which gives the radiation characteristic as 178 .
We implemented this equation in Matlab R2021a (Math-182 works, Inc.) to obtain the radiation characteristics for dif-183 ferent distances between the ''mouth'' and the microphone 184 (located on the axis of the piston), and for different areas of 185 the piston. The ambient density of air and the sound velocity 186 were set to 0 = 1.225 kg/m 3 and c = 343.2 m/s (sound 187 velocity at 20 • C), respectively. Following Flanagan [22], the 188 radius of the sphere (head) was set to a = 9 cm. For the Bessel 189 functions J n (x) and Y n (x), the Matlab functions besselj 190 and bessely were used.

191
The upper limit for the sum in Eq. (6) was set to n max = 30. 192 With this value, the calculated solution was stable (without 193 ringing artifacts) up to a frequency of 12 kHz. Figure 2a 194 shows |H piston (ω)| for the distances d of 15 cm (black 195 curve), 30 cm (gray curve), and 60 cm (light gray curve). 196 In accordance with the measurements by Flanagan [23], 197 the slope of H piston (ω) is about 7 dB/oct for frequencies 198

214
The phase of H piston (ω) is very similar to that of H simple (ω), 215 as illustrated in Figure 3. Here, the gray curve shows the dif-  The closed surface of this domain was partitioned into four 240 patches based on their different acoustic boundary conditions: 241 the lip region from where the sound waves are radiated, the 242 surface of the head and torso that partly absorb and partly 243 reflect the sound waves, the non-reflecting outer boundary 244 of the sphere, and the fully reflective midsagittal symmetry 245 plane. The lip patch is shown in Figure 4c and has an area of 246 about 3.4 cm 2 , which is close to that of the radiating piston in 247 the piston-in-sphere model.

248
The FE simulation was performed with the devel-249 opment version of the open-source software dolfinx 250 (https://jorgensd.github.io/dolfinx-tutorial/). It requires as 251 input a 3D volume mesh of the simulation domain and a def-252 inition of the boundary conditions. The 3D mesh was created 253 with the software Gmsh [27] (www.gmsh.info) based on the 254 boundary representation of the simulation domain, and the 255 boundary conditions were defined by means of the patches. 256 The acoustic wave propagation in the simulation domain 257 was determined by solving the complex-valued Helmholtz 258 equation where k = ω/c is the wave number, ∇ is the nabla operator, P 261 is the acoustic pressure, x is the position, and c = 343.2 m/s 262 is the speed of sound (same as before).

263
At the lips (the radiating surface), we implemented a Neu-264 mann boundary condition where n is the outside-pointing surface normal vector, V 0 is a 267 constant particle velocity, and 0 = 1.225 kg/m 3 (as before). 268 At the surface of the head and torso, we applied the Robin 269 boundary condition with the surface impedance Z = 500 0 c. This impedance 272 makes the body surface mostly reflective with small acoustic 273 losses (absorbtion) comparable to the losses in the vocal 274 tract [28], [29]. To make the surface of the sphere non-275 reflective, a 3rd-order infinite-impedance element according 276 to [30] was applied. Finally, the symmetry plane was forced 277 to be hard-walled.

278
The trial and test functions in the variational form were 279 approximated by Laplacian polynomials of 2nd order. The 280 resulting degree of freedom was on the order of 2.6 million for 281 an element size of about 6 mm. The linear algebraic system 282 was solved on a high-performance computer with 96 cores for 283 480 frequency points between 25 Hz and 12 kHz (increment 284 of 25 Hz). The total computing time for all frequency points 285 was about 6 hours.

286
For each frequency point, the pressure P rad (ω) was picked 287 up at the mesh node closest to 30 cm in front of the lips 288 (see Figure 4b). Since the simulated volume velocity at the 289 lips was constant and frequency-independent, the obtained 290 VOLUME 10, 2022  305 Figure 5b shows the deviation of H FEM (ω) from the simple 306 source model. Like for the piston-sphere model, H FEM (ω) 307 generally boosts the higher frequencies compared to the sim-308 ple radiation model, but the spectrum is much more compli-309 cated and indicates a complex diffraction pattern around the 310 realistic head and torso geometry. As is known from studies 311 on speech directivity [31], [32], this pattern is also likely 312 to change significantly when the position of the receiver 313 changes, e.g. to a position to the side of the speaker. The 314 complex diffraction pattern is also indicated by the phase of 315 H FEM (ω) in Figure 3 (solid black line). However, overall the 316 phase of H FEM (ω) is still similar to that of the simple model. 317 To check the general correctness of the FEM simulation, 318 it was additionally performed for the geometric situation 319 of a piston in a sphere, analogous to the case described in 320 Sec. II-A. The obtained transfer function (up to 12 kHz) is 321 shown as the black dotted line in Figure 2b and agrees very 322 well with the explicit solution for H piston (ω). To apply the more realistic radiation characteristics H piston (ω) 326 and H FEM (ω) to articulatory speech synthesis, they must be 327 converted into digital filters H piston (z) and H FEM (z), where 328 z is the complex frequency of the z-domain. With regard 329 to the filter type, infinite-impulse-response (IIR) filters can 330 generally approximate a prescribed amplitude spectrum with 331 a lower filter order than a corresponding finite-impulse-332 response (FIR) filter, but provide little or no control over 333 FIGURE 6. a) Magnitude response of a digital first-order high-pass filter with the transfer function H(e j ω t ) = e j ω t − 1 at a sampling rate of f s = 1/ t = 44100 Hz (solid line). At higher frequencies, its magnitude starts to deviate from its analog counterpart with the transfer function H(ω) = j ω (dashed line). b) Deviation of the piston-in-sphere radiation characteristic from the (digital) simple source characteristic (black line) and its approximation by an FIR filter (gray line). c) Deviation of the head-torso radiation characteristic from the (digital) simple source characteristic (black line) and its approximation by an FIR filter (gray line). the phase response. Hence, we created FIR filters for a good 334 approximation of both the amplitude and phase response.  Figures 6b and c show H (ω) for the piston-in-357 sphere model and for the head-torso model, respectively. Note 358 that these spectra differ from those in Figures 2b and 5b in the 359 high-frequency part due to the frequency warping mentioned 360 above. With the following approximation of H (ω), this warp-361 ing is automatically compensated for.

362
To approximate H (ω) by an FIR filter, we exploit the fact 363 that an FIR filter

383
In the present study, |H (ω)| was sampled at the frequen-384 cies n · 25 Hz with n = 0 . . . 882 for a Nyquist frequency 385 of 22050 Hz. The filter order M was set to the smallest 386 value that made the root-mean-square error between |H (ω)| 387 and the approximation |H (e j )| smaller than 0.2 dB in the 388 frequency range 0-1 kHz. The threshold of 0.2 dB and the 389 frequency band of 0-1 kHz were empirically chosen as a 390 tradeoff between a good overall curve fit and a low final 391 filter order. The magnitude spectra of the obtained H (e j ) 392 for the piston-in-sphere model and the head-torso model are 393 shown as gray curves in Figure 6b and 6c, respectively. For 394 a sampling rate of 44100 Hz, the required filter order was 395 58 for the piston-in-sphere spectrum and 164 for the (more 396 VOLUME 10, 2022 complicated) head-torso spectrum. For lower sampling rates, 397 the required filter order is proportionally lower.  As a baseline for the acoustics assessment of the synthe-446 sized sentences, one long-term average spectrum (LTAS) was 447 calculated from the utterances of each speaker. To this end, 448 all ten sentences of the same speaker were peak-normalized 449 and connected to one long audio file. The silent pauses before 450 and after the sentences were removed, because they would 451 otherwise affect the overall LTAS. The LTAS were calculated 452 as the average of the power spectra of overlapping frames of 453 23.2 ms (=1024 samples) with 50% overlap between frames 454 using a Hamming window. The power spectrum of each frame 455 was calculated as where N = 1024 is the frame length in samples, n is the 458 frequency index, k is the time index, w(k) is the Hamming 459 window, and x(k) is the audio signal of the frame. Each LTAS 460 was normalized in such a way that the maximum amplitude 461 was 0 dB.

F. ARTICULATORY SYNTHESIS AND ANALYSIS OF STIMULI 463
To assess how the different models for the radiation charac-464 teristic affect the perceived naturalness of synthetic speech, 465 the ten sentences in Table 1 were synthesized with dif-466 ferent settings and then used in a perception experiment 467 (Sec. II-G For each of the 10 sentences, 9 variants were synthesized, 484 i.e. for all combinations of three phonation types (using 485 breathy, modal, and pressed voice) and the three models for 486 the radiation characteristic. This allowed us to study a poten-487 tial interaction between the phonation type and the radiation 488 characteristic, as both features affect the long-term spectral 489 balance of the speech signal. For all variants of the same 490 sentence, the same (copied) phone durations and pitch con-491 tours were used. The three phonation types were implemented 492 in terms of the corresponding glottal shapes defined for the 493 geometric vocal fold model in the gestural scores [35]. The 494 glottal shapes mainly differ in terms of the degree of glottal 495 abduction and the size of a constant posterior chink. For 496 breathy, modal, and pressed phonation, the rest displacement 497   Analogous to the recorded utterances, one LTAS was cal-524 culated across all 10 sentences for each of the 9 synthesized 525 variants. Like the LTAS of the natural recordings, they were 526 normalized for a maximum value of 0 dB. To quantify the 527 differences between the LTAS of the synthetic and natural 528 utterances, for each of the 9 synthetic LTAS, the root-mean-529 square deviation from the average natural LTAS (using differ-530 ences on the dB scale) was calculated across the frequencies 531 from 0-12 kHz.

533
The listening experiment was performed to assess poten-534 tial differences in the naturalness of the synthesized sen-535 tences due to the different radiation characteristics. Thirty 536 native speakers of German (18-41 years old, mean: 28 years; 537 17 male and 13 female) participated in the experiment after 538 giving their informed consent. The participants performed 539 the experiment individually in an audio studio where the 540 stimuli were presented over high-quality closed headphones 541 (STAX SR-202 Basic) that were connected to a headphone 542 amplifier (STAX SRM-212) and a laptop computer. Since the 543 perceptual differences due to different radiation characteris-544 tics were assumed to be small, the naturalness was assessed 545 in a forced-choice pairwise comparisons of the stimuli. For 546 each of the 10 sentences in each of the 3 phonation types, 547 3 pairs were created to compare the radiation models (sim-548 ple vs. piston-in-sphere model; simple vs. head-torso model; 549 piston-in-sphere vs. head-torso model). Hence, there were 550 90 different pairs of stimuli in total. To assess the consistency 551 of the responses, each pair was presented twice, requiring 552 participants to compare 180 pairs. The pairs were presented 553 in an individually randomized order for each participant. 554 Between the two stimuli of a pair, a short pause of 500 ms 555 was included. The participant had to decide which of the 556 two stimuli sounded more natural by pressing one of two 557 buttons on a computer screen using a mouse. Playback of each 558 pair could be repeated once on demand. After a decision, the 559 stimuli of the next pair were automatically played.

561
To assess the intelligibility of the synthetic utterances depend-562 ing on the settings for the phonation type and the radiation 563 characteristic, we used automatic speech recognition in a 564 similar way as in the study by Krug et al. [15]. Three dif-565 ferent state-of-the-art speech-to-text web services (Google 566 VOLUME 10, 2022 accessed via the Python libraries SpeechRecognition and 568 IBM-WATSON. The audio files sent to the services were 569 preprocessed in the same way as described in [15]. The word 570 error rates (WER) were calculated from the true text repre-571 sentations of the utterances and the ASR responses using the 572 Python library Jiwer. Before the WER calculation, true and 573 recognized sentences were normalized on the textual level 574 as in [15]. If numerals were returned by the ASR systems 575 as numbers, they were converted into the corresponding nor-576 malized text, i.e. ''7'' was converted into ''sieben'' (English:  the acoustic theory of speech production [17]. According to 621 this theory, for voiced sounds, the short-time spectrum of the 622 radiated speech is the product of the voice source spectrum, 623 the vocal tract filter, and the radiation characteristic. Hence, 624 when we assume that the LTAS ''averages out'' the effect 625 of the vocal tract filter, it is governed by the product of the 626 source spectrum and the radiation characteristic. Apparently, 627 the voice source settings used for the synthesis compensated 628 for the lower realism of the simple radiation model, i.e., the 629 too small spectral slope of the simple radiation characteris-630 tic was compensated for by a voice source with too strong 631 high-frequency components. Finally, as observed above, the 632 synthesis with modal voice deviated least from natural speech 633 across all radiation models.  The most striking observation here is that a decreasing 659 preference for a radiation model coincides with an increas-660 ing acoustic deviation of the corresponding synthetic stimuli 661 from natural speech based on the LTAS (Figure 9). This 662 suggests that the long-term spectral deviation of synthetic 663 from natural speech is a good indicator for the perceived 664 naturalness of synthetic speech.  Figure 11 shows the word error rates (WER) of the automatic 667 speech recognizers for the sentences synthesized under the 668 9 different conditions, which can be interpreted as a proxy for 669 the intelligibility of the synthetic speech. For each condition, 670 the box plot represents the WER distribution of 30 data points 671 (the 10 synthetic sentences recognized by each of 3 ASR sys-672 tems). The box plots suggest that the radiation characteristic 673 has only a minor effect on the WER. In fact, when all obtained 674    11. Results of the automatic speech recognition experiment. Each box plot represents the word error rates obtained by 3 different ASR systems for the 10 sentences synthesized with a specific setting for the voice quality (breathy versus modal versus pressed voice) and radiation characteristic (simple versus piston-in-sphere versus head-torso). The boxplot at the bottom represents the word error rates obtained by the 3 ASR systems for the 10 natural utterances that served as templates for the re-synthesis. The crosses indicate median values, and the filled triangles indicated mean values.

686
In contrast to our expectations, the synthetic speech with 687 the simple radiation characteristic was more similar to the 688 natural utterances in terms of the LTAS than the utterances 689 synthesized with the two more realistic radiation models. 690 This result can be explained by the fact that the LTAS is 691 generally determined by both the radiation characteristic and 692 the spectrum of the voice source. Since the vocal fold model 693 settings used here were adjusted using the simple radiation 694 model prior to this study [35], the voice source model was 695 obviously adjusted to compensate for the inaccurate (simple) 696 radiation characteristic to sound as human as possible. This 697 means that a more realistic radiation model alone does not 698 automatically lead to greater spectral similarity with natural 699 speech, but that the combination of the voice source model 700 and the radiation characteristic is crucial.

701
A greater long-term spectral similarity between a synthetic 702 utterance and the natural reference was also associated with 703 a higher perceived naturalness of the utterance. This finding 704 VOLUME 10,2022 suggests that the long-term spectral deviation of synthetic 705 speech from natural speech (for the same utterances of mul-