Extraction and Utilization of Excitation Information of Speech: A Review

article


I. I N T R O D U C T I O N
Speech is the most sophisticated means of communication among people.The carrier of speech is the acoustic speech pressure signal.In the speech, a small number of basic elements, such as phones or syllables, are combined to form a large number of units, such as words and phrases.The complexity of speech is due to the many-to-one relationship between the speech sound and its perceived counterpart in the way that several phonetic contrasts can be produced by the same acoustic cue.Conversely, several acoustic cues may indicate the same phonetic contrast.
In addition, the phonemic cues in conversational speech are enriched by characteristics, such as vocal emotions.Thus, the information conveyed through the speech signal is related not only to what is said but also how the spoken message is conveyed.While the former is useful in situations, such as information announcements, the latter is important in casual conversations.Speech is produced by the physiological apparatus of the human speech production system.The function of this system can be divided into two main parts: excitation, the major component of it is generated at the larynx, and filtering, which refers to the effects of the dynamic articulators on the excitation during speech production.The characteristics of the excitation vary depending on the speech sound to be produced.For the most prevalent category of speech sounds in most languages, voiced sounds (such as the vowel [a] and the nasal [n]), the excitation is the air flow waveform generated by the vibration of the vocal folds.This excitation is called the glottal flow due to the air passing through the orifice between the two vibrating vocal folds at the glottis (see Fig. 1).The filtering process extends from the vocal folds to the lips and nostrils.It is influenced by the positioning of the tongue, the degree of opening of the mouth, and the movement of the lips.By varying the acoustical properties of the excitation, humans are capable of changing some essential cues of speech, such as pitch (e.g., generating low or high voices) and voice quality (e.g., coloring speech to sound breathy or pressed).By changing the articulators, humans can produce sounds (called phones) representing different phonemes (e.g., /a/ or /i/).Among the three (voiced, unvoiced, and plosive) categories of speech sounds, voiced sounds are of special interest in speech science [1], [2].
There are many situations where the decomposition of the speech signal into the excitation and filter components is needed.The source-filter decomposition helps to model the two components effectively in several speech technology applications, such as speech synthesis [3], [4], enhancement [5], [6], and coding [7]- [10].Decomposition of speech signal helps to improve our understanding of the human speech production mechanism.Studies have shown that understanding the excitation component helps in generating acoustical cues of different voice qualities [11]- [13] and vocal emotions [14]- [20], as well as in the production of different paralinguistic and nonverbal sounds [21]- [23].The excitation information is also useful for providing complementary information to the more widely used vocal tract spectral features to improve, for example, the detection of speech disorders [24]- [27].The relatively less effort in the study of the excitation component is due to difficulty in the decomposition of the signal into the excitation component even though its importance is well established in many areas of speech science and technology [28]- [30].The decomposition is difficult, for example, in expressive speech because of large variations in the characteristics of speech sounds.In addition, the nonstationarity of the speech production process compounds the difficulty.The speech production process also involves nonlinearities [31]- [33] that cannot be handled using linear source-filter models.
This article is a review of the methods to extract and utilize the excitation component of mainly voiced speech.An important and widely studied example of such a feature is the inverse of the glottal cycle duration, that is, the instantaneous fundamental frequency (F0).In addition

. Demonstration of the production of voiced speech in two phonation types: normal (left) and breathy (right). The upper part of the figure shows three time-domain waveforms (the speech pressure signal, the glottal flow estimated by GIF, and the glottal area function) and the lower part shows images of the vocal folds. The gray vertical lines show the instants when the images of the vocal folds were taken by
the transoral high-speed digital videoendoscopy system (adopted from [38]).
to F0, one of the most important features is the strong impulse-like component that is present in each cycle of the glottal flow waveform in the production of voiced speech.This impulse-like component is caused by the sudden deceleration of the air flow in the vicinity of the GCI due to adduction of the vocal folds.The characteristics of this impulse-like event in the excitation waveform, particularly its sharpness, are closely associated with several important speech attributes, such as voice quality [11], [34] and loudness [35], [36].The excitation information also includes identifying the locations of GOIs, where secondary excitation of the vocal tract might take place.Furthermore, excitation information also involves estimating the entire excitation waveform from the microphone speech signal and then expressing this time-domain signal (or its spectrum) with a few parameters to quantify, for example, time ratios between the opening and closing phases of the glottal flow waveform.
The extraction of speech excitation information has been previously addressed in three review articles, which are more than five years old.In [30] and [37], the review focused on GIF and its applications.The review in [29] studied the GCI detection methods, F0 extraction, and GIF methods, as well as applications for speech synthesis, speaker recognition, expressive speech processing, and biomedical applications.In this article, a more holistic review of the recent advances in extraction and utilization of excitation information is provided by studying different types of GIF methods, F0 extraction methods, and GCI and GOI extraction methods.Furthermore, this review highlights the extraction and utilization of the excitation information based on recent developments in deep learning, an issue that is absent from all the previous reviews.In addition, the utilization of excitation information in speech-based biomarking of human health, an area that has become increasingly important in recent years, is discussed especially from the point of view of the detection of neurodegenerative diseases.
The organization of this article is given as follows.Section II describes the generation of speech signals by the physiological human speech production mechanism.Section III briefly describes the extraction of excitation information using nonacoustic techniques.In Section IV, the extraction of excitation information is studied by describing the estimation of glottal flow using GIF and describing the most important features of the excitation, namely, F0, GCI, and GOI, and the issues underlying their extraction.Section V describes how excitation information has been used in four specific areas of speech research by addressing the study of phonation types, vocal emotions, laughter sounds, and pathological voices.Section VI describes recent trends in this area by discussing the use of deep learning in GIF and the extraction of F0 and GCI, as well as the utilization of excitation information in the detection of neurological diseases.Finally, conclusions are given and future directions are discussed in Section VII.
The topics addressed in this article are thematically described in Table 1 (ranging from the speech production mechanism to the utilization of excitation information).The list of abbreviations used in this article is given in Nomenclature.

II. H U M A N S P E E C H P R O D U C T I O N M E C H A N I S M
The speech production mechanism allows humans to produce a vast range of sounds ranging from verbal sounds (normal speech) to nonverbal sounds (laughter, cry, and so on) and sounds of different voice qualities and emotions.Understanding the physiological speech production mechanism helps in the analysis of speech signals.A simplified schematic presentation of the speech production system is shown in Fig. 2. The system consists of many organs, which can be categorized into three main groups: the lungs (the subglottal system), larynx, and vocal tract (the supraglottal system) [39], [40].
The lungs serve as the source of energy for the speech production process generating pressure in the larynx due to airflow.The (mean) lung pressure, also called subglottal pressure, is controlled by the speakers for producing sounds of different vocal intensity levels and phonation types.It has been found that lung pressure can rise to values as high as 6 kPa (60 cm of H 2 O) in loud singing voices [41].However, in the production of speech signals, lung pressure is typically much lower.For example, the measured lung pressure values in [42] were below 1 kPa (10 cm of H 2 O) for most vowels of soft or normal loudness and lung pressure rose to larger values (around 4 kPa, that is, 40 cm of H 2 O) only in loud and very loud speech signals.
During the production of voiced speech, the pressure from the lungs causes vibration of the vocal folds.The vocal folds (see Fig. 1), located in the larynx, are the key physiological organs in the production of voiced speech.The vocal folds have a layered structure consisting of five layers (for more details, see [43]).The vibration of the vocal folds forms the acoustical excitation signal for voiced speech, called glottal volume velocity waveform or simply glottal flow.Fig. 1 demonstrates the production of voiced speech in two phonation types: normal (left) and breathy (right).The figure shows the speech signal, the glottal flow estimated by GIF, the glottal area function, and the images of the vocal folds.The gray vertical lines show the instants when the images of the vocal folds were taken by the transoral high-speed digital videoendoscopy system.
There are two other types of excitation during speech production resulting in unvoiced and plosive sounds.The unvoiced sounds (e.g., [s] and [f ]) are generated by forming a constriction at some point along the vocal tract and forcing air through this constriction to generate turbulence.Plosive sounds are generated by abruptly releasing the air pressure building behind closure along the vocal tract.These sounds are also called stops.Plosives can be both unvoiced (e.g., [k] and [t]) and voiced (e.g., [g] and [d]).The vocal tract system (consisting of the oral, nasal, and pharyngeal resonant cavities) shapes the excitation signal, and the resulting air flow signal is radiated at the lips to form the speech pressure signal.
The production of speech can be considered as exciting a filter (the vocal tract system) by an excitation.This is called the source-filter model of speech production.In the source-filter model, the source and filter are assumed to be independent.It is worth emphasizing that, even though the assumed independence of the source and filter enables using more straightforward technologies, for example, in speech analysis and synthesis, there is coupling between the source and tract in the production of speech as reported in many studies (e.g., [31]- [33]).In summary, according to the way the excitation signal of the human speech production system is generated, the produced speech signals can be roughly divided into the following three categories: 1) voiced sounds (excited by the quasi-periodic glottal flow); 2) unvoiced sounds (excited by aperiodic noise-type flow); 3) plosive sounds (excited by burst-type flow).
Production of these three broad categories of sounds is shown in the simplified diagram in Fig. 2. The current review focuses on the excitation information in the voiced sounds.
The source-filter model of speech production [shown schematically in Fig. 3(a)] can be expressed mathematically in the time domain as follows: (1) where s [n] is the speech signal, e[n] is the excitation (i.e., the derivative of the glottal flow waveform), v [n] is the impulse response of the vocal tract (filter), and * denotes convolution operation.In the z-domain, the corresponding equation is given as follows: where S(z), E(z), and V (z) correspond to the z transforms of the speech signal, source, and filter, respectively.Given the speech signal (s[n]), the excitation can be obtained as follows: Equation (3) shows that the excitation can be computed by canceling the effect of the vocal tract filter (1/V (z)) from the speech signal [see Fig. 3(b)].This forms the basis for the GIF method for the extraction of excitation information (discussed in Section IV-A1).The objective of this review is to discuss signal processing approaches to extract information in the excitation signal e[n], given the speech signal s[n].

III. E X T R A C T I O N O F E X C I T A T I O N I N F O R M A T I O N U S I N G N O N A C O U S T I C A L T E C H N I Q U E S
In this section, three nonacoustical techniques to extract excitation information (EGG, HSV, and videokymography) are briefly discussed.These techniques have been used to obtain the ground truth for the excitation features, such as F0, GCI, and GOI.The ground truth is useful for the evaluation of the methods developed for extracting these features from speech.
EGG is an electrical method to study voice production by feeding high-frequency-modulated current through two electrodes placed on either side of the glottis [45].The electrical impedance between the electrodes decreases as the vocal folds adduct, and the impedance increases when the vocal folds abduct.Hence, the EGG signal provides information about the area of contact between the vocal folds during the production of voiced speech.As a method to compute the ground truth, EGG benefits from being a low-cost approach and can be applied not only for isolated sounds but also for continuous speech.
Fig. 4 shows the EGG signal in one glottal cycle.The EGG signal consists of four distinct phases [46]: the closing phase, the CP, the opening phase, and the open phase.In the closing phase (between t1 and t3), the vocal folds first start contacting at the lower margins (between t1 and t2) and then moving the contact to the upper margins (between t2 and t3).Generally, the closing of the vocal folds is faster than the opening, and the instant of the maximum slope occurs at t2, which can be seen as a prominent negative peak in the dEGG shown in Fig. 4(b).The vocal folds are in full contact during the CP (between t3 and t4), blocking the passage of air through the glottis.In the opening phase (between t4 and t6), the lower margins of the vocal folds begin to separate slowly from each other (between t4 and t5), followed by separation along the upper margins of the vocal folds (between t5 and t6).The instant of the   maximum slope occurs at t5, which can be seen as the positive peak in the dEGG signal [see Fig.The locations of the peaks in the dEGG signal, i.e., the negative peak at t2 and the positive peak at t5, are considered to be GCI and GOI, respectively.F0 is estimated as the inverse of the time difference between two consecutive GCIs.The values of F0, GCI, and GOI extracted from dEGG are used as the ground truth in evaluating the corresponding features extracted from the acoustic speech signal.In general, the glottal opening is a relatively slow phenomenon compared to the glottal closing.Therefore, the glottal opening may not appear in the dEGG as a clear impulse.Note that the EGG signal does not carry any information about the variations in the acoustic pressure signal [47].A recent review of EGG for applications, including basic voice science, clinical practice, and singing, is given in [48].
In addition to EGG, laryngeal imaging methods, such as HSV and videokymography, have been used to compute the ground truth for the evaluation of various methods to extract excitation information from speech [49].HSV is a technology to extract 2-D images from the motion of the vibrating vocal folds, and it is widely used in voice clinics.Videokymography is a simplified version of HSV based on high-speed imaging of the vocal folds at a specifically selected location along a horizontal line.For more details on HSV, the reader is referred to the review article published in [50], and for more details on videokymography, the reader is referred to the reviews published in [51] and [52].Compared to EGG, the use of HSV and videokymography is more challenging in the computation of the ground truth because both of these methods require expensive equipment.Also, the obtained imaging data might be of low temporal and spatial resolution, and the methods do not enable a noninvasive analysis of voice production.Laryngeal imaging has been used jointly with acoustical analysis of speech excitation information in studying, for example, glottic cancer [53], diplophonia [54], and phonation onsets [55].For visualization, simultaneously recorded HSV and EGG signals are shown in Fig. 5 for the closing and opening phases for a nonpathological vowel production by a male speaker.

IV. E X T R A C T I O N O F E X C I T A T I O N I N F O R M A T I O N F R O M S P E E C H S I G N A L S
In this section, the extraction of excitation information from speech signals is described by first discussing the estimation of the glottal flow waveform using GIF and the parameterization methods developed to express excitation information from the glottal flow waveforms.Next, the most important excitation information features, which can be extracted directly from speech signals, such as F0, GCI, and GOI, are discussed.

A. Extraction of Excitation Information Using GIF
GIF refers to the approach to estimate the glottal source from speech signals.In this section, we will first give an overview of the GIF methods and then describe the parameters derived from the estimated glottal source waveforms.

1) GIF Methods:
The estimation of the glottal source waveform by GIF is based on estimating the vocal tract filter.The effect of the vocal tract resonances is reduced by filtering the speech signal through the inverse of the estimated vocal tract transfer function.The idea of GIF was proposed in the 1950s [56] using analog antiresonance circuits.Since the 1970s, GIF methods are using digital signal processing tools.These methods differ mainly in the way the vocal tract transfer function is estimated.Most methods are based on LP analysis, which assumes that the vocal tract transfer function can be approximated by an allpole filter [57].A widely used LP-based GIF method, i.e., the CP analysis, was proposed in [58].The CP analysis is based on computing the vocal tract transfer function with LP using the covariance criterion that is computed from speech samples in the CP of the glottal cycle (i.e., this method calls for the extraction of GCI and GOI).Another popular GIF method is the IAIF [57].In this method, the average effect of the glottal source on the speech spectrum during the open phase and CP of the glottal cycle is first estimated with a low-order all-pole filter.By removing this estimated average effect of the glottal source, a vocal tract model is computed without using the knowledge of GCI or GOI.
More recent GIF methods are based on the QCP analysis [59] and QPR [60].In the former, the CP analysis is replaced by a temporally weighted LP analysis, called weighted LP.QPR together with physically motivated

. Visualization of the closing and opening phases of the glottal cycle by simultaneous electroglottographic and high-speed recordings. Vertical bars to the EGG and dEGG signals indicate the moment in time at which the visual image occurs. The EGG sampling frequency is 44 444 Hz, and the high-speed camera sampling frequency is 3704 frames/s (reproduced from [46] with permission of the publisher, the Acoustical Society of America).
optimization (e.g., the flatness of the CP) is used to model jointly the vocal tract and the lip radiation.
GIF methods have also been developed based on the joint optimization of the source and filter [61]- [64].In these methods, glottal source models, such as the Liljencrants-Fant (LF) [65] model and the Rosenberg-Klatt (RK) model [30], [66], are used to represent the glottal flow pulse or its derivative in a parametric form.Due to the use of predefined mathematical functions for the glottal source, these GIF methods are limited in their ability to capture the behavior of the glottal source in natural speech, particularly for phonation types.Moreover, the use of multiparameter source models usually prohibits the use of classical optimization methods due to the nonconvex nature of the error surface, thus increasing the computational complexity [62].The joint optimization of the source and filter has also been applied in GIF using acoustical tube models of the vocal tract [67].The GIF proposed in [67] uses state-space modeling based on a concatenated tube model of the vocal tract and the LF model of the source.By optimizing the model using extended Kalman filtering, estimates of the glottal source and intermediate pressure values within the vocal tract are obtained.
GIF methods have also been developed using a combination of causal (minimum phase) and anticausal (maximum phase) components of the speech signal.The ZZT method [68], [69] and the CCD method [70] are two methods in this category.In these methods, the response of the vocal tract and the return phase of the glottal flow are considered as causal signals, and the open phase of the glottal flow is considered as an anticausal signal.These signals are separated by the mixed-phase decomposition using analysis synchronized with the GCIs.The performances of the ZZT and CCD methods are limited due to the use of short speech segments and also due to computational cost [69], [70].Moreover, the assumption that speech can be expressed as a combination of causal and anticausal components may not hold when the speech data are degraded due to noise.
In all GIF methods, the ultimate goal is to try to estimate the ground truth, that is, the true glottal volume velocity waveform produced by the vocal folds, with maximum accuracy.Unfortunately, noninvasive recording of the true glottal flow is not possible in the natural production of speech.This absence of the ground truth is an inherent obstacle in the assessment of all GIF methods.The problem has been circumvented in some studies by synthetic test vowels generated using artificial glottal flow waveforms, such as the LF model [61], [67].In addition, some studies have used physical modeling of the human voice production [59], [71]- [73].In this approach, the test data are generated by simulating physical laws in sound production and transmission, instead of using preselected artificial source waveforms, which are linearly filtered with digital vocal tract models.A few recent studies [74], [75] proposed using a physical apparatus, where synthetic speech signals are produced by using known voice source waveforms as inputs.The physical vocal tract replica is made of stacked plexiglass disks or 3-D-printed in plastic using MRI images of the true vocal tract.In the above studies, the glottal flow estimated by GIF was compared with the information of the glottal area.There

The ac-flow (fac), minimum flow (f min ), and the minimum of the derivative (d min ).
are many recent investigations where GIF methods have been studied jointly with glottal area information extracted using HSV or with physical models of voice production.These investigations have addressed issues, such as the relationship between the glottal flow and glottal area in the presence of source-filter interaction [76], [77] and in phonation onsets [55], the computation of parameter values for physical models [78], [79], and the estimation of subglottal pressure, laryngeal muscle activation, and vocal fold contact pressure [80].We argue that the strategy used in these investigations to study excitation information of speech (i.e., using GIF jointly with HSV and with physical modeling approaches) will become increasingly important and also increasingly feasible in the future due to the progress in HSV [50], [81], physical modeling [82], [83], and GIF [59], [67], [71].
2) Parameterization of the GIF-Based Glottal Flow Estimates: Glottal flows estimated by GIF are parameterized by expressing some important features in a compressed numerical form.Methods for parameterization of the glottal flow estimates can be grouped into time-and frequencydomain methods.
a) Time-domain parameterization methods: The traditional way to parameterize the glottal flow waveform in the time domain is to compute time-based quotients.This involves measuring ratios of time durations of different phases of glottal flow waveform in one cycle.These timebased measures require the identification of GCI and GOI in the estimated glottal waveforms.For illustration, one cycle of the estimated glottal flow and its derivative are shown in Fig. 6(a) and (b), respectively.In the figure, the glottal pulse is divided into three parts: the CP (Tc), the opening phase (To), and the closing phase (T cl ).The most widely used time-domain parameters are the OQ, SQ, and ClQ, which are defined as follows: where T = Tc + To + T cl is the period of the glottal cycle.Time-domain parameters are affected by distortions, such as ripple, caused by incomplete canceling of formants.To counter the effects of the ripple, time-domain parameters are sometimes computed by replacing the true closure and opening instants with the time instants when the glottal flow crosses a level, which is set to a value between the minimum and maximum amplitudes of the glottal pulse [84].
The time-domain parameterization of the glottal flow can also be computed using amplitude-based measures.The most widely used amplitude-based time-domain parameterization methods take advantage of two prominent amplitude values of the glottal flow and its derivative: the ac amplitude of the glottal flow pulse and the amplitude of the negative peak of the flow derivative [65], [85]- [87].An amplitude-based parameter called the NAQ proposed in [86] is given by b) Frequency-domain parameterization methods: Frequency-domain parameters of the glottal flow are obtained from the Fourier transform of the estimated glottal flow.In practice, only the power spectrum is used to derive the frequency-domain parameters.A widely used frequency-domain parameter is the alpha ratio, which measures spectral tilt by computing the ratio between the spectral energies below and above a certain frequency (typically ≤1 kHz) [88].Another frequency-domain glottal flow parameter is the HRF [89].The HRF measures the tilt of the glottal flow spectrum as the ratio between the sum of the amplitudes of harmonics above F0 and the amplitude of F0.Another measure for the spectral tilt of the glottal flow is the dB difference between the amplitude of the fundamental and the second harmonic, i.e., H1-H2 [90].It is also possible to quantify the glottal flow using the ratio between the harmonic and nonharmonic components of the glottal flow spectrum, which is referred to as the HNR [91], [92].

B. Extraction of F 0
F0 of the vocal fold vibration is one of the important components of excitation information in voiced speech.The value of F0 varies from about 60 Hz in low-pitched male voices to about 1500 Hz in sopranos' singing voices [93].The temporal variation of F0 corresponds to intonation, which contributes to vocal emotions [94].The factors affecting the performance of F0 estimation methods are the effects of vocal tract resonances, the rapid variation of F0 (e.g., in emotional speech and children's speech), and signal degradation due to noise and reverberation.
In the production of some speech sounds, the glottal excitation is inherently aperiodic containing more noise (such as in breathy phonation) or diplophony (such as in vocal fry) [95], which needs further investigation.As F0 extraction is covered in several tutorials/books, this topic is not handled in detail in this review article, but we, instead, discuss the general aspects of F0 extraction briefly here and focus more on recent deep learning-based progress of the topic in Section VI-A.For more details on F0 extraction, please see [96]- [104], where various methods are reviewed by for the study of clean and noisy speech, as well as singing voices.
The F0 extraction methods can be grouped into three broad categories: 1) time-domain; 2) frequency-domain; and 3) time-frequency-domain methods.Time-domain methods take advantage of the periodicity of the speech signal or the LP residual.In this category, autocorrelationbased methods are popular due to their simplicity.The autocorrelation function measures the degree of similarity between a signal and its delayed version [105].An estimate of the pitch period, i.e., the inverse of F0, is obtained by using the location of the peak in the autocorrelation function computed from a segment of speech or LP residual.This approach is used in many F0 extraction methods, such as SIFT [97], [106], RAPT [107], YAAPT [108], and PRAAT [109].Several modifications to the autocorrelationbased methods were proposed in the YIN method [93].
The spectra of periodic time-domain signals consist of high-energy amplitude components, located at F0 and its harmonics.This property forms the basis for frequencydomain methods.Examples of methods belonging to this category are the SHRP [110], the SRH [111], the summation of impulse-sequence harmonics [104], the method of dominant harmonics [112], and the SWIPE [113].
In the time-frequency-domain methods, the speech signal is first decomposed into several frequency bands, and then, the time-domain methods are applied to each subband signal.The auditory-model correlogram-based algorithm [114] is a popular method, in which speech is decomposed using an auditory filter bank, and an autocorrelation function is computed for each subband signal.MBSC-based F0 estimation [115] uses four wideband FIR filters to capture multiple harmonics in every subband.Different weighting schemes are used to obtain the peak of the enhanced summary correlogram for robust F0 estimation.

C. Extraction of GCI
The derivative of the glottal flow waveform estimated from natural speech typically shows a prominent negative peak during the closing phase [28], [86].This negative peak serves as the main excitation of the vocal tract system in each glottal cycle.The time instant of the negative peak is called GCI.The GCI is used in different areas of speech research, such as study of glottal activity [116], estimation of pitch [104], [117]- [119] and formants [120], [121], and the analysis of loudness [36] and nonverbal sounds (such as laughter [23] and shouting [122]).GCIs are also used in the time delay estimation [123]- [125], in determining the number of speakers from mixed signals [126], speech enhancement [5], [6], multispeaker separation [127], prosody modification [128], and speech synthesis [3], [4].
The widely used GCI detection methods are grouped into three categories [129].The first category is based on processing the excitation signal, the second category involves processing the speech signal, and the third category uses both the speech signal and the excitation signal.

1) Methods Based on Processing the Excitation Signal:
The methods in this category use the excitation signal derived from the speech signal after removing the contribution of the vocal tract.This is usually carried out by using the LP analysis.The location of the large error value in the LP residual within a glottal cycle corresponds to the GCI.Identification of GCI locations from the LP residual is sometimes difficult due to the polarity of the residual values around the GCI.To overcome this difficulty, the use of the Hilbert envelope of the LP residual was proposed in [130].In [131], the Gabor filtering of the Hilbert envelope of the LP residual was used to detect GCIs.Some methods use the group delay function of the LP residual to locate the GCIs [131], [132].It was found in [133] that the group delay-based methods gave high false alarms.Dynamic programming-based techniques were proposed to reduce false alarms.Methods in [134] and [135] use the glottal flow waveform instead of the LP residual to detect the GCIs.The ILPR was used to detect the GCIs by searching for transients in the ILPR using the dynamic plosion index [136].
2) Methods Based on Processing the Speech Signal: Earlier methods for GCI detection were based on short-time energy of the speech signal in the time-frequency representation [137].For the energy computation and the timefrequency representation, block processing of the speech signal is required, which may affect the accuracy of the GCI detection.In [138], GCIs were detected by searching for the maximum of the determinant of the autocovariance matrix of the speech signal.
Some methods exploit the properties of the impulse-like excitation present in the speech signal due to GCI.ZFF is one such method that takes advantage of the nature of the impulse-like excitation.In ZFF, the speech signal is filtered around 0 Hz using a cascade of two digital resonators [139].The negative-to-positive zero crossings of the ZFF signal correspond to GCIs for a signal with positive polarity [140].Another technique in this category is the LoMA method, which uses the time-scale representation to locate GCIs [141].The idea of the LoMA method is that discontinuities in the speech signal at GCIs and GOIs are reflected as amplitude maxima at each scale of the wavelet transform.Within a pitch period, an optimal LoMA is computed using dynamic programming to detect the GCIs.In [142], singularity/discontinuity behavior present in the speech signal was exploited using a nonlinear technique, called the microcanonical multiscale formalism, for GCI detection.The method was shown to be robust in conditions of low SNR.Recently, the magnitude spectral properties of the time-domain impulses were exploited to detect the GCIs using the SFF method [143]- [145].The method was shown to be robust in detecting GCIs in emotional speech and telephone quality speech.
3) Methods Based on Processing Both Speech Signal and Excitation Signal: In this category, the methods use the speech signal to first identify possible GCI locations within a certain interval.After this, discontinuities in the excitation signal are used to locate the GCIs.SEDREAMS is one such method [146].SEDREAMS uses the mean-based signal to find the possible GCI locations in an interval, after which the peak of the LP residual in the interval is used to detect the GCI.The mean-based signal oscillates around the local pitch period, thus guaranteeing good performance in terms of reliability, i.e., reduction in the number of false alarms and misses.In [147], SEDREAMS was modified to handle speech of different voice qualities.This method uses postprocessing techniques and dynamic programming, in addition to SEDREAMS.Other methods, such as DYPSA [133] and YAGA [134], use the excitation signal (LP residual in DYPSA and glottal flow waveform in YAGA), wavelet transform, group delay, and dynamic programming by minimizing various cost functions.The cost function consists of various elements, such as the interpulse similarity, normalized energy values, pitch deviation, costs derived from the projected phase slope, and deviations from an ideal phase slope function.More details on the GCI detection methods and the GCI-based analysis of speech processing can be found in [28], [29], [129], [146], and [148].

D. Extraction of GOI
In comparison to the detection of GCIs, the detection of GOIs is generally more difficult from speech signals because the abduction of the vocal folds is typically a more gradual phenomenon compared to the abduction of the vocal folds [28].Methods for the detection of GOIs are mainly based on first detecting the GCIs, after which a suitable duration is assumed for the open phase, either by fixing a value or by using a ratio with respect to the pitch period.The detection of GOIs is needed for the CP analysis and characterizing speech production using the OQ [149], [150].
It is to be noted that there is no unique definition for GOI [134].Three main definitions of GOI are reported in the literature [134].Each one of these definitions is limited to a specific application of interest.In the first definition, the GOI occurs at the end of the CP, where an increase in the LP residual error occurs [58], [134].This definition is used in the estimation of the glottal flow with the CP analysis.The second definition is based on the dEGG signal, where the GOI is identified as the location of the maximum value of the dEGG signal, corresponding to the maximum rate of change of the glottal impedance/conductance [46], [151].This definition has been used to compute the OQ to describe pathological voices [149], [150].The third definition of GOI is based on the EGG signal, by defining GOI as the time instant where the amplitude of the EGG signal is equal to a given percentage of the maximum value of the EGG signal within the glottal cycle [152].Since the glottal opening is typically more gradual compared to glottal closing, it is appropriate to define the GOI as an interval within a glottal cycle rather than a time instant.
In [153], the Hilbert envelope of the LP residual was used for the detection of GOIs, after first detecting GCIs.In [134], [154], and [155], the multiscale product of the decomposed wavelet signals was shown to be effective for the GCI/GOI detection from speech and EGG signals.
In [134], the YAGA method was proposed for the detection of GCIs/GOIs using wavelet transform, group delay, glottal flow waveform, and dynamic programming.SEDREAMS uses the LP residual and mean-based signal to detect GCIs/GOIs [146].
From the speech production's point of view, when the vocal folds are completely open in a glottal cycle, the subglottal system is maximally coupled to the supraglottal system, and the resultant vocal tract is longer compared to the tract during the CP.The effect of opening on the response of the vocal tract system is different during different stages of the open phase.When the glottis starts to open, the bandwidth of the first formant of the supraglottal vocal tract begins to increase.On the other hand, at the end of the opening phase, the effective vocal tract length will be larger due to coupling, and therefore, the center frequency of the lowest resonance will decrease and its bandwidth will increase.This results in the increased spectral flatness of the response of the vocal tract system.Motivated by this phenomenon, the lower DRF is used for deriving the open phase using the ZTW method [156], [157].The glottal open phase is determined using a threshold value of 0.5 over the normalized DRF contour.The interval below this threshold is identified as the open phase and the remaining part of the glottal cycle as the CP.It was shown in [145] that the spectral flatness computed at each instant of the ZTW spectrum highlights glottal opening, as the effective vocal tract length is longer in the glottal open phase, which increases the bandwidths of the resonances, making the spectrum flatter, compared to the CP.In [145], the open phase is identified as the interval between the peak in the spectral flatness plot within a glottal cycle to the following GCI.
Research has also been conducted to extract impulselike sequences and their relative strengths in each glottal cycle directly from the speech signal [23], [158].In [9], [159], [160], the excitation component was represented as a multiple-pulse sequence for the purpose of speech synthesis.For this, the LP analysis and synthesis methods were used to determine the locations and strengths of the impulses by considering one pulse at a time or by jointly optimizing the strengths of several pulses (such as the regular pulse excitation and the random pulse excitation).In a more recent study [158], a method was proposed to extract a sequence of impulses from the signal by modifying the ZFF method using various levels of trend removals.This approach is justified by the pitch perception of expressive voices [104], [161]- [163].However, there is a need for signal processing techniques that can exploit the impulse-like sequences derived directly from the input signal without using block processing and vocal tract system characteristics.
In addition to the issues described above, some studies have been proposed for extracting the excitation information using features, such as the strength of the impulselike excitation at glottal closure (as in NAQ) and the sharpness or the abruptness of glottal closure [36], [164].Fig. 7 illustrates some of the excitation features extracted from the speech signal.In Fig. 7, (a) shows a segment of voiced speech, (b) shows the dEGG signal, (c) shows the LP residual, (d) shows the glottal flow derivative, and (e) shows the instantaneous F0.

V. U T I L I Z A T I O N O F E X C I T A T I O N I N F O R M A T I O N I N D I F F E R E N T A R E A S O F S P E E C H R E S E A R C H
In this section, we discuss how the excitation information is utilized in different areas of speech research.The section is divided into four research areas, where extraction of excitation information plays a significant role: 1) study of phonation types; 2) study of vocal emotions; 3) study of laughter sounds; 4) study of pathological voices.

A. Study of Phonation Types
Humans are capable of coloring their speech by changing phonation type, i.e., the vibration mode of the vocal folds.The analysis and classification of different phonation types are needed in applications, such as in speech synthesis and modification systems [89], [165], [166], and tagging expressive speech corpora [167].Furthermore, the identification of phonation type is useful in the assessment of the cognitive load of the speaker, speaker recognition, emotion recognition, and speech recognition [14], [17], [29], [168]- [173].
Generally, three broad phonation types are considered.They are breathy, modal (or normal), and pressed (or tense).When phonation type changes from breathy to modal and pressed, the characteristics of the glottal flow pulse change considerably.The glottal flow pulse changes from a smooth symmetric waveform in breathy phonation to an asymmetric waveform with sharp edges in pressed phonation [11], [174].This variation in the time domain is reflected as the decrease in the decay of the spectral envelope of the glottal pulse in the frequency domain [175], [176].
Glottal source parameters were explored for discriminating breathy, modal, and tense voices in [11] and [147].Frequency-domain parameters, such as H1-H2 [176], HRF [89] and the PSP [177], were used for the discrimination task.In addition, time-domain parameters, such as ClQ, QOQ, OQ, and SQ, and amplitude-based parameters, such as NAQ, were also used [11], [30], [86].Some studies measured the amount of aspiration noise present in the signal for detection of breathy voice based on the observation that breathy voices are noisier compared to modal voices [176], [178].In [179] and [180], parameters were derived for various voice qualities by fitting the estimated glottal source waveform with the LF model.
In [164] and [175], it was observed that H1-H2 and NAQ were the best parameters for discrimination of different phonation types.However, it was observed that the accuracy of the estimated glottal source parameters reduces for high-pitched voices and expressive voices [29], [30].To overcome this, attempts have been made recently to extract the excitation information directly from the speech signal.In [164], a parameter called the MDQ was proposed to capture the sharp changes in the glottal closure characteristics from the LP residual.In [175], using the spectral parameter LFSD, it was observed that pressed voices show smaller OQ, and breathy voices show higher OQ.The effect of the subglottal system on the spectrum is stronger for breathy voices due to larger OQ compared to the pressed voices.Larger OQ results in the increase in LFSD for breathy voices, typically around the region of the glottal formant (which is lower in frequency than the first formant).In [175], it was observed that LFSD and MDQ are close to NAQ, and HNR seems to provide poor discrimination for the three phonation types.However, HNR was shown to provide good discrimination of breathy and modal voices compared to pressed and modal voices.It was observed that H1-H2 performs poorly for female speakers, and it is as good as NAQ for male speakers.This may be due to the overlap of the second harmonic with the first formant for female voices.In general, it was observed that no single parameter performed consistently well for all the speakers in the discrimination of phonation type.
Kadiri et al. [34], Kadiri and Yegnanarayana [182], [183], and Kadiri and Alku [184] explored the features derived from the ZFF, ZTW, and SFF methods for discriminating phonation types.In these studies, cepstral coefficients were obtained from the spectra estimated by the three methods, and the cepstral coefficients were used in addition to excitation information scalar features  [185], [186] (such as spectral statistics).Recently, in [184], the MFCCs computed from the glottal source waveforms estimated by the QCP method and the ZFF method were shown to be effective for the classification of different phonation types from speech signals.

B. Study of Vocal Emotions
The features used for emotion recognition can be broadly characterized as spectral and prosodic features.The general trend of four spectral features, i.e., changes in the lowest two formant frequencies, the bandwidth of the first formant (F1), and spectral tilt, is indicated in Table 2 for anger, happiness, and sadness [185], [186].The trend is indicated as an increase or decrease in the parameter value relative to the neutral state.Similarly, the trend of prosodic features, i.e., F0, energy, and speaking rate, is indicated in Table 3 [185], [186].
The basic technological principles of emotion recognition systems are similar to those used in speech and speaker recognition, as well as in language identification [187]- [189].In most emotion recognition studies, short segments of speech are represented in terms of spectral features, such as MFCCs or LPCCs, prosody features, and their statistics [185], [187], [190]- [194].These features are available in open toolkits, such as openS-MILE [192], [195]- [197].The features extracted from the emotional speech are used to develop nondiscriminative/discriminative models, such as GMMs, FFNNs, and DNNs [187], [198]- [200].Binary classification techniques, such as SVMs and Bayesian logistic regression, have been used for the multiclass problem by adopting them in hierarchical binary decision tree framework [188], [196], [201].
Emotion recognition systems generally use the features representing the vocal tract system characteristics.There are fewer studies of emotional speech involving the use of excitation information [14]- [17], [199], [202], [203].Most of these studies use the voice source features computed from a specific category of speech sounds, such as vowels [14], [15], [17], [202], [204].In [15] and [16], the role of the voice source in the perception  [185], [186] Vol.109, No. 12, December 2021 | PROCEEDINGS OF THE IEEE 1931 of emotional arousal (active and passive) and valence (positive and negative) attributes was studied from short vowels (with a duration of 150 ms).The results showed that NAQ correlates better with arousal than valence for both genders.Similarly, in [14], emotions in short vowel segments of [a :] in the continuous speech were analyzed.Significant differences were found in NAQ between most emotions.Even though NAQ correlates with emotions and voice quality changes, it was found that NAQ by itself is not sufficient for discriminating between emotions accurately [14].The interdependencies among glottal source features were studied in [17] between five emotions using six glottal source parameters extracted from the glottal flows estimated by GIF [17].In [202] and [204], the robustness of the glottal source features was examined across databases for four emotions (anger, happiness, neutral state, and sadness).

C. Study of Laughter Sounds
Nonverbal sounds, such as laughter, convey nonlinguistic information.Production of these sounds is typically involuntary and spontaneous.Nonverbal sounds do not have any clear description of articulation.In laughter, changes occur in the excitation due to involuntary bursts of activity.Laughter conveys a variety of functions, such as indication of affection, aggressive behavior (laugh in someone's face), bonding behavior (such as in early infancy), or appeasement behavior (such as in situations of dominance) [210].Detection of laughter can help in understanding the emotional state of a speaker [211].The analysis of laughter also helps in spotting regions of laughter in continuous speech.Characterization of laughter helps in laughter synthesis.
Laughter sounds have been classified in different ways in different studies.In [212], laughter was classified into three classes: 1) spontaneous laughter; 2) voluntary laughter; and 3) speaking or singing laughter.In spontaneous laughter, there is an urge to laugh without restraining its expression.Voluntary laughter is a kind of fake laughter to produce a sound pattern that is similar to that in natural laughter.The laughter in speaking/singing is not based on forced breathing but on well-dosed air supply, which results in breathiness and aspiration.The continuum from speech to laughter was divided into three categories [213], [214]: speech, speech-laughter, and laughter.The duration of vocalization was observed to increase in speech-laughter.This is likely due to changes in one/more features of vowel elongation, pitch, breathiness, and syllabic pulsation [213].Voiced laughter was shown to induce a significantly more positive emotional response in listeners compared to unvoiced laughter [215].
In [216], laughter analysis was carried out using features, such as F0, time duration, root mean square amplitude, and formant frequencies.It was observed that laughter has significantly longer unvoiced regions compared to voiced regions.The mean F0 of laughter sounds was reported to be 472 Hz for (Italian and German) female speakers, and the F0 values ranged between 246 and 1007 Hz [210], [217].The average F0 of normal speech sounds was reported to be 214 and 124 Hz for female and male speakers, respectively.A group of acoustic features, including F0, the number of calls per bout, formant clusters (F1 versus F2), and spectrograms, were investigated in [218] to analyze temporal features of laughter, their production modes, and source-filter effects.Their study proposed a subclassification of F0 contours in each laughter call into rising, falling, flat, sinusoidal, and arched.The acoustic features of laughter-speech continuum, such as the pitch range, voice quality, and formant space, were studied in [214].Two specific acoustic features (the rhythm and the change in F0) of the laughter series were investigated in [219].In [211], combinations of several features (pitch, energy, voicing features, modulation spectrum, and PLP features) were used to model laughter and speech.The voice source characteristics were investigated using the OQ along with spectral tilt in [214].Voice source features, including the instantaneous pitch period, the SoE at glottal closure, and their slopes and ratio, were used for the analysis of laughter in [218] and [221].

D. Study of Pathological Voices
Excitation information of speech is also used in studying pathological voices.Voice pathologies are disorders in which the phonation process in the larynx is disturbed due to, for example, dysphonia, polyps, and vocal nodules [221], [222].Voice disorders are complex, and they often do not have a single etiology [221].Voice pathologies arise due to infections, psychogenic, and physiological causes, and due to vocal misuse, which is prevalent in professions, such as teaching, singing, and client service representatives [223], [224].Change in voice from normal to pathological may indicate early neurodegenerative disease, such as Parkinson's disease [225].The utilization of excitation information of speech has attracted increasing interest in the area of speech-based detection of neurodegenerative diseases (discussed in Section VI-B).Automatic detection of voice pathology is important because it enables early intervention for the diagnosis.
The features used in investigating pathological voices can be generally classified into the following three categories [226], [227]: 1) perturbation measures; 2) spectral and cepstral measures; and 3) complexity measures.The perturbation measures aim to capture the presence of aperiodicity and aspiration noise in the voice signals that occur due to irregular movements of the vocal folds and incomplete glottal closure.The most widely used parameters in this category are jitter, shimmer, HNR, normalized noise entropy, and GNE ratio [228]- [240].The popular features in the category of spectrum/cepstrum measures are MFCCs [227], [241]- [243].In addition, LPCCs [229], [244], [245] and PLP [227], [246] have been used in voice pathology detection.The complexity measures have been proposed to capture nonlinearity and nonstationarity of voice signals using estimators based on nonlinear dynamic analysis [231], [247]- [252].The popular features in this category are computed using the fractal and correlation dimension [246], [247], [253]- [255].More details on the features used for pathology detection can be found in recent review articles [227], [256].
Since voice pathologies may affect different parts of the speech production mechanism, both the vocal tract system and the glottal source need to be parameterized for the analysis and detection.Existing studies have captured the characteristics of the vocal tract effectively by utilizing spectral and cepstral features (such as MFCCs and PLPs).However, there is less research in the analysis and detection of voice pathologies using glottal source features.Recently, a systematic analysis of glottal source features in normal and pathological voices was carried out in [24].In that study, the glottal source features were derived from the ZFF signal and the glottal flow waveform estimated using the QCP method [59].The features derived from the ZFF signal consisted of the SoE, EoE, loudness measure, and ZFF signal energy [34], [181], [182].The glottal flow signals estimated using QCP were parameterized in terms of time-and frequency-domain glottal features [30], [257].In addition to these, features derived directly from speech signals that capture the specific property of the glottal source were also studied.These features were the CPP [176], PS [258], MDQ [164], and Rd shape parameter [259], [260].Furthermore, MFCCs derived from the glottal source waveforms were shown to be effective for voice pathology detection.In [26], [262], and [263], it was shown that glottal source features were useful in the automatic detection of dysarthria and also in the assessment of intelligibility in speakers with dysarthria.In [263], glottal parameters computed by GIF were used to identify pathophysiological phonatory mechanisms for phonotraumatic and nonphonotraumatic vocal hyperfunction.In [264], detection of pathological voices caused by vocal nodules was investigated using several glottal parameters and a classifier based on a genetic algorithm.In [265], automatic detection of voice pathology was studied by using a random forest classifier and including several voice disorders, both functional and organic pathologies.The study compared glottal flow features with the widely used openSMILE feature set [266].The results indicated that the best detection accuracy was obtained by combining glottal features with the openSMILE features.Similar results have been obtained in recent investigations on automatic speech-based detection of diseases, such as heart disease [267] and specific language impairment [268].

VI. R E C E N T T R E N D S I N E X T R A C T I O N A N D U T I L I Z A T I O N O F E X C I T A T I O N I N F O R M A T I O N
This section describes recent developments in the extraction and utilization of excitation information.The section addresses the issue in two parts by first describing the use of deep learning for GIF and extraction of F0 and GCI.In the second part, the utilization of the excitation information in a popular health topic, the automatic detection of neurodegenerative diseases from speech signals, is discussed.

A. Deep Learning for GIF and for Extraction of F 0 and GCI
Inspired by the success of deep learning in many areas of speech technology, the extraction of excitation information has been recently studied using approaches based on deep learning both in GIF and the detection F0 and GCI.It is known that signal processing-based GIF methods are affected by distortions in the speech signal due to ambient noise, the poor audio quality of the recording equipment, and compression and bandwidth limitation caused by speech transmission [30], [269].To address this issue, a few recent studies [269]- [271] have proposed using DNN-based methods for estimation of the glottal source waveform.In [269], coded telephone quality speech was studied using a DNN-based GIF method by using both clean and coded speech in training.DNN was used to map the speech features (line spectral frequencies) extracted from the coded speech to the time-domain glottal flow waveforms estimated from the corresponding clean speech.The glottal flow estimated from clean speech (using an existing signal processing-based GIF method and the QCP method) was used to train the DNN.It was observed that the DNN-based GIF method showed good performance in the estimation of glottal flows under the coded condition for both high-and low-pitched vowels.
As described in Section IV-B, the existing F0 extraction methods are based on handcrafted signal processing frameworks working in the time-domain and/or frequency-domain.These signal processing approaches are known to be prone to pitch doubling/halving errors.In [102] and [273]- [275], machine learning models for F0 extraction were proposed.The method proposed in [272] first extracts spectral domain features (the normalized log-frequency power spectrogram) and then adopts a neural network to compute the F0 estimate.To capture the variation of F0, RNNs were explored.Specifically, the authors investigated both DNN-and RNN-based methods to produce reasonably accurate probabilistic outputs for pitch.From the pitch probability in each frame, a Viterbi decoding algorithm was used to derive continuous Vol.109, No. 12, December 2021 | PROCEEDINGS OF THE IEEE 1933 pitch contour.By removing feature extraction and Viterbi decoding modules, mapping the raw waveform directly to the F0-corresponded states was proposed in [275].In [276], the CREPE model, which is an end-to-end CNN that uses the raw waveform, was proposed.The network is trained in a supervised fashion by minimizing the cross-entropy loss between the output of the model and the ground-truth pitch.In [102], a voicing detection was proposed as a classification problem and pitch estimation as a regression problem.For both tasks, various acoustic features and traditional machine learning methods were used.In [277], vocoder-based modifications for speech data augmentation for neural network estimation (such as CREPE) of F0 were explored.
As described in Section IV-C, existing robust GCI detection methods use a two-stage approach.The initial stage involves the transformation of speech into a representative excitation signal (such as an LP residual), where GCIs can be localized better.The later stage involves the detection of locations of the GCIs.The initial stage uses signal processing approaches based on, for example, the sourcefilter model of speech production, and the later stage adopts algorithms, such as peak picking and dynamic programming.Recent developments in the area of data-driven representation learning have shown that it is possible to operate directly on the raw speech signal, and let the learning algorithm learn the abstract representations of the underlying task.As an example of this kind of approach, CNNs were utilized in [278] for the GCI detection by operating on low-pass filtered speech and regarding the negative peaks of the filtered signal as the correct GCIs.In [279], the GCI detection was posed as a temporal event detection problem, relaxing the constraints used in [278].In [279] and [280], the GCI detection was formulated using a representation learning perspective, where an appropriate representation is implicitly learned from the raw signal.In [281] and [282], a deep CNN-based GCI detection method was proposed by fusing raw speech and LP residual features.In [283] and [284], classificationbased data-driven algorithms were studied for the GCI detection, using conventional machine learning methods, such as SVMs, extremely randomized trees, k-nearest neighbors, and MLP with handcrafted features extracted from speech.In these studies, the problem was viewed as a two-class classification problem, where a peak in the speech signal could either correspond or not correspond to GCI.The handcrafted features are peak-based features comprising the amplitudes of the negative peak and the neighboring negative peaks, the time difference between the negative peak and each of the neighboring negative peaks, the amplitudes of the neighboring positive peaks, the width of the negative peak and each of the neighboring negative peaks, and the correlation of the signal around each of the neighboring negative peaks.In [285], features, such as voiced/unvoiced, harmonic/noise, and spectral features, were added to the handcrafted features for improving the performance of GCI detection.

B. Utilization of Excitation Information for Detection of Neurodegenerative Diseases
Neurodegenerative diseases, particularly Parkinson's disease and Alzheimer's disease, are becoming increasingly prevalent globally due to the aging of the population.The early detection of these diseases is essential, and speech provides an effective means of biomarking these diseases at an early stage of the disease's progress.Speech-based detection of neurodegenerative diseases has attracted increasing interest as an automatic, low-cost, and easy-to-administer method [231], [286].The detection methods proposed can be divided into traditional pipeline systems and modern end-to-end systems.In the former, selected handcrafted features are computed from speech to train classifiers (such as SVMs) to predict one of the two labels (disordered versus healthy).Many different speech features have been used in these studies.In the detection of PD, speech has been parameterized with handcrafted features based on articulation, phonation, and prosody [287]- [289].In the end-to-end systems, the use of handcrafted features is replaced by training deep learning networks that directly map the raw speech signal waveform (or its spectrogram) to the output labels (disordered versus healthy).Deep learning models, such as CNNs, MLPs, and LSTM [288], [290]- [292], for example, have been used for this purpose.
Since neurodegenerative diseases affect phonation, obtaining parameters based on speech excitation information is a justified approach to build traditional pipeline systems for the detection of neurodegenerative diseases from speech signals.A few recent studies [27], [293] have investigated the use of speech excitation information in the detection of PD with the traditional pipeline approach by estimating the glottal flow using the IAIF method (as described in Section IV-A1) and by training SVM classifiers using the computed parameters.These studies indicated that glottal parameters carry useful information to improve detection accuracy.In [294], excitation information was studied in PD by first estimating the glottal flow from speech using GIF, after which parameters of a biomedical two-mass model were determined by fitting the glottal flow spectrum to the model.The study showed that the biomedical model can be used to measure the instability of phonation, and the features are good biomarkers of PD.Some recent investigations have studied the use of timedomain excitation information to build end-to-end systems for the detection task.In this approach, voice excitation information is represented by the estimated glottal flow waveform, which is then used as input to a deep learningbased end-to-end system.There are two justifications for studying this kind of end-to-end system for the detection of neurodegenerative diseases.First, the glottal flow captures the phonation information, which is known to be affected by neurodegenerative diseases [287]- [289].Second, compared to the speech signal, which is the default input in most of the end-to-end detection systems, the glottal flow is a more basic signal due to the absence of vocal tract resonances.Using such time-domain signals, deep learning systems can be trained with smaller amounts of training data, as indicated in [295].This is particularly useful because long voice recordings cannot be obtained from patients easily.The end-to-end systems were recently studied in the detection of voice pathologies [25].The data for this study included voice pathologies caused by different diseases, including neurodegenerative ALS disease.The study indicated improvements in the detection accuracy when the glottal flow was used as input to deep learning-based classifiers, instead of the speech signal.Similar results were recently reported in [296] for the detection of PD.

VII. C O N C L U S I O N A N D D I R E C T I O N S F O R F U T U R E R E S E A R C H
In this article, a review was provided on the extraction and utilization of the excitation information of speech signals.First, the motivation of the topic was explained.Second, the functioning of the human speech production mechanism was briefly described.Third, the extraction of the main components of excitation information was presented by describing the GIF-based estimation of the glottal flow, the underlying excitation information parameters, and the extraction of F0, GCI, and GOI.Fourth, the utilization of excitation information in various speech processing tasks was discussed by including analysis and classification of phonation type, the study of emotional speech, the study of nonverbal laughter sounds, and the study of pathological voices.Finally, recent trends of the review topic were discussed by addressing two issues, the utilization of deep learning in GIF, the extraction of F0 and GCI, and the utilization of excitation information in studying neurodegenerative diseases.
Even though the fundamental theory underlying the review topic, that is, the linear source-filter theory of speech production [297], [298], has been known for more than five decades, the technologies discussed in the review are still topical, and the utilization of speech excitation information has attracted increasing interest in a few areas in recent years.One such area is speech-based biomarking of the state of health, especially the automatic detection and classification of neurodegenerative diseases.This research topic has gained momentum due to the aging of the population, a recognized global grand challenge.In the area of speech-based classification of neurodegenerative diseases, the traditional model-driven systems consisting of separate feature and classification stages are currently increasingly replaced with datadriven end-to-end systems based on deep learning.The end-to-end approach is attractive because it enables building health monitoring systems that do not need any domain expertise in the system training phase.It can, however, be argued that, when the traditional approach is used together with effective speech excitation parameters (e.g., those discussed in Section IV), the analysis benefits from its better capability to demonstrate which particular functions of the speech production mechanism have been affected by the underlying disease.This demonstration capability of traditional speech excitation features can be easily taken advantage of by clinicians and speech-language pathologists.Even though the end-toend approach has shown better accuracy compared to the traditional, feature-based approach in a few studies [288], [290], [292], the end-to-end technology can be criticized for providing a black box-type of solution with poor interpretability to the detection task [299].Moreover, the end-to-end approach requires larger amounts of training data than the traditional feature-based pipeline approach.Collecting large amounts of speech data from patient populations is not as easy as it is from healthy speakers.
In addition to the health-related research area described above, we argue that the methods to extract excitation information from acoustic speech signals discussed in this review can be used to improve our knowledge of human speech production, particularly when these methods are used jointly with the latest imaging technologies and physical modeling approaches of voice production.In this area, we emphasize, particularly, the recent progress in HSV (e.g., [81]) and GIF (e.g., [59] and [67]), which, in principle, enables obtaining glottal area and glottal flow signals with good spatial and time resolutions from natural voice production, not only for isolated vowel sounds but also for continuous speech.Information extracted jointly by HSV and GIF can be used both to acquire new fundamental research knowledge about the human speech production process and compute parameter values for physical models of voice production.
The review shows that, despite the fact that many methods have been developed over the past few decades to extract excitation information from speech, the development of new methods is still continuing, and new research is needed in order to tackle known limitations in current methods.One such limitation is related to the extraction of GCI where the performance of the state-of-art methods is good, but the performance is limited by the need for issues, such as the computation of the average pitch period and the use of block processing.The limitation of the performance of the GCI extraction due to these issues is severe, particularly in the analysis of expressive voices due to rapid variations in F0 and source-filter coupling.In addition, improved robustness is needed in GCI extraction methods to enable their utilization in realistic environments with noise and reverberation.The second topic that calls for new research is the extraction of GOI.The performance of existing GOI extraction methods is poor because the glottal opening is a relatively slow phenomenon (compared to glottal closing), and therefore, it manifests itself weakly in the amplitude characteristics of the speech signal.Hence, a more fine-grained detection of excitation components within a glottal cycle (including instants of secondary excitation near the glottal opening) is needed because they contribute, in addition to the major excitation at the instant of glottal closure, to the production and perception characteristics of speech signals.Moreover, improved robustness of GIF analysis to noise and other nonideal recording conditions is still needed, despite it having been shown recently in [270] and [296] that conducting inverse filtering with DNNs helps to improve the robustness of GIF.To improve robustness further, deep learning architectures other than DNNs, such as CNNs and LSTMs, could be studied as computational inverse networks of the vocal tract.Furthermore, features that better reflect the physical functioning of the vocal folds in the production of pathological speech or different vocal emotions, for example, need to be developed further to enhance speech analysis and classification, as well as the general understanding of human speech production.

Fig. 1
Fig. 1.Demonstration of the production of voiced speech in two phonation types: normal (left) and breathy (right).The upper part of the

Fig. 4 .
Fig. 4. Segment of (a) EGG signal and (b) corresponding dEGG signal.Four parts of the glottal cycle are defined as follows: the closing phase (from t 1 and t 3 ), the CP (from t 3 and t 4 ), the opening 4(b)].The vocal folds are apart during the open phase (between t6 and t7).

Fig. 5
Fig. 5. Visualization of the closing and opening phases of the glottal cycle by simultaneous electroglottographic and high-speed

Fig. 6 .
Fig. 6.Computation of time-based and amplitude-based parameters from (a) glottal pulse and (b) its first time-derivative.

Table 1
Topics Addressed in This Article

Table 2
Trend in Spectral Features of Emotional Utterances With Respect to Neutral State Utterance (Increase: ↑ and Decrease: ↓)

Table 3
Trend in Prosody Features of Emotional Utterances With Respect to Neutral State Utterance (Increase: ↑ and Decrease: ↓)