A Framework for Biomarkers of COVID-19 Based on Coordination of Speech-Production Subsystems

Goal: We propose a speech modeling and signal-processing framework to detect and track COVID-19 through asymptomatic and symptomatic stages. Methods: The approach is based on complexity of neuromotor coordination across speech subsystems involved in respiration, phonation and articulation, motivated by the distinct nature of COVID-19 involving lower (i.e., bronchial, diaphragm, lower tracheal) versus upper (i.e., laryngeal, pharyngeal, oral and nasal) respiratory tract inflammation, as well as by the growing evidence of the virus’ neurological manifestations. Preliminary results: An exploratory study with audio interviews of five subjects provides Cohen's d effect sizes between pre-COVID-19 (pre-exposure) and post-COVID-19 (after positive diagnosis but presumed asymptomatic) using: coordination of respiration (as measured through acoustic waveform amplitude) and laryngeal motion (fundamental frequency and cepstral peak prominence), and coordination of laryngeal and articulatory (formant center frequencies) motion. Conclusions: While there is a strong subject-dependence, the group-level morphology of effect sizes indicates a reduced complexity of subsystem coordination. Validation is needed with larger more controlled datasets and to address confounding influences such as different recording conditions, unbalanced data quantities, and changes in underlying vocal status from pre-to-post time recordings.


INTRODUCTION
The Supplementary Material section provides the following expansions of the main body topics: (1) more detailed description of the physiological motivation for the coordination model of Fig. 1 and Fig. 2; (2) more details of our standard lowlevel feature extraction, as well as introducing other vocal source features such as harmonic-to-noise ratio, vocal creak, and glottal open quotient; (3) effect sizes of summary statistics of the low-level features across the pre-and post-COVID- 19 conditions as a comparative reference to the high-level feature effect sizes; (4) further description of the subject-and sessiondependent environmental conditions; (5) more detailed description of the correlation methodology; and (6) expanded algorithm descriptions and software references to expedite use by others in the field.

A. Dataset Characterization
As introduced in the main body, audio data for five subjects, pre-COVID-19 (before exposure) and post-COVID-19 (after positive but asymptomatic), was obtained from YouTube, Instagram, and Twitter sources. The recordings were taken from press conferences, TV shows, and TV interviews all with celebrities or broadcast hosts, typically using high-quality recording facilities, but in varying environmental conditions. Generally, recordings occur in multiple environments within a pre-or post-stage. For each recording session, the data was segmented manually from the videos excluding secondary speakers such as interviewers or interviewees and other interferences. Table 1 provides for each of the five subjects the number of pre-and post-diagnosis segments and average segment durations, and approximate pre-and post-diagnosis recording times. Although the range of pre-diagnosis times (days-to-years) is much greater than post-diagnosis times (days), because the subjects are all young-to mid-aged adults, voice change due to development or aging is not a concern. 1  Nevertheless, there is always the possibility of a pathological state arising, e.g., nodules or polyps, over a long time stretch.   Table 2 provides a characterization of environmental conditions with respect to signal-to-noise ratio (SNR). Here we approximated SNR as the average signal energy during pauses (non-speech) divided by the average energy during speech in decibels (dB). Non-speech versus speech intervals were obtained using a standard speech activity detector [1]. On average there is a consistency across pre-and post-conditions per subject (about 1 dB difference) while the range of SNR from roughly 18 dB to 10 dB is on the order of noise levels we have encountered in using coordination measures effectively in tracking a variety of neurocognitive conditions [2] [3]. (We note that all videos were mp3 encoded.) A stronger environmental effect is that of reverberation. Reverberation has proven more difficult to measure quantitatively, but the effect seems to stand out perceptually in some segments, for example, one involving a hall and a few others in home environments during quarantine. We return to the effect of reverberation in Section II.C.

Supplementary Materials
It is also important to emphasize that due to the nature of the online video source (YouTube, Instagram, and Twitter), it is natural to consider the emotional context (e.g., the subjects' mood or stress level) that could confound the results. The subjects considered, however, routinely appear at press conferences and TV shows and interviews, so elevated heart rate and other confounders should be normalized, e.g., this isn't their first time in front of a camera, thus it is probably a consistent response for them. Nevertheless, there is the possibility of mood change such as due to self-knowledge of the COVID-19 diagnosis in the post-versus the pre-state. This is a consideration inherent in any longitudinal speech data collection.

B. Subsystem model
In the main Letter body, we presented a speech signal processing model that was physiologically motivated (Repeated here in Fig. 1.) In this model, coordination can occur both between the respiratory and laryngeal subsystems, and between the laryngeal and articulatory systems. It is hypothesized that this coordination across subsystems can change in the presence of COVID-19 either due to physiological insult to the respiratory system or to impaired neurological control of these subsystems.
In this model, the 'intensity' of the airflow (velocity), that we referred to as respiratory intensity, governs time-varying loudness, and is coupled (coordinated) with phonation, i.e., the rate and nature of vibration of the vocal folds (pitch), stability of phonation, and aspiration at the folds [4]. The estimation of low-level features associated with each subsystem, and used as a basis for the high-level coordination features, is discussed in next section. Stepping back momentarily for a deeper look into the model components, we see that respiratory intensity is modeled as an envelope that amplitude-modulates the airflow (induced by pressure in the lungs) that occurs at the vocal folds within the larynx. The airflow at the folds is either quasi-periodic or aspirated due to turbulence at the folds, amplitude-modulated by the respiratory envelope and is further modified by the vocal tract. Our assumption is that we can estimate the respiratory intensity as the envelope of the speech signal measured at the lip output..

C. Low-level feature extraction
We have found in previous work that the speech envelope is dominated by the respiratory intensity, but also contains a secondary resonance-harmonic interaction component that can occur as harmonics scan vocal tract resonances with time-varying pitch [5] [6]. Nevertheless, the manner in which we extract the speech envelope yields a signal that approximately follows the speech signal's time-varying loudness [5] [6] and is used as a proxy for respiration. Specifcially, the envelope was extracted using a custom MATLAB script that provides a smooth contour of amplitude peaks based on an iterative timedomain signal envelope estimation [5] [6].
For the laryngeal subsystem, we compute fundamental frequency ('pitch') of the vocal fold vibration with an autocorrelation-based approach [7] [8]. In addition, we compute many standard low-level features used in the speech community that characterize the nature of the vocal source. One such feature, cepstral peak prominence (CPP), captures the stability of periodicity in the vocal-fold vibration. The CPP is defined as the difference in dB between the magnitude of the highest peak and the noise floor in the power cepstrum which is the result of taking the inverse Fourier transform of the logarithm of the estimated spectrum of a signal [9]. Empirically, the CPP has been found to be robust in being used as an important measure of voice pathology and has been used effectively in neurocognitive applications [9][10] [11]. Because the respiratory and laryngeal systems are coupled, the CPP can potentially capture changes in the degree of aperiodicity and aspiration at the vocal folds, that can change with insult to the respiratory system by COVID19.
Although possibly a less robust measure in the presence of a noise background, another more direct approach to getting at aspiration is through the harmonics-to-noise (HNR) ratio which is the ratio, in decibels (dB), of the power of the harmonic (periodic) signal from vocal-fold vibration and the power of the speech noise signal at the vocal folds created by turbulence as air rushes past the vocal folds from the lungs [12]. HNR is thought to reflect breathiness in a voice. Yet another vocal source measure is termed vocal creak, corresponding to what is often referred to as a 'creaky voice', which reflects large irregularity in pitch periods (typically with low average pitch) and high peakiness of airflow pulses that excite the vocal tract [13] [14]. The value here is given as a creak probability. We have also have begun to explore the Glottal Open Quotient (GOQ) which is the ratio of the time duration over which the folds are open relative to the full glottal cycle. A larger GOQ often results in more turbulent airflow at the folds. Our fundamental frequency [8], CPP [15], and HNR [8] measures are obtained from the open source software package Praat [7] and the GOQ with the open source tool VOICEBOX [16]. Finally, our low-level features also aim to reflect the temporal dynamics of the articulatory component. As a proxy for articulation, a primary feature set is comprised of the vocal tract resonances, specifically the formant center frequencies, estimated by a Kalman filter technique, tracking the first three formant frequencies while also smoothly coasting through nonspeech regions. Here we use the open-source software package KARMA [1] to compute the first three formant frequencies (F1, F2, F3). Because of the embedded Kalman filtering, this method has proven robust in many applications and compares well against other methods of formant estimation [17].

D. High-level feature extraction
Our high-level features, derived from low-level features, are designed to capture coordination of the temporal dynamics of speech production subsystems at different time scales for the two coordination points illustrated in the model of Fig. 1. This approach has been used for characterization of psychomotor slowing in depression, cognitive load and fatigue, and a variety of other neuromotor conditions [2] [3].
Multivariate auto-and cross-correlations are computed within and across the underlying speech subsystems. We refer to this as correlation structure. Specifically, time-delay embedding is used to expand the dimensionality of the feature time series, resulting in a correlation matrix comprising auto-and crosscorrelation patterns that represents coupling strengths across feature channels at multiple time delays. The eigenspectrum of the correlation matrix quantifies and summarizes the frequency properties of the set of feature trajectories. Higher complexity across multiple channels is reflected in a more uniform distribution of eigenvalues, with lower complexity reflected in a larger proportion of the overall signal variability being concentrated in a small number of eigenvalues, often with the high-rank eigenvalues being lower in amplitude.
For each speech segment, correlation matrices are calculated from various combinations of feature trajectories from the different speech subsystems. Each matrix contains the correlation coefficients between the time-series at specific time delays to create the embedding space. Each matrix is computed at a specific delay scale, i.e., time-sampling interval (e.g., 10, 30, 70, or 150 ms), with 15 time-delays used per scale. The delay scales allow for characterization of coupling of signals at different temporal resolutions. In the current work, eigenspectra were computed with a 10 ms delay scale and thus a relatively fine time resolution. Each matrix comparing n signals has a dimension of (n*15 x n*15). For all correlations, an automatic masking technique was used to include only speech within a segment, i.e., no pauses. The pipeline for the methodology is illustrated in Fig. 2 with the example of the generation of a formant frequency track correlation matrix based on the first 3 formants. Further mathematical details of this approach can be found in [2][3]. Fig. 2: Signal processing and feature extraction pipeline for eigenspectrum feature calculation. A caricature of effect sizes for three different states is shown: "healthy" or baseline (x-axis), high complexity (excessive mode independence or "erratic"), and low complexity (too little mode independence or "overly coupled").

E. Effect size calculation
In understanding the relative importance of a feature set, we compute the Cohen's d effect size as the difference of the preand post-condition feature means normalized by the average of the standard deviations: where: M is the mean of each group The right-bottom panel of Fig. 2 provides a caricature of eigenspectral effect sizes for three different states: 'healthy' or baseline (x-axis), high complexity (excessive mode independence or 'erratic'), and low complexity (too little mode independence or 'overly coupled').
For the five subjects, independently and combined, the Cohen's d effect sizes pre-versus post-COVID-19 were computed based on auto-and cross-correlation for a variety of vocal features. These measures are obtained from low-level respiration intensity, fundamental frequency, cepstral peak prominence, and formant center frequencies. Eigenvalues were computed for each audio segment. The effect sizes are then compared between the pre-segments and the post-segments, and the total number of segments depends on the comparison being made. The group level, i.e., combined subjects, averages out differences within an individual and reveals trends at a higher level that is likely more robust than for any one subject. There are obviously individual differences depending on recording and also day-to-day variance, but the group comparison aims to get at a higher-level pattern.

II. RESULTS
In this section, effect sizes are computed first for low-level (univariate) features as a comparative reference for the high- We then proceed to expand on coordination results of the main Letter, investigating (1) effect sizes associated with the (auto)correlation structure of respiratory intensity (via the speech envelope) and (2) a glimpse into environmental effects by looking at signal-to-noise ratio (SNR) for a subset of subjects. Tables 3 and 4 provide mean and standard deviation summary statistics of the conventional univariate low-level features associated with phonatory and articulatory vocal subsystems. Statistics are shown for differences across the pre-and post-COVID-19 state and are denoted by ** for p values less than 0.05.  For the laryngeal-based variables, we see a lowering of average pitch perhaps consistent with heavier, less pliant vocal folds due to inflammation, while the smaller CPP and larger creak probability with the post-condition may be associated with a more erratic vocal-fold vibration. A lower HNR may be associated with increased aspiration in the speech signal due perhaps to atypical closing of the folds. However, the glottal open quotient (GOQ) (notoriously most difficult to measure especially with background noise), is not consistent with an atypical closing of the folds and barely changes, not showing significance (p>0.05). The changes in F1 and F2 mean and standard deviation have no intuitive interpretation.

B. Correlation structure of respiration
As with correlation structure across vocal subsystems, we can explore the (auto)correlation structure within a single component trajectory of a subsystem. Given the importance of respiration in the COVID-19 context, Fig. 3 provides the eigenspectral-based effect size associated with respiratory intensity for the group and individual subjects with high-rank eigenvalues tending lower for the post-COVID-19 cases, indicating perhaps less independence of movement.

C. Subject-dependent correlation structure
Regarding signal quality, due to the nature of the online video sources,we have noted there is a variety of inter-and intrasubject recording variability. Given that background noise levels are typically low in the given cases, the most perceptually notable effect is reverberation, possibly modifying the true effect sizes, over-or underestimating their importance. An example isolating two of the subjects with more consistent, least reverberant environments (as determined perceptually), shown in Fig. 4, enhances the combined effect sizes relative to  the N=5 case. Specifically, for the combined two selected subjects, coordination measures involving respiratory intensity show a stronger reduction in complexity of coordination, while coordination of fundamental frequency with articulation shows an increased variability relative to the full N=5 case.

III. DISCUSSION AND CONCLUSIONS
This exploratory study has provided a framework to investigate coordination-based vocal biomarkers for early warning and tracking of COVID-19. Our univariate summary statistics show generally small but interpretable effect sizes for mean and/or standard deviation that may be consistent with the pre-versus post-COVID-19 conditions. Such features could be used in conjunction with coordination-based features.
Our effect sizes associated with coordination of respiratory intensity (as measured through the speech envelope) with laryngeal activitiy indicate lower complexity and thus a breathing pattern and laryngeal coordination with less variability than typical during speaking. On the other hand, coordination of laryngeal and articulatory activity revealed the opposite effect with apparently larger variability. Although interpretations of some of these differences may be consistent with the effect of inflammation in COVID-19 and reported neurological motor deficits, it is too early to draw conclusions given the limited sample size and lack of physiological ground truth. It is interesting to note, however, that at a group level reduced complexity in coordination based on similar correlation-structure features has been observed within various vocal subsystems for neuromotor conditions of Parkinson's disease, ALS, dementia, and TBI [2] while greater variability of complexity has been observed in conditions of cognitive fatigue [18] and hypoxia [19].
At a fundmental level, there needs to be a physiological interpretation and meaning attached to our features. It's important to stress that we are using formant center frequencies as a proxy for articulation and loudness as a proxy for respiratory intensity and so it will be important in future work to connect with these motor functions more explicitly. For example, we are currrently using a method of waveform-toarticulatory inversion developed in the context of Major Depressive Disorder (MDD) to measure coordination of articulatory trajectiories (key points of movement of the tongue, lips, and jaw). Here, we are finding similar but enhanced effect size morphologies relative to formant coordination [20]. Note that GOQ, albeit a sensitive measure, is also a means to get at the physiological level, e.g., glottal airflow patterns, through inverse filtering methods. Nevertheless, ideally, we seek independent physiological measures of, for example, breathing patterns, lung capacity, and nasal obstruction that provide ground truth against which the analysis data can be compared.
Regarding signal quality, due to the nature of the online video sources, in addition to the subject's mood and stress levels alluded to earlier, other effects such as the acoustic environment and other forms of room changes, can clearly influence results, over-or underestimating the effect size importance. For example, given the quarantine requirement after testing positive and multiple pre-and post-interviews per subject, there is a variety of inter-and intra-subject recording variability, including mood change due to self-knowledge of the COVID-19 diagnosis. Because the recordings were typically made with high-quality microphones, noise levels in the recordings were fairly well controlled. As noted, the more noticeable effect, however, is occasional reverberation. As we discussed earlier, isolating two of the subjects with more consistent, least reverberant environments enhances the combined effect sizes relative to the N=5 case.
This short note has clearly raised more questions than answers but, in facing the above challenges, the authors hope the proposed framework and exploratory results motivate others to investigate these questions with larger and more controlled datasets toward vocal biomarkers for early warning and tracking of COVID-19. An approach to biomarkers based on speech lends itself to nonintrusive widespread use through mobile devices and thus longitudinal tracking in naturalistic environments through pre-and post-diagnosis of COVID-19. If in addition to being sensitive in asymptomatic stages of the virus, the vocal biomarkers are also proven specific, especially with other nonintrusive sensing, the approach would help provide a key capability for early warning to interrupt and treat the disease before its rapid spread.