ECoNet: Estimating Everyday Conversational Network From Free-Living Audio for Mental Health Applications

Sociability impairment, such as decreased social network size and socialization, is implicated in mental health disorders. To complement the existing self-reports-based assessment of sociability measures, which could be error-prone and burdensome, we propose to estimate an individual’s everyday conversational network from free-living speech recordings obtained with a wearable. Our first contribution is ECoNet, an automatic method to estimate the everyday conversational network using a modular audio processing architecture. Our second contribution is using ECoNet to analyze multiday egocentric audio recordings from 32 individuals representing diverse mental health conditions (healthy controls, depressive disorders, and psychotic disorders). Specifically, we discover that the conversational network size as a sociability measure has a significant correlation with mental health scores. For example, the correlation coefficient between network size and depression severity score was $-0.56$-0.56 ($p<0.01$p<0.01). Audio-based estimation of conversational network size using ECoNet, therefore, could provide a pervasive computing solution to complement existing mental health assessment methods.

S ociability, an individual's tendency to affiliate and interact with others, can be captured from their social network comprising of friends, coworkers, and relatives. [1][2][3] Characteristics of such social networks have been implicated in mental health disorders. 1,[3][4][5] Of specific interests in mental health applications are assessing one's network size (quantity) and whether one enjoys interactions, feels judged, etc. (quality). 1,2,4,5 Social network characteristics are currently obtained with selfreport questionnaires or name generators, 3 which are subjective and, thus, error-prone. In recent years, passive behavioral sensing has shown immense potential for mental health applications. 6 Passive sensing could also be potentially used to automatically infer social network characteristics and complement self-reports-based assessments. Sensor-based inference can provide more frequent and fine-grained measurements that might be burdensome, frequency wise, to solicit in self-reports.
A potential method to estimate social network characteristics is to use free-living audio recording, capturing speech-based interactions of an individual. The recorded speech could be used to estimate social network size, i.e., the number of conversational partners and quality of interactions (inferred from the turn-taking behaviors). Behavioral markers of sociability obtained from the social network characteristics could complement other mobile-sensing-based behavioral indicators, such as sleep patterns and stress, 7 for holistic mental health assessments. Speech is already being investigated as a possible behavioral marker of mental health, e.g., with timing-related features, such as speech rate and pause time. 8  capture the diversity of daily experiences. Recent studies have demonstrated that sparsely obtained speech from free-living has applications in mental health. 9 These earlier works, however, used manual speech transcription to derive behavioral features. In this work, we extend the viability of speech-based mental health assessment by proposing an automatic behavioral feature extraction (social network characteristics) using continuous audio recordings from free living. As an alternative to speech-based methods, RFID tags have been proposed for estimating social network characteristics for mental health applications. 10 However, RFID tagbased methods are not practically viable in most cases, as tags are required on all individuals in one's network. Thus, in this article, we focus on speech-based sensing.
Everyday conversational network: We define the everyday conversational network as the network of conversation partners appearing in egocentric audio recordings from a day. The audio recordings are expected to be multiple days long, thereby providing a longitudinal view of the individual's network. Such recordings are now conveniently possible using inexpensive wristband audio recorders of smartwatch form factors. If we could identify and assign each utterance in recorded speech with a speaker label, we could estimate various network properties; the assigned speaker labels need not be the real identity of a person but could just be internal labels to distinguish unique speakers. The network's size, referred to as the conversational network size in this work, is one such property that could be potentially useful as a sociability measure in mental health applications. Network size is commonly assessed, subjectively via self-report, for its potential relation with mental health. 1,[3][4][5] Related work: Conversational network size estimation is closely related to the audio-based speaker-counting task. 11,12 The speaker-counting task has primarily focused on estimating the number of conversation partners in a short audio recording. However, conversational network size estimation requires inferring the number of conversation partners from multiday noise-filled recordings, with potentially repeated encounters with the same speakers. An audio processing pipeline that addresses the possible audio environment changes during the day and higher computational costs in processing long audio recordings is also required. Xu et al. 11 proposed Crowd++, a speaker-counting system using melfrequency cepstral coefficient vector representation of speech segments. Variation of Crowd++ using different speech representations has also been proposed. 12 The Crowd++ system was used in Xu et al.'s work 13 to assess an individual's daily interactions. However, multiple encounters with the same speaker were not tracked and speaker-count's relation with mental health was not investigated. A related task to speaker-counting is concurrent speaker count estimation, i.e., estimating the maximum number of overlapped speakers in a short audio segment (e.g., St€ oter et al.'s work 14 ). Concurrent speaker count models, however, have no individual speaker models and, thus, cannot track speakers over time for conversational network size estimation.
Key challenges: The key challenge in estimating everyday conversational networks in a privacy-preserving manner is the absence of a priori knowledge about potentially encountered speakers, specifically, the absence of their speech samples or knowledge about when and how often they will be encountered. Even in a noiseless environment, identifying unique speakers can be challenging, e.g., when two speakers have similar voice characteristics. In our case, the challenges are compounded by free-living recording originating in arbitrarily noisy environments.
Main contributions: Our contributions in this article are twofold. First, we develop an automatic and robust estimation pipeline, labeled ECoNet , that estimates the unique speakers and their timing of speech from a day's audio recording. Our key innovation is an unsupervised speaker identification using speaker-discriminative speech segment representation along with a machine learning model for spurious speaker detection. The key insight in developing a spurious speaker detection model is that the duration and occurrence of the spurious speakers' speech segments might be different (more random) than those of true speakers. ECoNet has a 14.5% lower speakercount error compared with the existing speakercounting system. ECoNet has a modular architecture and uses block-based audio processing with speaker tracking to reduce computation and manage environment heterogeneity. We leveraged diverse public datasets for training ECoNet, and no speech content is parsed for privacy preservation. We note an important limitation of ECoNet, in that the assessed network does not distinguish who is talking to whom, and thus, whether the target individual (using the wearable) is involved in the conversation. We discuss the implications of this limitation in the "Interpretation" section.
Our second contribution is evaluating the potential use of conversational network size in mental health. We evaluated ECoNet in a pilot study acquiring multiday egocentric audio recordings from 32 participants comprising healthy controls and participants with depressive and psychotic disorders. Our results show a significant association between conversational network size, estimated using ECoNet, and mental health scores, such as depression severity. Thus, our findings imply that the conversational network size could serve as a complementary objective sociability measure, potentially useful for monitoring network size and engagement fluctuations in light of mental symptom changes.

ECONET: EVERYDAY CONVERSATIONAL NETWORK ESTIMATOR
In this section, we describe our proposed everyday conversational network estimator pipeline ECoNet . We adopted a modular architecture instead of a monolithic end-to-end architecture. The modular approach allowed us to leverage existing models and available well-labeled datasets to match our application. The architectural steps are demonstrated in Figure 1. To address the key challenges in everyday conversational network estimation, Steps 3-5 in Figure 1 depict our proposed approach to robustly detecting unique speakers in challenging audio environments.
ECoNet builds on the premise that an individual's speech can provide a unique signature of the individual. In the absence of labeled speech samples from speakers, an unsupervised clustering with speaker-discriminative speech segment representation could potentially be used to identify the speakers. Any spuriously detected speakers due to noise and speech variabilities (e.g., speaking rate, style, etc.), however, should be filtered out to correctly assess key network parameters, such as the conversational network size. Higher computational costs and environmental heterogeneity in processing daylong audio recordings should also be managed, for which we propose block-based audio processing with speaker tracking.
ECoNet has five key steps in its pipeline, i.e., speech segment detection, speaker-embedding generation, clustering, filtering spuriously detected speakers, and speaker tracking.
Step 1: Speech Segment Detection The first step in estimating the everyday conversational network is to detect all the speech segments with a voice activity detection (VAD) model. We evaluated multiple VAD models, including recently proposed deep neural network (DNN)-based VAD models, to identify the best model for ECoNet . Liaqat et al. 15 found existing VAD models to be highly erroneous for processing free-living audio, but DNN-based VAD models were not evaluated in their work. We employed the commonly used equal error rate (EER), the error rate where false-negative rate equals false positive rate, to compare VAD models. The Pyan-Net-architecture-based DNN model 16,17 had a significantly low EER and was selected for ECoNet . The EER were 0.05 and 0.12 in the DIHARD evaluation dataset and an in-house validation dataset, respectively, representing a 77.2% and 43.3% improvement over the next best model (refer to Table S1 in the supplementary material for evaluation results, which is available in the IEEE Computer Society Digital Library at http://doi.ieeecomputer society.org/10.1109/MPRV.2022.3155698).
Step 2: Speaker Embeddings Speech segments could be represented by speaker embedding and clustered to assign a unique speaker label to segments from the same speaker. Speaker embedding is a speaker-discriminative speech representation, obtained as activations from a speaker identification DNN model's final layers. Speech segments from the same speaker have a smaller interembedding distance than those from different speakers. We used time-delay neural network-based X-vectors 18 as the speaker embedding. X-vectors have given state-of-the-art results across speaker identification and verification tasks 18  Step 3: Embedding Scoring and Clustering We computed the similarity between X-vectors (embedding scoring) using a probabilistic linear discriminant analysis (PLDA) model (see Section S-I-B in the supplementary material for details, available online). The similarity scores were clustered with agglomerative hierarchical clustering (AHC). The PLDA-based scoring for AHC clustering is commonly used for speaker recognition. 18 With clustering, we assign a speaker label-corresponding to each identified cluster-to each of the speech segments.

Step 4: Spurious Speaker Detection
Noise and speech variability in free-living audio can cause spurious speaker detection, compromising the assessment of key everyday conversational network parameters. Spurious speakers are false positives in speaker detection, a new speaker assigned to a nonspeech, or an existing speaker's speech segment. A new speaker could also be assigned for overlapped speeches with significantly different embeddings from the individual speaker's embeddings. Xu et al. 11 used an interspeaker distance threshold to identify spurious speakers. However, a single threshold might not represent all spurious speaker detection scenarios. Toward a more representative model, we trained a random forest model using various speech-segmentbased features. We hypothesized that the spuriously detected speakers would have different (likely random) features compared with the true speakers in a conversation. The trained model had an area under the ROC curve (AUROC) of 0.96 in the DIHARD evaluation dataset, indicating good detection performance (see Figure S1 in the supplementary material, available online). With the spurious speaker-detection model, ECoNet would filter falsely detected speakers before further evaluations (details of the model and speaker filtering are presented in Section S-I-C in the supplementary material, available online).

Step 5: Speaker Tracking
Steps 1-4 of ECoNet are applied in a smaller audio block to reduce computational costs and manage environment heterogeneity. To identify the same speaker across audio blocks of a given day, we propose clustering-based speaker tracking. We obtained speaker representation of each local speaker of a block, based on their embeddings, and clustered local speakers across the audio blocks to obtain a global speaker label (in the scope of a day) for the local speakers. We employed averaged-embedding-based speaker representation that gave the best clusteringbased speaker tracking results in the DIHARD dataset (see Table S3 and Section S-I-F in the supplementary material for details, available online).
The global speakers identified in a day's audio recording and their associated conversation properties constitute the inferred everyday conversational network. Its size, the conversational network size , represents the number of identified global speakers. Conversational network size is normalized by the audio duration and scaled by a reference duration to obtain a normalized metric comparable across days and individuals.

Evaluation of ECoNet for Speaker-Counting Task
We evaluated the performance of ECoNet to estimate the conversational network size , which we estimate by identifying unique speakers present in the recording; this metric is used in the "Sociability and Mental Health: A Pilot Study" section. We evaluated ECoNet for speaker-counting in the DIHARD evaluation set 17 consisting of diverse audio recordings, such as from restaurants, meetings, clinical, and web videos. We used average error count distance (AECD), AECD ¼ 1 N P N i¼1 ðC i ÀĈ i Þ with C i andĈ i the true and estimated speaker count, and correlation coefficient between C i andĈ i as the evaluation metric. ECoNet outperformed Crowd++, 11 as well as its recently proposed extension, 12 for speaker counting. AECD reduced by 14.15% and correlation improved from 0.24 to 0.70 using ECoNet compared with Crowd++ (refer to Table S2 in the supplementary material for evaluation results, available online). We also evaluated if ECoNet could detect group differences in the number of unique speakers present. Two groups of sequences from the DIHARD evaluation set were created with a significantly different number of speakers, testing five different grouping settings for evaluation. ECoNet was able to detect the difference when other speaker-counting systems were unable (see Figure S2 in the supplementary material, available online).

SOCIABILITY AND MENTAL HEALTH: A PILOT STUDY
An individual's sociability measures could be computed from their everyday conversational network. One such sociability measure, related to the individual's social network size implicated in mental illness, 1,[3][4][5] is the conversational network size . In this work, we evaluated if the conversational network size assessed using ECoNet is related to an individual's mental health with a pilot study.

Study Design
Participants were recruited via a pilot study 19 at the psychiatry clinic of Baylor College of Medicine (BCM)/ Harris Health System, Houston, Texas, USA with 32 participants consisting of 11 outpatients with depression (depression group), eight outpatients with schizophrenia/schizoaffective disorders (psychosis group), and 13 healthy controls (healthy group). The healthy group was age-matched with the depression and psychosis group. We included a heterogeneous group in our pilot study to map sociability across disorders, aiming for sociability assessment as a transdiagnostic construct. The depression and psychosis groups were selected as these are the two most prevalent mental illnesses in the psychiatric clinics of BCM/Harris Health System. The study was triple approved by the IRB at Baylor College of Medicine, Rice University, and Harris Health System. Continuous audio recordings (suggested time of 8 AM to 8 PM) were obtained from participants in their free living, for up to seven days, using a wristband audio recorder (refer Sections S-II-A and S-II-B in the supplementary material for details, available online, on participant recruitment and summary of the obtained clinical audio dataset, respectively). The participants also reported the number of persons they interacted with, on each day of the study, in one of the ranges: 0-5, 6-10, 11-15, and > 15. In our study, the wristband device added random date offsets on some recording days. Further, we had a differing number of days of self-reports and audio recording days for some subjects, e.g., if audio bands were not worn but self-reports provided. Thus, one-to-one mapping of all the audio recordings to their corresponding self-reports was not possible.

Mental Health and Personality Scores
Several questionnaire-based mental health and personality scores of the participants were obtained, and the relation of the conversational network size with these scores was analyzed. We obtained general measures, such as anxiety and depression scores, representing some of the prototypical fluctuations manifested when patients do better or worse clinically (e.g., feeling anxious before a social interaction).
In future studies, we will include more symptom-specific measures also, such as those assessing psychotic features. We also obtained some trait-like scores (e.g., personality) to evaluate their relationship with participants' conversational/social behavior. Such evaluation could inform future studies investigating multiple factors influencing one's sociability. In particular, the following questionnaire-based scores were obtained.

Statistical Test and Significance
We assessed group differences with a t-test (or Mann-Whitney test) for variables following (or not following) a normal distribution. We applied Bonferroni correction in multiple comparisons. The statistical significance was set as follows: Ã for p < 0:05, ÃÃ for p < 0:01, and ÃÃÃ for p < 0:001, where p represents pvalue.

Conversational Network Size and Mental Health
The ECoNet was used to assess the relationship between conversational network size and mental health in the clinical audio dataset. We set the audio block duration in the block-based processing of ECo-Net to 2 hours and included days with at least 5 hours of audio for analysis.

Group Differences in the Estimated Conversational Network Size
We analyzed the conversational network size differences in three groups represented in our clinical audio dataset. The result obtained is shown in Figure 2. The depression group had a significantly smaller conversational network size compared with the healthy group. The conversational network size of the psychosis group was also smaller than the healthy group (though not significantly).

Correlation Between Conversational Network Size and Mental Health/Personality Scores
We computed the Pearson's correlation coefficient between the average conversational network size of the participants and their mental health/personality scores. The result is shown in Figure 3. As an example, the correlation between the conversational network size and depression severity (PHQ score) was -0.56 (p < 0:01), shown in Figure 4. This association was not affected when accounted for possible confounding variables. The PHQ score had a coefficient of -0.10 (zvalue: -3.14**) in predicting conversational network size within a generalized linear model with confounding variables included as predictors (see Section S-II-G in the supplementary material, available online).

Interpretation
The everyday conversational network is a function of the individual's sociability. When the target individual is a part of the conversation, as is often the case, the node representing the target individual is connected in the conversational network (see Figure S3 in the supplementary material, available online). Alternately, when a conversation does not include the target individual, his/her node is disconnected. In this case, capturing surrounding speakers represents the target individual's proclivity to be around other people and avoid isolation. Thus, the ECoNet estimate of conversational network size includes both the conversations that the target is a part of or is in the vicinity of. While biased upward, it provides a representation of an individual's sociability by approximating (in-person) social network size and the tendency to affiliate with others.

DISCUSSION
In this work, we proposed ECoNet, an everyday conversational network estimator, based on the unsupervised clustering of speech segments. ECoNet outperformed the existing audio-based speaker-counting system and was used to assess the relation of the conversational network size , a sociability measure, with mental health scores. Conversational network size was significantly correlated with mental health scores, such as depression severity.

ECoNet: Everyday Conversational Network Estimator
We demonstrated that a DNN-based VAD model could provide robust speech detection in free-living audio. Several VAD models, in comparison, gave a high error rate (see Table S1 in the supplementary material, available online) similar to the observations in Liaqat et al.'s work. 15 The PyanNet-architecture-based DNN model had a lower error likely because of automatically learned bandpass filters for speech-discriminative features. We also showed that the spurious speaker detection can be framed as a machine learning problem, which gave a good detection performance (see Figure S1 in the supplementary material, available online). A multi-feature model likely better represents different spurious speaker-detection scenarios compared with a single-threshold appro-ach in Xu et al.'s work. 11 Per-scenario performance of the spurious speaker-detection model, e.g., in the presence of overlapped speech, background noise, etc., should be assessed for further improvements. Overall, ECoNet outperformed Crowd++ 11 and its recent variations 12 in the speaker-counting task (see Table S2 in the supplementary material, available online) possibly due to the robust VAD model, speaker-discriminative embeddings, and representative spurious speaker-detection model.

Conversational Network Size and Mental Health Parameters
The conversational network size, a sociability measure related to the (in-person) social network size, was significantly smaller in the depression group compared with the healthy group (see Figure 2). The average conversational network size had a correlation coefficient of -0.56 (p < 0:01) with PHQ score (see Figure 4). These two observations from our study fall in line with the social isolation hypothesis 10 that implicates reduced social network size and lower sociability to depression. 1 , 20 The conversational network size of the psychosis group was also smaller compared with the healthy group, but not significantly so, possibly reflecting a different mechanism of social isolation in psychotic disorders. In general, reduced conversational network size indicates reduced sociability and was, as expected, associated with the higher stress-related levels (higher PSS/ GAD scores), and lower positive health scores, (e.g., low DTS scores) (see Figure 3). Conversational network size estimated from ECoNet has potential (mental) health applications 1-3,5 as well as likely applications in crowd monitoring, interaction assessment in educational settings, etc. Estimation of fine-grained conversation features (turntaking behaviors and engagement) using ECoNet could further enable new clinical and non-clinical applications.

Limitations and Future Work
Although ECoNet outperformed existing audio-based speaker-counting systems, it can be further improved. For example, a context-adaptive ECoNet that adapts the processing to detected scenarios, e.g., outdoor or indoor, could be explored. Each module within ECoNet could also be improved. The VAD model, for instance, could be extended with attention mechanisms for better temporal context. The ECoNet modules were developed and validated using diverse public datasets, such as the DIHARD dataset, 17 consisting of audio from representative real-life scenarios. The clinical audio dataset from our pilot study, although a large audio dataset, could not be used for the development as the audio cannot be listened to, ensuring privacy as required in the study. A large labeled free-living audio dataset would be helpful to further improve the VAD models and ECoNet in general. We designed ECoNet to be privacy-preserving as a speech-based system might raise privacy concerns. No analysis of the spoken contents is done, raw audio data need not be stored, and all processing can potentially be done in the individual's device or on a private network. We analyzed the recordings offline in this work, but the execution time was much smaller compared with the audio duration (see Section S-I-G). Thus, a (semi-)real-time processing on the target device/private network could be possible. We will investigate ECoNet 's deployment on target devices and continue to assess the privacy implications in future developments.
We investigated the potential of conversational network size obtained from the ECoNet in mental health applications. The depression severity, for example, was  significantly correlated with conversational network size (see Figure 4). However, the correlation was lower and not significant when considering the individual groups (see Section S-II-E). Additional features derived from ECoNet might be helpful to better model depression severity differences within a group. It is to be noted that average daily conversational network size provides only one view of sociability, assigning importance to daily interactions even if with the same speakers across days. Other views of sociability, such as the total network size over a monitoring period, could also be relevant to study and will be explored in future work. As discussed in the "Interpretation" section, a limitation of ECoNet is that an individual's involvement in the conversation cannot be ascertained. We will extend ECoNet with a conversational scene analysis module and explore other sociability measures that could be used to develop a mental health classification or severity prediction models in future work. Both the conversation properties of the most significant speakers, including the target individual, and characteristics of their utterances could be relevant to study (see Section S-II-H, available online). We will validate ECoNet 's capability to extract fine-grained sociability measures in future work. Measures related to interaction dynamics and turn-taking behaviors could help clarify fulfillment or loneliness, dissatisfaction, etc., resulting from social interactions, which are relevant in mental health applications.

CONCLUSION
We developed ECoNet, an audio-based automatic estimator of everyday conversation network from free-living audio, to complement the self-report-based assessments of sociability measures with automatically inferred measures. Several sociability measures could be extracted from the estimated conversation network, such as conversational network size, a task at which ECoNet outperformed the existing system. Conversational network size inferred using ECoNet was found to be significantly correlated with mental health (e.g., depression severity) in our pilot study on 32 participants with diverse mental health conditions. Our work demonstrates the relevance of an automatic sociability measure estimation from free-living audio for mental health applications. Further evaluation and validation with a larger population are required in future work to build on the observations from our work.