Speaker Diarization and Identification from Single Channel Classroom Audio Recordings Using Virtual Microphones

Speaker diarization refers to methods for identifying speakers from audio recordings. An important application comes from the need to assess student interactions in collaborative learning environments. Diarization is very difficult in these environments where a single microphone is used to record multiple voices. Although there have been advancements in this field, little progress has been made in the case of noisy and disorganized multi-speaker environments. Current state-of-the-art methods based on Deep Learning require large training databases and can suffer from significant noise interference and bias due to the speaker’s accent, age, and gender. In this paper, we are proposing a new method to identify speakers that does not require the use of large training sets. To this end, we use a virtual array of microphones. The signal at the virtual microphones is simulated by extracting the spatial information of the speakers from a single channel audio recording using approximate speaker geometry observed from a video recording. The Room Impulse Responses (RIRs) at the virtual microphones are then estimated using acoustic scene simulations. The RIRs are then used to compute a cross-correlation matrix of possible audio sources. Speaker diarization is performed using the cross-correlation matrices as input to a classifier. For the task of identifying active student speakers in classroom audio, the proposed method significantly outperformed diarization methods performed by Google Cloud and Amazon AWS services.


I. INTRODUCTION
Speaker identification in crowded rooms remains very challenging. Crosstalk and large amounts of background noise make speaker separation particularly challenging. The significant variations associated with picking up speakers in crowded rooms makes it very difficult to develop ground truths on large datasets. As a result, the use of Deep Learning methods is fundamentally limited on pre-training datasets that may not be representative of the complexities associated with training for crowded rooms.
For a single speaker in a non-crowded room, a typical speaker identification system involves the extraction of speech features such as formant frequencies, pitch contours, and coarticulation from the test samples and classification against a database of training samples [1]. The datasets still need to contain as many training examples as possible and should be updated periodically to maintain a proper performance level [2]. The accuracy of the identification depends on the size of the dataset; the bigger the dataset, the better the accuracy, but the longer the training times [3].
In addition to long training times, datasets are also prone to bias with respect to spoken language and accent. [4] This biasing is usually unintentional and unconscious, and it is the product of the environment where the speech recognition system is developed [5].
The limitations of speech processing systems are more evident in challenging situations such as classroom environments. In this paper, we restrict our attention to speaker diarization in collaborative learning environments where a small group of 2 to 5 students sits around a table (see Fig. 1).
In this case, there is strong background interference coming from having up to 5 collaborative groups with over 20 students total, 5 facilitators, 2 teachers, and 5 researchers in the same room. The speakers can take turns to speak, but it is not unusual to have crosstalk, where two or more speakers talk at the same time.
A fundamental problem in educational research is to understand how the classroom material engages the students. To understand how students interact, classroom sessions are recorded and transcribed. An important problem here is to determine which participant is speaking at a particular moment, what she or he has said, and for how long the participant spoke. Manual diarization of meetings is a tedious and time-consuming task, subject in many cases to the interpretation of the transcriber. Automated methods usually require multi-channel audio recordings and are prone to errors due to noise and crosstalk. Also, these systems have limitations in the number of speakers they can process, as well as the length of the audio segments.
While diarization systems do not require enrollment of the speakers, they can only generate abstract labels of a speaker that is active in an audio segment. On the other hand, speaker identification systems can provide non-abstract labels by enrolling the participating speakers. The enrollment process consists of each speaker providing several seconds of noise-free speech without crosstalk. This requirement cannot be met when the data consists of audio recordings of busy meetings with noisy backgrounds. It is thus important to develop speech identification and diarization methods that do not impose any requirement to pre-enroll the speakers.
We present a method for speaker identification and diarization using virtual microphones that does not require prior speaker enrollment. The proposed approach only requires a rough estimate of the speaker geometry that can be derived from video recordings. The approach does not require pre-training, is independent of the spoken language or accent of the participants and works well in noisy environments.
The proposed approach relies on the fact that discriminant information about the 3D geometry of each speaker is embedded in the recorded audio from a single microphone. The basic idea is to recognize speakers using acoustical simulation. As part of the simulation process, the proposed method computes the Room Impulse Response (RIR) for each of the microphones and the speakers and simulates the reception on each of the virtual microphones. The accuracy of the process of computing RIRs is verified through reallife measurements of the correlation patterns. Based on the simulated reception over the virtual microphones, the method computes correlation patterns among the virtual microphones. The recorded audio is then also used to generate different correlation patterns based on hypothesized speaker locations. A classifier is applied to the generated correlation patterns to select the most likely speaker location.
For our approach, we do not consider diarization for multiple speakers within the same group. Our approach however, accounts for significant crosstalk that is the result of strong background interference across groups. Thus, it is possible to address this issue by simply adding an extra microphone for each subgroup of students talking, and then considering the two (or more) subgroups as separate groups. Without an extra microphone, our approach can be adapted for having multiple people speaking simultaneously to the same microphone, as described in our methods section. This paper is structured as follows: Section II provides background information. Section III describes the proposed method. Section IV describes the implementation of the method, physical validation, and provides experimental results of the proposed method against current state-of-theart methods. Section V provides concluding remarks.

II. BACKGROUND
Speaker diarization can be summarized as "who said what, and when", and for "how long" [6]. The task of determining for how long one speaker has been active in a multi-participant conversation requires speaker diarization and subsequent identification with non-abstract labels. Most speaker diarization systems work by segmenting the audio using a voice activity detector (VAD), then the segments considered to be only noise are discarded, while those containing speech are analyzed for distinctive features. The different segments are classified with abstract labeling (e.g., speaker 10, speaker 1, etc.), usually by using cluster classification. Speaker identification systems work by enrolling speakers in a database, then extracting speech features to determine if the audio segment contains one of the enrolled speakers. A system that accepts or rejects the identity claim by a speaker is called a speaker verification system. In what follows, we present a summary of current state-of-the-art speaker diarization methods. We begin by describing classical speaker diarization of single-channel recordings and continue with speaker diarization using virtual microphone augmentation. We conclude the section with a discussion of commercial state-of-the-art methods based on Deep Learning.
Hu et al. [7] proposed a method to utilize the reverberant information, known as the Direct-to-Reverberant Ratio (DRR), from a single channel recording for speaker diarization. Hu et al. estimate the DRR using the algorithm from Peso Parada et al. [8] and combine it with a Mel-Frequency Cepstral Coefficient (MFCC) Diarization method proposed by Vijayasenan et al. [9]. The method uses both MFCC and DRR features in combination so a trained system can perform a clustering type of classification. The estimates for the DRRs are computed using features such as Signal-to-Noise ratios, MFCCs, power spectrum, and zero-crossing rates, among others. It is important to note that this work was tested only using simulated meeting recordings with clean audio and assumes that the speakers are stationary (they do not change positions).
Yoshioka et al. [10] described a way of linking several recording devices, such as laptops or mobile phones, to emulate a microphone array. After linking the different devices, the multi-channel audio can be used for speaker diarization. Yoshioka et al. claim a 13.6% diarization rate when 10% of the speech duration contains more than one speaker. This approach is innovative but requires the presence of several recording devices in the meeting room, and therefore it is not achievable with a single microphone recording as in our proposed method.
Another approach to virtual microphone emulation was presented by Katahira et al. [11], Del Galdo et al. [12], and Izquierdo et al. [13]. The authors proposed to simulate arrays of microphones by synthesizing virtual microphone signals using two physical microphones. These methods of microphone emulation are not viable when there is only one physical microphone available.
The most recent single-channel methods for speaker identification and diarization are based on Machine Learning. Deep Belief Networks (DBN) are widely used in speech recognition [14,15]. In [16], the authors claimed the use of X-vectors can achieve state-of-the-art results in speaker recognition. In [17], the authors showed that Deep Neural Networks using X-vectors often outperformed classic i-vector methods in terms of Equal Error Rate (EER) on standard datasets (e.g., VoxCeleb, NIST SRE 2016, and SWBD). To achieve this increase in performance, X-vector DNNs require the data to be augmented by adding noise and reverberation to the training data. This extra step is not necessary for our proposed method, where the only training needed consists of just a few seconds of audio from each of the speakers.
Pawel Cyrta et al. [18] presented a speaker diarization method using a deep learning architecture that builds the speakers embeddings by training a recurrent convolutional neural network applied directly on magnitude spectrograms. The authors evaluated their method using several available datasets consisting of meetings and broadcast materials from news stations, claiming a reduction of the diarization rate error of 30% when compared with the baseline, the LIUMJ Speaker Diarization system. Compared to our proposed approach, this method was tested using clean datasets with very low levels of noise as compared to noisy recordings of classroom environments. The method also demands large datasets for training the deep learning system. IBM, Google, Amazon, and Microsoft offer speech processing services based on algorithms that use Deep Learning methods. These tech giants offer powerful computer systems and large databases for these services. Amazon's, Google's, and Microsoft's are all closed-source cloud services that provide an API for speech-to-text processing and speaker diarization. In this paper, we reviewed Amazon's Transcribe (AWS) [19], Google's Cloud [20], and Microsoft Azure Speech Services [21], and experimentally compared Amazon's and Google's against our proposed system.
Amazon's Transcribe accepts either audio files or streaming data, single-channel, and outputs text files with speaker diarization based on a specified number of speakers. Amazons Transcribe works better with 2-5 speakers, and it is language dependent. The length of the audio files is limited to a maximum of 120 minutes. Amazon's Transcribe stores the voice data to train the models [22], unless the users select the option to delete this data. Amazon's functionality can be accessed via REST and SOAP protocol over HTTP [23]. Amazon offers a highly trained set of models called Amazon Transcribe Medical which is aimed at medical transcriptions. Users can also customize the vocabulary to better fit their needs, which is a very desirable feature not offered by Google.
Google's Cloud works similarly, with an interface for long speech and single-channel input for transcription purposes [22]. The optimum number of speakers is set at a maximum of 5. As with Amazon Transcribe, Google offers the option of privacy that prevents data logging that could be used to improve the models. Google's models are optimized for phone conversations or videos, accepting 16kHz or 8kHz audio, respectively, depending on the application [23]. It also offers vocabulary customization. Google offers good scalability, infrastructure, and payment schemes that are considered the best among the technology giants [24].
Microsoft offers speaker diarization via its Cognitive Services. Microsoft's Diarization system ranked first at the VoxSRC challenge 2020 by achieving a diarization error (DER) of 3.71% in development and 6.23% in evaluation testing [25]. The datasets consisted of audio collected from YouTube recordings. For the challenge, the network was trained with 1500 hours of simulated mixed training audio. Microsoft Speaker Recognition [21] offers textindependent speaker recognition/verification. The speakers need to be enrolled to create a signature, which is later compared with the audio to be analyzed. The minimum requirements are 20 seconds of speech for training, and 4 seconds of speech for identification, with unlimited speaker enrollment, with only one speaker present. In the case of diarization, Microsoft can only recognize up to two speakers. Microsoft Transcription requires multi-channel audio for diarization and the signature of the participating speakers for identification, labeling each speech segment with its correspondent speaker. Microsoft does not collect users' voice samples to train its models. Users can customize their vocabulary and the environment they are expecting to operate, meaning that customization must include noise, indoor or outdoor environments, multi-gender speech, etc. [21].
Although the systems we described above perform well under the environments they were tested and designed to operate, they still have some limitations with respect to training requirements, number of identified speakers, and interfacing. First, these systems are paid services that require connectivity to the API and subsequent batch processing. Our proposed system is completely standalone, not requiring any connection, thus allowing for implementation in applications where connectivity may be impaired. The system can run on stand-alone computers without the need to access remote computer clusters or databases. Second, we do not require speech databases; our system is based on physical models that are adapted to the scene we are analyzing. Instead of large datasets, our system requires capturing only about 1 to 2 seconds of audio from each speaker for both training and recognition. In contrast, at a minimum, state-of-the-art systems require tens of seconds of clean audio for training and several seconds of identification. In addition, the lack of databases also eliminates privacy issues, as voice logging is not needed to improve the models. Also, the physical model nature allows, at least in theory, to process an unlimited number of speakers, regardless of the language spoken. Finally, our system has been conceived to operate in noisy environments where microphone arrays and crosscorrelation analysis have been proven to be efficient methods for speaker discrimination [26,27].

III. PROPOSED METHOD
We present a top-down diagram of the proposed method in Fig. 2. Our approach relies on estimating the acoustic scene to determine the most likely speaker in each speech segment. In Fig. 2, the acoustic scene is simulated by the room model generator, the source estimator, and the room model estimator. Room model estimation is approximated from a video of the scene (e.g., see Fig. 1). During training, we compute cross-correlation patterns for each possible speaker. Then, during testing, we compute crosscorrelation patterns over each audio segment and compare them against the training patterns to determine the speaker that produced the closest correlation pattern. The rest of the current section provides detailed descriptions of each component used in our proposed system. Informed consent was obtained for all study participants.

A. ROOM ACOUSTICS AND SIMULATION
We begin describing our approach using a single source and a single microphone. We then extend our model for several sources and microphones and, finally, we present how we adapt our models to different speaker geometries.
We begin with a simple model based on a single source signal ( ) located in the far-field and recorded by a microphone as a signal ( ) that is the convolution of the Room Impulse Response (RIR), ℎ( ), and additive noise ( ) given by: (1) The RIR depends on the locations of the sources, the receiving microphone, the geometry of the room, the absorption of the materials in the room, and the audio frequencies of the sources [28]. The RIR captures audio propagation through a direct path, early reflections, and late reverberations. The direct path component is the Euclidian distance of the source to the microphone, and it is a function of the Time of Arrival (TOA) or the time it takes for the signal to travel from the source to the microphone. The other two components of the RIR are related to the reflections of the sound waves at the walls and objects in the room. The early reflections usually arrive 5 ms after the direct path. The late reverberations arrive 20 or 30 ms after the early reflections begin. The RIR can thus be expressed as the summation of each of the impulse responses corresponding to the direct path and the reflections as given by: where K is the number of reflections, k is used to index specific reflections, and w is measurement noise. The acoustic reflections depend on the absorption of the materials of the room and the frequency components of the acoustic signal [28]. The reverberation signals result from acoustic wave reflections. The late reverberations depend heavily on the frequency components of the sources but, in the case of the early reflections, this influence is minimum [29]. We next extend our model for the case of multiple sources and microphones. Suppose that we have possible sources: $ ( ), … , % ( ) and possible microphone signals: $ ( ), … , & ( ). Next, let ℎ ',! ( ) denote the RIR that describes the propagation from the -th source to the -th microphone. At the -th microphone, we receive signals from all sources as expressed by: where ( ) represents additive white noise. In our collaborative learning environment, we only record $ ( ). We thus need to use (3) to estimate the virtual microphone recordings: ) ( ), … , & ( ) from $ ( ). To use (3), we need estimates for ℎ ',$ ( ) and their approximate inverses ℎ ',$ *$ ( ). Note that the actual inverses may not exist [28]. We perform the source estimation in two steps. First, we estimate the sources using: Second, we plug in the estimated sources from (4) into (3) to compute ) ( ), … , & ( ). The estimation for ℎ ',! ( ) and ℎ ',$ *$ ( ) requires acoustic scene simulation that depends on the geometry of the speakers (sources) and the room where the students are meeting. In what follows, we provide more information on how to set the parameters.
As shown in Fig. 1, we can estimate the relative locations of the speakers and the recording microphone from a single video shot. For example, we can approximate that the table is about 1.5 meters long by 1 meter wide, that the speakers are separated about 0.7 meters from each other, and the speaker's mouths are about 0.24 to 0.25 m from the table. We can also locate the reference microphone in coordinates that are relative to each of the speakers. These are just approximations to create a generic model from where to calculate the RIRs. For the simulation, we consider a simplified model with a small room, large wall absorptions, with a limited number of images due to sound reflections. The acoustic simulation is thus meant to capture early reflections and avoid complex, long-delayed reflections.

B. VIRTUAL MICROPHONES
The spatial locations of the virtual microphones can be directly related to the source audio frequencies. To understand the issues, instead of the classic time-sampling, consider reconstructing an acoustic signal from its 3D spatial samples at a fixed time. In this case, the 3D sampling array separation d between the microphones must be less than half the wavelength λ of the audio source signal. Therefore, d should be which translates to a maximum frequency of For separating the speakers, there is a need to keep the distance between the microphones as large as possible. At larger distances, the correlation patterns will be very different for each speaker. Unfortunately, larger distances imply larger wavelengths and hence smaller spatial frequencies in (6).
For the maximum allowable separation, we select the fundamental frequency of human speech as the smallest spatial frequency that we are interested in. The fundamental frequency of human speech varies from 100 Hz to 120 Hz approximately, with some extreme cases going up to 255-300 Hz (children). Based on a max frequency average of 180 Hz and the speed of sound c = 343 m/s, we set the maximum separation for each microphone to: For separating the voices of children, we clearly need to consider much smaller separations that correspond to higher frequencies. After some experimentation, we set = 0.05m for the final collaborative learning environment used for the final classification experiments presented in section IV.C. Here, we note that = 0.05m corresponds to a maximum frequency of 3.43 KHz.
We present the proposed virtual microphone geometry in Fig. 3. Here, the speakers represent the sources ( = 3, but this number varies). The dark microphone (labeled M1 in the center) is the only real microphone and represents the recording microphone in the actual physical environment. The rest of the microphones are virtual ( = 5).
The distance between the microphones determines the Time Difference of Arrival (TDOA) between the microphones. The TDOA is simply defined as the difference in time a signal takes to reach two points separated by a certain distance in space. Initially, let us assume that fig. 2 is an ideal representation where there are no reflections or room absorptions. Then, the TDOA of an active speaker will be unique to at least a pair of microphones, either virtual or physical. For example, if speaker 3 is active, then the TDOA between M5 and M3 will be the same, and different from the TDOA between M2 and M3. These TDOAs are unique for speaker 3. Without loss of generality, we expect the unique property to hold for more complex models that we consider here. We are interested in the location of the peak of the normalized cross-correlation function defined by: ,,' = argmax ,,' ( ).
(9) If a source signal propagates to microphones , , then ,,' represents the time lag that it takes for the signal to reach after reaching . Thus, ,,' > 0 implies that the signal arrived at microphone before . On the other hand, ,,' < 0 implies that the signal arrived at microphone before . The cross-correlation matrix of all possible values ,,' will be used for determining the locations of the speakers.
Before using ,,' for speaker recognition, we provide a summary of its properties. First, it is clear that the diagonal is zero. Second, based on the definition, it is clear that ',, = − ,,' . Therefore, the matrix of ,,' values are antisymmetric. Hence, for differentiating among speakers, we only need to use the entries above or below the diagonal.
To develop a model for the approach, we consider the problem of recognizing one of several possible speakers from a given audio segment. First, we need to construct virtual microphone approximations to ℎ ',! ( ). Second, we estimate the correlation matrix features + under the assumption that speaker is talking while all other speakers remain quiet: ! ( ) = 0, ≠ . These + models are only computed once here. They do not need to be computed for each audio segment. Third, for each audio segment, we compute , the cross-correlation matrix of the actual signal. Fourth, we estimate the active speaker by solving: where match (.,.) measures the agreement between + . We thus allocate the speaker = that gives the best match among all considered speakers. A simple match function is given by the number of template entries that match as given by: where `, ,' − ,,' + a is the discrete delta function that is 1 when the correlation pattern match with ,,' = ,,' + , and it is 0 when they are different: ,,' ≠ ,,' + .
Our approach rejects background noise using hypothesized directions and correlation pattern matching. Firstly, the RIRs model the position of the audio sources. Hence, acoustic sources that do not match the model will generate a different correlation pattern that will not affect our results. We use this approach to model background noise source (e.g., S6 in Fig. 6). Secondly, we note that our use of correlation patterns remains robust with respect to additive white noise. To see this, note that while additive acoustic noise can reduce the cross-correlation coefficient ,,' (see equation (8)), the correlation patterns defined in terms of ,,' only depend on the location of the correlationpattern maximum (see equation (9)). Thus, a uniform reduction of ,,' ( ) throughout time will not be expected to change the location of its maximum.
The proposed method can be extended to address the case when we need to differentiate among more than one speaker talking at the same time within the same group. For this case, we would need to consider a much larger number of correlation patterns. For example, for detecting up to active speakers talking at the same time, we have 2possibilities. However, the approach can be further complicated by the need to account for having students speaking at very different levels (e.g., loudly versus quietly).
Clearly though, within the group, we are not interested in having multiple speakers talking at the same time. Within the proposed framework, a simple solution would be to place additional microphones within each student subgroup.

IV. IMPLEMENTATION, VALIDATION, AND RESULTS
In this section, we present the experiments conducted to evaluate the capability of the proposed method to identify speakers in audio segments. We begin by applying the principles of section III to an acoustic model based on an approximated room geometry. We validate the physical model using audio experiments. We then provide speaker diarization results and compare our method against Amazon AWS and Google Cloud.

A. ACOUSTIC MODEL PARAMETERS
In Fig. 4, we present the basic setup for our acoustic simulation. We considered a maximum of 5 participants and hence 5 possible source directions. For the cases of 2, 3, or 4 speakers, we simply selected the closest directions from the 5 basic directions of Fig. 4. Hence, we did not recalibrate our models for every possible variation on the acoustic scene. Furthermore, we also considered all speakers to be at the same height from the table (0.25m). For realistic simulation, we also modeled room noise as a sixth speaker placed at the lower-left part of Fig. 4.
To estimate the RIRs, we used Pyroomacoustics [30,31,32]. Pyroomacoustics is an open-source software system that supports the reproducibility of our results. Pyroomacoustics calculates the RIRs using the Image Source Model Method (ISM) [33]. Image sources are computed based on the distance of each speaker to the absorbing boundaries. For the simulation, the software assumes vertical incidence on the walls and the corners. To control the number of generated sources, we do not consider greatly attenuated sources that are associated with long delays.
For the acoustic simulation, the generation of a large number of simulated sources tends to provide for a better approximation. We simulated the learning environments by assuming acoustic walls with high reflection coefficients located at a short distance behind each speaker. As a result, each speaker generated 2 to 3 reflected sources that were propagated to the virtual microphones. The distance between the table and the ceiling was set to 2m.

B. PHYSICAL MODEL SIMULATION AND VALIDATION
To validate our model simulation approach, we compared correlation patterns generated by our simulation environment and physical measurements using actual microphones and speakers. We consider two setups for validating our approach. First, we compare the performance of the virtual microphone array simulation against a physical microphone array. Second, we perform a controlled audio experiment to understand some of the limitations of the virtual microphone array in collaborative learning environments. Firstly, for validation using an array of physical microphones, we used the same microphones as the central microphone in our video recordings. The microphones were calibrated using a sinusoidal source of 450 Hz, and we compensated for any physical delay during the audio recordings. The model absorption was empirically set at 0.95. The 2D model included 4 loudspeakers and 5 microphones as depicted in Fig. 5. In the Pyroomacoustics model, sound reflections were simulated using 8 images of the actual audio sources.
As shown in Fig. 5, the physical microphones were placed out closer to the speakers. The larger separations still satisfied the constraint given in equation (5). We note

Figure 6: 2-D Model for Controlled Experiments
that larger separations were needed to keep apart the large physical microphones, as opposed to the virtual microphones that do not have such constraints.
To generate the physical measurements, we used an anechoic male voice of 2 seconds duration. The voice was played through the four speakers and was simultaneously recorded through the six microphones. The same signals were simulated using Pyroomacoustics. For each recording, we compute the resulting correlation patterns.
A comparison of the measured correlation patterns is given in Table I. Here, we note that the signals were sampled at 48 kHz, at the same sampling frequency as our video recordings. The results summarized in Table I indicate general agreement between the simulation and the actual physical measurements. In most cases, the error is less than 20%. Most importantly, there are significant differences between the correlation patterns from different speakers. Hence, the simulation model appears to be sufficiently accurate for differentiating speakers based on their positions.
Secondly, we validate our approach in a controlled audio environment. Here, we study the performance of the system in identifying different speakers. For this experiment, we played each source from different loudspeakers in our audio lab and used only the central microphone M3 to capture the audio. To demonstrate the method is not biased to any speech or speaker, speaker 2 (S2) repeats the same speech as speaker 1 (S1) on two occasions. Noise was injected into the environment by playing a compact disk (CD) containing a recording of conversational room noise. The CD player was located at about 2 m from the reference microphone. The audio was segmented using a Voice Activity Detector preserving the noisy segments. The physical dimensions of the model and the location of the microphones were adjusted to better follow the geometry of the acoustic scene depicted in Fig.  4. The final 2-D model is shown in Fig. 6.
We employed the Diarization Error Rate (DER) [34,35] as a metric for Diarization performance. The DER is defined as the fraction of the time that is not attributed correctly to a speaker or non-speech [36]. It is estimated using: where FA is the length of False Alarms; Miss is the length of missed speech segments; Overlap is the total length of overlapped speech; Confusion is the total length of misclassified segments, and the Reference Length is the total length of the audio reference. We did not use Overlap for our tests.
The test consisted of playing three separate audio tracks containing only two speakers at a time, and one audio track containing four different speakers. Audio samples A and B were played as speakers 1 and 3, while audio sample C was played as speakers 2 and 4. Audio sample D was played as speakers 1, 2, 3 and 4. The audio was divided into segments with a maximum length of 1.5 s, and a minimum of 0.5 s. We used 1 s long samples from each of the speakers to train the model. As described earlier, for classification, we used a simple match-and-vote classifier where the speaker position with the highest number of cross-correlation matches with respect to the training template is selected as the current speaker. Table II provides a summary of the results. Overall, the results indicate a good DER of not more than 0.27 in the worst case. The results validate the approach on this limited validation experiment. We present a careful comparison against state-of-the-art methods in the following section.

C. RESULTS FOR COLLABORATIVE LEARNING ENVIRONMENTS
We next present comprehensive validation of our approach based on actual collaborative classroom videos. We provide detailed analysis for complex audio samples collected during the afterschool program [37]. The corresponding videos contain acoustic scenes like the one shown in Fig. 4, with 2 was very noisy with 5 collaborative small groups each consisting of 3 to 4 students, 5 facilitators, 2 teachers, and 5 researchers in the same room (over 32 speakers).
To process the videos, we assume the baseline model presented in Fig. 6. The parameters of the model are set as described in subsection IV.A. We basically made minor adjustments to the baseline model to reflect the number of speakers and their locations, while maintaining the same geometry for the virtual microphone array.
We constructed 8 carefully chosen examples with 2, 3, 4, and 5 speakers. For the ground truth, we reviewed the videos to provide 0.5 second accuracy within a total duration of three minutes. The ground truth involved a manual review of the video clips to associate lip movements to specific speakers. Here, we note that the proposed method allowed us to identify each speaker based on their location. This was not possible for Amazon AWS and Google Cloud. Instead, for comparison purposes, we mapped the results from Amazon AWS and Google cloud to the most likely speaker that would give the best results. To train our system, we used a noisy sample of 1.8 seconds from each speaker. Here, we note that our method does not depend on the specific speakers. We use training to estimate the RIRs that depend on the relative location of the speakers with respect to the physical microphone. Hence, as long as the speakers return to their seats, we can handle any unknown speaker that takes their seat at the table. Furthermore, as discussed earlier, we only require a rough estimate of the sitting arrangement. There is no need to retrain the model unless there are very significant changes in their seating arrangements.
We used simple voice activity detection to segment the audio. We used a maximum audio segment length of 1.2 seconds and discarded audio segments that were shorter than 0.5 seconds.
We present detailed comparative results in Table III and summary results in Table IV. We begin with a summary of the results and then provide a much more detailed analysis.
From the summary results, it is clear that the proposed method significantly outperformed Amazon AWS and Google Cloud. For the results, the percentage error is given in terms of the actual speaker time as given by: In all cases, the proposed method gave substantially lower error rates. With two speakers, the error was acceptable at less than 20%. In comparison, the error rates for all alternative methods were much higher in every possible sample. As we shall describe next, alternative methods failed in many instances. We provide a detailed analysis of the results in Table  IV. We use red highlighting to denote cases of dramatic failures. In such cases, we have that a speaker was completely missed, or the estimated talking time of the speaker had more than a 100% error (e.g., an excessive over-estimation of speaker talking time).
Out of 28 possible speakers across all examples, Amazon AWS gave failing results for 14 cases (50%), Google cloud gave failing results for 10 cases (36%), while the proposed method gave failing results for 2 cases (7%). Here, it is interesting to note that the proposed method never failed to detect a speaker (0% error), while Amazon AWS could not detect any talking time for 10 cases (36%). Google cloud failed to detect any talking time for 4 cases (14%). It is also interesting to note that we have dramatic failure cases for all 8 samples for Amazon AWS and Google Cloud. In contrast, for the proposed method, we have 2 samples with examples of over-estimation, with 6 samples being free of dramatic failures. We use green highlighting to denote cases where the total estimated speaking time gave 20% or less error. Based on this criterion, both AWS and Google Cloud gave satisfactory results in 5 cases (18%) versus 11 cases (39%) for the proposed method.
Overall, it is clear that the problem remains challenging. However, the results from the proposed method demonstrate promise in the proposed approach that cannot be matched by the current state-of-the-art methods.

V. CONCLUSIONS
In this paper, we have demonstrated the advantages of using virtual microphones and cross-correlation patterns to identify speakers in very challenging classroom environments from a single-channel recording. Our method presented an error rate that was significantly better than state-of-the-art systems from Amazon AWS and Google Cloud. Furthermore, in contrast with other methods, our proposed approach does not require extensive training, and it is directly applicable in challenging classroom audio environments where clean audio datasets are not available.