The Minimum Overlap-Gap Algorithm for Speech Enhancement

In this paper, we propose a novel speech enhancement paradigm which can effectively solve the problem of retrieving a desired speech signal in a multi-talker environment. The proposed speech enhancement paradigm involves a three-step procedure consisting of separation, ranking, and enhancement. First, a speech separation system – which could be a conventional spatial ﬁlter bank or more advanced separation systems – separates mixtures of speech signals captured by microphones into speech signals from candidate speakers. Next, novel ranking algorithms – proposed in this paper – are applied to determine the talker-of-interest amongst the separated speech signals. Finally, the speech signal of the talker-of-interest is estimated as a linear combination of the separated signals, whose weights are determined by the ranking algorithms. We propose ranking algorithms, which exploit turn-taking patterns between conversational partners in order to determine the talker-of-interest amongst competing speakers. Unlike some existing solutions, our ranking algorithms do not require access to additional sensors, e.g., EEG electrodes, cameras, etc., but only relies on microphone signals. Speciﬁcally, the proposed algorithms rank the separated speech signals based on the probability of speech overlaps and gaps with the user’s own voice. The speech signal with highest ranking is the talker with minimum probability of speech overlap and gap with the user’s own voice. The proposed ranking algorithms are shown highly effective at determining the talker-of-interest, since conversational partners, i.e., the user and the talker-of-interest, behaviorally avoid speech overlaps and gaps. We evaluate the proposed speech enhancement paradigm in two practical hearing aid related applications, where the objective is to enhance a speech signal of a conversational partner in a multi-talker environment. The results of the evaluation demonstrate that the proposed speech enhancement systems in both applications signiﬁcantly outperform conventional speech enhancement systems.


I. INTRODUCTION
The cocktail party problem is often regarded as one of the most difficult situations any speech enhancement system may encounter. The complexity in the acoustic environment is vast and its composition may include multiple competing speakers, music, reverberation, and noise. Solving the cocktail party problem, i.e., the speech signal(s)-of-interest, i.e. the target signal(s), is commonly the goal for speech enhancement systems in applications such as hearing assistive devices (HADs) and speaker-phone systems. The enhancement system in these applications is crucial for many humans as they rely on the aid to communicate more efficiently in noisy environments, particularly when competing speech and noise become dominant. However, achieving effective suppression of loud competing speech and noise remains a remarkably difficult problem to solve even with the most recent state-ofthe-art speech enhancement systems.
The problem of interest in this paper is to enhance a conversational partner, i.e., the talker-of-interest, in the presence of multiple competing speakers and noise. The competing speakers are obviously undesired and can potentially be louder than the conversational partner. In order to be able to enhance the conversational partner in such multi-speaker situations, any enhancement system faces the question: "Who is the user listening and talking to?". The traditional speech enhancement paradigm for single-microphone systems involves estimation of temporal statistics of the conversational partner and noise for implementation of linear filters. For the multiple microphone case, beamformers are often implemented and typically require estimation of the direction-ofarrival (DOA) and/or spatio-temporal statistics of the conversational partner and noise [1]- [5]. However, the presence of multiple speakers poses a great estimation challenge, since the conversational partner and competing speakers are often indistinguishable from an acoustic perspective. In worst case scenarios, speech enhancement algorithms might in fact suppress the conversational partner and enhance competing speech. For example, current DOA estimators such as SRP-PHAT [6], maximum likelihood [7,8], and deep learningbased DOA estimators [9], are not able to robustly handle a conversational partner in a multi-speaker environment, without additional a priori information on the conversational partner's location or voice activity. Consequently, these DOA estimators will indecisively switch between the candidate speakers as being the conversational partner leading to an enhanced signal of unacceptable intelligibility and quality.
In this paper, we propose a speech enhancement paradigm that can efficiently identify the conversational partner in a multi-speaker environment and retrieve the desired speech signal. The paradigm is described through the three stepprocedure as shown in Fig. 1.
In the first step, the noisy microphone signals are fed into a speech separation system to separate mixtures of speech signals into individual source signals/components, which we refer to as candidate speakers. Example of speech separation systems include beamforming systems which separate speech using beams steered in different directions, or deep neural network (DNN) based separation algorithm e.g. uPit [10,11] and TasNet [12,13]. Some applications allow microphones to be placed physically on the candidate talker, in which case the separation is trivial.
In the second step, the separated candidate speakers are ranked according to their likelihood of being the conversational partner. Existing ranking strategies may involve additional sensor signals and prior knowledge to support the decision of estimating the conversational partner channel after speech separation. As an example, beamforming systems in HADs often rank, or simply assume, the frontal speaker as the most likely conversational partner [2]. However, unfortunately, the user may not always face the conversational partner in all situations, which leads to a loss of performance. Alternatively, estimated candidate speakers may be ranked using EEG-signals, retrieved from EEG-electrodes placed on the scalp of the user, to detect the user's attention on conversational partner, EOG-signals to estimate eye-gaze from in-ear electrodes, and cameras to track eye-movements and estimate eye-gaze [14]- [18]. While these signals have the potential to support the decision of determining the talkerof-interest, they require additional sensors which increase equipment cost, increase wearing inconvenience, and likely also increase computational cost and power consumption. These trade-offs make acquisition of EEG, EOG, and visual signals impractical for small devices such as HADs where power consumption and wearing inconvenience matters for the end user.
Finally, the last step involves enhancement of the conversational partner signal. The enhanced signal is formed as a linear combination of the separated speech signals where the weights are determined from the speaker ranking algorithm.
Additionally, we propose a method to the ranking problem in Fig. 1, which does not require additional sensors apart from microphones. A microphone-only system is highly desirable from a practical perspective, both due to the cost of additional sensors and from a algorithm complexity perspective. Our method is based on exploiting the conversational behavior between the user and the conversational partner. We use the so-called turn-taking behavior between two conversational partners [19]- [23] to rank the candidate speakers according to the talker which is most likely the user's conversational partner. Specifically, the method analyses the speech overlaps and gaps between the user and a candidate speaker to quantify turn-taking, and then selects the speaker with minimum probability of speech overlap and gap with the user as the talker-of-interest. This paper is organized as follows. Sec. II introduces the basics of conversation and turn-taking behavior and its potential use in ranking the candidate speakers and determining the talker-of-interest. In Sec. III, we derive our minimum overlap-gap (MOG) method and propose statistical models of speech overlap and gap behavior between a user and a conversational partner. Based on the statistical model, we propose an extension, namely, the Bayesian MOG (BMOG) algorithm. In Sec. IV, we describe the estimation of the parameters for the proposed statistical models of turn-taking from datasets of real conversations. We use the statistical models to derive the theoretical performance of the (B)MOG algorithm. Finally in Sec. V, we evaluate the performance of the proposed speech enhancement paradigm and (B)MOG algorithms in two speech enhancement applications.

II. SPEECH INTERACTION IN CONVERSATIONS
Determining the talker-of-interest and ranking the candidate speakers are needed for the proposed speech enhancement paradigm and can be an extremely difficult problem to solve. We propose to rank the candidate talker using the turn-taking model presented in [19]. Human interaction is a group of behavioral mechanisms that are taught since childhood to use when engaged in conversations to structurize exchange of information [20]. Addressing and turn-taking mechanisms found in conversations are examples of interaction management between conversational partners [20].
Addressing is used by the addressee, i.e., the talking person, to indicate whom the speech is directed to. For example, humans may use gaze, gestures, and speech to indicate the conversational partner. Strong indicators are typically head pose and eye-gaze which potentially could be utilized by speech enhancement systems to determine the talker-of-interest [20]. However, measuring the head pose and eye-gaze would usually require additional sensors such as accelerometers, electrodes, or cameras for applications such as HADs.
The turn-taking mechanism is another type of interaction management and is universal across cultures and languages. Turn-taking is used to structurize conversations. Turn-taking is used to coordinate who should speak next and when, to ensure that only one speaker is talking at a time, while others remain silent. Conversational partners may occasionally overlap and gap in conversations, but these are often of short duration such as when the listener responds the talker by saying "yes" or "uh hm" [19,24]. In order to maintain rapid turn-taking, listeners also try to predict the end of a speech utterance of their conversational partner to minimize speech overlap and gap.
We use the turn-taking model in [19] to model a) the conversational behavior between the user and the conversational partner, and b) the voice activity pattern between the user and a competing speaker. We may describe a) and b) in terms of four voice activity states. S 1 : Conversational partner/competing speaker speaks while user is silent. S 2 : User speaks while conversational partner/competing speaker is silent. S 3 : Conversational partner/competing speaker and user are both silent. S 4 : Conversational partner/competing speaker and user are both speaking.
State S 1 and S 2 are turns of the conversational partner/competing speaker and user, respectively, while state S 3 is referred to as gaps or pauses and state S 4 is referred to as overlaps [19,24,25]. In [24] it was found that 77% of all recorded conversations between a user and a conversational partner were in state S 1 or S 2 , 19.2 % belonged to state S 3 , and 3.8% were in state S 4 . For a user and a competing speaker, the proportion of time spent in each state, may be argued to be significantly different compared to a user and a conversational partner. Specifically, a larger proportion of speech overlaps and gaps would be expected between a user and a competing speaker, since the turn-taking mechanisms would not exist. In addition, when the conversational partners are exposed to noisy environments, the proportion of time spend in each state changes, with overlaps becoming more common as the noise level increases. In [26] it was found that in very noisy environments the proportion of time spent in state S 1 or S 2 decreased from 70% at a noise level of 54 dB SPL to 50% at 78 dB SPL, S 3 increased from 8% at 54 dB SPL to 24% at 78 dB SPL, and for S 4 from approximately 22% at 54 dB SPL to 26% at 78 dB SPL, where normal conversation breaks down. A possible reason for these observations is that conversational partners insist on maintaining rapid turn-taking during conversations, resulting in poorer timing and prediction of their partners end of a turn, hence increasing the proportion of overlaps and gaps.
These results indicate that humans rely significantly on turn-taking to maintain normal conversations even in very noisy environments as conversations otherwise would break down. Although speech overlaps and gaps become more frequent in noisy environments, these conversational patterns remain robust in noisy condition and the turn-taking patterns between a user and a conversational partner would presumably still be significantly different than the voice activity patterns between a user and a competing speaker. Hence, in the following we propose a method that exploits these turntaking patterns to determine the talker-of-interest in a multitalker environment.

III. THE MINIMUM OVERLAP-GAP ALGORITHM
In this section, we derive the proposed algorithm for ranking the candidate speakers using expected turn-taking patterns. Our primary focus in this section is the task of ranking the speakers by their likelihood of being the conversational partner, i.e. the Ranking block in Fig. 1.
First, the speech separation system separates mixtures of speech signals into individual discrete time-sequences s i (n), i = 0, 1, ..., I, where s 0 (n) is the user's own voice, and the remaining s i (n), i = 1, ..., I are the I candidate speech signals. For each speech signal s i (n), a binary output α i (n) of voice activity detector (VAD) is defined as   where α 0 (n) is the user's own voice VAD (OVAD). We assume that α i (n) represents the actual speech activity of the various speech sources -as we demonstrate in Sec. V-E, the proposed ranking and enhancement system work well, even when α i (n) are estimated from sources separated with a practical beamforming system. Fig. 2 shows an example of VAD outputs a real conversation between the user and the conversational partner in addition to two competing speakers. The outputs of the VADs are used to determine the voice activity state between the user and a candidate speaker i i.e. S 1 : if α 0 (n) = 0 and α i (n) = 1. S 2 : if α 0 (n) = 1 and α i (n) = 0. S 3 : if α 0 (n) = 0 and α i (n) = 0. S 4 : if α 0 (n) = 1 and α i (n) = 1.
As discussed in Sec. II, conversational partners use turntaking when engaged in a conversation. A consequence of the turn-taking mechanism is that conversational partners avoid speech overlaps and gaps, i.e., they minimize the proportion of time spent in state S 3 and S 4 . In the following, we use this observation to propose an algorithm exploiting this to determine the talker-of-interest.

A. MINIMUM PROBABILITY OF SPEECH OVERLAP AND GAP
The paradigm presented in Fig. 1 ranks the candidate speakers using their voice activity patterns, prior to the enhance-ment. The proposed algorithm selects the speaker with minimum probability of speech overlap and gap related to the user's own voice as the talker-of-interest, . We refer to this method as the Minimum Overlap-Gap (MOG) algorithm. Let A i (n), i = 0, 1, ..., I be Bernoulli random variables of the VADs and let α i (n), i = 0, 1, ..., I be their corresponding realizations. The probability of a speech overlap and speech gap between the user's own voice and candidate speaker i, is denoted as P A0Ai (α 0 (n) = 1, α i (n) = 1) and P A0Ai (α 0 (n) = 0, α i (n) = 0), respectively. The MOG algorithm selects the speaker with minimum probability of overlaps and gaps: whereî MOG (n) is the estimated conversational partner channel index and minimizing the cost in (2) is equivalent to minimizing the occurrences of the states S 3 (n, i) and S 4 (n, i), i.e. gaps and overlaps, respectively.. Alternatively, the optimization problem may also be formulated as maximizing the probability of mutual exclusion between the binary sequences α 0 (n) and α i (n) (see Appendix A) i.e.
i MOG (n) = arg max i∈{1,...,I} Furthermore, as shown in Appendix B, solving (3) is also equivalent to finding the candidate speaker index, which maximizes the mean-square-error (MSE) between the user own-voice VAD (OVAD) and candidate speaker's VAD, i.e., i MOG (n) = arg max i∈{1,...,I} Note that the optimization problem is bounded in [0, 1] as A 0 (n) and A i (n) are binary values. The definition of the MOG algorithm in (4) is a maximization of the MSE between two binary sequences and is thus computationally simple.

B. BAYESIAN MOG FOR PROBABILITY-BASED SPEAKER RANKING
Probability-based ranking of the candidate speakers can provide additional insights compared to the MOG algorithm in (4) which only identifies a single talker-of-interest. In this approach, a posterior probability is estimated for each candidate speaker which quantifies the uncertainty of a candidate speaker being the talker-of-interest. This information can be particularly useful for a speech enhancement system, for example, to adjust the level of noise suppression.

1) Statistical models of the sum of squared error
One approach to derive posterior probabilities for each candidate speaker, is to statistically model the distribution of overlaps and gaps between 1) a user and a conversational partner, and 2) a user and a competing speaker, and then use Bayes theorem to estimate the probabilities. To model the statistical distribution of overlaps and gaps, we introduce the random variable Z i (n), which represents the squared error between the own voice VAD and the candidate speaker VAD: where Z i (n), A 0 (n), and A i (n) are Bernoulli random variables. The random variable Z i (n) quantifies if A 0 (n) and A i (n) are overlapping or gapping, i.e., when Z i (n) = 0, or not. We define the sum of squared errors (SSEs) as where N is the number of past observations of Z i (n) upon which the decision will be based. The SSE quantifies the total amount of observed overlaps and gaps within N observations. Low SSEs indicate large amounts of overlaps and gaps between A 0 (n) and A i (n), whereas high SSEs indicate small amounts of overlaps and gaps. It is also worth noting that N is related to the integration time, which we define as where f s,vad is the sampling frequency of the VADs. The integration time T int , is easier interpreted than N as it also accounts for the sampling frequency of the VADs. In order to model the distribution of Φ i (n), we use that Φ i (n) is a sum of N Bernoulli distributed random variables. For independently and identically distributed Z i (n), then Φ i (n) follows a binomial distribution. However, preliminary experiments with natural conversations have shown that observations of Φ i (n) have a higher dispersion than a binomial distribution, hence the binomial distribution is too restrictive to explain the observations. Instead, we have found that a beta-binomial distribution provides a significantly better fit than the binomial distribution. The beta-binomial distribution is parameterized by N and two shaping parameters γ and β and its probability mass function (PMF) is given as is the Beta-function parameterized by γ and β, and denotes the binomial coefficient. In the remaining part of the paper, we use the PMF notation p Φi (φ i ; γ, β, N ) p(Φ i = φ i ; γ, β, N ) for brevity. First, we statistical model Φ i when the user is engaged in a conversation and afterwards model Φ i for the interaction between the user and a competing speaker. Hence, the first statistical distribution p Φi (φ i ; γ t , β t , N ) is fitted to observations of SSEs between a user and conversational partner engaged in a conversation, where the subscript t denotes that the shaping parameters are related to the true conversational partner. The second distribution p Φj (φ j ; γ v , β v , N ) is fitted to observations of SSEs between a user and competing speakers, where the user and competing speaker are engaged in different conversations.

2) Hypothesis testing
In order to estimate probabilities for each candidate speaker, we define I hypotheses H i : Candidate speaker i is the conversational partner, and the remaining I −1 speakers are competing speakers for i = 1, ..., I.
For each time n, we observe realizations, φ k , of Φ k for all k = 1, ..., I. Assuming that Φ k , are statistically independent, the likelihood function conditioned on H i is given by where I = {1, ..., I} is the set of candidate speaker indices, and I\i denotes the set of competing speakers under hypothesis H i , i.e. I excluding the element i. Using Bayes theorem, the posterior probability of H i is given by where P (H i ) is the prior probability of the conversational partner being channel i. This method of estimating the posterior probability is referred to as the Bayesian MOG algorithm.

IV. PARAMETER ESTIMATION FROM CONVERSATIONAL SPEECH DATABASE
To implement the Bayesian MOG (BMOG) algorithm in (12), the shaping parameters γ t , β t , γ v , and β v for the statistical models p Φ (φ; γ t , β t , N ) and p Φ (φ; γ v , β v , N ) are estimated from speech databases containing real conversations. Next, using the estimated statistical models, we analyze the theoretical speaker ranking performance of the MOG algorithm in terms of misclassification rate.

A. SETUP AND SPEECH DATABASE 1) Conversational speech database
In order to estimate the shaping parameters γ t , β t , γ v , and β v , we use the speech database in [27] which contains dialogues between 19 pairs of native-Danish talkers recorded during a task dialog experiment. The participants had normal hearing and were coupled into pairs to collaborate solving DiapixUK tasks [28]. DiapixUK is spot-the-difference tasks where partners were given two almost identical cartoon pictures with a few differences. The participants were not allowed to view each others pictures, but had to solve the DiapixUK task by exchanging descriptions of their picture through verbal communication. The partners were placed in different sound booths and communicated through headphones and headworn microphones. The experiment had four test conditions: 1) native language (Danish) and no noise, 2) native language (Danish) and babble noise, 3) second language (English) and no noise, and 4) native language (English) and babble noise.

2) Voice activity detection
The presence of speech in the signal s j (n) is determined by a binary VAD which produces an output sequence α j (n) = {0, 1} for either of the speakers in the dialogue. For voice activity detection, we used the robust voice activity detector (rVAD) proposed in [29] applied to the essentially noisefree dialogue recordings. The input to rVAD is L consecutive samples of s i (l) with sampling frequency f s . The output of rVAD is a sequence of N voice activity decisions α i (n) at sampling frequency f s,vad = 100 Hz. Version rVAD2.0 was used in this paper and can be found in [30].

B. PARAMETER ESTIMATION FOR THE BETA-BINOMIAL DISTRIBUTION
We used the speech data set recorded in a quiet condition and in Danish language for parameter estimation. The speech signals are sampled at 22.05 kHz but downsampled to 16 kHz for compatibility with rVAD. In order to collect observations of the SSEs for a user and a conversational partner, we used the following procedure: 1) Select an integration time T int , e.g. T int = 10 seconds, where the integration time is related to N by N = Tint fs,vad . 2) Divide the speech signals into non-overlapping segments with length T int . 3) Apply the rVAD on the speech signals of conversational partners. 4) Compute the SSE from the VAD outputs using (6).
To gather observations of the SSE between the user and a competing speaker, we perform a similar procedure, but instead of choosing a matching conversational pair, we randomly choose two non-conversational speakers to form a pair and compute the SSE. Histograms and fitted beta-binomial distribution of SSEs between a user and a conversational partner, as well as a user and a competing speaker are shown in Fig. 3 for different integration times. Clearly and as expected, the separability between p Φ (φ; γ t , β t , N ) and p Φ (φ; γ v , β v , N ) becomes greater as T int becomes larger. The dispersion of SSE becomes smaller for both distributions as T int increases. The shaping parameters γ t , β t , γ v , and β v are functions of T int .
1) Parameter estimation of γt, βt, γv, and βv given T int For each T int , the parameters γ t , β t , γ v , and β v are estimated using observations of the SSEs. The observations of SSEs are denoted as φ (k) t and φ (k) v , k = 1, ..., K, respectively, where the subscript t denotes the SSE between the user and conversational partner, v is the SSE between the user and a competing speaker, and K is the total number of observations. Each observation of φ v are assumed independent. The parameters are found numerically using maximum likelihood estimation such that In order to provide simple models of γ t , β t , γ v , and β v , scatter plots of estimated shaping parameters for different T int are shown in Fig. 4. We choose to describe the shaping parameters using a power model. Leth(T int ; a, b) be the general form of a power model with parameters a and b:h This model can be useful for implementation of the BMOG algorithm for any T int , and to facilitate the theoretical performance evaluation of the MOG algorithm in Sec. IV-C. To estimate the parameters a and b of the power model, we use a non-linear least squares procedure with the general form of whereĥ T (j) int is an estimated shaping parameter, i.e., either , and J is the total number of data points for each ML estimated shaping parameter. We minimize (14) numerically. The estimated power model parameters are summarized in Table 1. Fig. 4 shows that the fitted power models provide an excellent fit to the ML estimated shaping parameters as a function of T int . γt(·;â,b)βt(·;â,b)γv(·;â,b)βv(·;â,b)

C. THEORETICAL PERFORMANCE OF THE MOG ALGORITHM
In this section, we analyze the theoretical performance of the MOG algorithm and compare it with performance achieved through simulations. Two quantities that have a significant impact on the performance of the MOG algorithm, are the number of candidate speakers, I, and the integration time T int . Increasing the number of candidate speakers will increase the solution search space, hence increase the a priori risk of choosing a wrong candidate as the target speaker. Decreasing the integration time T int will lead to higher variance in the estimation of the SSEs in (6). The misclassification rate is used to measure the performance of the MOG algorithm and is defined as the probability of classifying a competing speaker as the conversational partner. We denote the misclassification rate as To derive an expression for the misclassification rate, we define as the probability of correct classification, where Φ t denotes the SSE between the user and conversational partner, and Φ v,j is the SSE between the user and the j'th competing speaker. The misclassification rate is then given by In Appendix C, we show that the misclassification rate of the MOG algorithm can be expressed as where For verification, we compare the theoretical misclassification rate given by (15) with the misclassification rate achieved with the MOG algorithm in simulations as seen in Fig. 5. From Fig. 5b, we clearly see a close match between the theoretical and simulated misclassification rates, where the conversational partners are speaking in Danish without any noise stimuli. Likewise, a close match between the theoretical and simulated misclassification rate can be seen in Fig. 5b, where the conversational partners are speaking in English (second language) with babble noise as noise stimuli. The  Fig.  5a. The simulated MOG performance using the datasets L1/no noise and L2/babble noise is shown in Fig. 5b and Fig. 5c.
close match indicates that the fitted statistical models are able to generalize to unseen conditions.

V. EVALUATION IN SPEECH ENHANCEMENT APPLICATIONS
In this section, we demonstrate the use of MOG and BMOG for solving problem of enhancing a conversational partner in a multi-talker environment, using the speech enhancement paradigm of Fig. 1. In particular, we use MOG/BMOG to rank the candidate speakers according to how likely they are to be the conversational partner. In Secs. V-A and V-B, we outline the practical implementation of the MOG and BMOG algorithms and in Sec. V-C, we present the reference/baseline speaker ranking methods that will be used in our experiments. In Secs. V-D and V-E, we demonstrate the use of the proposed speech enhancement systems in two different applications for HADs. Fig. 6 shows an example of the speech enhancement paradigm of Fig. 1 employing multiple microphones. In many situations, the microphone signals consist of a mixture of speech signals (including target and potential competing speakers) and noise from the environment. The unprocessed microphone signals are denoted as x m (n) for m = 1, ..., M , where M is the number of microphones and n is the discretetime index. Let x(n) = [x 1 (n), ..., x M (n)] T be the noisy microphone signals stacked in a vector, which is processed by a speech separation system. The speech separation system separates the microphone signals into estimated speech signalsŝ(n) = [ŝ 0 (n),ŝ 1 (n), ...,ŝ I (n)] T . Next, voice activity detection is applied to each of the separated signals,ŝ i (n), i = 1, ..., I. A speaker ranking algorithm, e.g., MOG or BMOG, ranks the conversational partner by assigning a ranking score to each candidate speaker. Finally, in the example system in Fig. 6, the enhancement of the conversational partner is achieved simply as a linear combination of the separated speech signalsŝ(n). The weights are found using a gain function which maps the ranking score to a gain value for each separated speech signal. A straightforward gain function for the MOG algorithm, is to set the gain to a value of '1' to the estimated conversational partner channel, and a value of 0 < g min < 1 for the remaining channels, i.e.,

A. SPEECH ENHANCEMENT SYSTEM USING SPEAKER RANKING
whereî is the estimated channel of the conversational partner. It might occur that a competing speaker is estimated as being the conversational partner which can lead to severe loss in speech enhancement performance. It can also disrupt an ongoing conversational between a user and a conversational partner if the speaker ranking algorithm suddenly changes the estimated conversational partner. To increase the robustness, a minimum gain g min can be applied such that a small amount of speech from all candidate speakers are always let through. Likewise, g min can be made as a function of n, such that g min = 1 in the initial phase of a conversation, and gradually decreases towards a minimum value when the conversation has been established. Another approach, specifically for the BMOG algorithm, is to use the estimated posterior probabilities as weights for the linear combination such that g j (n) = max (g min , P (H i |φ 1 , ..., φ I )) .
A potential advantage of the posterior probability as a gain function is similar to that of introducing g min > 0 in (17): It reduces perceptual switching artifacts and limits the effect of target loss in case of misclassification. For both approaches, the estimated conversational partner signal iŝ

B. IMPLEMENTATION OF THE MOG AND BMOG ALGORITHMS
In order to implement the MOG algorithm in (4), we estimate the MSE as the average square-error between α 0 (n) and α i (n) over integration time T int . The MOG estimate of the conversational partner index then becomeŝ i = arg max i∈{1,...,I} n k=n−N +1 Implementation of the BMOG algorithm is a two-step procedure. First, the shaping parameters are computed for betabinomial distributions given T int using (13) and TABLE 1, which may be done offline. Secondly, the posterior probabilities P (H i |φ 1 , ..., φ I ) are computed. To do so, the likelihood function in (11) is computed in the logarithmic domain for numerical stability. For this purpose, we first define the variable ψ i as: The natural logarithm of ψ i is Substituting (21) into (12) gives Using the logarithm function on both sides yields The posterior probability can be found by inserting (21), (22), and (25) into (24) and applying the exponential function exp(·) to (24). The implementation of the BMOG algorithm is summarized in Algorithm 1.  (13) and TABLE 1. 2: Compute the SSEs using (6) to obtain φ i (n) for all i. 3: Compute the log-likelihoods

C. STATE-OF-THE-ART METHODS FOR SPEAKER RANKING
The idea of using turn-taking to detect conversations between two speakers has been explored in [31]- [33] but was not used in the context of enhancing a conversational partner of a user as presented in Fig. 1. In [31], the presence of a conversation between two speakers was quantified using mutual information between the user's and candidate speakers' voice activity sequences. The normalized cross-correlation function was later proposed as a quantifier of conversations in [32]. Both methods can be compared to the MOG/BMOG algorithms in a fair manner, since all methods require access to VAD sequences for each speaker and they return a cost that can be used for ranking the candidate speakers.

1) Maximum mutual information [31]
The mutual information method is based on finding the candidate speaker that maximizes the mutual information between VOLUME x, 2022 the user's and candidate speaker's voice activity sequenceŝ where all joint and marginal probabilities are sample estimates obtained from α i (n) over integration time T int . One problem with the MMI algorithm is situations where the numerator or denominator of the logarithmic function becomes zero. These situations might occur if the integration time is short, e.g., 2 seconds, as there is a risk that the user or candidate speaker i might be silent within the period of time.
In the evaluations, we removed results where the numerator or denominator of the MMI algorithm becomes zero.

2) Normalized cross-correlation [32]
Similarly, the normalized cross-correlation (NCC) method is here used to detect the presence of a conversational partner. The optimization problem of NCC is formulated aŝ where R 0,i (p) is the normalized cross-correlation between A 0 and A i at lag p. r1 and r2 are search region bounds for the lag p. We set p equal to zero in our evaluation.

3) Speaker ranking performance
We examine the speaker ranking performance between the proposed MOG algorithm against MMI and NCC. The performance is reported in terms of misclassification rate as a function of the number of competing speakers N c and integration time T int . We use speech signals from [27] for the performance evaluation. Specifically, we use the subset of the data set containing 2-person conversations in second language English (L2) in babble noise. The speech signals are segmented into segments of length T int . For each T int , one 2-person conversation is randomly selected to constitute the user's own voice and the user's conversational partner. A number of N c arbitrarily chosen speakers from the data set are selected to constitute the competing speakers. Fig.  7 shows the misclassification rate P (E = 1; I, T int ) as a function of T int and the number of competing speakers N c = I − 1 for each ranking algorithm. A comparison between MOG, MMI, and NCC shows that the misclassification rate is significantly lower for the MOG algorithm compared to the MMI and NCC, particularly, when 1) the integration time is short, and/or 2) there is a large number of competing speakers. At long integration times, e.g. 40 sec, the difference between the algorithms is smaller. However, the MOG algorithm consistently performs better than the MMI and NCC algorithms.

D. APPLICATION 1: WIRELESS HEARING AID NETWORK
In this section, we demonstrate the use of the proposed (B)MOG based speech enhancement paradigm, cf. Fig. 1 in a hearing aid (HA) application, in which the HAs of several users are wirelessly connected. The basic idea is that multiple HA users can distribute their own voice signal to the other users' HAs through a wireless network. This can be useful, e.g., in acoustically challenging social gatherings with multiple HA users. The proposed speech enhancement paradigm can in this situation assist the HA user by first ranking and then enhancing the estimated conversational partner amongst the users. 1 The signal model of the sound picked-up by the user's HA microphone can be described as where s 0 (n) is the HA user's own voice signal as picked-up by the user's microphone while s i (n) for i = 1, ..., I are the clean speech signals picked up by the microphones located at the candidate speakers.

1) Simulation Setup
We reuse the speech database presented in Sec. V-C3 for the candidate speakers and own voice signals. We use the data set with conversations in second language and babble noise, which was not used for estimation of the shaping parameters. Two conversational partners are randomly chosen from the data set, where one is randomly chosen as the HA user and the other as the conversational partner for each signal realization. The competing speakers are chosen from the same data set, but are not conversing with the HA user. The HA user's conversational partner is unknown to the speech enhancement systems. We use rVAD 2.0 [29,30] for voice activity detection and the sampling frequency of the VAD output is f s,vad = 100 Hz. The integration time needed for the speaker ranking algorithms, is implemented as sliding windows with length T int and with a hop size of 1 sample at sampling frequency f s,vad . The speech enhancement systems used in the evaluation are referred to as: • No processing: The speech enhancement system does not apply any speaker ranking algorithms and simply outputs the sum of all candidate speakers. • MMI, NCC, and MOG: The MMI, NCC, and the proposed MOG algorithms are used as speaker ranking. The gain function is implemented as in (17). • BMOG: Posterior probabilities of the conversational partner are estimated using BMOG and used as a gain function for enhancement, cf. (19). The prior probability distribution P (H i ) in (12) is set uniform.

2) Results: Wireless hearing aid network
We evaluate the speech enhancement performance by comparing the enhanced conversational partner with the clean speech signal of the conversational partner in terms of ESTOI [34,35], PESQ [36], and segmental SNR [37]. We evaluate the speech enhancement systems for T int = {5, 10, 20, 30} s and the following number of competing speakers N c = {1, 2, ..., 10} [s] . The minimum gain for MMI, NCC, and MOG is set to g min = 0.01. A minimum gain of g min > 0 is necessary for the MMI, NCC, and MOG enhancement systems to avoid rare situations with a complete suppression of the conversational partner. These situations typically arise at low T int and results in undefined PESQ and segmental SNR scores. The minimum gain for BMOG was set to g min = 0 as it did not experience similar problems. The results are shown in Fig. 8 and each score is averaged over 100 realizations of conversations. Generally, we see a significant improvement in terms of both ESTOI and PESQ when using MOG and BMOG compared to NCC and MMI. The improvement is particularly notably at low integration time such as T int = 5 s and T int = 10 s. At higher integration times, the improvements become less prominent with the exception of NCC, which seems to perform the worst. We note that BMOG seems to perform much better than MOG in terms of PESQ at T int = 30. This is due to the minimum gain which is set to 0.01 for MOG but 0 for BMOG. From our experiments, we have observed that setting the minimum gain to be above 0 can help NCC, MMI, and MOG perform better on average at low integration times, e.g., T int = 5. However, the trade-off is slightly degraded performance at high integration times as shown in the results.
From these results, it is clear that speech enhancement systems that use MOG and BMOG generally outperform the NCC and MMI methods for this particular application. VOLUME x, 2022

E. APPLICATION 2: BEAMFORMING SYSTEM IN HEARING AIDS
In this section, we demonstrate the use of the proposed speech enhancement paradigm in another hearing aid application. Modern hearing aids are equipped with multiple microphones which allow for implementation of acoustic beamformers to enhance the speech signal of a conversational partner of a HA user. However, retrieving the speech signal can be particularly difficult in situations with multiple competing speakers, because it is hard to decide who is the conversational partner. Hence, in this application the proposed (B)MOG speech enhancement paradigm is used to efficiently retrieve the speech signal of the conversational partner amongst several competing speakers. First, we model the received signal at the microphones of the HAs. The user's and candidate speakers' speech signals propagate to the microphones and are simulated using acoustic impulse responses (AIRs). The AIR from the i'th speaker to the m'th microphone is denoted as h i,m (n) where i = 0, 1, ..., I is the speaker index, and m = 1, ..., M is the microphone index. The index value i = 0 is used to denote the user's index. The AIRs can be decomposed into h i,m (n) = h i,m (n) * d i,m (n) where * denotes the convolution operator, h i,m (n) is the AIR from the i'th speaker to a pre-selected reference microphone m ∈ {1, ..., M }, and d i,m (n) is the impulse response from the reference microphone to the m'th microphone also referred to as the relative impulse response. Let s i (n) be the received signal of the i'th speaker at the reference microphone, m , i.e. s i (n) = s i (n) * h i,m (n). Then the received signal of the i'th speaker at the m'th microphone is s i,m (n) = s i (n) * d i,m (n). We denote v m (n) as being the noise vector (e.g. ambient noise and microphone self-noise) as received at the m'th microphone. The noisy signal at the m'th microphone is then modeled as

1) Speech separation using beamformers
The received microphone signal, x m (n), is a mixture of clean user and candidate speaker signals received at microphone m, s i,m (n), plus noise v m (n). Following the speech enhancement paradigm in Fig. 6, the microphone signals are first separated into user and candidate speaker signals before applying speaker ranking. We use the minimum power distortionless response (MPDR) beamformer to separate the speech signals. The MPDR beamformers are implemented in the time-frequency domain using the short-time Fourier transform (STFT) and are for each time-frequency tile computed as [38] W i (k, l) = where k and l denote the frequency and frame indices, respectively. C x (k, l) = E{X(k, l)X H (k, l)} is the cross power spectral density (CPSD) matrix of the noisy microphone signals and D i (k) = [D i,1 (k), ..., D i,M (k)] T , i = 0, 1, ..., I, k = 0, 1, ..., K denotes the relative acoustic transfer function (RATF) vector for the i'th speaker and k'th frequency bin [39,40]. The m'th element of the RATF vector is the frequency domain representation of d i,m (n). Unfortunately, the number of candidate speakers and their RATF vectors are seldomly known in practice. Instead, we use I to denote the number of MPDR beamformers steered towards a set of I unique and fixed directions in the acoustic environment. In other words, the spatial filter bank is implemented using a dictionary of RATF vectors D 1 (k), ..., D I (k)}, k = 0, 1, ..., K, where we assume that the dictionary is given in advance. Assuming that each beam contains a maximum of one candidate speaker (i.e., that candidate sources are sufficiently spatially separated), each beamformer output,ŝ i (n), is treated as a candidate speaker signal. The output of each beamformer iŝ whereŜ i (k, l) is the enhanced signal from direction i, and is treated as a speech signal from a candidate speaker. The beamformer outputsŜ i (k, l) are transformed back to the time-domain using the inverse STFT to obtainŝ i (n). The remaining part of the speech enhancement system, i.e., ranking and enhancement, follows the same procedure as in application 1 in Sec. V-D.

2) Simulation of the acoustic scene
We simulate the acoustic scenes to resemble a cocktail partylike scenario with a HA user engaged in a conversation with a conversational partner. Such a situation involves the presence of speech signals from the HA user, the conversational partner, and competing speakers, and the presence of noise from the environment. To simulate the received signals at the microphones, we use a database of AIRs measured in a sound studio where room reverberation has been removed [41]. The measurement setup consists of a spherical loudspeaker array with a HA user seated in the center of the array. The HA user is wearing a behind-the-ear (BTE) hearing aid on each ear. Each BTE hearing aid has three microphones where two are placed in a front/rear configuration on the HA and the third is placed in the ear canal. The microphones are used in a binaural HA configuration where we assume wireless, simultaneous, and error-free signal exchange between the left and right HAs. Hence, beamformers are implemented using a the total number of M = 6 microphones. The AIRs are measured from uniformly spaced positions in the horizontal plane with respect to the head of the HA user and with a resolution of 7.5 • resulting in AIRs for 48 different angles. We define 0 • as the frontal direction from the user's point-of-view. The own voice AIRs are measured using a mouth reference microphone placed in front of the HA user's mouth.
We use the conversational speech database in [27], as in Application 1, as speech material in our simulation. Realistic noise measured in a canteen is used in our simulation. The noise is measured using a spherical microphone array to accurately capture the noise field [42]. The noise recordings are transformed and convolved with the AIRs to reproduce the same noise field as would have been experienced by a HA user in the canteen.
Competing speakers are added to the acoustic scenes. The speech material for the competing speaker are from the same speech database as in Sec. V-D [27]. The speech of the competing speakers is unrelated to the conversation between the user and conversational partner. We experiment with N c = 3 and N c = 5 competing speakers in our evaluation. Increasing the number of competing speaker to much larger than N c = 5, results in poorer speech separation as the beamformers cannot sufficiently suppress the speakers from other directions. The purpose here is mainly to demonstrate the feasibility of using (B)MOG ranking in a beamforming context and for a larger number of competing speakers, other better performing speech separation systems could be used, e.g., (Conv-)TasNet [12,13] or Wavesplit [43].
For The positions of the speakers are fixed for the whole duration of a realization of an acoustic scene. We do not simulate head-movements of the HA user but these movement can be compensated with other sensors, e.g., accelerometers in practice. To simulate the received signals of the speech sources at the microphones, we convolve the speech signals with the AIRs associated with the direction. The speech power of each competing speaker is approximately identical to the speech power of the conversational partner before convolving with the AIRs. Canteen noise is added to the acoustic scenes and the SNR is defined as the ratio between the clean speech power of the conversational partner at the source location and the power of the background noise. The SNR is set to 12 dB.
To implement the OVAD/VAD blocks in Fig. 6, rVAD 2.0 [30] is used for voice activity detection on the separated speech signalsŝ i (n) and the own voice signalŝ 0 (n).
The sampling frequency of the received microphone signals is set to 16 kHz. We use a square-root Hann window with a window size of 256 samples for the STFT and inverse STFT. The hop-size is 128 samples. Estimate the noisy CPSD matrix: where X(k, j) = [X(k, l − L + 1), ..., X(k, l)].
The beamforming system is summarized in Algorithm 2 in pseudo-code.

3) Evaluation of the speech enhancement paradigm in beamforming systems
We evaluate the performance in terms of 1) speaker ranking performance in Sec. V-E4 and 2) speech enhancement performance in Sec. V-E5. First, the speaker ranking in this application is closely related to direction-of-arrival (DOA) estimation. DOA estimation often arises in beamforming applications where the goal is to estimate the direction of the talker-of-interest in order to steer a beamformer. In our context, DOA estimation is related to estimating the channel of the conversational partner. Hence, the MOG algorithm is in fact a DOA estimator in this context. Secondly, the speech enhancement performance will quantify the potential benefit of using the proposed speech enhancement paradigm in a beamforming context for HAs. The reported performance scores are averaged from simulations of 40 realizations of the acoustic scenes for the results in Sec. V-E4 and Sec. V-E5.
To evaluate the speaker ranking performance, we evaluate the DOA accuracy and the mean-absolute-error (MAE) between the estimated DOAθ n and the true DOA θ n of the conversational partner. The DOA accuracy is the probability of estimating the correct DOA of the conversational partner and the MAE is estimated as the average absolute error:   where θ n andθ n are in radians, √ j = −1, and arg(·) is the argument of a complex number. TheM AE is averaged over The speech enhancement performance is reported in terms of ESTOI, PESQ, and segmental SNR scores to estimate the speech intelligibility, speech quality, and noise suppression performance of the proposed speech enhancement paradigm, respectively. The ESTOI, PESQ, and segmental SNR scores are computed using the output of the enhancement system s t (n) and the clean conversational partner speech signal received at the reference microphone index s t,m (n).
Our evaluation includes four beamforming systems which are based on the speech enhancement paradigm in Fig. 6. All systems use the same spatial filter bank of MPDR beamformers for speech separation. We use the rVAD 2.0 for voice activity detection for all systems. We refer the beamforming systems to as • SE-Oracle: The beamforming system, SE-Oracle, is used as a reference system to indicate the upper bound performance if the direction of the conversational partner is known in advance. PHAT, uses the well-known SRP-PHAT algorithm [6] to estimate the DOA of the conversational partner. In contrast to the speaker ranking algorithms NCC, MMI, and MOG, the SRP-PHAT algorithm does not utilize turn-taking to the candidate speakers related to conversations but instead searches for the most dominant speaker. The output at time n of SE-SRP-PHAT iŝ s t (n) =ŝ î SRP-PHAT (n) (n) whereî SRP-PHAT (n) is the DOA estimate of the conversational partner at time n. • SE-BMOG: The beamforming system, SE-BMOG, uses the BMOG algorithm to compute a posterior probability distribution of the direction of the conversational partner. The output of SE-BMOG at time n is a linear combination of the separated candidate speakers using the posterior probabilities as weights, i.e.,ŝ t (n) = I i=1 P (H i |φ 1 , ..., φ I )ŝ i (n). The prior probability distribution for BMOG was set to be a uniform prior probability distribution.
We did not include a beamforming system with a NCCbased speaker ranking algorithm as it performed significantly poorer than the other algorithms in preliminary experiments.

4) Results: DOA estimation performance in beamforming systems
This section focuses on speaker ranking/DOA performance and not speech enhancement performance of the complete beamforming system, which is treated in Sec. V-E5. Therefore, BMOG is not included since the output of BMOG is a probability distribution and not an estimate of the conversational partner as the MOG algorithm.
The results for DOA estimation performance in terms of DOA accuracy and MAE are shown in table 2 and 3, respectively. Each score in the table is an average over 40 realizations of the acoustic scenes.
The MOG algorithm seems to outperform the MMI algorithm consistently by approximately 15%-points. Similarly, the MAE for the MOG algorithm is lower than the MAE for MMI and SRP-PHAT for all T int and N c . It is also clear, that the SRP-PHAT algorithm in general struggles in estimating the conversational partner DOA in a multi-speaker situation which is demonstrated in Fig. 9. Essentially, the SRP-PHAT algorithm constantly switches between the candidate speakers as the estimate of the conversational partner. The MOG algorithm, however, effectively exploits the turntaking mechanism in conversations and is able to detect the conversational partner.
An interesting observation is that the DOA estimation accuracy is slightly higher for N c = 5 compared to N c = 3 at low integration times, e.g., T int = 5 s. Likewise, the MAE is lower for N c = 5 compared to N c = 3 at low integration times. However, note that the angular distance between the conversational partner and competing speakers becomes larger at N c = 3 compared to N c = 5. That is, for N c = 3 the speakers are located at {0 • , 90 • , 180 • , 270 • } whereas for N c = 5 the speakers are located at {0 • , 60 • , 120 • , 180 • , 240 • , 300 • }. Therefore, possible explanations of these observations at low integration times are that 1) the DOA estimates of MMI and MOG become more biased for N c = 3, which results in a lower accuracy and 2) MMI and MOG are more likely to return a higher absolute error for N c = 3 than for N c = 5 in case of a DOA estimation error. However, it is evident from the results, that the MOG algorithm has significantly higher accuracy compared to MMI and SRP-PHAT for all combinations of T int and N c .

5) Results: Speech enhancement performance in beamforming systems
The results for beamforming performance are shown in Fig. 10, which plots performance scores ESTOI, PESQ, and segmental-SNR as a function of integration time T int for different beamforming systems. Cleacly, the MOG algorithm outperforms the MMI and SRP-PHAT algorithms significantly in most situations. The results also indicate that the SRP-PHAT algorithm performs slightly worse than the unprocessed signal in multi-speaker environments unless additional knowledge on the conversational partner is given. The MMI algorithm also performs slightly worse than the unprocessed signal in terms of ESTOI at T int = 5 s and T int = 10 s as the MMI algorithm can erroneously estimate a competing speaker as being the conversational partner for low integration times. The speech enhancement system using the BMOG algorithm, however, performs best on average across all scores, especially in terms of ESTOI and PESQ. This is likely due to a softer gain function based on the estimated posterior probability, which is less aggressive compared to the gain function used in the MOG algorithm. The softer gain function translates to higher ESTOI and PESQ scores, but a slightly lower segmental SNR score. With long integration times, both speech enhancement systems using MOG and BMOG are extremely effective at retrieving a conversational partner in a multi-speaker situation as they perform close to the oracle beamformer. However, long integration times also require that the conversational partner stays within the same beam for longer duration, e.g. in a restaurant where the speakers are seated.

VI. CONCLUSION
In this paper, we have proposed a speech enhancement paradigm using a speaker ranking algorithm which can effectively retrieve a desired speech signal in a multi-talker environment. Specifically, the proposed speech enhancement paradigm exploits turn-taking behavior to determine the conversational partner amongst a set of candidate talker of a user by finding the talker with minimum probability of speech overlaps and gaps. The proposed algorithm only requires access to microphone signals, which is in contrast to existing methods which require additional sensor inputs, e.g. EEG, cameras, etc. We demonstrated the proposed speech enhancement paradigm in two applications, where retrieval of a conversational partner's speech signal in a multi-talker environment, is desired. We compared the proposed systems to current state-of-the-art speech enhancement systems, and results indicate that the proposed systems significantly outperform the state-of-the-art systems.

A. PROOF OF MINIMIZING SPEECH OVERLAP AND GAP, AND MAXIMIZING MUTUAL EXCLUSION
In this Appendix, it is shown that minimizing the probability of speech overlap and gap is equivalent to maximizing the probability of mutual speech exclusion between the user's own voice VAD and candidate speaker's VAD. Specifically, we show that arg min  The sum of the support of P A0Ai (α 0 = k, α i = j) is equal to one such that 1 k=0 1 j=0 P A0Ai (α 0 (n) = k, α i (n) = j) = 1.
where the left-hand side is the probability of speech overlap and gap and the right-hand side is '1' subtracted by the probability of mutual speech exclusion. Hence, minimizing the probability of speech overlap and gaps is equivalent to: i MOG (n) = arg min or maximizing the probability of mutual speech exclusion: i MOG (n) = arg max i 1 k=0 P A0Ai (α 0 (n) = k, α i (n) = 1−k), hence proving the equivalence in (32).

B. PROOF OF MINIMIZING SPEECH OVERLAP AND GAP, AND MAXIMIZING MEAN-SQUARE-ERROR
In this Appendix, we show that minimizing the probability of speech overlap and gap is identical to maximizing the meansquare-error between the own voice VAD and the candidate speaker VAD i.e.
Hence, the probability of speech overlap and gap is where E (A 0 − A i ) 2 is the mean-square-error (MSE) between A 0 and A i . We see that the probability of speech overlap and gap is equivalent to 1 − E (A 0 − A i ) 2 . Hence, the optimization problem for the MOG algorithm iŝ i MOG (n) = arg min which is a maximization of the MSE between the own voice VAD and a candidate speaker VAD.

C. EXPECTED MISCLASSIFICATION RATE FOR MOG
The speaker misclassification rate is defined as the probability of classifying a wrong candidate speaker as the conversational partner. Using the MOG algorithm, we consider a misclassification as when Φ t is equal to or smaller than Φ v . For a number I of candidate speakers and integration time T int , the misclassification rate P (E = 1; I, T int ) is given by denotes the correct classification rate, and P Φ (φ − 1; γ v , β v , N ) is the cumulative distribution function of p Φ (φ − 1; γ v , β v , N ),

Proof
First we consider the probability of correct classification under the assumption that Φ v,j for all j's are independent: To simplify the expression, we define following cumulative distribution function for all j = 1, ..., I − 1. Inserting (47) into (46), we have Assuming that Φ v,j are independent and identically distributed such that γ v,j = γ v , β v,j = β v for all j, we may simplify to As the misclassification rate is