ICASSP 2023 Acoustic Echo Cancellation Challenge

The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20 ms, as well as including a full-band version of AECMOS (Purin et al., 2020). We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.


I. Introduction
With the growing popularity and need for working remotely, the use of teleconferencing systems such as Microsoft Teams, Skype, WebEx, Zoom, etc., has increased significantly.It is imperative to have good quality calls to make the user's experience pleasant and productive.The degradation of call quality due to acoustic echoes is one of the major sources of poor speech quality ratings in voice and video calls.While digital signal processing (DSP) based AEC models have been used to remove these echoes during calls, their performance can degrade when model assumptions are violated, e.g., fast time-varying acoustic conditions, unknown signal processing blocks or non-linearities in the processing chain, or failure of other models (e.g., background noise estimates).This problem becomes more challenging during full-duplex modes of communication where echoes from double talk scenarios are difficult to suppress without significant distortion or attenuation [2].
With the advent of deep learning techniques, many supervised learning algorithms for AEC have shown better performance compared to their classical counterparts, e.g., [3], [4], [5].Some studies have also shown good performance using a combination of classical and deep learning methods such as using adaptive filters and recurrent neural networks (RNNs) [5], [6] but only on synthetic datasets.While these approaches are promising, they lack evidence of their performance on real-world datasets with speech recorded in diverse noise and reverberant environments.This makes it difficult for researchers in the industry to choose a good model that can perform well on a representative real-world dataset.
Most AEC publications use objective measures such as echo return loss enhancement (ERLE) [7] and perceptual evaluation of speech quality (PESQ) [8].ERLE in dB is defined as: where y(n) is the microphone signal, and e(n) is the residual echo after cancellation.ERLE is only appropriate when measured in a quiet room with no background noise and only for single talk scenarios (not double talk), where we can use the processed microphone signal as an estimate for e(n).
PESQ has also been shown to not have a high correlation to subjective speech quality in the presence of background noise [9].Using the datasets provided in this challenge we show that ERLE and PESQ have a low correlation to subjective tests (Table 1).In order to use a dataset with recordings in real environments, we can not use ERLE and PESQ.A more reliable and robust evaluation framework is  The first is a real dataset captured using a large-scale crowdsourcing effort.This dataset consists of real recordings that have been collected from over 10,000 diverse audio devices and environments.The second dataset is synthesized from speech recordings, room impulse responses, and background noise derived from [10].An initial test set was released for the researchers to use during development and a blind test set near the end, which has been used to decide the final competition winners.We believe these datasets are large enough to facilitate deep learning and representative enough for practical usage in shipping telecommunication products (e.g., see [11]).This is the fourth AEC challenge we have conducted.The first challenge was held at ICASSP 2021 [12], the second at INTERSPEECH 2021 [13], and the third at ICASSP 2022 [14].These challenges had 49 participants with entries ranging from pure deep models, hybrid linear AEC + deep echo suppression, and DSP methods.While the submitted AECs have consistently been getting better, there is still significant room for improvement as shown in Table 2.The two largest areas for improvement are (1) Single Talk Near End quality, which is affected by background noise, reverberation, and capture device distortions, and (2) Double Talk Other Degradations, which includes missing audio, distortions, and cut-outs.In addition, the overall challenge metric, M was 0.883 out of 1.0 in the ICASSP 2022 challenge, which also shows significant room for improvement.
To improve the challenge and further stimulate research in this area we have made the following changes: An overview of the four AEC challenges is given in Table 3.
Related work is reviewed in Section II.The challenge description is given in Section III.The training dataset is described in Section IV, and the test set in Section V. We describe a baseline deep neural network-based AEC method in Section VI.The online subjective evaluation framework is discussed in Section VII, and the objective function in Section VIII.The challenge metric is given in Section IX and the challenge rules are described in https://aka.ms/aec-challenge.The results and analysis are given in Section X, and conclusions are discussed in Section XI.

II. Related work
There are many standards for measuring AEC performance.For objective metrics, IEEE 1329 [2] defines metrics like terminal coupling loss for single talk (TCLwst) and double talk (TCLwdt), which are measured in anechoic chambers.TIA 920 [16] uses many of these metrics but defines required criteria.ITU-T Rec.G.122 [17] defines AEC stability metrics, and ITU-T Rec.G.131 [18] provides a useful relationship of acceptable Talker Echo Loudness Rating and one-way delay time.ITU-T Rec.G.168 [19] provides a comprehensive set of AEC metrics and criteria.However, it is not clear how to combine these dozens of metrics into a single metric, or how well these metrics correlate to subjective quality.
Subjective speech quality assessment is the gold standard for evaluating speech enhancement processing and telecommunication systems, and the ITU-T has developed several recommendations for subjective speech quality assessment.ITU-T P.800 [20] describes lab-based methods for the subjective determination of speech quality.In P.800, users are asked to rate the quality of speech clips on a Likert scale from 1: Poor to 5: Excellent.Many ratings are taken for each clip, and the average score for each clip is the MOS.ITU-T P.808 [21] describes a crowdsourcing approach for conducting subjective evaluations of speech quality.It provides guidance on test material, experimental design, and a procedure for conducting listening tests in the crowd.The methods are complementary to laboratory-based evaluations 800.An open-source implementation of P.808 is described in [22].ITU-T P.835 [23] provides a subjective evaluation framework that gives standalone quality scores of speech (SIG) and background noise (BAK) in addition to the overall quality (OVRL).An open-source implementation of P.835 is described in [24].More recent multidimensional speech quality assessment standards are ITU-T P.863.2 [25] and P.804 [26] (listening phase), which measure noisiness, coloration, discontinuity, and loudness.An open-source implementation of P.804 using crowdsourcing is described in [27].ITU-T Rec.P.831 [28] provides guidelines on how to conduct subjective tests for network echo cancellers in the laboratory.ITU-T Rec.P.832 [8] focuses on the handsfree terminals and covers a broader range of degradations.Cutler et al. [29] provide an open-source crowdsourcing tool extending P.831 and P.832 and include validation studies that show it is accurate compared to expert listeners and repeatable across multiple days and different raters.Purin et al. [1] created an objective metric, AECMOS, based on this tool's results on hundreds of different AEC models.AECMOS has a high correlation to subjective opinion.
While there have been hundreds of papers published on deep echo cancellation since the first AEC challenge, we feel the winners of each challenge are of special note since they have been tested and evaluated using realistic and challenging test sets and subjective evaluations.Table 4 provides the top three papers for each previous AEC challenge.Note that because the performance rankings and paper acceptances were decoupled in ICASSP 2021 and INTERSPEECH 2021, the challenge placement and performance rankings are not identical, and for INTERSPEECH 2021 not well correlated.For ICASSP 2022 and 2023, the top five papers based on the challenge performance were submitted for review, fixing the disparity between paper acceptance and model performance.

III. Challenge description A. Tracks
This challenge included two tracks: • Non-personalized AEC.This is similar to the ICASSP 2022 AEC Challenge.• Personalized AEC.This adds speaker enrollment for the near end speaker.A speaker enrollment is a 15-25 second recording of the near end speaker that can be used for adopting the AEC for personalized echo cancellation.For training and model evaluation, the datasets in https://github.com/microsoft/AEC-Challengecan be used, which include both echo and near end only clips from users.For the blind test set, the enrollment clips will be provided.

B. Latency and runtime requirements
Algorithmic latency is defined by the offset introduced by the whole processing chain including short time Fourier transform (STFT), inverse STFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification.It does not include buffering latency.Some examples are: • A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length -hop length = 10 ms.padding with kernel size-1 samples would result in no algorithmic latency.• A STFT-based processing with window length = 20 ms and hop length = 10 ms using 2 future frames information introduces an algorithmic latency of (window length -hop length) + 2*hop length = 30 ms.
Buffering latency is defined as the latency introduced by block-wise processing, often referred to as hop length, frameshift, or temporal stride.Some examples are: • A STFT-based processing has a buffering latency corresponding to the hop size.• A overlap-save processing has a buffering latency corresponding to the frame size.• A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.
Real-time factor (RTF) is defined as the fraction of time it takes to execute one processing step.For a STFT-based algorithm, one processing step is the hop size.For a timedomain convolution, one processing step is 1 sample.RTF = compute time / time step.
All models submitted to this challenge must meet all of the below requirements: 1) To be able to execute an algorithm in real-time, and to accommodate for variance in compute time which occurs in practice, we require RTF ≤ 0.5 in the challenge on an Intel Core i5 Quadcore clocked at 2.4 GHz using a single thread.2) Algorithmic latency + buffering latency ≤ 20ms.
3) No future information can be used during model inference.

IV. Training datasets
The challenge includes two open-source datasets, one real and one synthetic.The datasets are available at https://github.com/microsoft/AEC-Challenge.

A. Real dataset
The first dataset was captured using a large-scale crowdsourcing effort.This dataset consists of more than 50,000 recordings from over 10,000 different real environments, audio devices, and human speakers in the following scenarios: 1) Far end single talk, no echo path change 2) Far end single talk, echo path change 3) Near end single talk, no echo path change 4) Double talk, no echo path change 5) Double talk, echo path change 6) Sweep signal for RT60 estimation RT60 is the time for an initial signal's sound pressure level to attenuate 60 dB from its original level.For the far end single talk case, there is only the loudspeaker signal (far end) played back to the users and users remain silent (no near end speech).For the near end single talk case, there is no far end signal and users are prompted to speak, capturing the near end signal.For double talk, both the far end and near end signals are active, where a loudspeaker signal is played and users talk at the same time.Echo path changes were incorporated by instructing the users to move their device around or bring themselves to move around the device.The RT60 distribution for 4387 desktop environments in the real dataset for which impulse response measurements were available is estimated using a method by Karjalainen et al. [39] and shown in Figure 2.For 1251 mobile environments the RT60 distribution shown was estimated blindly from speech recordings [40].The RT60 estimates can be used to sample the dataset for training.The near end single talk speech quality is given in Figure 1.
We use Amazon Mechanical Turk as the crowdsourcing platform and wrote a custom HIT application that includes a custom tool that users download and execute to record the six scenarios described above.The dataset includes Microsoft Windows and Android devices.Each scenario includes the microphone and loopback signal (see Figure 3).Even though our application uses the WASAPI raw audio mode to bypass built-in audio effects, the PC can still include Audio DSP on the receive signal (e.g., equalization and Dynamic Range Compression); it can also include Audio DSP on the send signal, such as AEC and noise suppression.
For far end signals, we use both clean speech and realworld recordings.For clean speech far end signals, we use the speech segments from the Edinburgh dataset [41].This corpus consists of short single speaker speech segments (1 to 3 seconds).We used a long short term memory (LSTM) [42] based gender detector to select an equal number of male and female speaker segments.Further, we combined 3 to 5 of these short segments to create clips of length between 9 and 15 seconds in duration.Each clip consists of a single gender speaker.We create a gender-balanced far end signal source comprising of 500 male and 500 female clips.Recordings are saved at the maximum sampling rate supported by the device and in 32-bit floating point format; in the released dataset we down-sample to 48 kHz and 16-bit using automatic gain control to minimize clipping.
For noisy speech far end signals we use 2000 clips from the near end single talk scenario.Clips are gender balanced to include an equal number of male and female voices.
For the far end single talk scenario, the clip is played back twice.This way, the echo canceller can be evaluated both on the first segment, when it has had minimal time to converge, and on the second segment, when the echo canceller has converged and the result is more indicative of a real call scenario.
For the double talk scenario, the far end signal is similarly played back twice, but with an additional silent segment in the middle, when only near end single talk occurs.
For near end speech, the users were prompted to read sentences from a TIMIT [43] sentence list.Approximately 10 seconds of audio is recorded while the users are reading.For track two (personalized AEC) we include 30 seconds of target speaker for each clip in the test set.In addition, the training and test set from the ICASSP 2022 Deep Noise Suppression Challenge track two [15] can be used.

B. Synthetic dataset
The second dataset provides 10,000 synthetic scenarios, each including single talk, double talk, near end noise, far end noise, and various nonlinear distortion scenarios.Each scenario includes a far end speech, echo signal, near end speech, and near end microphone signal clip.We use 12,000 cases (100 hours of audio) from both the clean and noisy speech datasets derived in [10] from the LibriVox project1 as source clips to sample far end and near end signals.The LibriVox project is a collection of public-domain audiobooks read by volunteers.[10] used the online subjective test framework ITU-T P.808 to select audio recordings of good quality (4.3 ≤ MOS ≤ 5) from the LibriVox project.The noisy speech dataset was created by mixing clean speech with noise clips sampled from AudioSet [44], Freesound2 and DEMAND [45] databases at signal to noise ratios sampled uniformly from [0, 40] dB.
To simulate a far end signal, we pick a random speaker from a pool of 1,627 speakers, randomly choose one of the clips from the speaker, and sample 10 seconds of audio from the clip.For the near end signal, we randomly choose another speaker and take 3-7 seconds of audio which is then zeropadded to 10 seconds.The selected far end speakers were 71% male, and 67% of the near end speakers were male.To generate an echo, we convolve a randomly chosen room impulse response from a large Microsoft unreleased database with the far end signal.The room impulse responses are generated by using Project Acoustics technology 3 and the RT60 ranges from 200 ms to 1200 ms.The distribution of RT60 is shown in Figure 4.In 80% of the cases, the far end signal is processed by a nonlinear function to mimic loudspeaker distortion (the linear-to-nonlinear ratio is 0.25).For example, the transformation can be clipping the maximum amplitude, using a sigmoidal function as in [46], or applying learned distortion functions, the details  of which we will describe in a future paper.This signal gets mixed with the near end signal at a signal-to-echo ratio uniformly sampled from -10 dB to 10 dB.The signal-toecho ratio is calculated based on the clean speech signal (i.e., a signal without near end noise).The far end and near end signals are taken from the noisy dataset in 50% of the cases.The first 500 clips can be used for validation as these have a separate list of speakers and room impulse responses.Detailed metadata information can be found in the repository.

V. Test set
Two test sets are included, one at the beginning of the challenge and a blind test set near the end.Both consist of 800 real-world recordings, between 30-45 seconds in duration.The datasets include the following scenarios that make echo cancellation more challenging: • Long-or varying delays, i.e., files where the delay between loopback and mic-in is atypically long or varies during the recording • Strong speaker and/or mic distortions • Stationary near end noise • Non-stationary near end noise • Recordings with audio DSP processing from the device, such as AEC or noise reduction • Glitches, i.e., files with "choppy" audio, for example, due to very high CPU usage • Gain variations, i.e., recordings where far end level changes during the recording (A), sampled randomly

VI. Baseline AEC Method
We adapt a noise suppression model developed in [47] to the task of echo cancellation.Specifically, a recurrent neural network with gated recurrent units takes concatenated log power spectral features of the microphone signal and far end signal as input and outputs a spectral suppression mask.The short-time Fourier transform is computed based on 20ms frames with a hop size of 10 ms, and a 320-point discrete Fourier transform.We use a stack of two gated recurrent unit layers, each of size 322 nodes, followed by a fullyconnected layer with a sigmoid activation function.The model has 1.3 million parameters.The estimated mask is point-wise multiplied by the magnitude spectrogram of the microphone signal to suppress the far end signal.Finally, to resynthesize the enhanced signal, an inverse short-time Fourier transform is used on the phase of the microphone signal and the estimated magnitude spectrogram.We use a mean squared error loss between the clean and enhanced magnitude spectrograms.The Adam optimizer [48] with a learning rate of 0.0003 is used to train the model.The model and the inference code are available in the challenge repository. 4

VII. Online subjective evaluation framework
We have extended the open source P.808 Toolkit [22] with methods for evaluating echo impairments in subjective tests.We followed the Third-party Listening Test B from ITU-T Rec.P.831 [28] and ITU-T Rec.P.832 [8] and adapted them to our use case as well as for the crowdsourcing approach based on the ITU-T Rec.P.808 [21] guidance.A third-party listening test differs from the typical listening-only tests (according to the ITU-T Rec.P.831) in the way that listeners hear the recordings from the center of the connection rather in the former one in which the listener is positioned at one end of the connection [28] (see Figure 6).Thus, the speech material should be recorded by having this concept in mind.During the test session, we use different combinations of single-and multi-scale Absolute Category Ratings depending on the speech sample under evaluation.We distinguish between single talk and double talk scenarios. 4https://github.com/microsoft/AEC-Challenge/tree/main/baseline/icassp2022 For the near end single talk, we ask for the overall quality.For the far end single talk and double talk scenario, we ask for an echo annoyance and for impairments of other degradations in two separate questions: 1) How would you judge the degradation from the echo?2) How would you judge other degradations (noise, missing audio, distortions, cut-outs) Both impairments are rated on the degradation category scale (from 1: Very annoying, to 5: Imperceptible) to obtain degradation mean opinion scores (DMOS).Note that we do not use the Other degradation category for far end single talk for evaluating echo cancellation performance, since this metric mostly reflects the quality of the original far end signal.However, we have found that having this component in the questionnaire helps increase the accuracy of echo degradation ratings (when measured against expert raters).Without the Other category, raters can sometimes assign degradations due to noise to the Echo category [29].
The setup illustrated in Figure 5 is used to process all speech samples with all of the AECs under the study.To simplify the rating process for crowdworkers, we distinguished between near end and far end single talk as well as the double talk scenarios and tried to simulate them for the test participants.In the case of near end single talk we recorded the AEC output (S out ).For far end single talk, we added the output of the AEC (S out ) with a delay of 600ms to the loopback (R in ) signal, yielding R in + delayed S out .For the listener, this simulates hearing the echo of their own speech (i.e., R in as an acoustic sidetone).For double talk the process is similar, but due to there being more speakers, simply adding the delayed AEC output (S out ) would cause confusion for the test participants.To mitigate this issue, the signals are played in stereo instead, with the loopback signal (R in ) played in one ear (i.e., acoustic sidetone) and the delayed output of the AEC (S out ) played in the other.Figure 6 was used to illustrate the double talk scenario to crowdworkers.
For the far end single talk scenario, we evaluate the second half of each clip to avoid initial degradations from initialization, convergence periods, and initial delay estimation.For the double talk scenario, we evaluate the final third of the audio clip.
The subjective test framework is available at https://github.com/microsoft/P.808.A more detailed description of the test framework and its validation is given in [29].

VIII. Objective metric
We have developed an objective perceptual speech quality metric called AECMOS.It can be used to stack rank different AEC methods based on MOS estimates with high accuracy.
It is a neural network-based model that is trained using the ground truth human ratings obtained using our online subjective evaluation framework.The audio data used to train the AECMOS model is gathered from the numerous subjective tests that we conducted in the process of improving the quality of our AECs as well as the first two AEC challenge results.The performance of AECMOS on AEC models is given in Table 5 compared with subjective human ratings on the 18 submitted models.A more detailed description of AECMOS is given in [1].Sample code can be found on https://aka.ms/aec-challenge.

IX. Challenge metric
The challenge performance is determined using the average of the five subjective scores described in Section VII and WAcc, all weighted equally; see Equation (2), where F E is far end single talk, N E SIG and N E BAK are P.835 SIG and BAK scores for near end single talk, DT echo is double talk echo, and DT other is double talk other.

X. Results and analysis
The challenge had 20 entries, 17 for the non-personalized track and 3 for the personalized track.In addition, we included two internally developed models based on [11], labeled MS-1 and MS-2.We batched all submissions into three sets: The results are given in Figure 7, and the analysis Of variance (ANOVA) for the top entries is given in Figure X.The 2nd and 3rd places were tied.For the ties, the winners were selected using the lower complexity model.
A high-level comparison of the top-5 entries is given in Table 8.Some observations are given below: • There is a PCC=-0.54 between the model size and the overall score.For this challenge, smaller models tend to outperform the larger models.• There is a PCC=0.67 between the RTF and the overall score.More complex models tend to outperform the less complex models.• There is a PCC=0.10 between if the model was a hybrid and the overall score.Both hybrid and deep models perform well.• There is a PCC=0.19 between the training dataset size and the overall score.Dataset size was not a significant factor in this challenge 5 .• There is a PCC=0.49between using additional datasets and the overall scale.Only one team added additional data (LibriSpeech [50]), though they were the first-place team [51].• The first-place entry showed that personalized AEC did increase performance, but only by a small amount (improving the final score by 0.002).

A. Performance comparison of the ICASSP 2022 and 2023 AEC Challenge
To compare the winning model performance from the ICASSP 2022 AEC Challenge to models from this year's challenge, we apply the top-scoring model MS-1 on 2022 AEC Challenge data and use the online subjective evaluation framework to compare the results.Table 9 shows that MS-1 is statistically the same as the 2022 AEC Challenge winner [36], even though the algorithmic latency + buffering latency for MS-1 is 20ms and for [36] 40ms.In studies with the

FIGURE 3 .
FIGURE 3. The custom recording application recorded the loopback and microphone signals.

FIGURE 4 .
FIGURE 4. Distribution of reverberation time (RT60) for the synthetic dataset.

FIGURE 5 .
FIGURE 5. Echo canceller test set-up for Third Party Listening Test B according to the ITU-T Rec.P.831 (after [28]).S is send and R is receive.

FIGURE 6 .
FIGURE 6.Double talk scenario in the Third Party Listening Test.The test participant (marked by "You") is positioned in the center of the communication.TABLE 5. AECMOS performance using Pearson's correlation coefficient (PCC), Spearman's rank correlation coefficient SRCC, and Kendall's Tau-b with a 95% confidence interval [49] Scenario PCC SRCC Tau-b 95

TABLE 2 . Amount of improvement remaining based on the ICASSP 2022 AEC Challenge [14].
needed that everyone in the research community can use, which we provide as part of the challenge.This AEC challenge is designed to stimulate research in the AEC domain by open-sourcing a large training dataset, test set, and subjective evaluation framework.We provide two new open-source datasets for training AEC models.

TABLE 6 . Performance metrics used in Figure 7
Double talk files for an Echo and Other degradation DMOS test (DT Echo/Other DMOS).
• Near end single talk files for a MOS test (NE ST MOS).•Far end single talk files for an Echo and Other degradation DMOS test (FE ST Echo/Other DMOS).•