A Deep Denoising Sound Coding Strategy for Cochlear Implants

Cochlear implants (CIs) have proven to be successful at restoring the sensation of hearing in people who suffer from profound sensorineural hearing loss. CI users generally achieve good speech understanding in quiet acoustic conditions. However, their ability to understand speech degrades drastically when background interfering noise is present. To address this problem, current CI systems are delivered with front-end speech enhancement modules that can aid the listener in noisy environments. However, these only perform well under certain noisy conditions, leaving quite some room for improvement in more challenging circumstances. In this work, we propose replacing the CI sound coding strategy with a deep neural network (DNN) that performs end-to-end speech denoising by taking the raw audio as input and providing a denoised electrodogram, i.e., the electrical stimulation patterns applied to the electrodes across time. We specifically introduce a DNN that emulates a common CI sound coding strategy, the advanced combination encoder (ACE). We refer to the proposed algorithm as ‘Deep ACE’. Deep ACE is designed not only to accurately code the acoustic signals in the same way that ACE would but also to automatically remove unwanted interfering noises, without sacrificing processing latency. The model was optimized using a CI-specific loss function and evaluated using objective measures as well as listening tests in CI participants. Results show that, based on objective measures, the proposed model achieved higher scores when compared to the baseline algorithms. Also, the proposed deep learning-based sound coding strategy gave eight CI users the highest speech intelligibility scores.


I. INTRODUCTION
A COCHLEAR implant (CI) is a surgically implanted neuroprosthetic device that restores the sensation of hearing in people who suffer from profound sensorineural hearing loss. The CI sound coding strategy is responsible for computing the electric stimulation current levels from the audio captured by the CI sound processors' microphone. There are several CI sound coding strategies used in the industry [1]. Out of these, a widely used sound coding strategy is the continuous interleaved sampling (CIS) [2]. CIS decomposes the incoming sound into multiple different frequency bands, which are used to modulate electric pulses that stimulate the auditory nerve. The set of pulses is sent to all available active electrodes to stimulate the auditory nerve across time in an interleaved way. Other strategies perform band selection by picking the most perceptually relevant channels for stimulation. Band selection has the advantage of reducing power consumption without compromising speech intelligibility, which is the reason why it is widely used in the CI industry. Some common criteria to select relevant bands are based on magnitude, used in the advanced combination encoder (ACE) [3], or on psychoacoustic masking, used in the PACE/MP3000 sound coding strategy [4]. When these CI sound coding strategies are used, the electrodes located near the base of the cochlea represent higher frequencies, whereas those located in the most apical region transmit low-frequency information. In this work, we focus specifically on the ACE sound coding strategy. However, the presented approach could be generalized to any available sound coding strategy, as all of them generate electrodograms (i.e., the normalized amplitudes that are subsequently mapped to the current levels that each electrode will deliver to the auditory nerve over time).
In general, a CI together with its corresponding sound coding strategy allows the user to understand speech in quiet conditions, however, it fails to do so when loud interfering signals (i.e., at low signal-to-noise ratios; SNRs), such as noise or other talkers, are present [5]. In order to overcome the limitations that CI users face in noisy conditions, many speech enhancement techniques have been proposed to improve speech intelligibility, such as spectral contrast enhancement [6], [7], spectral subtraction [8], Wiener filtering [9] and time-frequency masking [10]. Although these techniques work reasonably well, recently the signal processing community has been leaning towards more modern data-driven approaches to perform single-channel speech enhancement, such as deep learning models [11], [12], [13], [14].
Modern approaches to source separation and speech enhancement typically utilize time-frequency representations of the input signals for extracting features, which can lead to highly effective results [15], [16]. However, these do not exploit potentially rich sources of information, such as the phase, limiting speech separation quality. To overcome this problem, end-to-end deep learning-based approaches that directly work in the time domain have been recently proposed. For example, [17] proposed a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end timedomain speech separation. This model addresses the shortcomings of separation in the frequency domain, achieves state-of-the-art performance, and is suitable for low-latency applications. Thus, approaches that perform end-to-end processing are getting more attention in the community, making them an attractive potential solution to the CI 'cocktail party' problem [18]. A front-end approach, however, may not fully exploit the CI processing characteristics.
In order to optimize speech enhancement for CIs, it may be beneficial to design algorithms that consider the CI processing scheme. Hence, there has been some work done specifically for CIs, where DNNs are included in the CI signal path [14], [19], [20], [21], [22]. These approaches reduce noise, for example, by directly applying masks in the filter bank used by the CI sound coding strategy. Recently, inspired by the aforementioned Conv-TasNet, [23] proposed a deep learning-based end-to-end CI sound coding strategy, referred to as 'Deep ACE'.
Deep ACE replaces the clinical ACE sound coding method and automatically performs speech enhancement by estimating denoised electrodograms directly from raw audio. It leverages audio-to-electrodogram domain transformation to improve noise reduction for CIs. Although phase information is not necessary for the synthesis of the electrodograms, using it may help generate a proper input signal encoding, and ergo, a better latent representation. Deep ACE is intended to take advantage of such signal representation in order to extract global patterns from its characteristics, identifying which ones are more likely to be embedding speech content.
This study extensively examines Deep ACE [23], introducing a novel and improved topology, along with an optimized hyperparameter configuration that enhances the model's generalization capabilities. The model was trained on a large dataset and optimized through a loss function tailored for CI listening that discourages the activation of irrelevant bands, with the aim of improving speech comprehension for CI users. The study evaluates the proposed model and compares it with baseline algorithms using objective measures and listening tests with CI users to determine if Deep ACE can outperform the tested baselines and the existing clinical ACE sound coding strategy.

A. Advanced Combination Encoder (ACE)
The ACE sound coding strategy processes the acoustic signal captured by the microphone, by first sampling it at 16 kHz. Then, a filter bank implemented as a 128-point fast Fourier transform (FFT), commonly with a 32-point hop size, is applied, introducing a 2 ms algorithmic latency (this will depend on the channel stimulation rate; CSR). Next, an estimation of the desired envelope is calculated for each spectral band E k , (k = 1, . . ., M). Each spectral band is mapped to an electrode and represents one channel. M denotes the total number of channels/electrodes. In this study, the band selection block sets N = 8 out of M = 22 envelopes by selecting the ones with the largest amplitudes, which are then non-linearly compressed by a loudness growth function (LGF) given by: (1) The output of the LGF at band k (p k ) represents the normalized stimulation amplitude used to stimulate the auditory nerve using electrode k. The stimulation patterns across electrodes obtained from the LGF output over time constitute the electrodogram see Fig. 2. For values of E k below base level s, p k is set to zero, and for values of E k above saturation level m, p k is set to one. We used ρ = 416.2, s = 4/256, and m = 150/256 in our experiments.
Finally, the last stage of the sound coding strategy maps every p k into the subject's dynamic range between threshold levels and most comfortable levels for electrical stimulation. The N selected electrodes are stimulated sequentially for each audio frame, representing one stimulation cycle. The number of cycles per second thus determines the CSR. A block diagram showing the described processes is shown in Fig. 1(a); ACE.

1) Wiener Filter (Baseline #1):
Here, we use a classic front-end signal processing method based on Wiener filtering, a widely used technique for speech denoising that relies on a priori SNR estimation [24] (Figure 1 b; Wiener+ACE). Different variations of this algorithm are used in commercially available single-channel noise reduction systems included in CIs [25], [26]. Therefore, this classic algorithm is an appropriate baseline to use when developing new speech enhancement methods in the context of CIs [19].

2) Conv-TasNet (Baseline #2):
The front-end DNN-based baseline system used in this study is the well-known conv-TasNet (which we will refer to as TasNet for simplicity) [17]. This system performs end-to-end audio speech enhancement and feeds the denoised signal to ACE, where further processing is performed to obtain the electrodograms ( Fig. 1(c); TasNet + ACE). The TasNet structure has proven to be highly successful for single-speaker speech enhancement tasks, improving state-of-the-art algorithms, and obtaining the highest gains with modulated noise sources [27].

3) Deep ACE (Proposed Method):
This architecture builds upon the previously developed deep denoising sound coding strategy described in [23]. Deep ACE is designed to estimate the output of the LGF by taking in raw audio input and predicting the denoised electrodograms. This approach is independent of individual CI fitting parameters and maintains the standard ACE strategy's 2 ms total algorithmic delay. The enhancer module in the here presented Deep ACE contains three main differences when compared to the one in TasNet+ACE ( Fig. 1(d); Deep ACE). The previous version of the model presented in [23] shared two of these dissimilarities with the current version. These were the use of a trainable antirectifier unit as the activation function in the encoder and the output dimensionalities at the decoder. For details, refer to [23].
The primary architectural innovations in the here presented Deep ACE, are the inclusion of a deep envelope detector (DED) positioned in the skipping path of the original Deep ACE model [23], and the improvement of hyperparameter configuration. The DED replaces the envelope detection block in the original ACE (see the 'DED' block in Fig. 1(d)). This module performs dimensionality reduction between the encoder and the decoder modules to match the number of bands to be stimulated and to extract other essential features from the encoded signal. This process is necessary for implementation purposes, specifically for the employed loss function (refer to Section II-B5, (5)), and it involves three consecutive 1-D convolution layers that are stacked together. The code for training and evaluating Deep ACE can be found online 1 .

4) Model Training Setup:
The deep learning models were trained using batches of two audio segments, each lasting for a duration of 4 seconds, and were trained for a maximum of 100 epochs. In order to achieve optimal results, the initial learning rate was set to 1e-3, which was subsequently reduced by half if the validation set's accuracy did not show any improvement during three consecutive epochs. To further regularize training, early stopping with 5-epoch patience was applied. Finally, only the best-performing model was saved after the training session.
To optimize the different models, we used the Adam firstorder gradient-based optimization algorithm for stochastic objective functions [28]. The utilized range of hyperparameters is presented in detail in Table I. Note that the hyperparameters used in this work have been adjusted through empirical testing to improve the overall models' performance when compared to the ones used in [23]. For a comprehensive description of the various parameters, interested readers can refer to [17].

5) Model Training Objectives:
In the case of the Tas-Net+ACE algorithm, the optimizer was used to maximize the scale-invariant (SI) SNR [29] at the output of the TasNet, before being processed by the ACE sound coding strategy (see Fig.  TABLE I  HYPERPARAMETERS USED TO TRAIN THE DEEP LEARNING MODELS 1(c)). The SI-SNR between a given signal with T samples, x ∈ R 1×T and its estimatex ∈ R 1×T is defined in (2).
(2) In the Deep ACE model, the decoder module is developed to predict the output at the LGF of ACE to be fed into the band selection process. Therefore, the cost function employed to train it will be based on the mean-squared error, denoted by L . The L between an M -channel and F -frame target LGF output, p ∈ R M ×F and its estimatep ∈ R M ×F , is defined as: In this work, Deep ACE is optimized by minimizing a variant of the loss function used in [23] (i.e., L ). Specifically, we combine the loss function defined in (3) with a punishment term that aims at removing CI stimulation in unwanted channels. To penalize the selection of irrelevant channels we introduce a second loss term that is measured by means of the binary cross entropy between the ideal target mask µ ∈ R M ×F and the estimated maskμ ∈ R M ×F (at the output of the separator). We will denote this function by L μ , and its value computed in NATS is given by: where μ kf is equal to one if channel k at frame f contains speech, and to zero otherwise (also known as an ideal binary mask). P (μ kf ) is the predicted probability that channel k in frame f contains speech. The cost function used to optimize Deep ACE is denoted as L δ , and was constructed by linearly combining L p and L μ as follows: Empirical testing was used to determine the values for the multiplicative weighting factors w and w μ , which were then established as 15 and 1, respectively. The basis for this cost function is rooted in prior research [30], which demonstrated that individuals using CIs can withstand significant distortions in speech segments provided that the selection of frequency bands is accurate.
It is important to note that the second loss term is applied at the separator output, which means that the estimated mask must have the same dimensions as the LGF output. To achieve this, Deep ACE utilizes a DED module (described in II-B3) in the skipping path to decrease the channel dimension of the encoded input and enable the masking operation (see Fig. 1(d)). In addition, the motivation for developing this module is linked to the fact that it is also a component within the ACE sound coding strategy. In a similar manner, it is responsible for minimizing the dimensionality between the filter bank (FFT) and the band selection block (as depicted in Fig. 1(a)). Specifically, the envelope detector in ACE consolidates the frequency bins obtained from the spectral transformation into the number of available electrodes (M ).

C. Audio Material
In this work, we used a total of three different speech datasets and three noise types to assess the models' performance and generalization abilities. All these audio sets will be described in this section. As a preprocessing stage, all audio material was set to mono and re-sampled at 16 kHz. The corresponding electrodograms were obtained by processing all audio data with the ACE sound coding strategy at an output channel stimulation rate of 1,000 pulses per second CSR.

1) Speech Data: a) LibriVox corpus [31]:
This speech data was originally designed for end-to-end speech translation, however, in this study, we mix the speech material with noise to train our models for speech denoising. The speech data contained in this corpus consists of fluent spoken sentences with a total duration of 18 hours. The quality of audio and sentence alignments was checked by a manual evaluation, showing that speech alignment is in general very high. In fact, the sentence alignment quality is comparable to well-used parallel translation data.
b) TIMIT corpus [32]: This corpus contains broadband recordings of 630 people speaking the eight major dialects of American English, each reading ten phonetically-rich sentences. In this work, files from 112 male and 56 female speakers in the test set were selected. c) HSM corpus [33]: Speech intelligibility in quiet and in noise was measured by means of the Hochmair, Schulz, Moser (HSM) sentence test, based on a dataset composed of 30 lists with 20 everyday sentences each (106 words per list).   II  DATASETS USED TO TRAIN, VALIDATE AND TEST THE MODELS use stationary speech-shaped noise (SSN) and non-stationary modulated seven-speaker babble noise (ICRA7) as synthetic interferers.
3) Training, Evaluation and Testing Data: The training set was composed of speech from the LibriVox corpus and noise from the DEMAND dataset. Specifically, 30 male (M) and female (F) speakers were randomly selected from the speech corpus, and two environments were randomly selected from each of the noise categories. For validation, 20% of the training data was used. The noise and speech subsets used for training will be referred to as EN 1 and LibriVox 1 , respectively. For testing, the remaining audio data was used (the testing subsets from the DEMAND and LibriVox corpora are referred to as EN 2 and LibriVox 2 , respectively). A description of the dataset distribution for the experiments is shown in Table II.
Speech and noise signals were mixed at SNR values ranging uniformly from -5 to 10 dB. The processed clean speech signals were also included in the listening experiments to assess whether the proposed model introduced perceptually relevant distortions.

1) Objective Evaluation:
To assess the objective performance of each of the tested algorithms we compute the amount of noise reduction achieved, electrode-wise correlation coefficients between the denoised and clean signals, and a speech intelligibility score based on the short-time objective intelligibility (STOI) index [37]. Note that in this work we investigate end-to-end CI processing, so the latter objective measure is computed from the synthesized electrodograms (p) obtained using a vocoder, resulting in the STOI version used in this work, the vocoder STOI (VSTOI; [38], [39]). a) SNRi: To assess the amount of noise reduction performed by each of the tested algorithms we compute the SNR improvement (SNRi). This measure is calculated in the electrodogram domain and compares the original input SNR to the one obtained after denoising, and is given by: where p k represents the LGF output of band k and the superscripts n, c, and d are used to denote the noisy, clean, and denoised electrodograms, respectively. b) LCC: To characterize potential distortions and artifacts introduced by the tested algorithms, the linear correlation coefficients (LCCs) between the clean ACE electrodograms (p c ) and the denoised electrodograms (p d ) were computed. The LCCs were computed channel-wise (i.e., one correlation coefficient was computed for each of the 22 channels) to assess where cov(X, Y ) is the covariance between X and Y , and σ p k is the standard deviation of the values in the corresponding electrodogram p k . c) VSTOI: To estimate the speech intelligibility performance expected from each of the algorithms, the VSTOI score [37], [38], [39] was used. This metric relies directly on STOI [37], which is modeled based on normal hearing speech performance. However, VSTOI has proven to be useful in CI studies in order to compare relative expected speech intelligibility outcomes [39]. Specifically, the purpose of this metric is to evaluate the potential relative variations in speech performance that could be achieved in behavioral experiments, rather than providing an exact estimation of an individual's CI performance. The VSTOI score ranges from 0 to 1, where the higher score represents a predicted higher speech performance.
In this work, speech recognition performance was estimated using the clean unprocessed speech as a reference and the vocoded denoised speech as the processed signal. The vocoded speech was obtained from the electrodograms (p k ) by expanding the amplitudes contained in the electrodogram signals through the inverse LGF operation. Next, the expanded amplitudes contained in each band were used to amplitude-modulate band-pass filtered noise channels. The center frequencies of the band-pass filters used to obtain the modulated noise bands correspond to the ones mapped to each of the CI electrodes. Finally, by summing up all amplitude-modulated noise bands the vocoded signal is obtained.

2) Behavioral Evaluation: a) Participant demographics:
Eight postlingually deafened CI users participated in the listening tests. All participants were native German speakers and had been implanted for several years. They were invited to participate in a 3-hour test at the German Hearing Center of the Hannover Medical School (MHH), for which the travel costs were covered. The experiment was granted ethical approval by the MHH ethics commission. A synopsis of the patient-related data is shown in Table III. b) Experimental setup: The testing material was processed to obtain electrodograms, which were then delivered to the cochlear implant (CI) located in the participants' selfreported best-performing hearing side (as indicated in Table III) through direct stimulation using the RF GeneratorXS interface (Cochlear Ltd.), controlled by MATLAB and the Nucleus Implant Communicator V.3 (Cochlear Ltd.). The experiments were conducted on a personal computer with custom-made software written in MATLAB. Prior to the commencement of experiments, a hardware security check was conducted by analyzing the generated signals by the research interface with an oscilloscope.
During the experiment, the CI participant was accompanied by two observers in the laboratory. One observer operated the software, while the other counted the number of correctly identified words by marking them on a corresponding printed list. Each listening condition was evaluated twice, using different randomly selected lists of HSM sentences. The final score was computed by averaging the number of correctly identified words for each condition, resulting in the word recognition score (WRS). The test SNR was adjusted to a level where the participant could understand between 20% and 80% of the presented words using the unprocessed ACE noisy condition and was assessed for the two different types of noise (as shown in Table III). Fig. 2 shows exemplary clean, unprocessed, and denoised electrograms obtained with each of the algorithms. All results presented in this section will be based on the electrodograms extracted from the HSM speech dataset with SSN and ICRA7 background noises. Fig. 3 illustrates the SNRi obtained with each of the algorithms. The Deep ACE model demonstrated superior performance over TasNet+ACE and Wiener+ACE in all conditions, particularly at low SNRs. This finding suggests that the Deep ACE sound coding approach presented here represents an improvement over the model introduced in [23], where no improvement in SNR was observed compared to the competing front-end deep-learning baseline (TasNet). Moreover, although the SNRi values for the TIMIT and LibriVox 2 speech datasets were analyzed, they are not reported in this study, but similar patterns were identified under all testing noise conditions. In general, these observations demonstrate a substantial improvement with respect to the previous version of the model presented in [23]. b) LCC: Here we assess the similarity between the original clean and denoised electrodograms produced by the different algorithms. Fig. 4 shows the obtained LCCs as a function of the CI electrode numbers. It can be seen that the Wiener+ACE condition shows the lowest correlation for the lower frequency bands and that Deep ACE shows, in general, the highest LCCs. The results suggest that denoising mid-low frequencies is more challenging, while denoising higher frequencies is easier. This may be due to the predominance of lower-frequency noise signals and the relative scarcity of higher-frequency signals in the target. Specifically, note how LCCs were lower for the SSN Fig. 3. Box plots showing the SNRi scores in dB for the tested algorithms in SSN and ICRA7 noises for the different SNRs using the HSM speech dataset. All pair-wise differences were statistically significant. The black horizontal bars within each box represent the median for each condition, the circle-shaped marks indicate the mean improvement, and the top and bottom extremes of the boxes indicate the Q 3 = 75% and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR and the lower whisker is given by Q 1 − 1.5·IQR [41]). Black dots indicate observations that fall beyond the whisker range (outliers). Fig. 4. Polynomial regressions showing the channel-wise LCCs between processed and clean electrodograms for the different algorithms and noises using the HSM dataset. Shaded areas represent the 95% confidence level interval [41]. Higher electrode numbers represent lower frequencies.

1) Objective Instrumental Results: a) SNRi:
noise kind, where low-frequencies are dominant when compared to the ICRA7 noise condition. c) VSTOI: Fig. 5 illustrates the VSTOI scores obtained by the evaluated algorithms in different speech and noise conditions. In general, the VSTOI scores obtained with the proposed Deep ACE model are higher and agree with the obtained Fig. 5. Box plots showing the VSTOI scores for the tested algorithms in SSN and ICRA7 noises for the different SNRs using the HSM speech dataset. All pair-wise differences were statistically significant. The black horizontal bars within each box represent the median for each condition, the circle-shaped marks indicate the mean improvement, and the top and bottom extremes of the boxes indicate the Q 3 = 75% and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR and the lower whisker is given by Q 1 − 1.5·IQR [41]). Black dots indicate observations that fall beyond the whisker range (outliers).
SNRi. These results also represent a substantial improvement compared to the model presented in [23].
In quiet, the obtained mean VSTOI scores obtained by ACE, TasNet+ACE, and Deep ACE were 0.807, 0.789, and 0.803, respectively.
2) Behavioral Results: Fig. 7 shows the WRS measured in quiet for eight CI subjects. We evaluated ACE and Deep ACE without background noise to test whether the latter introduced any artifacts that compromised the intelligibility of the clean speech signals. A Wilcoxon signed-rank test [42] showed no significant differences between the mean WRS measured using ACE and Deep ACE (p = 0.85), confirming that our method coded the clean speech accurately. Fig. 6 shows the WRS in noise measured using the different algorithms for the different noises. The normally distributed mean WRS values were evaluated using two 1-way repeated measures analyses of variance (ANOVA; [43]) for each noise condition, with the tested algorithm as the factor. Any ANOVA that revealed a significant effect was followed up by the required post-hoc tests, for which type I error was corrected based on the Holm-Bonferroni method [44].
The ANOVAs revealed a significant effect of algorithm in the measured mean WRS when using SSN background noise [F (3,21)  In order to assess the WRS benefit obtained with each of the three algorithms, the improvement in WRS with respect to ACE was computed (ΔWRS = WRS denoised. − WRS ACE ).

IV. DISCUSSION
In this work, we propose an end-to-end speech coding and denoising strategy for CIs; Deep ACE. The vast majority of speech enhancement algorithms for CIs rely on front-end processing that discards potentially rich sources of information, for this reason, here we investigate an end-to-end deep learning model that merges the denoising preprocessing stage with the CI sound coding strategy. This approach leverages the simplicity of the output signal to be estimated, the electrodogram, which does not require any phase information to be reconstructed, potentially facilitating CI noise reduction.
Combining the noise reduction algorithm with the CI sound coding strategy has the added advantage of reducing processing latency when compared to other front-end methods. For instance, using a front-end TasNet denoising block would result in a latency of 4 ms, whereas the Deep ACE model presented here only introduces 2 ms of latency. This is particularly crucial for devices like CIs that need to transmit signals with minimal latency delays. For example, in the case of single-sided deaf individuals (i.e., CI in one ear and normal hearing in the other), CI processing latency is of utmost importance as these users are exposed to relative sound delay values between the CI and normal hearing ear of 10-12 ms [45]. Here, the goal is to reduce CI processing time to align with the natural delay caused by the traveling wave inside the cochlea, which ranges from 1-9 ms, being longer at lower frequencies [46]. This is desirable because relative delay differences between the CI and acoustic listening sides can disrupt spatial hearing for single-sided CI users [45].  Additionally, lowering latency is important to address any issues with unsynchronization between the speech being spoken and the speech being perceived, and other problems related to audiovisual mismatch that could negatively impact the advantages of lip reading.
This work builds on a previous study [23] which introduced Deep ACE for the first time. Here, we have optimized the architecture and introduced a new CI-specific loss function, aiming at improved speech enhancement performance and greater generalization power. The results indicate that the presented endto-end CI speech enhancement model outperforms the front-end baseline algorithms in terms of SNRi and predicted speech intelligibility. Additionally, these findings indicate that the model has strong generalization capabilities, performing well with new, Fig. 8. Violin plots showing the WRS improvement by processing the noisy signals with the different algorithms compared to ACE. The black horizontal bars within each of the boxes represent the median for each condition, the diamond-shaped marks indicate the mean improvement, and the top and bottom extremes of the boxes indicate the Q 3 =75% and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR, and the lower whisker is given by Q 1 -1.5·IQR [41]). Asterisks on top of the significance bar indicate the significance level (* p < 0.05, ** p < 0.01, *** p < 0.001). Black dots indicate observations that fall beyond the whisker range (outliers).
unfamiliar data and exhibiting resilience to various types of noise and speech signals, representing a notable advancement over the model featured in [23], which utilized some of the same test materials as those used in the training phase.
The behavioral speech tests with no background noise showed that the proposed end-to-end deep learning coding strategy 'Deep ACE' can be used to accurately code the clean speech captured by the CI microphone. Specifically, speech tests in quiet revealed no significant differences in speech understanding between the clinical ACE sound coding strategy and the proposed Deep ACE (see Fig. 7). Furthermore, word recognition scores measured in noise showed a benefit of using all the speech-denoising methods, obtaining a statistically significant improvement relative to the baseline ACE condition, as seen in Fig. 6. Note that the here observed improvement obtained by the Wiener+ACE using ICRA7 background noise was not observed in [23], however, it is consistent with other studies [9]. This result may be explained by the fact that, in this work, this condition was mostly tested at positive SNRs (see Table III). Finally, when comparing the WRS improvement with respect to ACE obtained by the three tested speech-denoising algorithms, Deep ACE outperformed the other two, obtaining the highest WRS benefit (see Fig. 8), this benefit of Deep ACE was also not observed in the listening tests performed in [23]. Although not statistically significant, the TasNet+ACE condition demonstrated a higher WRS improvement score compared to the Wiener+ACE condition when tested with ICRA7 background noise. This outcome is consistent with the objective measures that indicate a greater improvement in SNR and VSTOI scores, as shown in the right panels of Figs. 3 and 5.

V. CONCLUSION
In this study, we present Deep ACE, a speech coding and denoising sound coding strategy for CIs that utilizes end-to-end deep learning processing. This method aims to provide precise acoustic signal coding like ACE while effectively removing background noise without introducing processing latency. We assessed the performance of the proposed model through both objective measures and listening tests with eight CI users, comparing its performance to the standard ACE and two front-end baseline models, namely the Wiener filter and TasNet. Our results indicated that Deep ACE effectively codes speech signals and outperforms the baseline models in both objective measures and listening tests. These findings suggest that Deep ACE has the potential to replace the current clinical ACE sound coding strategy and improve speech comprehension for CI users in noisy environments.