Skip to Main Content
The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.