Skip to Main Content
Speech recognition systems need to operate in a wide range of conditions. Thus they should be robust to extrinsic variability caused by various acoustic factors, for example speaker differences, transmission channel and background noise. For many scenarios, multiple factors simultaneously impact the underlying “clean” speech signal. This paper examines techniques to handle both speaker and background noise differences. An acoustic factorization approach is adopted. Here, separate transforms are assigned to represent the speaker [maximum-likelihood linear regression (MLLR)], and noise and channel [model-based vector Taylor series (VTS)] factors. This is a highly flexible framework compared to the standard approaches of modeling the combined impact of both speaker and noise factors. For example factorization allows the speaker characteristics obtained in one noise condition to be applied to a different environment. To obtain this factorization modified versions of MLLR and VTS training and application are derived. The proposed scheme is evaluated for both adaptation and factorization on the AURORA4 data.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:20 , Issue: 7 )
Date of Publication: Sept. 2012