Skip to Main Content
An utterance training system utilizing automatic speech recognition (ASR) has been developed as a computer aided language laboratory system. Because the performance of ASR is seriously degraded due to surround noise, many noise reduction methods has been proposed. In particular, in utterance training system, learners often sit side by side in a classroom so that each learner's utterances degrade other learners utterances. Spectral subtraction, one of the noise reduction methods, suppresses stationary noise from a signal by subtracting a noise spectrum estimated by observed signal without voice activity. However, it relies on the assumption that the noise is stationary. Even though multi-channel methods such as Delay and Sum, Griffiths-Jim or various Blind Source Separation methods are applicable for non stationary noise, these methods build under synchronization of all input signals. In this paper, a time-frequency masking method utilizing signals observed at distributed computers connected over TCP/IP network is proposed. Because the characteristics of TCP/IP based network, various transmission delays are unavoidable so that signals from computers cannot synchronize perfectly even when a time synching protocol such as the Network Time Protocol is utilized. The proposed method is based on the assumption that the noise spectrum is stable for certain duration. From this assumption, the asynchronous signals observed at distributed computers are utilized for speech enhancement based on time-frequency masking. Simulation results show a possibility to improve the performance of ASR when several interference speakers exist around the target speaker.