Loading [MathJax]/extensions/MathMenu.js
A Study of Child Speech Extraction Using Joint Speech Enhancement and Separation in Realistic Conditions | IEEE Conference Publication | IEEE Xplore

A Study of Child Speech Extraction Using Joint Speech Enhancement and Separation in Realistic Conditions


Abstract:

In this paper, we design a novel joint framework of speech enhancement and speech separation for child speech extraction in realistic conditions, targeting the problem of...Show More

Abstract:

In this paper, we design a novel joint framework of speech enhancement and speech separation for child speech extraction in realistic conditions, targeting the problem of extracting child speech from daily conversations in BabyTrain mega corpus. To the best of our knowledge, it is the first discussion of a feasible method for child speech extraction in realistic conditions. First, we make detailed analysis of the BabyTrain mega corpus, which is recorded in adverse environments. We observe problems of background noises, reverberations and child speech that is partially obscured by adult speech (for instance due to speaker overlap but also imitation by the adult). Motivated by this, we conduct a joint framework of speech enhancement and speech separation for child speech extraction. To measure the extraction results in realistic conditions, we propose several objective measurements to evaluate the performance of the our system, which is different from those commonly used for simulation data. Compared with the unprocessed approach and classification approach, our proposed approach can yield the best performance among all subsets of BabyTrain.
Date of Conference: 04-08 May 2020
Date Added to IEEE Xplore: 09 April 2020
ISBN Information:

ISSN Information:

Conference Location: Barcelona, Spain
University of Science and Technology of China, Hefei, China
University of Science and Technology of China, Hefei, China
Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France
University of Science and Technology of China, Hefei, China
Georgia Institute of Technology, Atlanta, Georgia, USA

1. INTRODUCTION

Recent years have seen a literal explosion in the use of child-centered audio-recordings, gathered as infants and young children go about their day [1]. The resulting data are of interest to both a wide range of theories (e.g., developmental psychology, cognitive science) and numerous applications (e.g., the diagnosis of potential language disorders, the measurement of effects of an intervention). Despite the interest in these data, there are very few analysis algorithms that can cope with these data, which truly deserve the name of ’in the wild’. To begin with, much of the voice recorded belongs to the infant or child wearing the device, who produce non-speech vocalizations (such as crying as well as non-emotional, non-speech productions). Moreover, the other people recorded may vary in their closeness to the microphone, such that their voice alternates between near-field and far-field within the same recording. Finally, many people may be recorded; in our experience, children can come across 20 people over a normal day, with as many as 9 people in a 5-minute interval [2].

University of Science and Technology of China, Hefei, China
University of Science and Technology of China, Hefei, China
Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France
University of Science and Technology of China, Hefei, China
Georgia Institute of Technology, Atlanta, Georgia, USA

Contact IEEE to Subscribe

References

References is not available for this document.