1. INTRODUCTION
Recent years have seen a literal explosion in the use of child-centered audio-recordings, gathered as infants and young children go about their day [1]. The resulting data are of interest to both a wide range of theories (e.g., developmental psychology, cognitive science) and numerous applications (e.g., the diagnosis of potential language disorders, the measurement of effects of an intervention). Despite the interest in these data, there are very few analysis algorithms that can cope with these data, which truly deserve the name of ’in the wild’. To begin with, much of the voice recorded belongs to the infant or child wearing the device, who produce non-speech vocalizations (such as crying as well as non-emotional, non-speech productions). Moreover, the other people recorded may vary in their closeness to the microphone, such that their voice alternates between near-field and far-field within the same recording. Finally, many people may be recorded; in our experience, children can come across 20 people over a normal day, with as many as 9 people in a 5-minute interval [2].