1. INTRODUCTION
The detection of spoofed speech generated by text-to-speech (TTS) and voice conversion (VC) systems is usually formulated as a binary classification task [1]. A detector, referred to as a spoofing countermeasure (CM), requires a significant amount of training data containing diverse human (bona fide) and synthesized (spoofed) speech waveforms. However, preparing various spoofed training data is costly. For example, it took a few months to develop the TTS and VC systems that generated the training set of the ASVspoof 2019 logical access (LA) database through trial and error [2].