I. Introduction
Voice user interfaces (VUIs) have gained increasing popularity recently due to their non-contact and human-centered interaction experience. Modern audio systems underlying VUIs exhibit prodigious speech cognition capabilities powered by deep learning, including spotting keywords [1], [2], identifying individuals [3], [4], and understanding utterances [5], [6]. However, these powerful features are also accompanied by multifaceted security issues owing to the endogenous vulnerability of deep learning and the omnipresent availability of VUIs. In particular, the latest studies have demonstrated the significant threat of adversarial example attacks to audio systems, enabling adversaries to invade VUIs effortlessly by just slightly perturbing the input. This opens up the potential for stealthy device activation, targeted user impersonation, or even malicious command execution.