Skip to Main Content
In this paper, we describe our recent effort in combining speech recognition and understanding into a single pass decoding process. The goal is to utilize the semantic structure not only to better handle disfluencies and improve the overall understanding accuracy, but also to shorten the response time and achieve higher interactivity. Three related techniques are instrumental in our approach. First, we employ the unified language model (ULM) to incorporate semantic schema into the recognition language model, and extend the search process from word synchronous to semantic object synchronous (SOS) decoding. Finally, we utilize sequential detection to defer, reject, or accept semantic hypotheses and execute consequent dialog actions while the user's utterance is ongoing. We incorporated these methods into SALT and HTML and conducted comparative user studies based on the MiPad scenarios. The experimental results show the system can gracefully cope with spontaneous speech and the users prefer the highly interactive nature of such systems even though there are no significant differences in the task completion rate and the understanding accuracy. However, the interactive interface does allow a more effective visual prompting strategy that contributes to the significantly lower out of grammar utterances.