Abstract:
An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performa...Show MoreMetadata
Abstract:
An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper proposes a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. We evaluate the proposed approach on a large audiovisual corpus (over 60 hours), which contains different channel and environmental conditions, comparing the results with competitive single task learning (STL) and MTL baselines. Although our main goal is to improve the performance of our ASR task, the experimental results show that the proposed approach can achieve the best performance across all conditions for both speech tasks. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications.
Published in: IEEE Transactions on Multimedia ( Volume: 23)
Funding Agency:

Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Fei Tao (Student Member, IEEE) received the B.S. degree in electrical engineering from Beijing Jiaotong University, Beijing (BJTU), Beijing, China, in 2009, the M.S. degree from Texas Southern University (TSU), Houston, TX, USA, and the Ph.D. degree from the University of Texas at Dallas, Richardson, TX, USA, in 2018. His research interests include digital signal processing, speech and video processing, audio visual speec...Show More
Fei Tao (Student Member, IEEE) received the B.S. degree in electrical engineering from Beijing Jiaotong University, Beijing (BJTU), Beijing, China, in 2009, the M.S. degree from Texas Southern University (TSU), Houston, TX, USA, and the Ph.D. degree from the University of Texas at Dallas, Richardson, TX, USA, in 2018. His research interests include digital signal processing, speech and video processing, audio visual speec...View more

Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Carlos Busso (Senior Member, IEEE) received the B.S. and M.S. (hons.) degrees in electrical engineering from the University of Chile, Santiago, Chile, in 2000 and 2003, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, CA, USA, in 2008. He is an Associate Professor with the Electrical Engineering Department, University of Texas at Dallas (UTD), Rich...Show More
Carlos Busso (Senior Member, IEEE) received the B.S. and M.S. (hons.) degrees in electrical engineering from the University of Chile, Santiago, Chile, in 2000 and 2003, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, CA, USA, in 2008. He is an Associate Professor with the Electrical Engineering Department, University of Texas at Dallas (UTD), Rich...View more

Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Fei Tao (Student Member, IEEE) received the B.S. degree in electrical engineering from Beijing Jiaotong University, Beijing (BJTU), Beijing, China, in 2009, the M.S. degree from Texas Southern University (TSU), Houston, TX, USA, and the Ph.D. degree from the University of Texas at Dallas, Richardson, TX, USA, in 2018. His research interests include digital signal processing, speech and video processing, audio visual speech recognition, and multimodal fusion. At BJTU, he received the university scholarship from 2005 to 2008. He also received the second prize in the 2008 Beijing College-Student Circuits Design Contest. In 2011, he received the Dwight David Eisenhower President Fellowship for his research in Intelligent Transportation System at TSU.
Fei Tao (Student Member, IEEE) received the B.S. degree in electrical engineering from Beijing Jiaotong University, Beijing (BJTU), Beijing, China, in 2009, the M.S. degree from Texas Southern University (TSU), Houston, TX, USA, and the Ph.D. degree from the University of Texas at Dallas, Richardson, TX, USA, in 2018. His research interests include digital signal processing, speech and video processing, audio visual speech recognition, and multimodal fusion. At BJTU, he received the university scholarship from 2005 to 2008. He also received the second prize in the 2008 Beijing College-Student Circuits Design Contest. In 2011, he received the Dwight David Eisenhower President Fellowship for his research in Intelligent Transportation System at TSU.View more

Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Carlos Busso (Senior Member, IEEE) received the B.S. and M.S. (hons.) degrees in electrical engineering from the University of Chile, Santiago, Chile, in 2000 and 2003, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, CA, USA, in 2008. He is an Associate Professor with the Electrical Engineering Department, University of Texas at Dallas (UTD), Richardson, TX, USA. He was selected by the School of Engineering of Chile as the best Electrical Engineer graduated in 2003 across Chilean universities. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a fellowship in Digital Scholarship from 2007 to 2008. At UTD, he leads the Multimodal Signal Processing laboratory [http://msp.utdallas.edu]. His research interest is in human-centered multimodal machine intelligence and applications. His current research includes the broad areas of affective computing, multimodal human-machine interfaces, nonverbal behaviors for conversational agents, in-vehicle active safety system, and machine learning methods for multimodal processing. His work has direct implication in many practical domains, including national security, health care, entertainment, transportation systems, and education. He is the recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He is the Co-Author of the winner paper of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. He was the General Chair of ACII 2017 and ICMI 2021. He is a member of ISCA, AAAC, and a senior member of ACM.
Carlos Busso (Senior Member, IEEE) received the B.S. and M.S. (hons.) degrees in electrical engineering from the University of Chile, Santiago, Chile, in 2000 and 2003, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, CA, USA, in 2008. He is an Associate Professor with the Electrical Engineering Department, University of Texas at Dallas (UTD), Richardson, TX, USA. He was selected by the School of Engineering of Chile as the best Electrical Engineer graduated in 2003 across Chilean universities. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a fellowship in Digital Scholarship from 2007 to 2008. At UTD, he leads the Multimodal Signal Processing laboratory [http://msp.utdallas.edu]. His research interest is in human-centered multimodal machine intelligence and applications. His current research includes the broad areas of affective computing, multimodal human-machine interfaces, nonverbal behaviors for conversational agents, in-vehicle active safety system, and machine learning methods for multimodal processing. His work has direct implication in many practical domains, including national security, health care, entertainment, transportation systems, and education. He is the recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He is the Co-Author of the winner paper of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. He was the General Chair of ACII 2017 and ICMI 2021. He is a member of ISCA, AAAC, and a senior member of ACM.View more