Triggered Attention for End-to-end Speech Recognition | IEEE Conference Publication | IEEE Xplore

Triggered Attention for End-to-end Speech Recognition


Abstract:

A new system architecture for end-to-end automatic speech recognition (ASR) is proposed that combines the alignment capabilities of the connectionist temporal classificat...Show More

Abstract:

A new system architecture for end-to-end automatic speech recognition (ASR) is proposed that combines the alignment capabilities of the connectionist temporal classification (CTC) approach and the modeling strength of the attention mechanism. The proposed system architecture, named triggered attention (TA), uses a CTC-based classifier to control the activation of an attention-based decoder neural network. This allows for a frame-synchronous decoding scheme with an adjustable look-ahead parameter to control the induced delay and opens the door to streaming recognition with attention-based end-to-end ASR systems. We present ASR results of the TA model on three data sets of different size and language and compare the scores to a well-tuned attention-based end-to-end ASR baseline system, which consumes input frames in the traditional full-sequence manner. The proposed triggered attention (TA) decoder concept achieves similar or better ASR results in all experiments compared to the full-sequence attention model, while also limiting the decoding delay to two look-ahead frames, which in our setup corresponds to an output delay of 80 ms.
Date of Conference: 12-17 May 2019
Date Added to IEEE Xplore: 17 April 2019
ISBN Information:

ISSN Information:

Conference Location: Brighton, UK

1. INTRODUCTION

End-to-end and sequence-to-sequence neural network models, respectively, have recently gained increased interest and popularity in the automatic speech recognition (ASR) community [1]–[4]. The output of an end-to-end ASR system is usually a grapheme sequence that can either be single letters or larger units such as word-pieces and entire words [5]. The appeal of end-to-end ASR is that it enables a simplified system architecture compared to traditional ASR systems [6] by being composed of neural network components only and by avoiding the need for language specific linguistic expert knowledge to build such systems. Connectionist temporal classification (CTC) [7] and the attention mechanism [8] are the two most widely used neural network architectures for end-to-end ASR, and attention-based encoder-decoder neural networks have shown the ability to outperform CTC-based neural networks [9], [10]. However, attention-based decoders are per se not well suited to be applied in a streaming fashion, i.e. to compute outputs as audio samples are recorded, since attention weights are typically computed from an input sequence of an entire speech utterance, which is referred to as full-sequence mode. This is because it is unable to align an input and output sequence frame-by-frame, in contrast with CTC. Recently, the neural transducer (NT) concept was proposed [11] that adds a block processing strategy to the attention mechanism by using a fixed number of input frames and by introducing a special symbol to detect the end of an output sequence for each chunk of input frames. Disadvantages of the NT model are that it requires alignment information from an auxiliary ASR system to be trained and parameter initialization from a pre-trained full-sequence model to achieve a high recognition accuracy [12].

Contact IEEE to Subscribe

References

References is not available for this document.