1. INTRODUCTION
End-to-end and sequence-to-sequence neural network models, respectively, have recently gained increased interest and popularity in the automatic speech recognition (ASR) community [1]–[4]. The output of an end-to-end ASR system is usually a grapheme sequence that can either be single letters or larger units such as word-pieces and entire words [5]. The appeal of end-to-end ASR is that it enables a simplified system architecture compared to traditional ASR systems [6] by being composed of neural network components only and by avoiding the need for language specific linguistic expert knowledge to build such systems. Connectionist temporal classification (CTC) [7] and the attention mechanism [8] are the two most widely used neural network architectures for end-to-end ASR, and attention-based encoder-decoder neural networks have shown the ability to outperform CTC-based neural networks [9], [10]. However, attention-based decoders are per se not well suited to be applied in a streaming fashion, i.e. to compute outputs as audio samples are recorded, since attention weights are typically computed from an input sequence of an entire speech utterance, which is referred to as full-sequence mode. This is because it is unable to align an input and output sequence frame-by-frame, in contrast with CTC. Recently, the neural transducer (NT) concept was proposed [11] that adds a block processing strategy to the attention mechanism by using a fixed number of input frames and by introducing a special symbol to detect the end of an output sequence for each chunk of input frames. Disadvantages of the NT model are that it requires alignment information from an auxiliary ASR system to be trained and parameter initialization from a pre-trained full-sequence model to achieve a high recognition accuracy [12].