Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows recognition accuracy comparable to that of RNN-transducer (RNN-T) while achieving lower emission latency.

Abstract-This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows recognition accuracy comparable to that of RNNtransducer (RNN-T) while achieving lower emission latency.

I. INTRODUCTION
O NLINE streaming automatic speech recognition (ASR) is a core technology for speech applications such as live captioning, simultaneous translation, voice search, and dialogue systems. The traditional but still dominant approach in production is a hybrid system that modularizes the entire system into an acoustic model, a pronunciation model, and a language model (LM). Recently, end-to-end (E2E) systems have achieved comparable performance to that of hybrid systems by optimizing a direct mapping function from the input speech to the target transcription [1]- [3]. Representative approaches include the connectionist temporal classification (CTC) [4], recurrent neural network transducer (RNN-T) [5], recurrent neural aligner (RNA) [6], hybrid autoregressive H. Inaguma  transducer (HAT) [7], and attention-based encoder-decoder (AED) [8], [9] models. With the simplified architecture, the E2E approaches are advantageous for rapid system development, on-device applications with a small footprint, and fast inference when a large amount of training data is available.
The E2E models have been compared in offline scenarios [10]- [12], and AED models are typically the best choice because of the strong token dependency on the decoder side. In online streaming scenarios, however, AED models are not suitable because they require the entire input in order to generate the initial token. On the other hand, frame-synchronous models such as CTC and RNN-T can easily be extended to the streaming setting. RNN-T has been a practical choice because of its better performance than CTC with the help of token dependency modeling in the prediction network [3], [13], [14]. However, it is known that RNN-T consumes significant memory during training [15], [16] and requires a large search space during inference because of its frame-wise prediction, which significantly slows down the decoding speed.
The main problem in making streaming AED models for practical systems is that they are not robust for long-form speech, which is not as problematic in frame-synchronous models [34]. Moreover, latency in the decision boundaries on token emission occurs in any E2E model [16], [32], [35]. This is because (1) the model is typically equipped with a time-restricted streaming encoder having limited future contexts, and (2) the model is optimized with an end-to-end objective, which encourages the decoder to use as many future observations as possible.
Previous studies on frame-synchronous models tackled this problem by shifting output frames [36] and leveraging frame-wise alignment supervision [37]- [40]. As for labelsynchronous models, Inaguma et al. leveraged alignment infor-mation extracted from a hybrid system to reduce the emission latency of MoChA [32]. However, this approach still depends on a hybrid model and is not a purely end-to-end solution. While the delayed token generation problem occurs similarly in frame-synchronous models, CTC models are better than AED models in terms of latency, because they are optimized with the forward-backward algorithm and assume conditional independence on a per-frame basis. Moreover, the peaky alignments learned in CTC are expected to be compatible with the token boundaries in MoChA.
In this article, we propose a novel purely end-to-end training method to enhance the alignment learning process of streaming MoChA models without external alignments. 1 We regard the peaks in CTC alignments as a reference for the tokens boundaries in MoChA; thus, we train a MoChA model to mimic a CTC model in order to detect token boundaries at similar positions. We refer to this method as CTC synchronous training (CTC-ST) [41]. This boundary supervision greatly eases the optimization of MoChA, especially with contaminated inputs as in SpecAugment [42]. This is because the accumulation of alignment errors in MoChA can be recovered with the help of the CTC alignments. The CTC model is jointly optimized with the MoChA model by having them share an encoder to encourage monotonic alignments in the MoChA decoder, similarly to the joint CTC/Attention framework [43]. In the proposed method, however, the CTC alignments are further provided to the MoChA decoder as supervision of token boundaries to restrict their positions. Because the alignment knowledge learned in CTC is transferred to improve MoChA's alignment in a unified architecture, we regard this framework as a form of self-distillation. 2 Experimental evaluations on four benchmark datasets show that CTC-ST significantly improves the recognition accuracy, especially for long-form and noisy speech. We also demonstrate that CTC-ST can reduce the emission latency without external alignment information and achieve a tradeoff of the accuracy and emission latency comparable to that of alignment knowledge distillation from a hybrid system [32]. Finally, we compare MoChA with RNN-T in recognition accuracy and emission latency to demonstrate that CTC-ST can close the performance gap, so that the best MoChA system achieves recognition accuracy comparable to that of RNN-T and lower emission latency. 3

A. Streaming attention-based encoder-decoder model
We categorize streaming AED models into two groups in terms of how they segment speech frames for token generation.
1) Segmentation on encoder side: The first method in this category is NT [17], [45], [46], which performs labelsynchronous decoding on every fixed-size input block and moves to the next block if it detects no additional token boundaries. ACS [19] extends the idea to an adaptive segmentation policy based on a halting mechanism [47]. CIF [20] further enhances ACS by applying a fine-grained segmentation within the encoder output. SCAMA [21] learns to count the number of tokens to be generated in each fixed input chunk. Sterpu et al. followed a similar idea [48]. Triggered attention [18], [49], [50] is based on the joint CTC/Attention framework and truncates encoder outputs with CTC spikes to perform global attention over the past encoder outputs from each spike position, but the decoding complexity is quadratic. Our work is different in that we leverage CTC alignments only during training, and the decoding complexity is linear. A scout network [51] learns to detect word boundaries with alignment information from a hybrid system during training to reduce the emission latency, and it performs global attention similarly to triggered attention. However, it introduces a dependency on frame-wise alignment supervision. These methods can detect token boundaries regardless of the input length, but they are limited by not using contextual information is not used for segmentation.
2) Segmentation on decoder side: The methods in this category leverage decoder states as a query to segment the input speech for every token. Local windowing methods were proposed first [22]- [25]. GMM attention [26], [34] forces the center of attention to move monotonically to the end of the encoder output. Kong et al. further incorporated sourceside information [52]. NAT [27] trains stochastic variables with a policy gradient, and the optimization was further improved in [53]. Hard monotonic attention (HMA) [54] also introduces stochastic variables to detect token boundaries, but the model can be trained efficiently with a cross-entropy objective. MoChA [28] relieves the strong monotonic constraint in HMA by introducing additional soft attention over a small window. Miao et al. proposed stable MoChA (sMoChA) by simplifying the attention calculation in MoChA to ease the optimization [30]. They further proposed monotonic truncated attention (MTA) [33] by using soft attention scores in HMA at test time as well to reduce the gap between training and testing. HMA was further extended to the Transformer decoder [55], [56], while Li et al. extended the idea of ACS to the Transformer decoder [57]. Incremental decoding uses offline models for streaming applications, but the decoding complexity is quadratic [58]- [61].

B. Emission latency in E2E ASR model
Emission latency inevitably occurs in any E2E ASR model because sequence-level optimization allows the model to use as much future information as possible. This problem was tackled in CTC acoustic models for the first time by constraining CTC paths during marginalization with frame-level alignments from a hybrid system [62]. Zhang et al. trained a CTC model jointly with a frame-level cross-entropy [38]. Similar methods have also been investigated for RNN-T by pretraining the encoder with a frame-level cross-entropy [39] and applying joint training [40]. Inaguma et al. investigated the application of alignment information to MoChA and reduced the recognition errors and emission latency simultaneously [32]. However, such alignment information is not necessarily available. Recently, FastEmit [63] was proposed by designing a new training objective for RNN-T to reduce the emission latency without any frame-level supervision, which was applied to two-pass E2E architectures in the voice search task successfully [64], [65]. The idea of FastEmit was extended to MoChA in [66], referred to as StableEmit. Self-alignment from RNN-T was also leveraged in [67]. Yu et al. trained a single RNN-T in both offline and streaming modes (dual-mode ASR) by sharing parameters [68].

C. Knowledge distillation for streaming ASR
Knowledge distillation [44] has also been investigated to improve the performance of streaming E2E ASR models. Previous works focused on distilling knowledge from an offline or streaming teacher model to a weaker streaming student model within the same decoder topology [69]- [76]. In frame-synchronous models, however, there exists a problem that the timing to emit tokens can differ between the teacher and student models, depending on the future context size in the encoder. Through an approach using bidirectional long shortterm memory (BLSTM), Kurata et al. trained BLSTM-CTC to mimic LSTM-CTC in order to generate posterior spikes at similar positions and then distill the frame-level posterior probabilities to the LSTM-CTC model [77]. The idea was extended to RNN-T in [75]. Ding et al. jointly trained multiple teacher CTC models to synchronize their posterior spikes [78]. Distillation between different decoder topologies has also been investigated. Moriya et al. distilled knowledge from a teacher AED model to a student CTC model [79]. Self-distillation in a single E2E ASR model has also been proposed as an in-place operation, from an offline mode to a streaming mode [61], [68], and from a Transformer decoder to a CTC layer [80]. Unlike those previous methods, we focus on distilling the positions of token boundaries learned in a CTC model to an AED model, rather than distilling the posterior distributions. Moreover, the teacher and student models share the same encoder and are trained jointly from scratch.

III. BASICS
In this section, we review HMA and MoChA. Let x = (x 1 , . . . , x T ) be an input speech sequence, h = (h 1 , . . . , h T ) be encoder outputs (T ≤ T ), and y = (y 1 , . . . , y U ) be the corresponding output token sequence. The encoder performs downsampling to reduce the input sequence length from T to T . We use i and j as the time indices of the output and input sequences, respectively.

A. Hard monotonic attention (HMA)
Standard offline AED models are based on the global attention mechanism [8], in which relevant source information is selected according to the target context via attention scores calculated by normalizing energy activations over h. However, this prevents the model from performing online streaming recognition, because the decoder must see all the encoder outputs to generate the initial token. Moreover, the decoding complexity at each generation step is in proportion to the encoder output length T . This results in a total decoding complexity O(T U ). To start generating tokens when given partial acoustic observations during inference, HMA introduces discrete binary decision processes. As a result, it can perform decoding with linear-time complexity O(T ) during inference, but it behaves differently between the training and test times.
At test time, the decoder scans the encoder outputs h 1 , . . . , h T from left to right. At every input frame index j, the decoder has the option to (1) stop at the current frame j to generate a token or (2) move forward to the next frame j + 1 according to a selection probability p i,j ∈ [0, 1]. A discrete decision z i,j ∈ {0, 1} on whether to stop at the j-th frame is sampled from a Bernoulli random variable parameterized by p i,j as where e i,j is a monotonic energy activation, s i 4 is the i-th decoder state, and σ is a logistic sigmoid function. When z i,j = 1, i.e., p i,j ≥ 0.5, the decoder stops at an index j = t i (referred to as the token boundary of the i-th token). Then, only the corresponding single encoder output h ti is used for generating the next token and updating the decoder state. The next token boundary t i+1 is determined by resuming scanning from the previous token boundary j = t i .
However, this hard assignment of z i,j is not differentiable. To perform the standard backpropagation training, the expected alignment scores α i,j are calculated by marginalizing over all possible alignment paths as follows: Because α i,j in Eq. (3) introduces a recurrence relation, it is difficult to calculate it in parallel over the input indices. However, by substituting q i,j = α i,j /p i,j , it can be calculated efficiently with the cumulative sum and product operations, denoted respectively as cumsum and cumprod, as follows: Finally, the monotonic energy activation e i,j in Eq. (1) is implemented as where f is the nonlinear activation, g and v are parameters for weight normalization, and W h , W s , b, and r are trainable parameters. We use a rectified linear unit (ReLU) activation function as f . Following [54], the scalar offset parameter r is initialized as -4 in this work. To ensure the discreteness of p i,j , a zero-mean, unit-variance Gaussian noise is added to the pre-sigmoid activation in Eq. (2) during training. The subsequent token generation processes are the same as in the global AED model.

B. Monotonic chunkwise attention (MoChA)
In HMA, source information is restricted to a single encoder output, and this strong constraint greatly sacrifices the accuracy in general. To overcome this problem by leveraging the surrounding contexts, MoChA introduces an additional soft attention mechanism over a fixed window of width w on top of HMA.
At test time, soft attention scores β i,j are calculated over w encoder outputs from every token boundary t i as follows: where u i,j is the chunk energy activation formulated as in Eq. (1) without weight normalization and the offset parameter r, and c i is a context vector for the i-th token. During training, β i,j can be calculated on top of the expected alignment score α i,j as The computation of β i,j is expensive because of the nested summation. Fortunately, it can be computed more efficiently with a moving sum operation, denoted as MovSum, as follows: This computation can be implemented by 1-dimensional convolution. c i is calculated as The objective function is formulated as the negative log-likelihood L mocha = − log P mocha (y|x), where P mocha is the output probability distribution of MoChA.
To encourage MoChA to learn monotonic alignments, we train it jointly with an auxiliary CTC objective L ctc = − log P ctc (y|x), where P ctc is the CTC output probability distribution, by sharing the encoder sub-network [43], [81]. Moreover, to avoid vanishing of α i,j , quantity loss L qua is introduced by making the expected total number of token boundaries closer to the reference output sequence length U as follows [20], [32]: We refer to this technique as quantity regularization (QR). 5 Note that any external alignment information is not used here. The total objective function L total is defined as a linear interpolation of L mocha , L ctc , and L qua as follows: where λ ctc (0 ≤ λ ctc ≤ 1) and λ qua (≥ 0) are tunable weights for the CTC loss and quantity loss, respectively.

IV. PROBLEMS IN MOCHA
In this section, we review two major problems in MoChA: vanishing alignment probabilities and delayed token generation.

A. Vanishing alignment probabilities
Conventional alignment models such as the hidden Markov model (HMM) [82] and CTC models use the forwardbackward algorithm to calculate alignment probabilities. However, it is not straightforward to apply that algorithm to the HMA mechanism. This is because HMA does not normalize the monotonic energy e i,j across the entire set of encoder outputs h to obtain the expected alignment score α i,j , which means that α i,j is not a valid probability. Moreover, the decoder is an autoregressive model, and an incremental leftto-right update of the decoder state is required for each token. Therefore, during the marginalization process at training time, α i,j depends only on past alignments, as can be seen in Eq. (3), and it can quickly be attenuated as the number of decoding steps increases [33], [83], because T j=1 α i,j < 1. This is especially problematic for long-form speech because the model is more likely to fail to learn the scale of p i,j properly in the latter steps. Accordingly, the gap in the HMA behaviors between training and testing is widened. This leads to premature endpointing, which increases deletion errors [84], [85].

B. Delayed token generation
To enable online inference, a streaming ASR model needs to be equipped with a time-restricted encoder, which does not use enough future information. However, because E2E models are optimized with sequence-level criteria, i.e., the cross-entropy, future encoder outputs are used as much as possible [29], [31], [32], [37]. As a result, token boundaries are shifted several frames ahead from the actual acoustic boundary positions. Moreover, MoChA allows emission of multiple tokens at the same input index. These problems inevitably cause a large perceived latency and make the model unusable in online ASR.   Table II) and the proposed model with CTC-ST (bottom, T6). Reference: "we might be putting lids and casting shadows on their power wouldn't we want to open doors for them instead." token boundaries from MoChA and the CTC posterior probabilities, respectively. It is known that the posterior probabilities in a well-trained CTC model tend to peak in sharp spikes [4]. In the figure, we can observe that MoChA's token boundaries are indeed shifted to the right (future side) from the actual acoustic boundaries and are poorly aligned to the CTC spikes.

V. ALIGNMENT KNOWLEDGE DISTILLATION FROM HYBRID
ASR SYSTEM This section describes previous approaches to tackle the delayed token generation problem in MoChA by leveraging word alignments extracted from a hybrid ASR system [32]. We review two methods in [32]: Delay Constrained Training (DeCoT) and Minimum Latency Training (MinLT), which can reduce both the emission latency and recognition errors. Because the alignment knowledge bootstrapped in a hybrid system is transferred to an AED model, we refer to the procedure as alignment knowledge distillation from the hybrid system.
Let A = (a 1 , · · · , a T ) (a j : V-dimensional one-hot vector; V: vocabulary size of the hybrid system) be a frame-level word alignment corresponding to the input sequence x, and be a sequence of endpoints of token boundaries for a reference transcription y = (y 1 , · · · , y U ). To convert the word alignment to a subword alignment compatible with MoChA, we divide the total time duration of each word in A by the ratio of the character length of each subword. Finally, we select the end timestamp as the token boundary.

A. Delay constrained training (DeCoT)
In DeCoT, inappropriate alignment paths that poorly match the reference alignment are removed by masking out their scores α i,j during marginalization. Then, α i,j in Eq. (3) is reformulated to α decot i,j as follows: where δ decot is a hyperparameter to control the acceptable delay, and b ref i is a reference boundary of the i-th token transferred from the hybrid system.
As explained in Section IV-A, MoChA has a problem of an exponential decay of α i,j , and the masking for α decot i,j accelerates it further. To recover the proper scale of α decot Intuitively, QR in DeCoT emphasizes the valid alignment paths during marginalization. Moreover, this also leads to better estimation of β i,j in Eq. (5). Accordingly, the total objective function in Eq. (7) is modified as follows: Unlike in [32], we also add a CTC loss as an auxiliary objective for DeCoT.

B. Minimum latency training (MinLT)
While DeCoT can effectively reduce the emission latency, a fixed buffer size δ decot must be predefined for every token. However, the emission latency can vary depending on the speaking rate, subwords, and so on. To reduce the latency of each token more flexibly, MinLT directly minimizes the expected latency so that the expected token boundaries in MoChA get closer to the corresponding reference boundaries. A differentiable expected latency loss L minlt is specified as where b mocha i is the expected boundary position in MoChA for the i-th token during training. The total objective function in Eq. (7) is modified as follows: where λ minlt (≥ 0) is a hyperparameter to control the latency.

VI. PROPOSED METHOD: CTC SYNCHRONOUS TRAINING
In this section, we propose a novel training method to alleviate the problems of vanishing alignment probabilities and delayed token generation in MoChA by leveraging CTC alignments as a reference for token boundaries.

A. Overview
As observed in Fig. 1, there exists a gap in the timing to emit tokens between the CTC and MoChA models even when they share the same encoder. However, we can see that the CTC spikes are closer to the actual acoustic boundaries. The reason is that CTC does not suffer from alignment error propagation because of the optimization with the forwardbackward algorithm and the assumption of conditional independence on a per-frame basis. Therefore, we expect that CTC can generate more reliable alignments than MoChA and serve as an effective guide for MoChA to learn to detect token boundaries more accurately. Motivated by this reasoning, we propose CTC synchronous training (CTC-ST), in which a MoChA model is trained to mimic a CTC model in order to generate token boundaries at similar positions. Figure 2 shows a system overview. Both the MoChA and CTC branches are jointly optimized by sharing the encoder, and reference token boundaries are obtained from the CTC alignments generated from the CTC branch. Therefore, the CTC model serves not only as a regularizer to enhance the encoder representations but also as a teacher alignment model to estimate accurate token boundary positions. In this sense, we regard CTC-ST as a form of self-distillation from CTC to MoChA. Specifically, the synchronization of token boundaries in CTC-ST can be viewed as explicit interaction between MoChA and CTC models on the decoder side, unlike in the conventional joint CTC/Attention framework [43]. However, we only leverage the discrete token boundary positions rather than the alignment probability distributions as in the conventional knowledge distillation method [44].
Because CTC is not allowed to emit multiple symbols at the same input index, the reference token boundaries from the CTC alignments can also enforce the monotonicity of α i,j in MoChA, which leads to emission latency reduction as well. Moreover, unlike the methods described in Section V, the entire model can be trained in an end-to-end manner without relying on external alignments extracted from a hybrid system or manual annotation.

B. Extraction of CTC alignments
We use the most probable CTC pathπ = argmax π p(π|x) (|π| = T ), given by Viterbi alignment, via forced alignment with the forward-backward algorithm in a manner similar to triggered attention [18]. The time indices of non-blank tokens inπ are used as the reference token boundaries b ctc = (b ctc 1 , . . . , b ctc U ). When repeated non-blank labels exist, the leftmost index corresponding to the same non-blank token is used as a reference token boundary. The last time index T is used for the end-of-sentence (EOS) mark, eos . For instance, given a CTC pathπ =[φ,c,c,φ,a,a,a,φ,t,t,φ] (φ: blank) corresponding to a reference transcription "c a t eos ", we convert it to [φ,c,φ,φ,a,φ,φ,φ,t,φ, eos ] and then extract the time indices of the non-blank tokens b ctc =(2,5,9,11) (1-indexed). Unless otherwise specified, b ctc is generated with the model parameters at each training step on the fly and is expected to get more accurate as the training continues. We can also pre-compute b ctc and use the fixed boundaries throughout training when adopting curriculum learning; this approach is described in Section VI-D and analyzed in Section XI-A.

C. Optimization
We define the objective function of CTC-ST L sync (hereafter, the CTC-ST loss) as The total objective function in Eq. (7) is reformulated as where λ sync (≥ 0) is a tunable hyperparameter. When using CTC-ST, we do not use the quantity loss and CTC-ST loss simultaneously. This is because we found that the combination was not effective in our experiments, as described in Section VIII. Instead, we propose an effective curriculum learning strategy in Section VI-D.

D. Curriculum learning strategy
To calculate the effective gradient via Eq. (13) in CTC-ST, it is necessary to estimate reasonable expected token boundary positions in MoChA during training. However, α i,j tends to diffuse over several frames in the early training stage. Again, we note that α i,j is not explicitly normalized to sum up to one. Moreover, CTC alignments would not be very accurate in the early training stage either. Therefore, applying CTC-ST from a random parameter initialization leads to unstable, slower convergence. To tackle this problem, we propose a simple but effective curriculum learning strategy composed of the following two stages.

1) Stage 1:
We first train a MoChA model equipped with a bidirectional encoder (e.g., BLSTM) together with QR by applying Eq. (7) from scratch until convergence. As the bidirectional encoder can see the entire context, we refer to this model as "offline." In this stage, we expect the model to learn a proper scale of α i,j .
2) Stage 2: Next, we optimize a MoChA model equipped with a latency-controlled bidirectional encoder (LC-BLSTM) [86] with CTC-ST by applying CTC-ST with Eq. (14). We initialize the parameters with values optimized in stage 1. Because the BLSTM and LC-BLSTM encoders have the same model structure, we can reuse all of the parameters; the only difference between them is the lookahead context size. The optimizer's parameters and the learning rate are reset at the beginning of stage 2. In this stage, we expect the model to learn accurate token boundary location. When using a unidirectional LSTM encoder, the same encoder is used in both stages. We also apply this curriculum learning strategy to DeCoT and MinLT to stabilize the training [32].  [88] 960 Reading English 10k CSJ [89] 586 Lecture Japanese 10k AMI (SDM) [90] 100 Meeting English 500

E. Combination with SpecAugment
Recently, SpecAugment [42] has shown the capability to greatly improve the performance of E2E ASR models. SpecAugment is an on-the-fly data augmentation method that introduces stochastic time and frequency masks into input speech. M T time masks, whose size is sampled from a uniform distribution U(0, T ) are applied to the input logmel spectrogram. Similarly, frequency masks are also applied with mask parameters M F and F . However, such input masks easily collapse the recurrence of α i,j in Eq. (3) right after the masked region. Although such a problem does not exist in offline global AED models or frame-synchronous models, it is a severe problem in MoChA. In our experiments, in fact, the performance of the baseline MoChA model was degraded by applying SpecAugment.
In contrast, CTC-ST can help MoChA recover the collapsed α i,j in the masked region by leveraging the CTC spikes because CTC assumes conditional independence on a perframe basis. Therefore, we expect that CTC-ST is beneficial for MoChA to learn monotonic alignments that are robust for noisy inputs.

VII. EXPERIMENTAL EVALUATION A. Datasets
We used the TEDLIUM release 2 (TEDLIUM2) [87] and Librispeech [88], the Corpus of Spontaneous Japanese (CSJ) [89], and the single distant microphone (SDM) portion of the AMI Meeting Corpus [90] for our experimental evaluations. The corpus statistics and the utterance length distributions are presented in Table I and Fig. 3, respectively. We used 10k vocabularies based on the byte pair encoding (BPE) algorithm [91] except for the AMI corpus, for which a vocabulary of 500 BPE units was used.

B. Experimental configuration
Using the Kaldi toolkit [92], we extracted 80-channel logmel filterbank coefficients computed with a 25-ms window that was shifted every 10ms. Input features were normalized by the global mean and variance calculated on each training set. We removed utterances longer than 16 seconds from the training data to conserve the GPU memory capacity. The training utterances were sorted by their input lengths in ascending order during the entire training stage. We applied 3-fold speed perturbation [93] to the TEDLIUM2 and AMI corpora with factors of 0.9, 1.0, and 1.1.
The encoders consisted of two CNN blocks followed by five layers of (LC-)BLSTM or unidirectional LSTM. Each CNN block consisted of two CNN layers, each of which had a 3 × 3 filter followed by a max-pooling layer with a This resulted in a 4-fold frame rate reduction in total and introduced a 60ms lookahead latency for every output of the CNN blocks. We set the number of units in each (LC-)BLSTM layer to 512 per direction. To reduce the input dimension of the subsequent (LC-)BLSTM layer, we summed the LSTM outputs in both directions at every layer [94]. For a unidirectional LSTM encoder, the unit size was increased to 1024. In this article, we denote an LC-BLSTM encoder with a hop size of N c frames and a future context of N r frames as "LC-BLSTM-N c +N r ." The decoder was a single layer of unidirectional LSTM with 1024-dimensional units. We set the window size w of chunkwise attention in MoChA to 4. Offline global AED models used the location-based attention [8].
We also trained RNN-T models with the same encoder for comparison. The RNN-T models had a two-layer LSTM prediction network with 1024 memory units and a joint network with 512 units. The 1k BPE was used for the vocabulary except for 500 units on the AMI corpus. These vocabulary sizes were selected to achieve the best performance for RNN-T. We also used an auxiliary CTC loss for RNN-T as L total = (1 − λ ctc )L rnnt + λ ctc L ctc , where L rnnt is a RNN-T loss. λ ctc was set to 0.3.
The Adam optimizer [95] was used with an initial learning rate of 1e − 3, which was then decayed exponentially. For the TEDLIUM2 and AMI corpora, 4k warmup steps were used. We applied dropout and label smoothing [96] with probabilities of 0.4 and 0.1, respectively. The weight of the quantity loss λ qua in Eq. (7) was set to 2.0, 0.1, 1.0, and 1.0 for the TEDLIUM2, Librispeech, CSJ, and AMI corpora, respectively. λ qua in Eq. (9) was set to 2.0. The CTC-ST weight λ sync in Eq. (14) was set to 1.0 for all corpora, unless otherwise noted. λ ctc was set to 0.3 in all models. All training was performed with a single GPU.
For inference, we used a 4-layer LSTM LM with 1024 units per layer. For AED models, we used a beam width of 10  and scores normalized by the output sequence length at every output timestep, except for global AED models on the AMI corpus 6 . Joint CTC decoding was performed for global AED models [81]. For RNN-T models, we used the breadth-first time-synchronous decoding (TSD) algorithm with a monotonic constraint (hereafter, mono-TSD) [97] to speed up decoding 7 , and we merged paths corresponding to the same label history, except for the AMI corpus 8 . We also reduced the beam width of streaming RNN-T models to 5 because of the inference speed constraint. Our implementation is publicly available. 9 VIII. RESULTS A. TEDLIUM2 1) Effectiveness of CTC-ST: Table II summarizes the results on the TEDLIUM2 corpus. In the offline scenario, the naive implementation of MoChA showed very poor performance. Quantity regularization (QR) drastically improved the performance (T3). Although CTC-ST also improved the performance, it was less effective than QR when it was applied from 6 This was because the utterance lengths in the AMI corpus are relatively short. 7 Although we did not explicitly use a monotonic RNN-T loss, the joint CTC training enforced a similar effect. 8 Again, this was because the utterance lengths are relatively short. 9 https://github.com/hirofumi0810/neural sp scratch. The BLSTM RNN-T outperformed the global AED model. In the streaming scenario, however, our proposed CTC-ST significantly improved the baseline performance as compared to QR regardless of the encoder type. We obtained relative word error rate (WER) improvements of 12.0, 13.9, and 12.3% for the UniLSTM, LC-BLSTM-40+20, and LC-BLSTM-40+40 MoChA models, respectively, on the test set. Although using a larger number of future lookahead frames improved the WER as expected, the effectiveness of CTC-ST was consistent. The token boundaries from the MoChA and CTC branches in T5 and T6 are visualized in Fig. 1. We can see that the MoChA boundaries moved to the left, and that the gap in the timing to emit tokens in both branches was reduced. This is desirable for reducing the perceived latency [32], which will be evaluated in Section X-B. Moreover, we found that the CTC spikes slightly shifted to the left. Although RNN-T was more robust for long-form utterances on the dev set, MoChA optimized with CTC-ST matched the performance of RNN-T on the test set.
2) Bucketing by input length: Figure 4 shows a plot of the WER bucketed by the input length on the test set. The plot confirms that the largest gains by CTC-ST were for utterances longer than 20 seconds. The offline global AED model with joint CTC decoding (T1) did not have difficulty in recognizing long utterances, whereas the baseline streaming MoChA models did, regardless of the encoder type (T5, T7). As we had expected, the proposed CTC-ST successfully mitigated this problem (T6, T8).
3) Effectiveness of curriculum learning: Next, we investigate the effectiveness of the regularization methods (CTC-ST and QR) and the curriculum learning strategy by using LC-BLSTM-40+40 MoChA. Table III summarizes the results. Initialization with the offline model T3 was very helpful for both regularization methods, which is consistent with a previous study [31]. However, regularization with the CTC-ST loss or quantity loss in stage 2 was essential for achieving a performance gain even with curriculum learning. When using the LC-BLSTM encoder, CTC-ST was more effective than QR regardless of the use of curriculum learning, unlike the results listed in Table II for the BLSTM encoder. Combining both losses did not lead to any further improvement, although it was more effective than applying the quantity loss alone. Therefore, CTC-ST has an overlapping effect of encouraging MoChA to learn the scale of α i,j properly.

4) Combination with SpecAugment:
We next investigate the combination of CTC-ST and SpecAugment, whose results are summarized in Table IV. We set (F sp , T sp ) to (13,50) for UniLSTM MoChA, (27,50) for LC-BLSTM MoChA, and (27,100) for the other models. We used the same configurations on other corpora as well. Because of the convergence issue, we applied SpecAugment only in stage 2 only. Moreover, we increased λ sync to 4.0 for the UniLSTM MoChA model. 10 The naive streaming MoChA models only with QR did not obtain any improvement with SpecAugment, whereas the performance of the RNN-T models improved. Therefore, this was a problem on the decoder side, rather than the encoder side. However, CTC-ST mitigated this problem and showed additional 12.8% and 13.1% relative improvements for the UniSLTM and LC-BLSTM models, respectively, on the test set. Finally, MoChA optimized with CTC-ST matched the performance of RNN-T on the test set when SpecAugment was applied as well.

B. Librispeech
Table V summarizes the results on the LibriSpeech corpus. For the UniLSTM MoChA model, CTC-ST achieved relative improvements of 14.8% and 5.8% on the test-clean and testother sets, respectively. For the LC-BLSTM MoChA model, we obtained gains of 27.0% and 2.1% with CTC-ST on the test-clean and test-other sets, respectively. Therefore, we conclude that CTC-ST is effective for large-scale data as well. Furthermore, the best MoChA models outperformed their RNN-T counterparts except for the dev-clean set.
We also compared our models with streaming RNN-based E2E systems reported in the literature. Our enhanced MoChA models optimized with CTC-ST showed the best performance. To compare our model with sMoChA [30] and MTA [33], we deactivated SpecAugment in the LC-BLSTM MoChA model. The results were 3.9% and 11.2% on the test-clean and test-other sets, and we still confirmed that our method   outperformed them. The average lookahead latency of our LC-BLSTM MoChA was 660ms (= 400ms (N c )/2 + 400ms (N r ) + 60ms (CNN) while that was 640ms (= 640ms/2 + 320ms) in [30], [33]. Therefore, we consider that the difference in the lookahead latency is negligible. Because CTC-ST is a method to enhance attention-based decoders, any type of encoder can be used. Further improvement is expected by adapting Transformer [101] and Conformer [102] encoders, which we leave for a future work. Table VI summarizes the results on the CSJ. For both UniLSTM and LC-BLSTM MoChA models, we observed clear improvements with CTC-ST, which was consistent with the previous experiments. However, the relative gains for LC-BLSTM MoChA were smaller than those in other corpora. We reason that this was because the utterance lengths in the CSJ are relatively shorter, as shown in Fig. 3. We will analyze this behaviour by simulating long-form speech utterances in Section IX.  Table VII summarizes the results on the AMI corpus. We did not use any external LMs on this corpus because we did not observe any improvement. We observed significant improvement with CTC-ST in the far-field ASR task as well. CTC-ST gave 12.6% and 13.4% relative improvements over the baseline LC-BLSTM MoChA model on the dev and eval sets, respectively. LC-BLSTM MoChA also achieved better performance than that of LC-BLSTM RNN-T. Note that our baseline was very strong, as demonstrated by its superior performance compared to the TDNN+LF-MMI system [104]. As shown in Fig. 3, utterance lengths in the AMI corpus are relatively short compared to other corpora. Therefore, we can conclude that CTC-ST is also very effective for noisy speech, for which AED models have trouble learning alignments.

IX. EVALUATION OF ROBUSTNESS TO LONG-FORM SPEECH
In Section VIII-A, we have observed that CTC-ST was effective for reducing WER of long-form utterances on TEDLIUM2. In this section, we further analyze this behavior by simulating long-form evaluation sets on other domains. We used CSJ and Librispeech for this purpose because input lengths of the original utterances in the evaluation sets were seen during training, as shown in Fig. 3. We simulated longform utterances by merging adjacent utterances according to timestamps. Specifically, given a maximum input length threshold T cat [sec.], we concatenated adjacent utterances of the same speaker from the first utterance in a greedy way until the accumulated utterance length surpassed T cat . We continued this process until no segment was merged in an iteration. Figures 5 and 6 show the results on CSJ and Librispeech, respectively. On CSJ, the baseline MoChA model without SpecAugment (blue bars) performed well with manual audio segmentation. However, as T cat increased, the performance was gradually degraded, whereas MoChA models trained with CTC-ST only (red bars) were robust in recognizing the longform utterances. On the other hand, the WER of the naive MoChA trained with SpecAugment (green bars) was increased quickly for longer utterances. This indicates that SpecAugment affected the training of the naive MoChA model, which could not be observed with the original test sets because they did not include unseen input lengths. However, we confirmed that CTC-ST mitigated this problem, showing the lowest WER on all lengths (purple bars). We also confirmed the effectiveness  of CTC-ST on Librispeech in all length bins as well although SpecAugment without CTC-ST did not degrade WER as severely as on CSJ. Still, the gains by CTC-ST were larger in long-form speech. Therefore, we can conclude that CTC-ST is effective for recognizing long-form speech, which is generally challenging for AED models [34], [84].

X. EVALUATION OF EMISSION LATENCY
In this section, we evaluate the emission latency and compare CTC-ST with alignment knowledge distillation from a hybrid system [32]. Moreover, we also compare MoChA with RNN-T.
A. Emission latency metric 1) Token emission latency (TEL): Unlike the algorithmic latency introduced by lookahead frames in the encoder, the token emission latency (TEL) represents the user-perceived latency in a real application [105]. Although some previous works investigated the endpoint latency corresponding the last token in voice search and assistance tasks [3], [105], [106], we mainly focus here on the per-token emission latency because we are interested in long-form speech applications, as in lectures and meetings.
Following [32], we define the TEL as the difference in timing between the reference and predicted boundaries. To obtain the reference token boundaries, we perform forced alignment with Kaldi. The predicted boundaries in an utterance are obtained from the input timesteps at which monotonic attention in MoChA is activated, i.e., {j|z i,j = 1} i=1,...,U . The TEL of the i-th token in the n-th utterance, ∆ n,i , is calculated as whereb n i and b n i are the i-th predicted and reference boundaries, respectively, in the n-th utterance. We do not include the eos token for TEL calculation. A negative latency can be observed for some tokens because of premature boundary detection. To match the lengths of a hypothesis and the corresponding reference when calculating the TEL, we apply teacher forcing by conditioning the decoder on the groundtruth transcript. However, the WER is reported with beam search decoding using the same model. Therefore, the TEL is a corpus-level latency metric. In the following, we report the median (PT@50) and 90th (PT@90) percentile values of the corpus-level TEL distributions. We also evaluate the TEL of CTC. In this case, we perform forced alignment and use the most plausible alignment path to calculate the TEL.  corresponding to the last word (last WEL) [63], [68].

B. Latency evaluation of CTC-ST
As shown in Fig. 1, the naive MoChA tended to emit tokens later than the corresponding CTC spikes, and CTC-ST reduced the gap in the example. To evaluate this quantitatively, we calculated the TEL. We used UniLSTM encoders for this purpose, because LC-BLSTM encoders introduce the algorithmic latency on the encoder side, whereas we are interested in the emission latency on the decoder side. Tables VIII and IX summarize the results on the TEDLIUM2 and Librispeech corpora, respectively. The TEL on Librispeech was averaged over the test-clean and test-other sets. We first observed that SpecAugment significantly increased the TEL of the baseline MoChA model. We also evaluated the TEL from the CTC branch 11 , and it also increased slightly by applying SpecAugment. On the other hand, we confirmed that CTC-ST significantly reduced both TEL and WER on both corpora. Increasing λ sync up to 4.0 showed improvements of both metrics on TEDLIUM2. On the other hand, on Librispeech, WER was best at λ sync = 1.0 while the TEL was continuously reduced by increasing λ sync up to 3.0. PT@50 of the baseline MoChA with SpecAugment was reduced by 240ms and 240ms on TEDLIUM2 and Librispeech, respectively. PT@90 was reduced by 600ms and 280ms on TEDLIUM2 and Librispeech, respectively. The TEL of MoChA matched that of CTC in most conditions, confirming the function of CTC-ST. CTC-ST traded WER and TEL effectively by changing λ sync . Interestingly, we found that CTC-ST also reduced the TEL of the CTC branch by increasing λ sync . This indicates that joint  RNN-T  240  320  240  400  0  160  MoChA  320  840  320  640  80  440  + CTC-ST  80  240  80  200 - 80 40  training reduced the TEL of the other branch interactively via the shared encoder. Although CTC itself had a delay from the reference acoustic boundaries, it provided better timing to emit tokens for MoChA.

C. Comparison with alignment distillation from hybrid system
Next, we compared CTC-ST with methods of alignment knowledge distillation methods from a hybrid system. We trained MoChA models with MinLT and DeCoT on the same model configuration. Tables VIII and IX summarize the results on the TEDLIUM2 and Librispeech corpora, respectively. We observed that both DeCoT and MinLT also reduced the WER and TEL from those of the baseline model. 12 DeCoT with the optimal δ decot outperformed MinLT in both metrics on both corpora. CTC-ST also outperformed MinLT in both metrics. On Librispeech, increasing λ sync brought a large TEL reduction without hurting the WER so much while MinLT sacrificed the WER a lot with a small TEL reduction. Compared to DeCoT, CTC-ST achieved a lower TEL, especially for PT@50, with a comparable WER. This was because DeCoT focused on tokens whose emission latency surpassed δ decot and thus the TEL reduction was large in tail parts (PT@90). On the other hand, CTC-ST reduced emission latency of all tokens. Therefore, we conclude that CTC-ST can achieve a similar or better tradeoff compared to alignment knowledge distillation from a hybrid system without relying on external alignment information.

D. Comparison with RNN-T
Finally, we compare the emission latency between MoChA and RNN-T. In addition to the average per-word statistics, we also calculated the WEL corresponding to the first and last tokens. The results in Tables X and XI present the WEL  on TEDLIUM2 and Librispeech, respectively. We confirmed that MoChA trained with CTC-ST achieved lower WELs than those of RNN-T in all the conditions on both corpora. Note that RNN-T was also jointly trained with the CTC objective, 12 Unlike in [32], we applied SpecAugment to DeCoT and MinLT. We found that those methods can also tolerate noisy inputs to some extent.  and thus can be regarded as a strong baseline. Comparing the first WEL and the last WEL, we found that the latter had a lower latency. We reason that more acoustic contexts were necessary to emit the first word because there was no linguistic context on the decoder side. On the other hand, the last WEL of MoChA with CTC-ST was close to zero.

XI. ANALYSIS
In this section, we perform an ablation study of alignment generation in CTC-ST. Finally, we compare MoChA and RNN-T in terms of the inference speed.

A. Effect of incremental alignment update
In the above experiments, we generated the reference boundaries from CTC alignments with the model parameters at each training step on the fly. Here, we investigated the effect of using fixed reference boundaries throughout stage 2 by using parameters optimized in stage 1. We refer to this strategy as precomputing. When generating the CTC alignments for precomputing, we deactivated SpecAugment and other regularization methods such as dropout. We used λ sync = 1.0 in this experiment. Tables XII and XIII summarize the results on the TEDLIUM2 and Librispeech corpora, respectively. For TEDLIUM2, the on-the-fly CTC alignment generation consistently outperformed the precomputing strategy, regardless of the encoder type. Note, however, that the precomputing strategy also significantly outperformed the baseline listed in Table IV. For Librispeech, precomputing showed similar performances to those of on-the-fly computing. This was because the parameters learned in stage 1 had already provided good CTC alignments by leveraging more training data. This also confirms the observation that CTC-ST achieved similar WERs to those of DeCoT on Librispeech in Section X-C.

B. Inference speed
While we have shown that MoChA can match RNN-T in terms of accuracy, we also evaluated its efficiency as compared to RNN-T in terms of the inference speed. For both models, we precomputed token embeddings before decoding. For RNN-T, we cached prediction network states corresponding to the same hypothesis [13] and batched all hypotheses in the beam for updating the prediction network and joint network [107]. We applied both TSD and mono-TSD as the search algorithm [97]. The maximum expansion number was set to 3 per frame in the TSD algorithm. We used the best MoChA (optimized with CTC-ST) and RNN-T models trained with SpecAugment, with the beam width set to 10 and {5, 10}, respectively. The inference speed was measured with a 6-core Intel(R) Xeon(R) Gold 6128 CPU @ 3.4GHz. We investigated {1, 2, 4} threads, and we report the real-time factor (RTF) obtained by averaging five trials. Figure 7 shows the results on the TEDLIUM2 test set. We observed that all MoChA models achieved an RTF of less than 1.0 with a single thread in a Python implementation. Using more threads led to faster decoding. The UniLSTM encoder was slower than the LC-BLSTM encoder because of the incremental state update on a per-frame basis. On the other hand, RNN-T required using the mono-TSD algorithm with the half beam width to achieve a similar inference speed. Moreover, RNN-T with the TSD algorithm was much slower because of the multiple symbol expansions per frame.

XII. CONCLUSIONS
In this article, we have proposed CTC synchronous training (CTC-ST), a self-distillation method for knowledge in inputoutput alignment to improve the performance of streaming AED models. Specifically, we distill knowledge of token boundary positions from a CTC model to a MoChA model, both of which share an encoder and are trained jointly. The proposed method forces MoChA to generate tokens in positions similar to those predicted by CTC, by synchronizing both sets of token boundaries during training. Experimental evaluations on four benchmark datasets demonstrated that the proposed method significantly improved MoChA in terms of both the recognition accuracy and the emission latency, especially for long-form and noisy utterances. We also compared the proposed method with methods of alignment knowledge distillation from an external hybrid ASR system and achieved a similar tradeoff of the accuracy and latency without any external alignments. Finally, we showed that MoChA can achieve comparable recognition accuracy, lower emission latency, and faster inference speed compared to RNN-T.
In future work, we would like to further reduce the gap in recognition accuracy between RNN-T and MoChA in very long utterances. Reducing flicker of MoChA by selecting stable partial hypotheses that do not change in the subsequent prefix expansion, which were studied for incremental systems [59], [60], [108]- [111] and RNN-T [112], is also an interesting research direction.