Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale

End-to-End speech recognition has become the center of attention for speech recognition research, but Hybrid Hidden Markov Model Deep Neural Network (HMM/DNN) -systems remain a competitive approach in terms of performance. End-to-End models may be better at very large data scales, and HMM/DNN-systems may have an advantage in low-resource scenarios, but the thousand-hour scale is particularly interesting for comparisons. At that scale experiments have not been able to conclusively demonstrate which approach is best, or if the heterogeneous approaches yield similar results.In this work, we work towards answering that question for Attention-based Encoder-Decoder models compared with HMM/DNN-systems. We present two simple experimental design principles, and how to build systems adhering to those principles. We demonstrate how those principles remove confounding variables related to both data, and neural architecture and training. We apply the principles in a set of experiments on three diverse thousand-hour-scale tasks. In our experiments, the HMM/DNN-systems yield equal or better results in almost all cases.

The End-to-End approaches have many differences with HMM/DNN-systems, but there are also many differences between the End-to-End approaches.Comparing these heterogeneous approaches is not straight-forward, since experimental results cannot be attributed to a single cause, because of the many differences.Broadly speaking, in very large data tasks, at the ten-thousand or hundred-thousand-hour scales, End-to-End models are reported to outperform HMM/DNN-systems [5], [6], and with limited resources, with a hundred hours and less, HMM/DNN-systems may have an advantage [7], [8].This empirical result is also plausible theoretically, as End-to-End models generally rely less on in-built structure [9].
However, the thousand-hour scale is particularly interesting for comparisons, because no approach has been shown to be conclusively better [7], [10], and because the thousand-hour scale is still possible to reach, resource-wise, in multiple languages, in open datasets, and in multiple styles [11], [12], [13].
In this work we propose two simple experimental design principles, which allow making stronger statements from comparisons of heterogeneous speech recognition systems.The Equal Data Setting is a principle which avoids confounding differences in data.The Matched Encoder Setting is a principle which avoids differences in neural architecture and training.
We apply these principles in a set of experiments comparing HMM/DNN-systems and AED-models.We focus on these two approaches, because both have recently been shown to have high performance [10], and because they are very different: using input-vs.output-synchronous decoding, having implicit internal vs. explicit external language models, using hard and explicit vs. soft and implicit alignments.Additionally, building multiple well-performing speech recognition systems is a large effort, and thus we need to limit the scope of the work.
Our main contributions are as follows.
Firstly, we propose a conceptual framework for comparing End-to-End and HMM -based speech recognition systems.We develop the Equal Data Setting and Matched Encoder Setting principles for experimental design.We show how to build HMM/DNN-systems and AED-models, which adhere to these principles.
Secondly, we conduct a set of experiments comparing HMM/DNN-systems and AED-models, with the HMM/DNNsystems consistently reaching equal or better performance compared to our AED-models.Because of the principles we followed, we are able to say that the results are not due to having additional data, nor due to the neural architecture or training favouring the HMM/DNN-systems.
Thirdly, we make multiple discoveries about our HMM/DNNsystems.We develop a multi-head decoding method, which yields the best results.We find that frame-level training is still useful, but on the other hand, expert pronunciation-lexicons and tree-clustering for state-tying do not appear necessary, echoing other recent work [14], [15], and potentially simplifying the HMM/DNN-system.

A. Related Work
Speech recognition approaches are most commonly compared by reporting various state-of-the-art results from the literature so far.On popular benchmarks this may also lead to intense competition, with rapid progress on the state-of-the-art numbers.For example, in [16], an AED-model using SpecAugment was reported to surpass state-of-the-art results on Librispeech, but concurrently in [7], an HMM/DNN-system achieved the lowest error rates at that point.In [17] an AED-model was found to outperform other Switchboard-300 results in the literature.Though these results were later surpassed by an HMM/DNN-system in [10], the results of the earlier work with an AED-model were already improved to the lowest currently known numbers in [18].In this competition for the state-of-the-art numbers, the systems are not constrained in anyway, and implementations differ considerably, e.g. in terms of the number of training epochs.
Our proposed principled experimental design is more similar with experiments where speech recognition approaches are compared directly, applying some specific constraints.For example [19] compares HMM/DNN-systems, CTC-models, AEDmodels, and Transducer-models using the same encoder architecture for all, though exact training parameters are not described.In [7], AED-models and HMM/DNN-systems are compared using the same neural model type (Bidirectional LSTM), though the authors do not use exactly matching architectures nor training hyperparameters, but opt instead to optimize each model's recipe in isolation.
Concurrently with our work, [9] provides an overview survey of End-to-End speech recognition.The survey breaks down the term End-to-End into more precisely defined concepts, and also includes a section relating End-to-End speech recognition to HMM-based speech recognition (the term Classical speech recognition is used).Additionally, [20] presents the field from an industry perspective, stating how choosing an appropriate speech recognition approach to develop and deploy is not easy.

II. PRINCIPLES FOR COMPARISONS OF HETEROGENEOUS SYSTEMS
Here we introduce two constraining principles for building heterogeneous speech recognition systems for direct comparisons.The goal is to create comparisons which reveal more about the differences in the speech recognition approaches, as opposed to confounding elements such as differences in data or optimization.
Firstly, we introduce the Equal Data Setting, which we first explored in [8].Secondly, in this work we also propose the Matched Encoder Setting.

A. Equal Data Setting
Different speech recognition approaches may be able to leverage different data sources: hybrid HMM/DNN-systems are able to leverage curated pronunciation lexicons and additional text-only data, while standard End-to-End AED-models only use transcribed speech.In a practical, commercial application, where the end goal is to train the best performing model, it is sensible to use all available data sources.However, we wish to quantify the differences in the models -not the differences in data.Thus, we argue that the models should be compared under an Equal Data Setting [8], where the data that is available to each approach is exactly the same.
If we are comparing a speech recognition approach that only uses End-to-End Data, i.e. just transcribed speech, the Equal Data Setting limits all approaches to End-to-End Data.Mostly, this limitation affects HMM/DNN-systems.In an Equal Data Setting, HMM/DNN-systems use grapheme-based lexicons and transcript-based language models.Training language models only on transcripts will likely lead to less capable models, but requires no special techniques, as it is just a reduction in the amount of data available.Grapheme-based lexicons, where the acoustic model units are based on characters, can be used without any pronunciation dictionary data.For languages such as English, with non-trivial pronunciation, grapheme-based systems can be expected to perform slightly worse, whereas for languages like Finnish, which have a transparent orthography, grapheme-based systems are the norm and curated lexicons do not offer a benefit [21].
Instead of limiting the HMM/DNN-system to End-to-End Data, it is possible to extend End-to-End models to use other data types.Developing methods to leverage text-only data is probably beneficial in any speech recognition approach.Joint Training can be maintained by synthesizing audio or encoder representations for the text-only data [22], [23], however this requires an additional synthesis model.A simpler method is using an external neural Language Model (LM) in shallow fusion with an AED-model.This way it is possible to use additional text-only data and retain an Equal Data Setting.However, the resulting model is no longer Jointly Trained.Though shallow fusion is the standard approach, it does not compensate for the internal language model in the AED-model (only learnt on the transcripts of the data).This compensation is possible through more sophisticated methods (e.g.[24]).To leverage pronunciation dictionaries, AED-models can be made to use phoneme-based units, though this can make decoding more complicated and may not offer any benefit over grapheme-based units [25].
Speech recognition research has a long tradition of controlled benchmarks with clearly defined training, validation and test data (e.g.[11], [26], [27]).With the advent of End-to-End approaches, the Equal Data Setting is needed as a more precise definition for experiments, due to the benchmarks' additional text resources and pronunciation lexicons.

B. Matched Encoder Setting
Virtually all current ASR approaches use a notion of an Speech-Encoder, which maps the audio into a representation that contains only the information relevant for transcription.In an AED-model, the Speech-Encoder is simply the Encoder part of the model, and in an HMM/DNN-system, the Speech-Encoder is at the core of the Acoustic Model, before the representations are mapped into emission probabilities.Both approaches can use the exact same neural architectures for the Speech-Encoder.This similarity is contrasted by very different paths for decoding text.The AED-model includes the attentional decoder, from which text is produced output-synchronously.An HMM/DNN-system uses a search system that integrates the probabilities of the Acoustic Model, the hidden Markov model transition probabilities, a pronunciation model, a separate Language Model, and possibly other models.
Since the task of speech recognition remains the same regardless of approach, similar representations are probably useful in all approaches.Unfortunately the representations learned by the AED-model encoder have only been studied inextensively [28], [29].
We propose to use the same neural architecture for the Speech-Encoder in heterogeneous speech recognition systems, such as with HMM/DNN-systems and AED-models.This ensures that the Speech-Encoder has the same capacity and is equally good at modeling acoustics.In addition to the neural architecture, the important neural network training hyperparameters, such as batch size, learning rate schedule, and number of training epochs, are matched.Only the approach specific hyperparameters (e.g. the weight of an auxiliary Cross-Entropy loss in an HMM/DNN-system) have no counterpart and thus cannot be matched.The initialization is also matched.In similar vein, all speech recognition approaches probably benefit from using augmentation or auxiliary inputs such as speaker embeddings [16], [18], [30], [31], and their use should be matched.
We call this approach of using the same Speech-Encoder, with the same augmentation and auxiliary inputs, and the same training hyperparameters the Matched Encoder Setting.
In a general sense, the goal is to avoid the phenomenon where two systems are compared, but one of them is more heavily optimized -what could be called the favourite child problem.However, matched hyperparameter training is not trivial, since neural network training depends on the criterion: the same training hyperparameters could be closer to optimal for one approach.We still propose picking one model to optimize first, and then applying those parameters to the other system.Crucially, this sets a lower bound on the performance of the latter system -it could only improve through further optimization.Furthermore, if the latter system outperforms the former system, it is not a result of the favourite child problem (at least in terms of hyperparameter tuning).Finally, as we will show in Section II-B, hyperparameters which work well for one system are often applicable to another, as well.

C. What Else Should Be Controlled For?
There are certain things that we strived to control for, that do not clearly follow from the Equal Data Setting and Matched Encoder Setting.Firstly, we used the same subword vocabularies for the language models of the HMM/DNN-systems and for the outputs of the AED-models.Secondly, if we apply sequence-discriminative training, we should apply it to both approaches.Thirdly, we generally use single-pass decoding with both models, in this case restricting the HMM/DNN-system to N-gram language models, but we also explore neural language model rescoring in some experiments.
We believe the principles proposed in this section follow from good scientific practice.We also believe that following these principles allows us to draw stronger conclusions from our experiments in comparing HMM/DNN-systems and AED-models.Nevertheless, these principles cannot cover all design choices in determining the compared systems.We discuss limitations in Section V-A.

III. SPEECH RECOGNITION SYSTEMS
We use the principles introduced in Section II to build comparable HMM/DNN-systems and AED-models.We aim to build HMM/DNN-systems and AED-models following wellestablished, modern practices.
The Matched Encoder Setting practically necessitates using the same software tools for training both the HMM/DNN-system acoustic model and the AED-model, since different tools can very easily have subtle differences in neural implementations.There are few public, open source tools that allow this readily.One example is combining the Returnn and the RASR toolkits [32] in the TensorFlow ecosystem.In the PyTorch ecosystem, the Espresso [33] and the k2-fsa1 toolkits allow some form of AED-models and HMM/DNN-system, but both lack for example sequence-discrinative AED-model training and Gaussian Mixture Models (GMM).We conduct our experiments in the PyTorch ecosystem, and opt to use SpeechBrain [34] to train neural networks, and build the full recipes by integrating many different toolkits.We release our recipes online,2 hoping to help further research in this implementation-intensive area.
We use three different Speech-Encoder neural architectures: the Convolutional-Recurrent-Feedforward (CRDNN) model [34], the Conformer (Confo) [35], and wav2vec 2.0 (w2v2.0)[36].The CRDNN and Conformer-models take (respectively) 40-and 80-dimensional Mel-scale filter bank logenergy vectors as input.Both architectures have a front-end of two convolutional layers, with the CRDNN using layers of 64 and 128 channels, and the Conformer using layers of 64 and 32 channels, and with both architectures using 3-by-3 kernels.The convolutional layers subsample the input in time, three-fold for HMM/DNN-systems and four-fold for AED-models, resulting in 30 ms and 40 ms output frame-rates, respectively.This minor difference in the encoder does not change the number of parameters, but the different ASR approaches simply work best at different time-granularities.A small exception is the projection layer after the Conformer convolutional front-end, which is necessarily slightly wider when the total stride is 3 (compared to 4 in the AED).From here on the architectures diverge.On the CRDNN, the convolutional layers are followed by three 512-wide bidirectional LSTM layers, and finally by one 512-wide feed-forward layer.The wav2vec 2.0 encoder is the Large size, which has 318 million parameters.The encoders have been pretrained on a large untranscribed speech datasets using the wav2vec 2.0 Self-Supervised Learning (SSL) approach.The model has a convolutional frontend, which takes the raw audio waveform as input.We keep the convolutional frontend parameters frozen.The bulk of the model is made up of Transformer (Trafo) layers.We use openly available pretrained parameters (Uralic V2 for Finnish, LV60 for Librispeech).The pretraining SSL approach is explained in [36].On top of the wav2vec 2.0 pretrained model, we add two randomly initialized feed-forward layers (1024-wide), which slightly improved our results in preliminary experiments.The wav2vec 2.0 encoder natively runs at a 20 ms output frame-rate.Thus its output does not cleanly divide into the 30 ms rate of the HMM/DNN-system.Instead, the wav2vec 2.0 -based HMM/DNN-system uses the 20 ms output frame-rate.The AED-model simply takes every other output, yielding the regular 40 ms frame-rate.
The Conformer encoders are trained with SpecAugment [16], as it is part of the recipe we adopt.We make a small adjustment: we do not use on-the-fly time stretching, so that our original GMM-alignments can be used.Although purely sequence-trained models could use time-stretching, in preliminary tests, we found that removing the time stretching yields us the same results as the recipe we adopt.The original work on SpecAugment also suggests the time-stretching is not crucial [16].The CRDNN and wav2vec 2.0 encoders do not use SpecAugment, keeping in line with our earlier work.Augmentation and SpecAugment yield better performance, but this applies to both AED-models [16] and HMM/DNN-systems [30], and as such we believe it to be mostly a matter orthogonal to comparisons such as those presented here.

A. Hybrid Hidden Markov Model / Deep Neural Network Systems
The main HMM/DNN-systems are built in many stages, starting with GMM acoustic models.These are then followed by DNN acoustic models which use both the Lattice-Free Maximum Mutual Information (LF-MMI) [37] and Cross-Entropy (CE) training criteria.Language models are trained separately.
To study the different benefits that the GMM alignments yield, we also train HMM/DNN-systems which use either Flat Start (FS) DNN acoustic models or only use Cross-Entropy targets.
Our acoustic models use word-position-dependent graphemeunits (permitting four variations of a character: at the beginning, inside, and at the end of a word, and as single character words [38]), except in Section IV-B, where we additionally present results with word-position-dependent phoneme-units for contrast.
1) Gaussian Mixture Models: The hybrid HMM/DNNsystem acoustic model recipe begins by training increasingly more complex GMMs.We use the Kaldi toolkit [39] for all GMMs.We follow the Kaldi standard four-stage GMM recipe outline, where the last stage is a speaker-adapted tri-unit tristate HMM/GMM-system.The final GMM acoustic model is used to align the training data.These alignments are then used both in the HMM tree-clustering algorithm for state-tying and for Cross-Entropy target labels.

2) Deep Neural Network Acoustic Models:
The early DNN acoustic model formulations computed the probability of the input belonging to a particular HMM emission state, and turned this into emission likelihoods through normalizing (dividing) by the prior probability.These DNNs use the frame-wise Cross-Entropy criterion, which trains the network to match the GMM alignments.
Cross-Entropy training remains a mainstream DNN acoustic model training method, either used as the first training phase, or as an auxiliary task.However, to achieve state-of-the-art performance, HMM/DNN-systems use some form of sequencediscriminative training.We use the sequence-discriminative LF-MMI criterion, taking the implementation from PyChain [40].Most of the improvement can be acheived with any sequencediscriminative criterion, but a criterion that directly minimizes the expected error, which we lack, could still yield some further improvements [37], [41].This presents a small caveat in the interpretation of our results.
As recommended, we add l 2 -regularisation with weight 0.0005 to the outputs of the LF-MMI head [37], yielding minor improvements.The outputs of the LF-MMI DNN acoustic models are typically interpreted as logarithmic pseudo-likelihoods, requiring no division by the prior.It has been shown that LF-MMI can be used for Flat Start training, requiring no alignments, an output space based on simple pruning instead of tree-based clustering, and starting from a randomly initialized neural network [42].This allows pure sequence-level training, offering similar simplicity as CTC training of acoustic models, but with a sequence-discriminative criterion.
Our main HMM/DNN-systems experiments use Cross-Entropy and LF-MMI in a multi-task learning setup.For the Cross-Entropy loss we apply uniform Label Smoothing (LS), which can help calibrate the output of the model, aiding in beam search [43].In multi-task learning, the Cross-Entropy and LF-MMI criteria have their own output head, which is a separate linear layer, though both heads use the same units.We note that since both heads are used to compute HMM emissions likelihoods, perhaps the term multi-loss learning or multi-head model could also be appropriate.We use a three-fold reduced output frame-rate for both the Cross-Entropy and the LF-MMI heads, as is typical with LF-MMI.When we use the Cross-Entropy output head in inference, we normalize the output with a prior vector.The prior is estimated empirically by averaging the Cross-Entropy head outputs on a sample of the training data.We decode by computing log-likelihoods in SpeechBrain and then using beam search in the Kaldi Weighted Finite State Transducer (WFST) decoder.
The common Kaldi inference time solution is to discard the Cross-Entropy head, and only use the LF-MMI outputs.Instead, we find that in our implementation, the best performance is achieved by using both output heads and linearly combining their outputs after a log-softmax, with the same weights as used during training (0.1 weight for Cross-Entropy).To the best of our knowledge, this proposed multi-head inference is a novel improvement for the HMM/DNN approach, though it resembles an efficient form of model combination.We presented initial results using this approach in [44] and explore it here in more detail.Table I compares the various acoustic model training criteria and output heads used during inference.The LF-MMI + Cross-Entropy training has a clear benefit both over LF-MMI alone or Cross-Entropy alone on both Librispeech and Finnish Parliament Train20 (see Section IV-A for dataset information).On Finnish Parliament Train20, perhaps because of the extensively tuned HMM/GMM recipe, the Cross-Entropy head yields the best single-head results.On Librispeech, the LF-MMI head is the better one of the single output heads.Additionally, we find that the simple pruning Flat Start LF-MMI outperforms the tree-clustering state-tied LF-MMI in our test on Finnish Parliament Train20.This is surprising, but seems to suggest that the tree-clustering does not always yield better performance in HMM acoustic modeling, which is also suggested in other recent work [15] and would simplify the HMM/DNN-system further.The investigation of this phenomenon is out of scope for this work.
3) Language Models: HMM/DNN-systems typically use Ngram language models, as they can easily be made computationally feasible for single-pass decoding.Additionally, large neural language models may be used in rescoring to improve results.Under the End-to-End Data limitation, the amount of data available for language modeling is lower than in typical systems.This emphasizes the data sparsity problem inherent with N-gram language models using large vocabularies.Thus, it is especially important to use subwords as the language modeling unit, which leads to a smaller vocabulary.We use Byte Pair Encoding (BPE) units with SentencePiece segmentation.With SentencePiece units, we take care to handle the word-positiondependent units correctly [8].With subword units, it is especially important to use longer N-gram spans [45] and thus we use the VariKN toolkit, which can grow large span modified Kneser-Ney backoff language models [46].We use 10-gram models for all transcript-based language models.
Neural language models are commonly thought to be more data hungry than N-gram models.Thus the benefits from neural language model rescoring may be diminished under the Endto-End Data limitation.Nevertheless, we present some experiments using neural language models, which are trained on the transcripts only.All of our neural language models are based on the Transformer architecture and use the same subword units as the corresponding N-gram models.With HMM/DNN-systems, we apply these neural language models in 100-best list rescoring (we also tried a 1 000-best list but it did not improve results).The neural language models are implemented in SpeechBrain.
The language model weight and word-insertion-penalty are important decoding hyperparameters, and are optimized on development sets.
We note that it has been shown that with efficient implementations, arbitrary history length neural language models can be applied to single-pass search in HMM/DNN-systems [47].However, here the single-pass HMM/DNN-system implementation is limited to local context language models, leaving the advantages of arbitrary-length history modeling to the AED-model in this comparison.

B. Attention-Based Encoder-Decoder Models
The AED-models add an attentional decoder on top of the Speech-Encoder.The decoder uses attention to find relevant parts in the input, and then computes a distribution over the output text units.We use the same set of subword units that the HMM/DNN-system language models use.To optimize the networks, we employ the Cross-Entropy criterion with label smoothing and add an auxiliary CTC criterion, which has its own output head on top of the encoder.The CTC head uses the same subword units as the main attentional decoder.For the CRDNN models, the CTC criterion is only used for the first 15 nominal epochs, the idea being to aid learning in the beginning of training, since the attention mechanism is difficult to learn from random initialisation.For the Conformer and wav2vec 2.0 encoders, we use hybrid CTC/Attention modeling, where the CTC outputs are also used in decoding [48].This joint CTC/Attention decoding is somewhat symmetrically matched by the two-output-head decoding in the HMM/DNNsystem.We decode with beam search in SpeechBrain.To deal with the length bias of AED-models [49], we use an endof-sentence probability threshold and an attention coverage penalty [50].
Because the HMM/DNN-system uses sequencediscriminative training, we want to apply a sequencediscriminative criterion to the AED-model as well.We implement the Minimum Word Error Rate (MWER) -criterion [51].We use the recommended settings: sampler beam size 4, Cross-Entropy as regularisation with weight 0.01, and regularisation through subtracting the mean number of errors on a sample.We find it is important to use word-level MWER, not subword-level, though a subword implementation is faster because it requires no Sentencepiece conversion.
We present some experiments where neural language models of Section III-A3 are used in shallow fusion with the AEDmodels.

C. Minimum Word Error Rate Training for Joint CTC/Attention Models
The classic MWER algorithm does not account for Joint CTC/Attention, so to use MWER, we needed to develop some additional solution.It could theoretically be possible to develop and implement MWER training for Joint CTC/Attention, but practically we deemed it out of the scope of this work.Another approach could be to use MWER training on the attentional decoder, and keep updating the CTC head with the regular CTC criterion.However, MWER training, which uses beam search at every step, is particularly compute intensive.Thus we deemed it best to freeze the encoder, and only update the attentional decoder with MWER.This way, after MWER finetuning, the encoder representations have not drifted away from ones learned by CTC, and we can again apply Joint CTC/Attention decoding.

IV. EXPERIMENTS
We showcase how the Equal Data Setting and Matched Encoder Setting principles affect our results compared to not using such principles.The effects of the End-to-End Data limits placed in our proposed Equal Data Setting can be directly estimated with comparable systems, which do not conform to the Equal Data Setting.The effects of the proposed Matched Encoder Setting are seen indirectly.
In our main experiments, we compare AED-models and HMM/DNN-systems on three different tasks under the Matched Encoder Setting and the Equal Data Setting (MES-EDS Comparison).Though we are interested in the relative results of the different models, we also present some external baseline results where applicable.Section IV-E analyses the results.First, we introduce the datasets.

A. Datasets
We use three different thousand-hour-scale datasets: the Finnish Parliament Train20 dataset, Librispeech, and a Combined Finnish Data task.The Combined Finnish dataset includes both the full Finnish Parliament ASR Corpus and the Lahjoita Puhetta dataset, and it is a new task which we introduce here.
Table II summarises the data.1) Finnish Parliament Train20: The Finnish Parliament (FP) ASR Corpus [52] is the largest publicly available transcribed Finnish speech corpus.The full training set is 3 087 hours, and has 449 different speakers.The transcripts consist of 19 million words.The speech, taken from recordings of the Finnish national parliament plenary sessions, is semi-spontaneous and covers a wide breadth of topics.The training data has two distinct subsets: Train16 and Train20.We pick the Train20 subset.The Train20 subset has an extensively tuned recipe for GMMs [52], which will aid in exploring the benefits of alignments in HMM/DNNsystems.The Train20 dataset has 1 783 hours of data from 302 speakers and its transcripts have 11 million words.
The Finnish Parliament ASR Corpus also has a development set, Dev16, and two test sets, Test16 and Test20.The Dev16 and Test16 sets are from the year 2016, and Test20 is from 2020.Additionally, the corpus has text-only resources for building language models.We use the 30 million word text dataset, and abbreviate it Parl30 M. This text data also derives from the transcripts of the plenary sessions, so it has some overlap with the training data.For all the Finnish Parliament Train20 experiments we use a 1 750 BPE unit vocabulary, which was deemed to work well in prior experiments [8].
2) Librispeech: Librispeech [11] is a well known and highly competed English read speech task.We use the 960 h full set from more than a thousand different speakers, and the transcripts have 9 million words.Librispeech has two development and test sets: a clean one and a noisy, "other" one.Official 4-gram language models and official pronunciation dictionaries are distributed alongside the data.The 4-gram language model has been trained on an official 800 million word text corpus.For the main experiments, we train our own language models on the speech transcripts, and use grapheme-based acoustic model units.We use a 5 000 units BPE unit vocabulary for all Librispeech experiments (this performed slightly better than 2 000 units in preliminary experiments).
We choose to experiment on Librispeech for a few reasons.Firstly, it is English, where expert-knowledge pronunciation dictionaries are typically used.Under the Equal Data Setting, we limit the HMM/DNN-systems to grapheme-units, and English allows us to quantify this side of the Equal Data Setting.Secondly, Librispeech has extensive baseline results for us to compare to.Lastly, Librispeech covers another style: read speech.
3) Combined Finnish Data: We combine two datasets, the Finnish Parliament ASR Corpus, and the Lahjoita Puhetta corpus [13], to form a new Combined Finnish task.This task uses the largest amount of transcribed Finnish speech training data published so far, to the best of our knowledge.This combined task is not just large, but also requires the speech recognition approaches to handle multiple styles and domains.Altogether the training set has 4 224 hours from 18 187 speakers, and the transcripts contain 29 million words.
The Finnish Parliament ASR Corpus is described above in Section IV-A1.The Lahjoita Puhetta corpus has 1 601 hours of transcribed speech from 17 821 different speakers.The speech was donated by the Finnish public and transcribed by professional transcription services.The speech is spontaneous and colloquial, covering many dialects and topics.A development and test set split was introduced in the original publication and we use the same setup in this work.The corpus includes automatically created time-alignments for the recordings.Since the original recordings are relatively long for speech recognition purposes, we split all the recordings by pauses, which were marked by the professional transcribers.This splitting is also done for the development and test sets, because the long recordings would lead to pathological output issues for AED-models [13].These output issues require further research outside the scope of this work.
To create the Combined Finnish Data task development set, we simply combine the Lahjoita Puhetta and Finnish Parliament Dev16 development sets.

4) YLE Test:
The YLE Test data contains about six hours of Finnish broadcast news speech.The test data has a corresponding development set, but we do not separately optimize parameters on it in this work.

B. Equal Data Setting
We evaluate how the Equal Data Setting affects our results by comparing models which differ in resources.We contrast transcript-only language model results with extra-text language model results.We compare grapheme-unit models with phoneme-unit ones.With these results, we showcase how important it is to decouple differences in data from differences in the ASR approaches.
On Finnish data, the only End-to-End Data limitation is the amount of language model data.In Table III we present experiments with HMM/DNN-systems using different language model data and language model setups and additionally we include results with AED-models with and without shallow fusion language models.We do not apply internal language model compensation here, which presents a caveat, as it could improve the language model integration.The comparisons with neural language models keep the data equal, but the AED-models are no longer jointly trained.The transcript-based Transformer language model yields a 4% relative improvement over the transcript 10-gram model for the HMM/DNN-system and a 13% relative improvement over the non-shallow-fusion result for the AED-model.The Parl30 M 10-gram and Transformer rescoring combination brings a 28% relative improvement over the transcript 10-gram HMM/DNN-system and a 36% improvement over the AED-model over the non-shallow-fusion result.The use of external language models is crucial to obtain the best speech recognition systems, and it is also important to develop better strategies and methods for language model integration in AED-models [24].
On English data, under the End-to-End Data limitation, both extra text data and expert-curated pronunciation dictionaries are excluded.In Table IV we present results with HMM/DNNsystems, which use phoneme-or grapheme-units and systems which use the official 4-gram or our transcript-only language models.Creating English pronunciations for subword units is difficult, because the units should be pronounced differently depending on their context.Additionally, the segmentation is based on text compression, and as such does not take phonemic information into account.Therefore, our grapheme-and phoneme-unit comparison is performed with the official 4-gram word-level language model.Additional text-only data is a clear benefit in speech recognition.The main novelty in our chosen Equal Data Setting is to compare AED-models with HMM/DNN-systems using transcript-based language models.Both the Finnish and English results validate that this yields a meaningful comparison, where the data difference is eliminated.Finally, on the English data, we see that phoneme-units may offer little to no benefit over grapheme-units, echoing similar results in [14].

C. Matched Encoder Setting
We emphasize the importance of the Matched Encoder Setting by presenting various results from our experiments, as well as some external baselines.We show how not following the Matched Encoder Setting could lead to drawing the wrong conclusions.
First, we revisit some results that we previously presented on the YLE test data.In [8], we presented results under the Equal Data Setting, where the HMM/DNN-system and the AEDmodel have similar performance.The number of parameters and the use of auxiliary inputs were matched in that comparison.However, the AED-model used ESPNet Transformerarchitecture recipe, whereas the HMM/DNN-system relied on a Kaldi TDNN-recipe.This leaves open the question of just how much did the AED-model gain over the HMM/DNNsystem from using a more advanced neural model.In [52], we again used the Equal Data Setting, this time with a Kaldi HMM/TDNN-recipe compared against a CRDNN AED-model, using the Finnish Parliament Train20 data, with the Kaldi system outperforming the AED-model slightly.These latter results are in Equal Data Setting with our new results, presented in Table V.The new results show how under a Matched Encoder Setting, the HMM/DNN-system win over the AED-model is actually emphasized in this case.Furthermore, our new AED-model has roughly equal performance compared to the Kaldi HMM/TDNN on the Equal Data Setting comparison on Finnish Parliament Train20, which is further evidence that results not under a Matched Encoder Setting can be difficult to interpret.However, we note that the primary result in [8] was that the external text-only data is the key to improved results, which was shown by a clear margin, and served to emphasize the importance of using the Equal Data Setting.
In Table VI we report two initial sets of results from the course of performing these experiments.On the Finnish Parliament Train20 data, we tried two different combinations of batch size and number of updates, both adding up to seeing 40 million seconds of data.The HMM/DNN-system result change was statistically insignificant (by bootstrap estimate [53]), but the AED-model performed much better (≈ 30% relative) with 1 million iterations of 40 s batches.On Librispeech, we iteratively improved the AED-model, which leapt from 6.26% word error In Table V we showed how experiments using unmatched encoders can be difficult to interpret, particularly when the differences in error rates are small.It was important to apply the best AED-model neural modeling to the HMM/DNN-system as well -otherwise it might have seemed as if the HMM/DNN-system was outperformed.In Table VI we showed how the HMM/DNN acoustic model learning is not highly dependent on training hyperparameters, and the best AED-model training parameters also yielded the best HMM/DNN-system.

D. Comparison Experiments
We compare our best AED-models and HMM/DNN-systems under the Matched Encoder Setting and Equal Data Setting in three tasks.We first optimize the AED-model, and then apply the same hyperparameters to the corresponding HMM/DNNsystem.As explained in Section II-B, this sidesteps the favourite child problem: the neural model optimization at least does not favour the HMM/DNN-system.
Each task tests ASR models in slightly different conditions and in the interest of making the set of experiments manageable to run, we do not test every model and approach on every task.On the Finnish Parliament Train20 and Librispeech tasks we optimize CRDNN and wav2vec 2.0 recipes.Additionally, on Librispeech, we adapt and apply a recently published well-performing Conformer recipe from SpeechBrain.On the Combined Finnish Data, we use the best CRDNN models from Librispeech.
1) Finnish Parliament Train20 Experiments: On Finnish Parliament Train20, we started with the AED-model from [52].The CRDNN encoder is described in Section III.The attentional decoder was a single 512-wide Gated Recurrent Unit (GRU) layer, and used location-and-content aware attention.The system was trained for 100 nominal epochs, where each epoch had 10 000 updates on dynamically sized batches, targeting 40 seconds of audio per batch.The system is trained with Adam using a 0.0001 learning rate, without learning rate scheduling.
This initial system was slightly improved by doubling the attention context vector size.We also tried using multi-headed attention, using a larger decoder, not using label smoothing, and trading number of steps for a larger batch size (same amount of data seen overall), but these did not improve results in our implementation.Further improvements were found by training more (75 additional nominal epochs of 5 000 updates) with larger batches (80 s), using a NewBob learning rate schedule.Finally, we improved the AED-model through sequence-discriminative finetuning with MWER training.The MWER finetuning only needed a few thousand steps to reach the best performance.At that point, the resulting AED-model had reached parity with the Kaldi HMM/TDNN baseline from [52].However, a Matched Encoder Setting HMM/DNN-system outperformed the AEDmodel.
The wav2vec 2.0 models require less training to reach good performance.Since they are considerably more computeexpensive, we train them for 25 nominal epochs (10 000 updates, 40 s batches).We use NewBob learning rate scheduling throughout training.After this, the AED-model is also slightly improved through our modified MWER finetuning approach as described in Section III-C.
We note that MWER training is a sequence-discriminative finetuning step, while the HMM/DNN-system uses the sequence-discriminative LF-MMI criterion throughout training.The MWER finetuning only took a few thousand steps, so on Finnish Parliament, we did not match this training step with anything on the HMM/DNN-system side.We could have continued the regular HMM/DNN-system training for a few thousand more steps to match the training length exactly, but the HMM/DNN-system had already converged, so it would not have changed the results.However, we note that some criterion minimizing the expected error could have been used here.
On the Finnish Parliament data, we also experimented with additional text resources.We present results using a 12-layer Transformer neural language model trained for 200 nominal epochs (with early stopping) on the transcripts, and another one trained on the Parl30 M text.We also train a single-pass-capable 10-gram language model on the Parl30 M data.
The Finnish Parliament Train20 results are reported in Table VII.Language model perplexities (normalized to the word level) are shown in Table VIII.We remind readers less familiar with Finnish, that the Finnish absolute perplexity values are often much higher than e.g.English, due to the much larger vocabulary.Model sizes are reported in Table IX.The N-gram language model parameter counts are measured by the number of N-grams in the model, though there may be both a probability and a backoff weight associated with it.
2) Librispeech Experiments: On Librispeech we start with the best AED-model configuration from Finnish Parliament Train20.We improve the initial Librispeech AED-model with a 1024-wide decoder and training with an 80-second batch size from the start for 100 nominal epochs of 10 000 updates.After the 60th epoch, we use a NewBob learning rate schedule.This equals seeing the full data a little over 23 times.We also try a larger batch size, an even larger decoder, different learning rates and learning rate schedules including warm-up, an LSTM decoder, and lowering the label smoothing value, but the changes listed above yield our best model.On Librispeech we use the same wav2vec 2.0 configuration as Finnish Parliament Train20, except we increase the batch size to 180 seconds, which helps slightly.
We adapt the SpeechBrain Conformer L 3 and Conformer S 4 optimized recipes to our data pipeline.Unlike our other recipes, the Conformers use SpecAugment and large batch sizes (2520 seconds).The models train for 120 nominal epochs of 1824 updates (so that nominal epochs approximately match full dataset epochs), using Noam learning rate scheduling with a warm-up, and the AdamW optimizer [54].Unlike the CRDNN and wav2vec 2.0 AED-models, which use a GRU decoder, the Conformer recipes use a 6-layer Transformer decoder.At decode-time the Conformer recipe uses 10 checkpoint parameter averaging [55], which improves results.However, the recipe does not include an MWER finetuning step, which we add.We decide to add MWER finetuning after parameter averaging, because our MWER finetuning is run for relatively few steps (20 nominal epochs of 200 updates, about 2 full epochs), which does not yield meaningfully different checkpoints to average.On all other results, MWER appears to provide a modest improvement, except with the Conformer L AED-model on Test Other.Furthermore, we verified that the improvements are from MWER and not just any training after parameter averaging: regular AED-model training after parameter averaging does not improve the results, as shown with the Conformer S AED-model results.
Since the MWER finetuning happens after parameter averaging (a non-standard approach), we match this on the HMM/DNN-system side with regular training for an equivalent amount of steps.The AED-model encoder is frozen, while the HMM/DNN-system encoder is not -this may be a small mismatch, but this way, the results more conclusively show that the HMM/DNN-system finetuning after parameter averaging is not an important step in our experiments.We only find very small improvements in the Conformer S results and the Flat Start Conformer L results, but not the main Conformer L HMM/DNN-system.We decide to report results both with and without MWER finetuning and continued training after parameter averaging, because the analysis in Section IV-E reveals that most of the improvements from training after parameter averaging do not stand credibility inspection.
The Librispeech results are reported in Table X.We have included many relevant results published elsewhere.We believe that as small CRDNN models, our results are reasonable, beating the Kaldi results and falling behind Returnn systems that also use classic (non-Transformer) neural layers.Further improvements to our results might be found through a combination of larger models, longer training with more complex optimisation (such as curriculum learning), augmentation, and additional language models.Our Conformer L AED-model falls slightly behind a comparable ESPnet model, which may be partly explained by the ESPnet advanced S4 Decoder [58].Our HMM/DNN-system wav2vec2.0results roughly match the Clean results obtained in the original wav2vec 2.0 publication, but fall slightly behind on the Other data, most likely due to us not using any augmentation, and the original paper applying SpecAugment, which is very beneficial on Librispeech [16].
The Librispeech language model perplexities are listed in Table XI and a detailed look at the model sizes in Table XII.
3) Combined Finnish Data Experiments: Finally, we use the Librispeech CRDNN recipes on the Combined Finnish data.Since the Combined Finnish Data is computationally demanding, we decide to limit the Combined Finnish experiments to CRDNN models.
Our chosen Equal Data Setting places an upper bound on the language model data: the speech transcripts.However, because language model training is decoupled from acoustic model training in HMM/DNN-systems, we are able to further limit the text data to one domain only.We try limiting the language model data to either Finnish Parliament transcripts or Lahjoita Puhetta transcripts, to see if those models work better on their own domains.The Combined Finnish Data results are reported in Table XIII.The language model perplexities are shown in Table XIV and a detailed rundown of the parameter counts in Table XV.

E. Analysis of Results
We analyze the test results of the Finnish Parliament Train20 and Librispeech tasks in detail.In addition to WER, we briefly looked at Character Error Rate results, but they appeared to draw the same picture as the word-level results, and we decided to focus on WER in this work.Table XVI highlights key comparisons.A more comprehensive table of comparisons is published online. 5We use a bootstrap estimate to measure how credible it is that the winning system is truly better [53], treating the 95% mark as a cutoff.To measure the extent to which the compared systems produce similar output, we compute three quantities.Sentence Difference is simply the percentage of utterances that resulted in different outputs from the two systems.Additionally, we compute Kendall's rank correlation coefficient, tau (τ ) [60], a measure of ranking agreement, on both utterance and speakerlevel WER.A tau value close to one indicates the same utterances or speakers were (in relative terms) easy or difficult for both  data with Parl30 M language models.On Test20, the Parl30 M language models have higher perplexity (see Table VIII) and thus hurt performance (also seen in [52]), and this effect is larger on HMM/DNN-systems.Another exception is the wav2vec 2.0 system performance with the Parl30 M language models on Finnish Parliament Test16 data, which yielded roughly equal performance (credibility ≈ 79%).
The Conformer encoder comparisons all lead to approximately equal performance (credibility < 95%).Besides the encoder architecture, the difference to the CRDNN and wav2vec 2.0 experiments is that the Conformer recipes used very large batch sizes and needed to train for much longer.Another difference is that the Conformer recipes used SpecAugment, but we verified that this alone does not allow the AED-model to reach parity.Adding SpecAugment to the CRDNN AED-model recipe on Finnish Parliament improved the WER only from 14.36% → 13.64% on Dev16, 10.39% → 9.99% on Test16, and 8.57% → 8.47% on Test20, which is still worse than the HMM/DNN-system without SpecAugment (having WERs of 11.72%, 8.21%, 7.59% on those evaluation sets respectively).A final difference is that the Conformer AED-models used Transformer decoder layers, which may be beneficial in particular in conjunction with large batch sizes and longer training.
The wav2vec 2.0 systems consistently outperform both CRDNN and Conformer systems, with an exception on HMM/DNN-systems on Finnish Parliament Test16 when using Parl30 M language models, which has roughly equal performance (credibility ≈ 54%).The Parl30 M language models appear to provide the most improvement on that data.
Another exception is the Librispeech Other data, where the wav2vec 2.0 AED-model has roughly equal performance with the Conformer L AED-model (credibility ≈ 59%).
The Flat Start HMM/DNN-systems on the other hand have more varied comparisons.With transcript language models, Flat Start CRDNN HMM/DNN-systems have roughly equal (credibility 79%) performance with AED-models on Finnish Parliament Test16.Librispeech Test Clean and Finnish Parliament Test20 lead to AED-model wins, and Librispeech Test Other to a Flat Start HMM/DNN-system win.With larger language models, the CRDNN Flat Start HMM/DNN-system has equal performance with the AED-model on Finnish Parliament Test16 (credibility 79%) and the Librispeech test sets and loses on Finnish Parliament Test20, due to the aforementioned Parl30 M phenomenon.The Flat Start Conformer L HMM/DNN-system loses to the AED-model on both Librispeech test sets.Finally, in this work Flat Start HMM/DNN-systems lose to regular HMM/DNN-systems without exception (though contrary results exist in the literature [61]).
We computed errors on rare words, in this case words that do not appear in the training data transcripts.We used Levenshtein alignments to find instances where rare words in the reference resulted in substitutions and deletions.Table XVII lists the results.The AED-model has slightly higher error rates on the rare words than the corresponding HMM/DNN-system in every comparison except with Conformer L encoders on Librispeech Test Other.Yet the Flat Start HMM/DNN-system performs slightly worse than the AED-model on all comparisons except with CRDNN encoders on Librispeech Test Clean.Perhaps the frame-level training in the main HMM/DNN-systems gives it an edge in modeling an unfamiliar acoustic sequence.However, the differences are not very large in any comparison.
We looked into word error streaks -how often do the systems have multiple consecutive edit operations.Fig. 1 plots the ratios of AED streaks to HMM streaks for streak lengths upto four.Longer streaks are too rare to yield meaningful data.With CRDNN encoders, the pattern is especially clear: AED appears to have relatively more longer streaks than the transcript HMM/DNN counterpart.This pattern is also visible with Conformers (particularly on Librispeech Test Clean), although for Conformers the result is less significant since there are no results for the Finnish data.Additionally, the Conformer figures show the effect of MWER finetuning, which appears to decrease long streaks of errors.The pattern of AED-models having more longer streaks than HMM/DNN-systems is not seen with wav2vec 2.0 encoders.We also looked into relative WER on the shorter and longer test set halves separately, but found it consistent across all speech recognition systems.On Finnish Parliament, the longer half had slightly higher WERs, whereas on Librispeech, the shorter half had higher WERs.We looked at largest wins (by number of edits) at the utterance level when comparing systems.This way we found some individual utterances, which lead to one system failing, with the other succeeding, proving the issue was not due to the utterance itself.In some cases the AED-model drops large portions of the utterance.We also find one case where the AED-model produces pathological repetitive output.Both are likely due to a failure of the attention mechanism.
In the Combined Finnish Data results in Table XIII the HMM/DNN-system outperforms the AED-modelBoth recognition families are able to handle multiple domains, and additionally, both are able to improve over the Transcript CRDNN systems trained on Finnish Parliament Train20.We find that limiting the language model data improves the language model perplexities on both Finnish Parliament and Lahjoita Puhetta, but in light of the WER results, the limiting is not helpful.

V. DISCUSSION
Our work highlights an open question: which speech recognition approach is the best one?We believe the Equal Data Setting and Matched Encoder Setting principles offer a compelling, fair alternative to competing for the state-of-the-art results.Even under the End-to-End Data limitation, and using the AED-model hyperparameters, the HMM/DNN-system consistently outperformed the AED-model in terms of WER in the CRDNN and wav2vec 2.0 experiments.In the Conformer experiments, with the hyperparameters tuned for the AED-model, and the HMM/DNN-system restricted to transcript language models and grapheme units, the AED-model did not surpass the HMM/DNN-system.One way to interpret these results is to see the HMM/DNN-system as a benchmark system: the results prove the room for improvement in the AED-model.Another interpretation is that although research focuses more and more on End-to-End speech recognition approaches, it is worthwhile to apply the neural network innovations to HMM/DNN-systems as well.
Our observations emphasize the need for more strictly controlled comparisons of heterogeneous speech recognition systems in addition to competition for the state-of-the-art error rates.The thousand-hour scale is a particularly interesting ground for these comparisons, because no approach has proven conclusively better at that scale and because it is reachable in open datasets in many languages.Analysing our empirical results, we find that the systems with similar performance still fail on different utterances -each have their own weaknesses.Between different HMM/DNN-systems, the systems using GMM alignments consistently outperform Flat Start systems.It appears that frame-level Cross Entropy training with GMM alignments is still useful for producing the best results, though we note that contrary results have also been presented [61].However, we find that some simplifications of the HMM/DNN-system are possible, at least the thousand-hour scale.Tree-clustering for state-tying and phoneme-based units do not yield meaningful improvements in our experiments.

A. Limitations and Future Work
The experiments presented here leave some caveats regarding how the decoding-side implementations are matched.We believe something akin to a Matched Decoding Setting could be proposed in the future.This might match the language modeling context length, the use of neural language models in single-pass search and the language model capacity.language compensation (which matches the N-gram model probability replacement in neural language model rescoring) could also be a part of this.
Pure error rate performance is not the only relevant metric in choosing a speech recognition system.Our comparison does not consider for example the ability to deploy on mobile devices, the capability for online recognition (all encoder architectures use full utterance context in this work), nor the ease of development.We reported parameter counts, which matter for memory usage and model capacity, though in the latter area, neural network weights are not directly comparable with N-gram probabilities.If anything, the parameter counts probably favored the AEDmodels, which had in total more parameters than the corresponding HMM/DNN-system throughout all the experiments.We sidestepped the favourite child problem by optimizing the AED-model and applying the hyperparamaters to the HMM/DNN-systems.This set an upper bound on the WER of the HMM/DNN-system.If the AED-model had outperformed the HMM/DNN-system, the conclusions would have been less clear.In that case, one solution could be to re-do the optimization in the other direction, applying the best HMM/DNN-system parameters to the AED-model.
Our practical experiments are naturally not able to cover all approaches.In particular, future work should include applying our proposed principles to comparisons involving Transducer models.Additionally, we concede that the manner in which we sidestepped the favourite child problem may lead to combinatorial amounts of work needed in comparisons involving more than two approaches.For example, had we attempted to include Transducers in this study, we would have had three pairs of approaches, with each pair potentially requiring their own set of hyperparameters and models.
Our analysis is able to show how models from different recognizer families make different errors, even though they may have similar performance.However, developing more advanced statistical methods would be a valuable contribution for the analysis of heterogeneous speech recognition systems.

VI. CONCLUSION
Choosing a speech recognition approach to use is currently difficult, both for deployment and for research, because there are many competing families of speech recognition approaches, each with their strengths and weaknesses.
We proposed two simple principles, and illustrated how those principles help to design more revealing speech recognition experiments.Experiments under the Equal Data Setting avoid confounding variables related to data, whereas experiments under the Matched Encoder Setting avoid confounding variables related to neural architecture and training.We demonstrated how to build AED-models and HMM/DNN-system adhering to these principles.During the course of developing our HMM/DNNsystems, we made multiple discoveries.We presented the multihead decoding approach, and showed how GMM-alignments are still valuable for achieving the best results, though possibly not for tree-clustering.
We presented experiments on three thousand-hour-scale speech recognition tasks, comparing AED-models and HMM/DNN-systems.We optimize AED-models, reaching our HMM/DNN-system baselines from previous work.However, in comparisons under our proposed principles, our HMM/DNNsystems consistently yielded either equal or better error rates than AED-models.Our findings highlight the viability of HMM/DNN-systems in the era of End-to-End models.

TABLE I HMM
/ DNN DEVELOPMENT RESULTS ON FINNISH PARLIAMENT (FP) TRAIN20 AND LIBRISPEECH, USING TRANSCRIPT LMS, ORDERED BY DECREASING WER

TABLE II A
DATA OVERVIEW: HOURS OF SPEECH, NUMBER OF SPEAKERS, AVERAGE UTTERANCE LENGTH, AND NUMBER OF WORDS IN THE TRANSCRIPTION, FOR EACH DATA SUBSET

TABLE V REVISITING
RESULTS FROM EARLIER WORK, WHICH USED THE EQUAL DATA SETTING, BUT NOT THE MATCHED ENCODER SETTING

TABLE VI DEVELOPMENT
RESULTS WITH MATCHED ENCODER SETTING rate to 4.84% (a 23% relative improvement).The best AEDmodel surpassed the initial HMM/DNN-system, but with the changes in training hyperparameters, the HMM/DNN-system improved 4% relative, and still had the lowest error rate.

TABLE VII MES
-EDS COMPARISON EXPERIMENTS ON THE FINNISH PARLIAMENT TRAIN20 DATA

TABLE XVI SELECTED
PAIRWISE COMPARISONS ON THE TEST SETS, WITH (ITALIC) COMMENTS HIGHLIGHTING THE RESULTS INTERSPERSEDMore sophisticated statistical study of this phenomenon is left as future work.

TABLE XVII ERROR
RATES ON WORDS THAT DID NOT APPEAR IN THE TRAINING TRANSCRIPTS