Momentum Pseudo-Labeling: Semi-Supervised ASR With Continuously Improving Pseudo-Labels

End-to-end automatic speech recognition (ASR) has become a popular alternative to traditional module-based systems, simplifying the model-building process with a single deep neural network architecture. However, the training of end-to-end ASR systems is generally data-hungry: a large amount of labeled data (speech-text pairs) is necessary to learn direct speech-to-text conversion effectively. To make the training less dependent on labeled data, pseudo-labeling, a semi-supervised learning approach, has been successfully introduced to end-to-end ASR, where a seed model is self-trained with pseudo-labels generated from unlabeled (speech-only) data. Here, we propose momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains an exponential moving average of the online model parameters. The interaction between the two models allows better ASR training on unlabeled data by continuously improving the quality of pseudo-labels. We apply MPL to a connectionist temporal classification-based model and evaluate it on various semi-supervised scenarios with varying amounts of data or domain mismatch. The results demonstrate that MPL significantly improves the seed model by stabilizing the training on unlabeled data. Moreover, we present additional techniques, e.g., the use of Conformer and an external language model, to further enhance MPL, which leads to better performance than other semi-supervised methods based on pseudo-labeling.


I. INTRODUCTION
T HE field of automatic speech recognition (ASR) has witnessed remarkable improvements in performance thanks to the advances in deep learning-based techniques [1], [2]. Much of the recent progress in ASR lies in the end-to-end framework [3]- [5], which directly models speech-to-text conversion using a single deep neural network. With wellestablished sequence-to-sequence modeling techniques [6]- [9] and sophisticated neural network architectures [10]- [12], endto-end ASR has demonstrated promising results on various benchmarks [13]- [15]. However, the performance often depends on the availability of a large quantity of labeled data (speech-text pairs) [16], which requires great annotation costs and is not always achievable.
To alleviate the heavy requirement for labeled data, semisupervised learning [17] has been attracting increasing attention for improving end-to-end ASR. Semi-supervised learning utilizes labeled data as well as unlabeled (or unpaired) data during model training, where the amount of labeled data is in general much smaller than that of unlabeled data. Some early works for semi-supervised end-to-end ASR are based on a reconstruction framework, including approaches based on a text-to-speech model [18]- [20] or a sequential auto-encoder [21]- [23]. Others adopted self-supervised pre-training techniques, such as BERTlike mask prediction [24]- [26], contrastive learning [27]- [29], and feature clustering [30], [31], to boost the performance of downstream ASR tasks.
We focus on self-training [32] or pseudo-labeling [33], which has recently been adopted for semi-supervised end-to-end ASR and shown to be effective [34]- [44]. In pseudo-labeling, a teacher model is first trained on labeled data and used to transcribe unlabeled (speech-only) data to obtain pseudo-labels. A student model is then trained using both the labeled and pseudolabeled data to achieve better performance than the teacher. Assuming external text data is accessible, external language models (LMs) and beam-search decoding are often incorporated into the labeling process to generate higher-quality pseudo-labels [35], [38]. Data augmentation is also important for assisting a student model with training on pseudo-labels [36], [37], [40]. In addition to these techniques, ASR performance can be further improved by iterating the pseudo-labeling steps [39]- [43]. In [40], a model is continuously trained on pseudo-labels, which are generated on the fly by the model itself. Pseudo-labels are refined as the model learns, and the model benefits from training on the refined labels. However, we observed that this frequent update of pseudo-labels can easily cause unstable training, especially when there is a large amount of unlabeled data or domain mismatch between labeled and unlabeled data, which is likely to be the case in real-world scenarios.
In this paper, we present a semi-supervised learning framework for end-to-end ASR, referred to as momentum pseudolabeling (MPL). In MPL, the pseudo-labels are iteratively updated based on an ensemble of models at different time steps within a single training process [45]. MPL consists of online and offline models that interact and learn from each other, similar to the teacher-student framework in the mean teacher method [46]. The online model is trained to predict pseudolabels generated on the fly by the offline model. The offline model maintains an exponential moving average of the weights of the online model, which can be regarded as an ensemble of the online models at different training steps. Through the interaction between the two models, MPL effectively stabilizes the training with unlabeled data and handles the constant change in pseudo-labels.
The contributions of this paper are summarized as follows: r We propose MPL and show its advantages over other semisupervised approaches based on pseudo-labeling.
r We present an effective way for controlling the MPL training, which reduces the burden for heuristic tuning.
r We evaluate MPL in various semi-supervised scenarios, which demonstrates its robustness against variations in the amount of data, variations in domain mismatch severity, and over-fitting to LM knowledge.
r We perform thorough analyses to confirm the effectiveness of MPL and propose several methods to further improve ASR performance. This paper summarizes our previous studies on MPL [47], [48] with the following extensions: we provide more detailed explanations of relationship to prior works (Section II) and precise formulations of end-to-end ASR and pseudo-labeling (Section III); we present a consistent description of [47] and [48], with more specific implementations (Section IV); we conduct experiments on a variety of semi-supervised scenarios, including additional experiments on smaller and larger amounts of labeled data (Section V); and we further demonstrate the effectiveness of MPL through more detailed experimental results and discussions (Section VI).

A. Self-Ensembling for Semi-Supervised Learning
Self-ensembling [45] is a semi-supervised learning framework, where a target of an unlabeled sample is obtained by a consensus of predictions from models at different training steps or different models. This prediction ensembling is expected to produce a more accurate pseudo-label than the most recent model prediction. Several approaches have been proposed to implement self-ensembling; we here refer to techniques based on an exponential moving average (EMA). Temporal ensembling [45] maintains an EMA of label predictions from different models, which is used as a target for model training at the current step. The mean teacher method [46] improves temporal ensembling by calculating an EMA of model weights and generating a pseudo-label using the averaged model. This can avoid sudden changes in pseudo-labels and enable the model to learn from unlabeled samples stably. The concept of EMA-based weight averaging has also been shown to be effective for stabilizing self-supervised representation learning [49], [50]. MPL is inspired by and similar to the mean teacher framework. However, we differentiate MPL from prior work in the following perspectives. 1) MPL is a semi-supervised learning framework for end-to-end ASR: while most previous studies focus on classification problems (e.g., image classification), few have introduced self-ensembling techniques to sequenceto-sequence mapping objectives, here connectionist temporal classification (CTC) [6]. 2) MPL uses hard (pseudo-)labels for training with unlabeled data: while soft labels generally contain richer information for promoting a model training [51], applying a distillation loss to CTC-based ASR systems is known to be problematic [52]; as CTC models emit spiky posterior distributions and predictions are naturally high-confidence, we consider hard labels more suitable for MPL. 1 3) MPL applies data augmentation (i.e., SpecAugment [55]) to the input only for training the online model, while the offline model generates pseudo-labels in inference mode: since we do not use soft labels in MPL, it is preferable for pseudo-labels to be accurate; moreover, the online model can learn to robustly predict pseudo-labels from noisy input, an effective approach known as consistency training [36], [40], [56].

B. Pseudo-Labeling With Multiple Iterations
A simple extension for enhancing the pseudo-labeling-based method is to conduct multiple rounds of the pseudo-label generation and model training processes, demonstrating promising results in various fields [57], [58] including end-to-end ASR [39]- [43]. Iterative pseudo-labeling (IPL) [39], an iterative version of [35], continuously trains a single ASR model with periodic regeneration of pseudo-labels. Here, the labeling process is performed via beam-search decoding with an external LM, which makes the pseudo-labels biased toward LM training texts and the model over-fit to the LM knowledge [42], [43]. In [40], an ASR model is trained on pseudo-labels generated without using an LM, where the pseudo-labels are updated on the fly after every training iteration. However, this frequent relabeling is likely to make pseudo-labels unstable and thus cause the model training to diverge. slimIPL [42] mitigates this problem by introducing a dynamic cache mechanism, which stores and uses pseudo-labels generated from the previous model states instead of regenerating them with the most recent model every iteration.
MPL is another direction for improving pseudo-labeling with multiple iterations, which can be considered as a general framework extending [35] and [40] (see Section IV-B). In each iteration of MPL training, pseudo-labels are generated on the fly from the offline model without an LM and used as targets to train the online model. The offline model maintains an EMA of the online model weights to stabilize pseudo-labels. This can be seen as alternative caching mechanism to [42] for exploiting older models. A similar approach to MPL was proposed in [59], which focused on lower-resource settings and conducted experiments on a hybrid ASR system in addition to a CTC-based end-to-end system. This paper thoroughly investigates MPL on its robustness against variations in domain mismatch severity and over-fitting to LM knowledge.

III. BACKGROUND
In this section, we review a formulation of CTC-based endto-end ASR [3] and semi-supervised ASR based on pseudolabeling [35]. Let X = (x t ∈ R D |t = 1, . . . , T ) be an input sequence of length T , and Y = (y l ∈ V|l = 1, . . . , L) be the corresponding output sequence of length L. Here, x t is a Ddimensional acoustic feature at frame t, y l is an output token at position l, and V is a vocabulary. Note that, in general, the output length is much shorter than the input length (i.e., L T ).
A. Network Architecture 1) Transformer Encoder: For converting an audio sequence X into a sequence of hidden representations, we build a Transformer-based encoder model [60] consisting of a stack of N enc identical blocks: where H = (h t ∈ R d model |t = 1, . . . , T ), and h t is a d modeldimensional hidden representation at index t.
Given an audio sequence X, the encoder first applies 2D convolution Conv2D(·) to down-sample the input sequence length from T to T (= T /4) [61]. Positional encodings PosEnc(·) are then added to each frame of the down-sampled sequence, which results in an initial hidden sequence: The i-th encoder block outputs a sequence of hidden representations where i ∈ {1, . . . , N enc }, and LayerNorm(·), SelfAttention(·), and FeedForward(·) indicate layer normalization, multi-head self-attention, and feed-forward network, respectively. The final sequence H is obtained by normalizing the output of the last encoder block, i.e., H = LayerNorm(H (N enc ) ).
2) Conformer Encoder: Besides the Transformer encoder, we construct a model based on the Conformer-based encoder [12] consisting of a stack of N enc identical blocks: Conformer is a variant of Transformer-based encoder architecture augmented with convolution to increase the capability for capturing local feature patterns [12], which has been shown to be more effective than standard Transformers on various speech processing tasks [62].
The computation in each Conformer encoder block can be defined by modifying the encoder steps in Transformer, where (4) and (5) In addition to the self-attention layer, Conformer introduces a module Conv(·) based on depthwise separable convolution [63].
The convolution module consists of point-wise convolution, gated linear unit activation, 1D depth-wise convolution, batch normalization, Swish activation, and point-wise convolution. Unlike Transformer, each Conformer block adopts relative positional encoding [64] for the self-attention layer, which enables the model to increase the robustness against different input lengths. Moreover, Conformer employs the Macaron Net-style structure [65], where the original feed-forward layer (5) is replaced with two half-step feed-forward layers ( (7) and (10)).

B. Connectionist Temporal Classification (CTC)
CTC [3], [6] optimizes end-to-end ASR by training a model to find monotonic alignments between an input sequence X and target sequence Y . To align the sequences at the frame level, Y is augmented by adding an additional blank token and allowing repetitions of the same token, which results in a CTC alignment Z = (z t ∈ V ∪ { }|t = 1, . . . , T ). Assuming the conditional independence of frame-wise token predictions, CTC models the probability P (Z|X) as the product of token emission probabilities: where P (z t |X) is a probability density function of the tokens and is obtained by applying a linear projection layer and a softmax layer to the encoded sequence H from (1) or (6). For a given target sequence, there exist several possible alignments, depending on the position of the blank tokens and the number of repeated tokens. Let B be a collapsing function that maps a CTC alignment Z to a target sequence Y , which is performed by suppressing repeated tokens and removing blank tokens. With the collapsing function, CTC calculates the probability P (Y |X) by marginalizing (12) over CTC alignments as where the inverse function B −1 (Y ) returns a set of CTC alignments that are compatible with Y . While (13) has to deal with all possible Z, it is efficiently computed via dynamic programming (e.g., forward-backward algorithm) [6]. Given a pair of input and output sequences (X, Y ), a model is trained to minimize the CTC loss defined by the negative log-likelihood of (13):

C. Semi-Supervised ASR With Pseudo-Labeling
The goal of semi-supervised ASR, in this work, is to exploit a large amount of unlabeled (audio-only) data to enhance a pretrained ASR model in a self-training manner. To this end, we focus on an approach based on pseudo-labeling [33], [35], which is described in two steps: 1) the supervised training of a seed end-to-end ASR model, and 2) the semi-supervised training of the seed model using unlabeled data.
1) Supervised Training of a Seed Model: A seed model P θ with parameters θ is first trained on labeled data D lab = {(X n , Y n )|n = 1, . . . , N}, using the CTC loss from (14): where A(·) denotes a data augmentation for improving generalization of the model, here SpecAugment [55].

2) Semi-Supervised Training With Pseudo-Labels:
The seed model is then used to generate pseudo-labels for unlabeled data D unlab = {X N +m |m = 1, . . . , M}. 2 For each unlabeled sample, pseudo-labelsŶ N +m are generated aŝ where argmax is performed using an external LM P lm and beamsearch decoding, and γ is the LM weight. With the pseudo-labels from (16), the loss for the unlabeled data is calculated in the same manner as (15): Finally, using both D lab and D unlab , the seed model is further trained on the combined objective of L lab and L unlab .

IV. MOMENTUM PSEUDO-LABELING
In this section, we explain our momentum pseudo-labeling (MPL) for semi-supervised ASR [47], [48]. The overall process of MPL is shown in Algorithm 1, which trains a pair of online and offline models that interact and learn from each other. Let us define the online and offline models as P ξ and P φ , with model parameters ξ and φ, respectively. Both models are initialized with the seed model P θ trained as in Section III-C1.

Algorithm 1: Momentum Pseudo-Labeling.
Input: D lab , D unlab labeled and unlabeled data A an ASR model architecture α a momentum coefficient 1: Train a seed model P θ with architecture A on D lab using (15) 2: Initialize an online model P ξ and an offline model P φ with P θ 3: Compute loss L for P ξ (Y |X) as in (15)  8: Update ξ using ∇ ξ L 9: Update φ ← αφ + (1 − α)ξ 10: end for 11: end for 12: return P ξ online model is returned for final evaluation

A. Online Model Training
On unlabeled sample X N +m ∈ D unlab , the online model is trained using pseudo-labelsŶ N +m generated on the fly by the offline model: where argmax is performed based on the best path decoding of CTC [6]. Specifically, the most probable tokensẐ are selected at each frame, and an output sequence is obtained using the collapsing function, i.e.,Ŷ = B(Ẑ). Note that (18) differs from (16) in that we use neither LM nor beam-search decoding. With unlabeled data X N +m ∈ D unlab and the corresponding pseudo-labels from (18), the semi-supervised loss of the online model is defined in the same manner as (17): which is maximized via a gradient descent optimization. In (19), we apply the data augmentation to an unlabeled input, aiming to provide the online model with training signals to learn robustly from the noisy input [36], [40]. In Section VI-D, we show that data augmentation is an important factor of MPL. Assuming labeled data D lab is available during the semisupervised process, we also use the supervised loss L lab (ξ) calculated similarly as in (15), which helps to stabilize the online model training as it learns from unlabeled data. Using D lab and D unlab , the online model is trained using the combined objective of L lab (ξ) and L unlab (ξ). Note that in Section VI-B, we demonstrate that MPL is yet stable and effective even when trained solely on unlabeled data, i.e., trained only with L unlab (ξ).

B. Offline Model Training
After every update of the online model, the offline model accumulates parameters of the online model via an exponential moving average with a momentum coefficient α ∈ [0, 1]. This momentum update makes the offline model evolve more smoothly than the online model. We can thus control the change in pseudo-labels generated on the fly by the offline model at each training step. This is important to prevent pseudo-labels from deviating too quickly from the initial labels generated by the seed model and to avoid collapsing to a trivial solution. Indeed, we empirically observe that training is prone to collapse (emitting only blank tokens for unlabeled data) for α = 0.0, in which case the online and offline models share parameters and the online model is trained with self-generated pseudo-labels as in [40]. The problem is prominent when there is a domain mismatch between labeled and unlabeled data, as is often the case in real-world deployment. At the other end of the spectrum, when α = 1.0, this approach becomes similar to standard pseudo-labeling [35] as described in Section III-C, where the offline model is never updated, and the online model is trained on fixed pseudo-labels generated from the seed model. This can stabilize the semi-supervised training, at the cost of leaving no room for improving pseudo-labels and limiting the improvement of the online model. We demonstrate the effectiveness of the momentum update in Section VI-B. After training with MPL, both the online and offline models can be used for evaluation, although we use the online model as our default. We compare and analyze the performance of these models in Section VI-C.

C. Tuning the Momentum Coefficient
Instead of directly tuning α in (20), we design a more intuitive method for deriving an appropriate value of α. Based on (20), the parameters of the offline model after K iterations can be written as where φ (k) and ξ (k) denote the parameters of each model at the k-th iteration, and φ (0) = ξ (0) = θ. We here assume that it is important to retain some influence of the seed model to stabilize the pseudo-label generation. As a measure of this influence, we focus on the term α K φ (0) in (21) and define a weight w of the seed model in φ (K) as where we consider K as the number of iterations (i.e., minibatches) in a training epoch. As K can often be in the thousands, small changes in α lead to huge differences in w (e.g., 0.999 3000 0.9997 3000 ), requiring small adjustments on α for different amounts of training data. Instead of directly tuning α for the momentum update, we propose to tune the weight w, which can be regarded as the proportion of the seed model parameters retained after a training epoch. Given w and K, α is calculated as By controlling the update through w, we expect MPL to be less affected by the amount of training data, which we examine in Section VI-B.

D. Adopting Conformer for MPL Training
To further enhance the MPL performance, we investigate utilizing the Conformer-based architecture [12], which is expected to improve overall ASR performance and thus enable a model to generate accurate pseudo-labels. While Conformer-based models have achieved outstanding ASR performance compared with standard Transformers [62], we empirically observe that Conformer suffers from poor generalization from labeled data to unlabeled data. A similar issue has been reported in other ASR tasks [68]- [70]. Simply adopting Conformer for MPL makes the training become unstable and diverge easily, especially when a domain mismatch exists between labeled and unlabeled data.
We assume that such a problem comes from unreliable statistics computed and used by batch normalization (BN) [71] in the convolution module (in (9)). As the seed model is first trained on labeled data only, the mean and variance estimated in BN are fitted to the statistics of D lab . When the model is then further trained on the combined D lab and D unlab via MPL, the data variation becomes large among mini-batches, which leads to making BN unstable [72]. Hence, we consider replacing BN with group normalization (GN) [73] in the convolution module, as it has been shown effective for Conformer-based streaming ASR [68]. GN divides feature maps into groups and normalizes the features within each group, which makes the training less dependent on the variations across mini-batches. This is found critical for stabilizing the Conformer-based MPL training, as carefully investigated in Section VI-E1.

E. Exploiting LM Knowledge for MPL Training
While using an external LM and beam-search decoding has been shown to be effective for generating pseudo-labels with high-quality [35], [38], [40], it is too computationally intensive to be adopted for MPL due to the on-the-fly label generation. To mitigate this limitation, we consider performing the standard pseudo-labeling (PL) training (as described in Section III-C2) prior to MPL. With this combination of PL and MPL, LM knowledge is implicitly transferred to the seed model, providing the MPL training with a better initialization for generating higher-quality pseudo-labels. Moreover, by avoiding the explicit LM usage during the MPL training, we can prevent the ASR model from over-fitting to the LM training text data, which often degrades the generalization capability of the model [42], [43]. In addition to PL, we investigate iterative pseudo-labeling (IPL) [39], which extends PL by continuously training a model with periodic regeneration of pseudo-labels. for all (X, Y ) ∈ D lab ∪D unlab do 7: Compute loss L for P θ (Y |X) with Eq. (15) 8: Update θ using ∇ θ L 9: end for 10: end for 11: end for 12: # Momentum pseudo-labeling 13: Initialize an online model P ξ and an offline model P φ with P θ 14: for epoch = 1, . . . , E mpl do 15: for all S ∈ D lab ∪ D unlab do 16: Compute loss L for P ξ (Y |X) as in Eq. (15) 19: Update ξ using ∇ ξ L 20: Update φ ← αφ + (1 − α)ξ 21: end for 22: end for 23: return P ξ online model is returned for final evaluation Algorithm 2 shows the proposed MPL training with the initialization strategy based on PL or IPL. In the beginning, a seed model is trained using a labeled set as in Section III-C1 (line 1). Then, the seed model is further trained via PL or IPL with LM and beam-search decoding (lines 3-11). Here, we denote I ipl as the number of iterations (pseudo-label updates), and E ipl as the number of epochs trained in each iteration. Note that this process becomes PL [35] when I ipl = 1 and IPL [39] when I ipl > 1. Finally, the enhanced seed model is used to initialize the online and offline models for MPL (lines [13][14][15][16][17][18][19][20][21][22]. The MPL training lasts E mpl epochs.

A. Data
The experiments were carried out using the LibriSpeech (LS) [74] and TEDLIUM3 (TED3) [75] datasets. LS is a corpus of read English speech, containing 960 hours of training data (split into train-clean-100, train-clean-360, and train-other-500). We also used the 10-hour Libri-Light (LL) training data (train-10 h) [16], which is a low-resource subset extracted from the LS training data. TED3 is a corpus of English Ted Talks consisting of 450 hours of training data (train-ted3). For each dataset, we used the standard validation and test sets for tuning hyper-parameters and evaluating performance, respectively. As input speech features, we extracted 80 mel-scale filterbank coefficients with three-dimensional pitch features using Kaldi [76]. For text tokenization, we used SentencePiece [77] to construct a 1 k subword vocabulary, which was extracted from either train-clean-100 or train-10 h transcriptions, depending on the semi-supervised setting. We also considered two settings with more labeled data, using LS-100 and LS-360 (LS-460) for training a seed model:

C. Model Architecture
As an ASR model, we trained the Transformer [60] or Conformer [12]-based encoder architecture described in Section III-A1 or III-A2, implemented in ESPnet [78]. The model consisted of the Conv2D layer (in (2)) followed by a stack of 12 encoder blocks (N enc = 12). The Conv2D layer down-samples the input length by a factor of 4, using two 2D convolution layers with 256 channels, a kernel size of 3 × 3, and a stride size of 2. In the multi-head self-attention module (in (4) and (8)), the number of heads d h and dimension of a self-attention layer d model were set to 4 and 256, respectively. The dimension of the feed-forward network d ff (in (5), (7), and (10)) was set to 2048. For the convolution module of Conformer (in (9)), we used a kernel size of 31. The number of groups was set to 8 for group normalization when it was used as a replacement for batch normalization in the convolution module.

D. Training Configuration
The seed model was trained for 150 epochs using the Adam optimizer [79] with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . We used Noam learning rate scheduling [60] with 25 k warmup steps and a learning rate factor of 5.0. The semi-supervised training lasted up to 200 epochs, where the gradient-based optimization was done by using the Adam optimizer with an initial learning rate of 10 −3 , β 1 = 0.9, β 2 = 0.999, and = 10 −8 . IPL was performed by iterating PL for a maximum of 8 times (I ipl ≤ 8), where the model was trained for 25 epochs (E ipl = 25) in each iteration. After each iteration of IPL, we averaged model parameters over the last 5 checkpoints to stabilize the pseudo-label generation. For the momentum update of MPL (in (20)), we used w = 0.5 for all the semi-supervised settings, and Table I lists an actual value of the momentum coefficient α used in each setting. For the training of all models, we used SpecAugment [55] as data augmentation.

E. Decoding Configuration
For evaluation, a final model was obtained by averaging model parameters over 10 checkpoints with the best validation performance [60]. We trained an LM consisting of 4 long short-term memory (LSTM) layers with 2048 units, using the LS-100 or LL-10 transcriptions combined with the external text data provided by LibriSpeech [74]. For decoding with the LM, we adopted a frame-synchronous CTC prefix beam search algorithm [80], [81], where we used a beam-size of 20, a score-based pruning threshold of 14.0, an LM weight of 1.0, and an insertion bonus factor of 2.0. For decoding without the LM, we performed the best path decoding of CTC [6].

F. Evaluation Metrics
We used the word error rate (WER) to measure the ASR performance. For evaluating the performance of semi-supervised training, we measured the WER recovery rate (WRR) [35], [82]. WRR compares WERs of the oracle model (trained using ground-truth transcriptions for the unlabeled data as well) and the semi-supervised model by calculating the ratio between their absolute reductions from the seed model's WER: where WER * denotes WER for each model.

VI. RESULTS AND DISCUSSION
In this section, we report and discuss results obtained from our semi-supervised ASR experiments. First, to verify the effectiveness of the proposed MPL, we perform some basic analyses based on the Transformer-based models. Then, we show results for further improving MPL, using the Conformer architecture and additional training/decoding techniques. Table II shows results on the indomain LS settings, comparing PL [35] (from Section III-C) and the proposed MPL. Note that the MPL results also appear in our previous paper [48]. The oracle results were obtained by fine-tuning the seed model using ground-truth labels for both the labeled source and target training sets. Looking at results in the LS-360 setting (A*), both PL and MPL led to a significant improvement over the seed model (A1,A2 vs. S1). When decoded without the external LM, PL resulted in better performance than MPL (A1 vs. A2), benefiting from high-quality pseudo-labels generated using the LM. However, when decoded with the LM, the performance gain was larger for MPL, achieving much lower WERs on the other sets with similar WRRs to those obtained by decoding without the LM. PL, in contrast, had smaller improvement with degraded WRRs, which indicates PL is likely to fit to LM knowledge, as reported in [42], [43], and have less variations in the hypotheses during the beam-search decoding.

A. MPL Results Using Transformer 1) In-Domain Settings:
In the LS-860 setting (B*) with more unlabeled data, MPL again outperformed the seed model with the same range of WRR as was observed in the LS-360 setting (A2 vs. B2), demonstrating its scalability with respect to the amount of unlabeled data. While PL was also effective in this setting, the gain from the seed model was smaller than that obtained from the LS-360 setting (A1 vs. B1); on the other sets, especially, the WRRs of PL dropped by an absolute difference of over 10%. MPL was capable of keeping high WRRs on the other sets, successfully increasing the generalization ability of the model. MPL greatly benefited from decoding with the external LM and achieved better results than those obtained with PL.
2) Out-of-Domain Setting: Table III lists results on the outof-domain TED3 setting. Note that the MPL results also appear in our previous paper [48]. PL resulted in a modest improvement over the seed model, while the gain was more substantial for MPL (C1 vs. C2). As there is a domain mismatch between the LM training text and the actual TED3 transcriptions, PL was less effective at learning from the out-of-domain unlabeled data. Moreover, the LM decoding led to lowering the WRR of PL, indicating that the model was prone to over-fitting to LM knowledge. MPL, on the other hand, took great advantage of   decoding with the LM while achieving as high a WRR as the result decoded without the LM. Fig. 1 shows the performance of MPL depending on the weight w (defined in (22)) used to derive α in the momentum update (in (20)). Note that the figures are reproduced from our previous paper [47]. We observed a similar trend among the curves in different semi-supervised settings ( Fig. 1(a), (b), and (c)). WERs increased as w was set closer to 0.0. When w = 0.0, which is a similar approach to [40], the training was likely to be unstable and failed under the LS-360 and TED3 conditions, as illustrated by the learning curves shown in Fig. 2. This indicates the importance of retaining the influence of the seed model to stabilize learning from unlabeled data. However, depending too much on the seed model (i.e., setting w closer to 1.0) also worsened WERs. Larger w slows down the progress of the offline model, causing MPL to become more like standard PL [35] but without an LM. Fig. 1(d) shows results under an extreme condition, where not only a domain mismatch exists between the seed model and unlabeled data, but labeled data is not used during the semisupervised process (i.e., training the online model with L unlab (ξ) only). Compared to the other settings, the performance was more sensitive to the change in w, but the overall trend was similar.

B. Effectiveness of w for Tuning the Momentum Update
In general, the proposed tuning method with w effectively controlled the momentum update in all settings. It provides a more intuitive guide for tuning α, taking the amount of data into account. Based on the validation results mainly on the LS-860 and TED3 settings, we set w = 0.5 for all the semi-supervised settings.   Table IV compares results obtained from the online and offline models after the MPL training in LS-100/LS-360. Here, last1 indicates evaluating each model from the last checkpoint and val10 from averaging model parameters over 10 checkpoints with the best validation performance. Without the checkpoint averaging technique, the offline model gave better performance than the online model. Being an exponential moving average of the online model parameters over the MPL training (cf. (20)), the offline model naturally benefited from the model ensembling, as it has been shown effective in [83]. However, with checkpoint averaging, the performance of both models improved and the gap was reduced to almost none. As the online model was slightly better on the development sets, we used it as our default for evaluation, which is contrary to previous work [46].

D. Importance of Data Augmentation
In Table V, we investigate the importance of applying SpecAugment during the MPL training (cf. (19)). Even without the augmentation, MPL led to a decent improvement over the seed model (S1 in Table II). However, WRRs significantly dropped compared to the results with SpecAugment (A2, B2, C2). Note that, for the models trained without the augmentation, we computed the WRR against an oracle model without the augmentation. Without SpecAugment, we observed that the training converged earlier, and MPL was less effective. Table VI, we compare WERs of seed models trained using the Transformer  or the Conformer architecture. Note that similar results also appear in our previous paper [48]. For Conformer-based models, we investigated different normalization methods for the convolution module (in (9)), including {batch [71], instance [84], group [73], layer [85]} normalization ({BN, IN, GN, LN}). Note that IN and LN are the same as GN with group sizes 1 and 256 (= d model ), respectively. Comparing the two architectures, the Conformer-based models significantly improved over the Transformer-based model (S1 in Table II). Within the Conformer-based models, GN resulted in the best performance on both LS and TED3, and the 100-hour training data seemed to be too small to take advantage of BN. As normalizing across feature maps (i.e., IN, GN, and LN) achieved better performance than BN on the out-of-domain TED3 set, this indicates that BN led to lower generalization capability with unreliable statistics. Note that in [11], BN achieved better performance than the other normalization methods when another ASR model based on depth-wise separable convolution was trained on the labeled full 960-hour set of LS. Fig. 3 shows learning curves from MPL training using Conformer with BN or GN. The figure is reproduced from our previous paper [48]. In all semi-supervised settings, BN caused the training to become unstable. This was especially the case in the out-of-domain setting with TED3, where the model diverged more quickly than in the other settings. In contrast, GN successfully stabilized the MPL training with Conformer.

1) Investigation on Normalization Method: In
2) In-Domain Setting: Table VII lists results on the indomain LS settings, comparing PL [35], IPL [39], and the proposed MPL. Note that similar results also appear in our previous paper [48]. Looking at the MPL results (X3,Y3), MPL led to a substantial improvement over the seed model (S2), effectively learning from unlabeled data using Conformer with GN. These Conformer results significantly outperformed those   of Transformer-based MPL (A2,B2 from Table II). With pseudolabels generated using the LM, PL [35] and IPL [39] achieved lower WERs on the clean sets than those obtained from MPL, and IPL resulted in better performance than MPL on the other sets as well (*1,*2 vs. *3). However, when decoded with the LM, the performance gain was larger for MPL with only a slight decrease in WRRs compared with the decoding without LM, and MPL achieved much lower WERs on the other sets. PL and IPL, in contrast, had smaller improvement with degraded WRRs, as was observed in the Transformer results (Table II).
3) Out-of-Domain Setting: Table VIII shows MPL results on the TED3 setting. Note that similar results also appear in our previous paper [48]. Conformer with GN significantly improved MPL over the seed model and Transformer-based MPL (Z3 vs. S2, C2), successfully stabilizing training on the out-of-domain unlabeled data. PL and IPL led to decent improvements over the seed model, but the gain was more substantial for MPL (C1 vs. C2), which is consistent with the Transformer results (Table III). Moreover, PL and IPL had little gain from decoding with the LM, indicating the model was too fitted to the LM knowledge.

F. Exploiting LM Knowledge in MPL via PL or IPL
In Tables VII and VIII, *4 and *5 show results for applying MPL training after enhancing the seed model using PL and IPL, respectively (cf. Section IV-E). Note that we performed PL or IPL for 100 epochs and MPL for another 100 epochs to match the total training epochs of the other methods.
In the in-domain settings (Table VII), this initialization strategy provided MPL with distinct improvements (X3 vs. X4,X5 and Y3 vs. Y4, Y5). With the IPL-based initialization, MPL achieved the best overall performance on both of the settings with different amounts of unlabeled data (X5, Y5). When decoded with the LM, the improved MPL achieved higher WRRs    than those obtained from PL and IPL (e.g., X2 vs. X5), preventing the model from over-fitting to LM knowledge but exploiting it to improve ASR performance.
In the out-of-domain setting (Table VIII), MPL further reduced WERs by using the initialization based on IPL (Z3 vs. Z5). However, the improvement was much smaller than those observed in the in-domain settings, and the standard MPL performed sufficiently well by decoding with the LM.

G. Adapting Language Model
To further improve the MPL result in the out-of-domain setting, we explore adapting the LM to the target domain (i.e., TED3). To this end, we made an attempt to make use of pseudo-labels, which were generated from train-ted3 using the model obtained from MPL (Z3). The pseudo-labels were simply mixed with the LS training text for training an adapted LM. Table IX shows results decoded with different LMs, where a source LM is trained on the LS training text (the same LM used in the other experiments), and a target LM is trained on the external text-only data provided by TED3 [75]. With the adapted LM, MPL slightly improved over decoding with the source LM. The seed and oracle models also benefited from decoding with the adapted LM, where the gain was much larger than that of MPL. As the pseudo-labels are obtained as a result of MPL training, the adapted LM was less effective for MPL with the already-acquired knowledge. Regarding LM training on pseudo-labels (uncertain ASR hypotheses), an effective approach has been proposed in [86], which trains a Transformer or LSTM-LM by calculating a Kullback-Leibler divergence loss against token-wise predictions of ASR confusion networks. We found this method challenging to apply to our framework, as this work focused on a CTC-based model with frame-wise predictions. Hence, future work should consider a better way for training an LM using pseudo-labels from a CTC-based ASR model. Table X lists in-domain results where Conformer-based PL and MPL are applied to the LL-10 settings. With a fewer amount of labeled data, the seed model was inferior in quality, compared to the one trained on LS-100 ( S2 vs. S3 ). In both of the settings, PL and MPL successfully improved the seed model ( *1 , *2 vs. S3 ). Even without using the LM, MPL achieved much lower WERs than PL ( *1 vs. *2 ), and PL resulted in a significant drop in WRRs compared to the previous experiments with more labeled data ( I1 vs. X1 and J1 vs. Y1 ). While PL-based approaches often depend on the quality of a seed model, MPL managed to alleviate the problem by continuously improving the pseudo-label quality via the interaction between the online and offline models. Using PL as an initialization, MPL further improved the performance by exploiting the LM knowledge effectively.

H. MPL Results in Lower-Resource Settings 1) In-Domain Setting:
2) Out-of-Domain Setting: Table XI shows results on the out-of-domain setting. While PL improved over the seed model, the gain was smaller when compared to the results from the in-domain settings (I1,J1 vs.K1). On the other hand, MPL performed better than PL and kept WRRs as high as the in-domain results (I2,J2 vs.K2). Even with the low-quality seed model, MPL enabled the model to train stably on the out-of-domain data, and the tuning of the momentum update (Section IV-C) worked robustly to the amount of labeled data. The PL-based initialization was also effective for improving MPL performance while maintaining the high WRRs when decoded with the LM (K3).

I. MPL Results in Higher-Resource Settings
1) In-Domain Setting: Table XII lists in-domain results where Conformer-based PL and MPL are evaluated on the LS-460 settings. With a larger amount of labeled data, the seed model had better quality than the models trained on less data ( S2 , S3 vs. S4 ). In both settings, PL and MPL improved over the seed model ( L1 , L2 vs. S4 ), and the higher-quality seed model led to better overall results compared to the LS-100 settings (Table VII). When decoded without the LM, MPL achieved similar results to those obtained from PL on both the clean and other sets ( L1 vs. L2 ). This is different from our previous observations in Table VII, where PL performed better than MPL by using pseudo-labels generated with the LM. With the better seed model trained on more labeled data, the LM was less effective in helping generate pseudo-labels, which is consistent with the findings in [42]. When decoded with the LM, MPL achieved lower WERs on the other sets than PL while keeping WRRs high. Overall, the PL-based initialization ( L3 ) resulted in the best result, but it was less effective on the clean sets compared to those in the LS-100 settings (Table VII).
2) Out-of-Domain Setting: Table XIII shows results on the out-of-domain setting. The seed model trained on LS-460 was also effective for the TED3 setting, achieving much lower WERs than the models trained on less data ( S2 , S3 vs. S4 ). The overall trend was consistent with what we observed in Tables III, VIII, and XI, with MPL achieving higher WRRs.

VII. CONCLUSION AND FUTURE WORKS
We proposed momentum pseudo-labeling (MPL), a semisupervised learning framework for end-to-end ASR. MPL consists of a pair of online and offline models that interact and learn from each other. The online model is trained to predict pseudo-labels generated by the offline model. The offline model maintains an exponential moving average of the online model weights. The interaction between the two models continuously improves the quality of pseudo-labels and permits stabilizing ASR training on unlabeled data. We applied MPL to a CTCbased end-to-end ASR model and conducted experiments on various semi-supervised settings based on LibriSpeech, Libri-Light, and TEDLIUM3. The results demonstrated that MPL significantly improves the seed model and is robust against variations in the amount of labeled/unlabeled data, variations in domain mismatch severity, and over-fitting to LM knowledge. With additional enhancements, e.g., Conformer with group normalization and integration of LM knowledge via IPL, MPL achieved superior performance compared to other pseudo-labeling-based approaches.