Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.


I. INTRODUCTION
T HE performance of automatic speech recognition (ASR)   systems has improved dramatically in recent years thanks to the availability of larger training datasets, the development of neural network based models, and the computational power to train such models on these datasets [1]- [4].However, the performance of ASR systems can still degrade rapidly when their conditions of use (test conditions) differ from the training data.There are several causes for this, including speaker differences, variability in the acoustic environment, and the domain of use.
Adaptation algorithms attempt to alleviate the mismatch between the test data and an ASR system's training data.Adapting an ASR system is a challenging problem since it requires the modification of large and complex models, typically using only a small amount of target data and without explicit supervision.Speaker adaptation -adapting the system to a target speaker -is the most common form of adaptation, but there are other important adaptation targets such as the domain of use, and the spoken accent.Much of the work in the area has focused on speaker adaptation: it is the case that many approaches developed for speaker adaptation do not explicitly model speaker characteristics, and can be applied to other adaptation targets.Thus our core treatment of adaptation algorithms is in the context of speaker adaptation, with a later discussion of particular approaches for domain adaptation and accent adaptation.
Adaptation algorithms require a data set for adaptation, which should be well-matched to the target test data.In the ideal case, the adaptation data would be labeled with a gold-standard transcription, to enable supervised learning algorithms to be used for adaptation.However, supervised data is rarely available: small amounts may be available for some domain adaptation tasks (for example, adapting a system trained on typical speech to disordered speech [5]).In the usual case, where supervised adaptation data is not available, supervised training algorithms can still be used with "pseudolabels" obtained from a trained (non-adapted) system by semisupervised training [6] or by teacher-student training [7].Alternatively, unsupervised training can be applied to learn embeddings for the different adaptation classes, such as ivectors [8] or bottleneck features extracted from an autoencoder neural network [9].
A second aspect of annotation required for adaptation is labeling of the adaptation class.Adaptation to the speaker can only reliably take place if there is metadata containing this information.In some cases -for example lecture recordings and telephony -this may be available.In other cases potentially inaccurate metadata is available, for instance in the transcription of television or online broadcasts.In many cases (for instance, anonymous voice search) speaker metadata is not available.In the absence of speaker metadata, then the adaptation can take place at the utterance level [10], or automatic clustering approaches can be used to define the adaptation classes [11], [12].This is discussed in Sec.II.
This overview focuses on the adaptation of neural network (NN) based speech recognition systems, although we briefly discuss earlier approaches to speaker adaptation in Sec.III.Speaker adaptation algorithms for hidden Markov model (HMM) based systems are reviewed in more detail by Woodland [13] and Shinoda [14].As we discuss, some of the algorithms developed for HMM-based systems, in particular feature transformation approaches have been successfully applied to NN-based systems.
NN-based systems [1], [15], [16] have revolutionized the field of speech recognition, and there has been intense activity in the development of adaptation algorithms for such systems.Adaptation of NN-based speech recognition is an exciting research area for at least two reasons: from a practical point of view, it is important to be able to adapt state-of-the-art systems; and from a theoretical point of view the fact that NNs require fewer constraints on the input than a Gaussian-based system, along with the gradient-based discriminative training which is at the heart of most NN-based speech recognition systems, opens a range of possible adaptation algorithms.
Neural networks were first applied to speech recognition as so-called NN/HMM hybrid systems, in which the neural network is used to estimate (scaled) likelihoods that act as the HMM state observation probabilities [15] (Fig. 1a).During the 1990s both feed-forward networks [15] and recurrent neural networks (RNNs) [17] were used in such hybrid systems and close to state-of-the-art results were obtained [18].These systems were largely context-independent, although contextdependent NN-based acoustic models were also explored [19].
The modeling power of neural network systems at that time was computationally limited, and they were not able to achieve the precise levels of modeling obtained using context-dependent GMM-based HMM systems which became the dominant approach.However, increases in computational power enabled deeper neural network models to be learned along with context-dependent modeling using the same number of context-dependent HMM tied states (senones) as GMMbased systems [1], [2].This lead to the development of systems surpassing the accuracy of GMM-based systems.This increase in computational power also enabled more powerful neural network models to be employed, in particular time-delay neural networks (TDNNs) [20], [21], convolutional neural networks (CNNs) [22], [23], long short-term memory (LSTM) RNNs [24], [25], and bidirectional LSTMs [26], [27].
Since 2015, there has been a significant trend in the field moving from hybrid HMM/NN systems to end-to-end (E2E) NN modeling [4], [16], [28]- [34] for ASR.E2E systems are characterized by the use of a single model transforming the input acoustic feature stream to a target stream of output tokens, which might be constructed of characters, subwords, or even words.E2E models are optimized using a single objective function, rather than comprising multiple components (acoustic model, language model, lexicon) that are optimized individually.Currently, the most widely used E2E models are connectionist temporal classification (CTC) [35], [36], the RNN Transducer (RNN-T) model [31], [37], and the attentionbased encoder-decoder (AED) model [16], [28].
CTC and the RNN-T both map an input speech feature sequence to an output label sequence, where the label sequence (typically characters) is considerably shorter than the input sequence.Both of these architectures use an additional blank output token to deal with the sequence length differences, with an objective function which sums over all possible alignments using the forward backward algorithm [38].CTC is an earlier, and simpler, method which assumes frame independence and functions similarly to the acoustic model in hybrid systems without modeling the linguistic dependency across words; its architecture is similar to that of the neural network in the hybrid system (Fig. 1a).
An RNN-T (Fig. 1b) combines an additional prediction network with the acoustic encoder.The prediction network is an RNN modeling linguistic dependencies whose input is the previously output symbol.Since the prediction network does not use the speech data, it is possible to train it on additional text data.The acoustic encoder and the prediction network are combined using a feed-forward joint network followed by a softmax to prediuct the next output token given the speech input input and the linguistic context.
Together, the RNN-T's prediction and joint networks may be regarded as a decoder, and we can view the RNN-T as a form of encoder-decoder system.The AED architecture (Fig. 1c) enriches this model with an additional attention network which interfaces the acoustic encoder with the decoder.The attention network operates on the entire sequence of encoder representations for an utterance, offering the decoder considerably more flexibility.A detailed comparison of popular E2E models in both streaming and non-streaming modes with large scale training data was conducted by Li et al [39].
We present a general framework for adaptation of NNbased speech recognition systems (both hybrid and E2E) in Sec.IV, where we organize adaptation algorithms into three general categories: embedding-based approaches (discussed in Sec.V), model-based approaches (discussed in Secs.VI-VIII), and data augmentation approaches (discussed in Sec.IX).
As mentioned above, our treatment of adaptation algorithms is in the context of speaker adaptation.In Secs.X and XI we discuss specific approaches to accent adaptation and domain adaptation respectively.
Our primary focus is on the adaptation of acoustic models and end-to-end models.In Sec.XII we provide a summary of work in language model (LM) adaptation, mentioning both n-gram and neural network language models, and the use of LM adaptation in E2E systems.
Adaptation and transfer learning have become important and intensively researched topics in other areas related to machine learning, most notably computer vision and natural language processing (NLP).In both these cases the motivation is to train powerful base models using large amounts of training data, then to adapt these to specific tasks or domains, for which considerably less training data is available.
In computer vision, the base model is typically a large convolutional network trained to perform image classification or object recognition using the ImageNet database [40], [41].The ImageNet model is then adapted to a lower resource task, such as computer-aided detection in medical imaging [42].Kornblith et al [43] have investigated empirically how well ImageNet models transfer to different tasks and datasets.
Transfer learning in NLP differs from computer vision, and from the speech recognition approaches discussed in this paper, in that the base model is trained in an unsupervised fashion to perform language modeling or a related task, typically using web-crawled text data.Base models used for NLP include the bidirectional LSTM [44] and Transformers Encoder Softmax which make use of self-attention [45], [46].These models are then trained on specific NLP tasks, with supervised training data, which is specified in a common format (e.g.text-totext transfer [46]), often trained in a multi-task setting.Earlier adaptation approaches in NLP focused on feature adaptation (e.g.[47]), but more recently better results have been obtained using model-based adaptation, for instance "adapter layers" [46], [48], in which trainable transform layers are inserted into the pretrained base model.
More broadly there has been extensive work on domain adaptation and transfer learning in machine learning, reviewed by Kouw and Loog [49].This includes work on few-shot learning [50]- [52] and normalizing flows [53], [54].Normalizing flows which provide a probabilistic framework for feature transformations, were first developed for speech recognition as Gaussianization [55], and more recently have been applied to speech synthesis [56] and voice transformation [57].
Finally we provide a meta analysis of experimental studies using the main adaptation algorithms that we have discussed (Sec.XIII).The meta-analysis is based on experiments reported in 45 papers, carried out using 33 datasets, and is primarily based on the relative error rate reduction arising from adaptation approaches.In this section we analyze the performance of the main adaptation algorithms across a variety of adaptation target types (for instance speaker, domain, and accent), in supervised and unsupervised settings, in six different languages, and using six different NN model types in both hybrid and end-to-end settings.

II. IDENTIFYING ADAPTATION TARGETS
Adaptation aims to reduce the mismatch between training and test conditions.For an adaptation algorithm to be effective, the distribution of the adaptation data should be close to that encountered in test conditions.For the task of acoustic adaptation this requirement is typically satisfied by forming the adaptation data from one or more speech segments from given testing conditions (i.e. the same speaker, accent, domain, or acoustic environment).While for some tasks labels ascribed to speech segments may exist, allowing segments to be grouped into larger adaptation clusters, it is unrealistic to assume the availability of such metadata in general.However, depending on the application and the operating regime of the ASR system, it may be possible to derive reasonable proxies.
Utterance-level adaptation derives adaptation statistics using a single speech segment.This waives the requirement to carry information about speaker identity between utterances, which may simplify deployment of recognition system -in terms of both engineering and privacy -as one does not need to estimate and store offline speaker-specific information.On the other hand, owing to the small amounts of data available for adaptation the gains are usually lower that one could obtain with speaker-level clusters.While many approaches use utterances to directly extract corresponding embeddings to use as an auxiliary input for the acoustic model [8], [58]- [60], one can also build a fixed inventory of speaker, domains, or topic codes [61] or embeddings [62], [63] when learning the acoustic model or acoustic encoder, and then use the test utterance to select a combination of these at test stage.The latter approach alleviates the necessity of estimating an accurate representation from small amounts of data.It may be possible to relax the utterance-level constraint by iteratively re-estimating adaptation statistics using a number of preceding segment(s) [58].Extra care usually needs to be taken to handle silence and speech uttered by different speakers, as failing to do so may deteriorate the overall ASR performance [63]- [65].
Speaker-level adaptation aggregates statistics across two or more segments uttered by the same talker, requiring a way to group adaptation utterances produced by different talkers.The generic approach to this problem relies on a speaker diarization system [66], that can identify speakers and accordingly assign their identities to the corresponding segments in the recordings.This is often used in the offline transcription of meetings or broadcast media.Some transcription tasks, such as lectures or telephone conversations, allow the assumption of speaker identifies across a whole recording or conversation side.Such an approach is likely to result in the estimation of many adaptation transforms for the same (physical) speaker appearing across multiple recordings, but this is not an issue if there is enough acoustic material in each recording.
Domain-level adaptation broadens the speaker-level cluster by including speech produced by multiple talkers characterized by some common characteristic such as accent, age, medical condition, topic, etc. .This typically results in more adaptation material and an easier annotation process (cluster labels need to be assigned at batch rather than segment level).As such, domain adaptation can usually leverage adaptation transforms with greater capacity, and thus offer better adaptation gains.
Depending on whether adaptation transforms are estimated on held out data, or adaptation is iteratively derived from test segments, we will refer to these as enrolment or online modes, respectively.Semi-supervised techniques refers to unsupervised learning that requires training targets which are automatically produced from a seed model.A two-pass system is a special case for which the necessary statistics are estimated from test data using the first pass decoding with a speaker-independent model in order to obtain adaptation labels, followed by a second pass with the speaker-adapted model.Finally, the enrolment approach can be estimated in either supervised or unsupervised modes, depending on whether the adaptation targets on held-out data were derived in a manual or automatic way.For semi-supervised approaches,it is possible to further filter out regions with low-confidence to avoid the reinforcement of potential errors [67]- [69].There is some evidence in the literature that, for some limited-incapacity transforms estimated in semi-supervised manner, the first pass transcript quality has a small impact on the adapted accuracy as long as these are obtained with the corresponding speaker-independent model [70], [71].In lattice supervision multiple possible transcriptions are used in a semi-supervised setting by generating a lattice or graph, rather than the one-best transcription [72]- [75].

III. ADAPTATION ALGORITHMS FOR HMM-BASED ASR
Speaker adaptation of speech recognition systems has been investigated since the 1960s [76], [77].In the mid-1990s, the influential maximum likelihood linear regression (MLLR) [78] and maximum a posteriori (MAP) [79] approaches to speaker adaptation for HMM/GMM systems were introduced.These methods, described below, stimulated the field leading an intense activity in algorithms for the adaptation of HMM/GMM systems, reviewed by Woodland [13] and in section 5 of Gales and Young's broader review of HMM-based speech recognition [80].
In this section we review MAP, MLLR, and related approaches to the adaptation of HMM/GMM systems, along with earlier approaches to speaker adaptation.Many of these early approaches were designed to normalize speaker-specific characteristics, such as vocal tract length, building on linguistic findings relating to speaker normalization in speech perception [81], often casting the problem as one of spectral normalization.This work included formant-based frequency warping approaches [76], [77], [82], and the estimation of linear projections to normalize the spectral representation to a speaker-independent form [83], [84].
Vocal tract length normalization (VTLN) was introduced by Wakita [85] (and again by Andreou [86]) as a form of frequency warping with the aim to compensate for vocal tract length differences across speakers.VTLN was extensively investigated for speech recognition in the 1990s and 2000s [87]- [90], and is discussed further in Sec.V.
In model based adaptation, the speech recognition model is used to drive the adaptation.In work prefiguring subspace models, Furui [91] showed how speaker specific models could be estimated from small amounts of target data in a dynamic time warping setting, learning linear transforms between preexisting speaker-dependent phonetic templates, and templates for a target speaker.Similar techniques were developed in the 1980s by adapting the vector quantization (VQ) used in discrete HMM systems.Shikano, Nakamura, and Abe [92] showed that mappings between speaker dependent codebooks could be learned to model a target speaker (a technique widely used for voice conversion [93]); Feng et al [94] developed a VQ-based approach in which speaker-specific mappings were learned between codewords in a speaker-independent codebook, in order to maximize the likelihood of the discrete HMM system.Rigoll [95] introduced a related approach in which the speaker-specific transform took the form of a Markov model.A continuous version of this approach, referred to as probabilistic spectrum fitting, which aimed to adjust the parameters of a Gaussian phonetic model was introduced by Hunt [96] and further developed by Cox and Bridle [97].
These probabilistic spectral modeling approaches can be viewed as precursors to maximum likelihood linear regression (MLLR) introduced by Leggetter and Woodland [78] and generalized by Gales [98].MLLR applies to continuous probability density HMM systems, composed of Gaussian probability density functions.In MLLR, linear transforms are estimated to adapt the mean vectors and covariance matrices of the Gaussian components.If µ and Σ are the mean vector and covariance matrix of a Gaussian, then MLLR adapts the parameters as follows, where A s , b s , and H s are the MLLR parameters for speaker s: ( The MLLR parameters are estimated using maximum likelihood.For efficient computation, the likelihood can be computed using the following [98]: MLLR is a compact adaptation technique since the transforms are shared across Gaussians: for instance all Gaussians corresponding to the same monophone might share mean and covariance transforms.Very often, especially when target data is sparse, a greater degree of sharing is employed -for instance two shared adaptation transforms, one for Gaussians in speech models and one for Gaussians in non-speech models.
Constrained MLLR [98], [99], is an important variant of MLLR, in which the same transform is used for both the mean and covariance: In this case, the likelihood is given by It can be seen that this transform of the model parameters is equivalent to applying a linear transform to the the data -hence constrained MLLR is often referred to as featurespace MLLR (fMLLR), although it is not strictly featurespace adaptation unless a single transform is shared across all Gaussians in the system.MLLR and its variants have been used extensively in the adaptation of Gaussian mixture model (GMM)-based HMM speech recognition systems [13], [80].The above model-based adaptation approaches have aimed to estimate transforms between a speaker independent model and a model adapted to a target speaker.An alternative Bayesian approach attempts to perform the adaptation by using the speaker independent model to inform the prior of a speaker-adapted model.If the set of parameters of a speech recognition model are denoted by θ, then maximum likelihood estimation sets θ to maximize the likelihood p(X | θ).In MAP training, the estimation procedure maximizes the posterior of the parameters given the data: where p(θ) is the prior distribution of the parameters, which can be based on speaker independent models, and r is an empirically determined weighting factor.Gauvain and Lee [79] presented an approach using MAP estimation as an adaptation approach for HMM/GMM systems.A convenient choice of function for p(θ) is the conjugate to the likelihood -the function which ensures the posterior has the same form as the prior.For a GMM, if it is assumed that the mixture weights c i and the Gaussian parameters (µ i , Σ i ) are independent, then the conjugate prior may take the form of a mixture model p D (c i ) i p W (µ i , Σ i ), where p D () is a Dirichlet distribution (conjugate to the the multinomial) and p W () is the normal-Wishart density (conjugate to the Gaussian).This results in the following intuitively understandable parameter estimate for the adapted mean of a Gaussian: where µ 0 is the unadapted (speaker-independent) mean, x n is the nth adaptation acoustic vector, γ(n) is the component occupation probability (responsibility) for the Gaussian component at time n (estimated by the forward-backward algorithm), and τ is a positive scalar-valued parameter of the normal-Wishart density, which is typically set to a constant empirically (although Gauvain and Lee [79] also discuss an empirical Bayes estimation approach for this parameter).The re-estimated means of the Gaussian components take the form of a weighted interpolation between the speaker independent mean and data from the target speaker.When there is no target speaker data for a Gaussian component, the parameters remain speaker-independent; as the amount of target speaker data increases, so the Gaussian parameters approach the target speaker maximum likelihood estimate.
In feature-based adaptation approaches, it is usual to adapt or normalize the acoustic features for each speaker in both the training and test sets.For example, in the case of cepstral mean and variance normalization (CMVN), statistics are computed for each speaker and the features normalized accordingly, during both training and test.Likewise, VTLN is also carried out for all speakers, transforming the acoustic features to a canonical form, with the variation from changes in vocal tract length being normalized away.However, in the modelbased approaches discussed above (MLLR and MAP), we have implicitly assumed that adaptation takes place at test time: speaker independent models are trained using recordings of multiple speakers in the usual way, with only the test speakers used for adaptation.In contrast to this, it is possible to employ a model-based adaptive training approach.
In speaker adaptive training [100], a transform is estimated for each speaker in the training set, as well as for each speaker in the test set.In the case of MLLR, at each iteration of training, the adaptation transforms are updated for each speaker, followed by an estimate of the canonical speaker independent model given the set of the speaker transforms.Hence adapted mean vectors and covariance matrices are computed for each speaker in the training set.At test time, the target speaker transforms are estimated as usual.Multiple types of adaptive training can be profitably combined -for example performing fMLLR adaptive training as a form of feature normalization, together with MLLR adaptive training for model adaptation [101].
Speaker space approaches represent a speaker-adapted model as a weighted sum of a set of individual models which may represent individual speakers or, more commonly, speaker clusters.In cluster-adaptive training (CAT) [11], the mean for a Gaussian component for a specific speaker s is given by: where µ c is the mean of the particular Gaussian component for speaker cluster c, and w c is the cluster weight.This expresses the speaker-adapted mean vector as a point in a speaker space.Given a set of canonical speaker cluster models, CAT is efficient in terms of parameters, since only the set of cluster weights need to be estimated for a new speaker.Eigenvoices [102] are alternative way of constructing speaker spaces, with a speaker model again represented as a weighted sum of canonical models.In the Eigenvoices technique, principal component analysis of "supervectors" (concatenated mean vectors from the set of speaker-specific models) is used to create basis of the speaker space.
A number of variants of cluster-adaptive training have been presented, including representing a speaker by combining MLLR transforms from the canonical models [11], and using sequence discriminative objective functions such as minimum phone error (MPE) [103].Techniques closely related to CAT have been used for the adaptation of neural network based systems (Sec.VI).

IV. ADAPTATION ALGORITHMS FOR NN-BASED ASR
The literature describing methods for adaptation of NNs has tended to inherit terminology from the algorithms used to adapt HMM-GMM systems, for which there an important distinction between feature space and model space formulations of MLLR-type approaches [98], as discussed in the previous section.In a 2017 review of NN adaptation, Sim et al [104] divide adaptation algorithms into feature normalisation, feature augmentation and structured parameterization.(They also use the a further category termed constrained adaptation, discussed further below.) The task of an ASR model is to map a sequence of acoustic feature vectors, X = (x 1 , . . .x t , . . ., x T ), x t ∈ R d to a sequence of words W .Although -as we discuss belowmost techniques described in this paper apply equally to endto-end models and hybrid HMM-NN models, we generally treat the model to be adapted as an acoustic model.That is, we ignore aspects of adaptation that affect only P (W ), independently of the acoustics X (LMl adaptation is discussed in Sec.XII).Further, with only a small loss of generality, in what follows we will assume that the model operates in a framewise manner, thus we can define the model as: where f (x; θ) is the NN model with parameters θ and y t is the output label at frame t.In a hybrid HMM-NN system, for example, y t is taken to be a vector of posterior probabilities over a senone set.In a CTC model, y t would be a vector of posterior probabilities over the output symbol set, plus blank symbol.Note that NN models often operate on a wider windowed set of input features, x t (w) = [x t−c , x t−c+1 , . . ., x t+c−1 , x t+c ] with the total window size w = 2c + 1.For reasons of notational clarity, we generally ignore the distinction between x t and x t (w), unless it is specifically relevant to a particular topic.
In this framework, we can define feature normalisation approaches as acting to transform the features in a speakerdependent manner, on which the speaker-independent model operates.For each speaker s, a transformation function g : R d → R d computes: where φ s is a set of speaker-dependent parameters.This family is closely related to feature space methods used in GMM systems described above in Sec.III, including fMLLR (when only a single affine transform is used), VTLN, and CMVN.
Structured parameterization approaches, in contrast, introduce a speaker-dependent transformation of the acoustic model parameters: In this case, the function h would typically be structured so as to ensure that the number of speaker-dependent parameters ϕ s is sufficiently smaller than the number of parameters of the original model.Such methods are closely related to modelbased adaptation of GMMs such as MLLR.Finally, feature-augmentation approaches extend the feature vector x t with a speaker-dependent embedding λ s , which we can write as Close variants of this approach use the embedding to augment the input to higher layers of the network.Note that the incorporation of an embedding requires the addition of further parameters to the acoustic model controlling the manner in which the embedding acts to adapt the model, which can be written f (x t ; θ, θ E ).The embedding parameters θ E are themselves speaker-independent.We argue that the distinctions described above are not particularly helpful in the field of NN adaptation.In MLestimated generative acoustic models such as GMMs, the distinction between feature-space and model adaptation is important (as noted by Gales [98]) because in the former case, different feature space transformations can be carried out per senone class if the appropriate scaling by a Jacobian is performed; in the latter case, it is necessary for the adapted probability density functions to be re-normalized.In NN adaptation, however, all three approaches can be seen to be closely related or even special cases of each other.For example, the normalisation function g can generally be formulated as shallow NN, possibly without a non-linearity.If there is a set of "identity transform" parameters φ I such that then we have where f ′ is new network comprising a copy of the original network f with the layers of g prepended.Applying feature normalisation (13) leads to: which we can write this as a structured parameter transformation of f ′ , as defined in ( 14): where the transformation h( • ; ϕ s ) is simply set to replace the parameters pertaining to g with the original normalisation parameters, φ s = ϕ s , leaving the other parameters unchanged.
Feature augmentation approaches may be readily seen to be a further special case of structured adaptation.In the simple case of input feature augmentation (15), we see that the output of the first layer, prior to the non-linearity, can be written as where W and b are the weight and bias of the first layer respectively.By introducing a decomposition of W , W = U V we write this as with U ∈ θ and V ∈ θ E being weight matrices pertaining to the input features and speaker embedding, respectively.This can be expressed as a structured transformation of the bias: with ϕ s = V λ s .Similar arguments apply to embeddings used in other network layers.Certain types of feature normalisation approaches can be expressed as feature augmentation.For example, cepstral mean normalisation given by can be expressed as with augmented features λ s = −µ s .Approaches to NN adaptation under the traditional categorization of feature augmentation, structured parameterization and feature normalization can usually be seen as special cases of one another.Therefore, in the remainder of this paper, we adopt an alternative categorization: • Embedding-based approaches in which any speakerdependent parameters are estimated independently of the model, with the model f (x t ; θ) itself being unchanged between speakers, other than the possible need to added additional embedding parameters θ E ; • Model-based approaches in which the model parameters θ are directly adapted to data from the target speaker according to the primary objective function; • Data augmentation approaches which attempt to synthetically generate additional training data with a close match to the target speaker, by transforming the existing training data.This distinction is, we believe, particularly important in speaker adaptation of NNs because in ASR it has become standard to perform adaptation in a semi-supervised manner, with no transcribed adaptation data for the target speaker.In this setting, as we will discuss, standard objective functions such as cross-entropy, which may be very effective in supervised training or adaptation, are particularly susceptible to transcription errors in semi-supervised settings.
We describe the model-independent approaches as embedding-based because any set of speaker-dependent parameters can be viewed as an embedding.Embeddingbased approaches are discussed in Sec.V. Well-known examples of speaker embeddings include i-vectors [8], [105], and x-vectors [106], but can also include parameter sets more classically viewed as normalizing transforms such as CMVN statistics and global fMLLR transforms (see Sec. III above).However, for the reasons mentioned above, we exclude from this category methods where the embedding is simply a subset of the primary model parameters and estimated according to the model's objective function.Note that methods using a one-hot encoding for each speaker are also excluded, since it would be impossible to use these with a speaker-independent model, without each test speaker having been present in training data; such methods might however be useful for closely related tasks such as domain adaptation, discussed in Sec.XI.
The primary benefit of speaker adaptive approaches over simply using speaker-dependent models is the prevention of over-fitting to the adaptation data (and its possibly errorful transcript).A large number of model-based adaptation techniques have been proposed to achieve this; in this paper, we sub-divide them into: • Structured transforms: Methods in which there a subset of the parameters are adapted, with many instances structuring the model so as to permit a reduced number of speaker-dependent parameters, as in LHUC [71], [107].
The can be viewed as an analogy to MLLR transforms for GMMs.They are discussed in Sec.VI. • Regularization: Methods with explicit regularization of the objective function to prevent over-fitting to the adaptation data, examples including the use of the use of L2 loss or KL divergence terms to penalize the divergence from the speaker-independent parameters [108], [109].Such methods can be viewed as related to the MAP approach for GMM adaptation.They are discussed in Sec.VII.• Variant objective functions: Methods which adopt variants of the primary objective function to overcome the problems of noise in the target labels, with examples including the use of lattice supervision [75] or multi-task learning [110].They are discussed in Sec.VIII.The second two categories above are collectively termed constrained adaptation in the review by Sim et al [104].Within this, multi-task learning is labeled by Sim et al as attribute aware training; however, we do not believe that all multi-task learning approaches to adaptation can be labeled in this way.
Data augmentation methods have proved very successful in adaptation to other sources of variability, particularly those -such as background noise conditions -where the required model transformations are hard to explicitly estimate, but where it is easy to generate realistic data.In the case of speaker adaptation, it is significantly harder to generate sufficiently good-quality synthetic data for a target speaker, given only limited data from the speaker in question.However, there is a growing body of work in this area using, for example, techniques from the field of speech synthesis [111].Approaches in this area are discussed in Sec.IX.
Most works suitable for adapting hybrid acoustic models can be leveraged to adapt acoustic encoders in E2E models.Both Kullback-Leibler divergence (KLD) regularization (Sec.VII) and multi-task learning (MTL) methods (Sec.VIII) have been used for speaker adaptation for CTC and AED models [112], [113].
Sim et al [114], updated the acoustic encoder of RNN-T models using speaker-specific adaptation data.Furthermore, by generating text-to-speech (TTS) audio from the target speaker, more data can be used to adapt acoustic encoder.Such data augmentation adaptation (discussed in Sec.IX) was shown to be an effective way for the speaker adaption of E2E models [115] even with very limited raw data from the target speaker.Embeddings have also been used to train a speaker-aware Transformer AED model [116].
Because AED and RNN-T also have components corresponding to the language model, there are also techniques specific to adapting the language modeling aspect of E2E models, for instance using a text embedding instead of an acoustic embedding to bias an E2E model in order to produce outputs relevant to the particular recognition context [117]- [119].If the new domain differs from the source domain mainly in content instead of acoustics, domain adaptation on E2E models can be performed by either interpolating the E2E model with an external language model (Sec.XII) or updating language model related components inside the E2E model with the text-to-speech audio generated from the text in the new domain [120], [121], discussed in Sec.XI.

V. SPEAKER EMBEDDINGS
Speaker embeddings map speakers to a continuous space.In this section we consider embeddings that may be extracted in a manner independent of the model.They can therefore also be useful in a standalone manner for other tasks such as speaker recognition.When used with an acoustic model, the model learns how to incorporate the embedding information by, in effect, speaker-aware training.Speaker embeddings may encode speaker-level variations that are otherwise difficult for the AM to learn from short-term features [65], and may be included as auxiliary features to the network.Specifically, let x ∈ R d denote the acoustic features, and λ s ∈ R k a kdimensional speaker embedding.The speaker embeddings may be concatenated with the acoustic input features, as previously seen in ( 15): Alternatively they may be concatenated with the activations of a hidden layer.In either case the result is bias adaptation of the next hidden layer as discussed in Sec.VI.As noted by Delcroix et al. [122] the auxiliary features may equivalently be added directly to the features using a learned projection matrix P , with the benefit that the downstream architecture can remain unchanged: There are many other ways to incorporate embeddings into the AM: for example, they may be used to scale neuron activations as in LHUC [71].More generally we may consider embeddings applied to either biases or activations through context-adaptive [123] or control networks [124].It is possible to limit connectivity from the auxiliary features to the rest of the network in order to improve robustness at test time or to better incorporate static features [125]- [127].Later in this section we shall discuss embeddings used as label targets, as well as embeddings as transformations of the input features themselves.
Since embeddings are estimated independently of the AM, there is a large variety of extraction methods, which are typically unsupervised with respect to the transcript.Many types of embeddings stem from research in speaker verification and speaker recognition.One such approach is identity vectors, or i-vectors [8], [105], [128], which are estimated using means from GMMs trained on the acoustic features.Specifically, the extraction of a speaker i-vector, λ s ∈ R k , assumes a linear relationship between the global means from a background GMM (or universal background model, UBM), m g ∈ R m , and the speaker-specific means, where T ∈ R m×k is a matrix that is shared across all speakers which is sometimes called the total variability matrix from its relation to joint factor analysis [129].An i-vector thus corresponds to coordinates in the column space of T .T is estimated iteratively using the EM algorithm.It is possible to replace the GMM means with posteriors or alignments from the AM [125], [130], [131] although this is no longer independent of the AM and requires transcriptions.The ivectors are usually concatenated with the acoustic features as discussed above, but have also been used in more elaborate architectures to produce a feature mapping of the input features themselves [132], [133].Some approaches extract low-dimensional embeddings from bottleneck layers in neural network models trained to distinguish between speakers [65], [126] or across multiple layers followed by dimensionality reduction in a separate AM (e.g.CNN embeddings [134]).One such approach, using Bottleneck Speaker Vector (BSV) embeddings [65], trains a feed-forward network to predict speaker labels (and silence) from spliced MFCCs (Fig. 2a).Tan et al [126] proposed to add a second objective to predict monophones in a multi-task setup.The bottleneck layer dimension is typically set to values commonly used for i-vectors.In fact, Huang and Sim [65] note that if the speaker label targets are replaced with speaker deviations from a UBM, then the bottleneck-features may be considered frame-level i-vectors.The extracted features are averaged across all speech frames, T s , of a given speaker by a simple average: There are a number of later approaches that we may collectively refer to as ⋆-vectors.Like bottleneck features, these approaches typically extract embeddings from neural networks trained to discriminate between speakers, but not necessarily using a low-dimensional layer.For instance, deep vectors, or d-vectors [135], [136], extract embeddings from feed-forward or LSTM networks trained on filterbank features to predict speaker labels.The activations from the last hidden layer are averaged over time.X-vectors [106], [124] use TDNNs with a pooling layer that collects statistics over time and the embeddings are extracted following a subsequent affine layer.A related approach called r-vectors [137] uses the architecture of x-vectors, but predicts room impulse response (RIR) labels rather than speaker labels.In contrast to the above approaches, label embeddings, or l-vectors [138], are designed to be used as soft output targets for the training of an AM.Each label embedding represents the output distribution for a particular senone target.In this way they are, in effect, uncoupled from the individual data points and can be used for domain adaptation without a requirement of parallel data.We will discuss this idea further in Sec.XI.For completeness we also mention h-vectors [139] which use a hierarchical attention mechanism to produce utterance-level embeddings, but has only been applied to speaker recognition tasks.
X-vector embeddings are not widely used for adaptating ASR algorithms in practice -especially in comparison to commonly used i-vectors -as experiments have not shown consistent improvements in recognition accuracy.One reason for this is related to the speaker identification training objective for the x-vector network which implicitly factors out channel information, which might be beneficial for adaptation.The optimal objective for speaker embeddings used in ASR differs from the objective used in speaker verification.
Summary networks [60], [122] produce sequence level summaries of the input features and are closely related to ⋆-vectors (cf.Fig. 2b).Auxiliary features are produced by a neural network that takes as input the same features as the AM, and produces embeddings by taking the time-average of the output.By incorporating the averaging into the graph, the network can be trained jointly with the AM in an end-to-end fashion [122].A related approach is to produce LHUC feature vectors (Sec.VI) from an independent network with embedded averaging [140].
We also consider speaker-level transformations of the acoustic features as speaker embeddings.These include methods traditionally viewed as normalisation, such as CMVN and fMLLR, which produce affine transformations of the features: CMVN derives its name from the application to cepstral features, but corresponds to a the standardization of the features to zero mean and unit variance (z-score): where µ is the cepstral mean, σ 2 is the cepstral variance, and ǫ is a small constant for numerical stability.fMLLR belongs to the family of Maximum Likelihood Linear Regression (MLLR) speaker adaptation methods originally developed for HMM-GMM models [98], but which has later been used with success to transform features for hybrid models [141], [142].fMLLR obtains feature-space affine transforms by maximimising the likelihood of the data, typically using the EM algorithm and HMM-GMM models.The transform may also be estimated using a neural network trained to estimate fMLLR features [143] (structurally similar transforms estimated using the main objective function are discussed in Sec.VI).Instead of transforming the input features, some work has explored fMLLR features as an additional, auxiliary, feature stream to the standard features in order to improve robustness to mismatched transforms [127], or to obtain speakeradapted features derived from GMM log-likelihoods [144], otherwise known as GMM-derived features.
VTLN is a physiologically motivated feature transformation technique [85], [86], [88], [145] which aims to control for varying vocal tract lengths between speakers by adjusting the filterbank in feature extraction.Typically, a piecewise linear warping function is used, which requires a single warping factor parameter.This can be estimated using any AM with a line search.Alternatively there are a range of techniques called linear-VTLN which obtain a corresponding affine transform similar to fMLLR, but choosing from a fixed set of transforms at test time (e.g.[89]).A related idea is that of the exponential transform [146], which forgoes any notion of vocal tract length, but akin to VTLN is controlled by a single parameter.More recently, adaptation of learnable filterbanks, operating as the first layer in a deep network, has resulted in updates which compensate for vocal tract length differences between speakers [147].
The embedding method is also helpful to the adaptation of E2E systems.Fan et al [116] generated a soft embedding vector by combining a set of i-vectors from multiple speakers with the combination weight calculated from the attention mechanism.The soft embedding vector is appended to the acoustic encoder output of the E2E model, helping the model to normalize speaker variations.
In addition to acoustic embeddings, E2E models can also leverage text embedding to improve their modeling accuracy.For example, E2E models can be optimized to produce outputs relevant to the particular recognition context, for instance user contacts or device location.One solution is to add a context bias encoder in addition to the original audio encoder into E2E models [117]- [119].This bias encoder takes a list of biasing phrases as the input.The context vector of the biasing list is generated by using the attention mechanism, and is then concatenated with the context vector of acoustic encoder and is fed into the decoder.

VI. STRUCTURED TRANSFORMS
Methods to adapt the parameters θ of a neural network based-acoustic model f (x; θ) can be split into two groups.The first group adapts the whole acoustic model or some  of its layers [108], [109], [148].The second group employs structured transformations [104] to transform input features x, hidden activations h or outputs y of the acoustic model.Such transformations include the linear input network (LIN) [149], linear hidden network (LHN) [150] and the linear output network (LON) [151].These transforms are parameterized with a transformation matrix A s ∈ R n×n and a bias b s ∈ R n .The transformation matrix A s is initialized as an identity matrix and the bias b s is initialized as a zero vector prior to speaker adaptation.The adapted hidden activations then become However, even a single transformation matrix A s can contain many speaker dependent parameters, making adaptation susceptible to overfitting to the adaptation data.It also limits its practical usage in real world deployment because of memory requirements related to storing speaker dependent parameters for each speaker.Therefore there has been considerable research into how to structure the matrix A s and the bias b s to reduce the number of speaker dependent parameters.
The first set of approaches restricts the adaptation matrix A s to be diagonal.If we denote the diagonal elements as r s = diag(A s ), then the adapted hidden activations become There are several methods that belong to this set of adaptation methods.Learning Hidden Unit Contributions (LHUC) [71], [107] adapts only the parameters r s : Speaker Codes [152], [153] prepend an adaptation neural network to an existing SI model in place of the input features.The adaptation network -which operates somewhat similarly to control networks, described below -uses the acoustic features as inputs, as well as an auxiliary low-dimensional speaker code which essentially adapts speaker dependent biases within the adaptation network: The network and speaker codes are learned by backpropagating through the frozen SI network with transcribed training data.At test time the speaker codes are derived by freezing all but the speaker code parameters and backpropagating on a small amount of adaptation data.Similarly, Wang and Wang [154] proposed a method that adapts both r s and b s as parameters β s and γ s of a batch normalization layer, adapting both the scale and the offset of the hidden layer activations with mean µ and the standard deviation σ: Mana et al [155] showed that batch normalization layers can be also updated by recomputing the statistics µ and σ in online fashion.
A similar approach with a low-memory footprint adapts the activation functions instead of the scale r s and offset b s .Zhang and Woodland [156] proposed the use of parameterised sigmoid and ReLU activation functions.With the parameterised sigmoid function, hidden activations h i are computed from hidden pre-activations z i as where η s , γ s and ζ s are speaker dependent parameters.|η s | controls the scale of the hidden activations, γ s controls the slope of the sigmoid function and ζ s controls the midpoint of the sigmoid function.Similarly, parameterised ReLU activations was defined as where α s and β s are speaker dependent parameters that correspond to slopes for positive and negative pre-activations, respectively.
Other approaches factorize the transformation matrix A s into a product of low-rank matrices to obtain a compact set of speaker dependent parameters.Zhao et al [157] proposed the Low-Rank Plus Diagonal (LRPD) method, which reduces the number of speaker dependent parameters by approximating the linear transformation matrix A s ∈ R n×n as where the D s ∈ R n×n , P s ∈ R n×k and Q s ∈ R k×n are treated as speaker dependent matrices (k < n) and D s is a diagonal matrix.This approximation was motivated by the assumption that the adapted hidden activations should not be very different from the unadapted hidden activations when only a limited amount of adaptation data is available; hence the adaptation linear transformation should be close to a diagonal matrix.In fact, for k = 0 LRPD reduces to LHUC adaptation.LRPD adaptation can be implemented by inserting two hidden linear layers and a skip connection as illustrated in Fig. 3b.Zhao et al [158] later presented an extension to LRPD called Extended LRPD (eLRPD), which removed the dependency of the number of speaker dependent parameters on the hidden layer size by performing a different approximation of the linear transformation matrix A s , where matrices D s ∈ R n×n and T s ∈ R k×k are treated as speaker dependent, and matrices P ∈ R n×k and Q ∈ R k×n are treated as speaker independent.Thus the number of speaker dependent parameters is mostly dependent on k, which can be chosen arbitrarily.Instead of factorizing the transformation matrix, a technique typically known as feature-space discriminative linear regression (fDLR) [141], [159], [160] imposes a block-diagonal structure such that each input frame shares the same linear transform.This is, in effect, a tied variation of LIN with a reduction in the number of speaker dependent parameters.
Another set of approaches uses the speaker dependent parameters as mixing coefficients α s for a set of bases B i which factorize the transformation matrix A s .Samarakoon and Sim [161], [162] proposed to use factorized hidden layers (FHL) that allow both speaker-independent and speaker dependent modelling.With this approach, activations of a hidden layer h with an activation function σ are computed as Note, that when α s = 0 and b s = 0, the activations correspond to a standard speaker independent model.If the bases B i are rank-1 matrices, B i = γ i ψ T i , then this allows the reparameterization of (40) as [162]: where D = diag(α s ).This approach is very similar to the factorization of hidden layers used for Cluster Adaptive Training of DNN networks (CAT-DNN) [12] that uses full rank bases instead of rank-1 bases.
Similarly, Delcroix et al [123] proposed to adapt activations of a hidden layer with using a mixture of experts [163].The adapted hidden unit activations are then There have also been approaches, that further reduce the number of speaker dependent parameters by removing the dependency on the hidden layer width by using control networks that predict the speaker-dependent parameters In contrast to the adaptation network used in the Speaker Codes scheme, the control networks themselves are speakerindependent, taking as input some lower dimensional speaker dependent representations z s ∈ R k , typically speaker embeddings.As such, they form a link between structured transforms and the embedding-based approaches of Sec.V.The control networks c * (z s , θ * ) can be implemented as a single linear transformation or as a multi-layer neural network.These control networks are similar to the conditional affine transformations referred to as Feature-wise Linear Modulation (FiLM) [164].For example, Subspace LHUC [165] uses a control network to predict LHUC parameters r s from i-vectors λ s , resulting in a 94% memory footprint reduction compared to standard LHUC adaptation.Cui et al [166] used auxiliary features to adapt both the scale r s and offset b s .Other approaches adapted the scale r s or the offset b s by leveraging the information extracted with summary networks instead of auxiliary features [167]- [169].
Finally, the number of speaker dependent parameters in all the aforementioned linear transformations can be reduced by applying them to bottleneck layers that have much lower dimensionality than the standard hidden layers.These bottleneck layers can be obtained directly by training a neural network with bottleneck-layers or by applying Singular Value Decomposition (SVD) to the hidden layers [170], [171].

VII. REGULARIZATION METHODS
Even with the small number of speaker dependent parameters required by structured transformations, speaker adaptation can still overfit to the adaptation data.One way to prevent this overfitting is through the use of regularization methods that prevent the adapted model from diverging too far far from the original model.This can be achieved by using early stopping and appropriate learning rates, which can be obtained with a hyper-parameter grid-search or by meta-learning [172], [173].Another way to prevent the adapted model from diverging too far from the original can be achieved by limiting the distance between the original and the adapted model.Liao [108] proposed to use the L2 regularization loss of the distance between the original speaker dependent parameters θ s and the adapted speaker dependent parameters θ Yu et al [109] proposed to use Kullback-Leibler (KL) divergence to measure the distance between the senone distributions of the adapted model and the original model If we consider the overall adaptation loss using cross-entropy: we can show that this loss equals to cross-entropy with the target distribution where P (Y | X) is a distribution corresponding to the provided labels y adapt .Although initially proposed for adapting hybrid models, the KLD regularization method may also be used for speaker adaption of E2E models [112], [113], [174].Meng et al [175] noted that KL divergence is not a distance metric between distributions because it is asymmetric, and therefore proposed to use adversarial learning which guarantees that the local minimum of the regularization term is reached only if the senone distributions of the speaker independent and the speaker dependent models are identical.They achieve this by adversarially training a discriminator d(x; φ) whose task is to discriminate between the speaker dependent deep features h ′ and speaker independent deep features h that are obtained by passing the input adaptation  frames through speaker dependent and speaker independent feature extractor respectively.This process is illustrated in Fig. 4. The regularization loss of the discriminator is where h are hidden layer activations of the speaker independent model and h ′ are hidden layer activations of the adapted model.The discriminator is trained in a minimax fashion during adaptation by minimizing L disc with respect to φ and maximizing L disc with respect to θ s .Consequently, the distribution of activations of the i-th hidden layer of the speaker dependent model will be indistinguishable from the distribution of activations of the i-th hidden layer of the speaker independent model, which ought to result in more robust performance of speaker adaptation.Other approaches aim to prevent overfitting by leveraging the uncertainty of the speaker-dependent parameter space.Huang et al [176] proposed Maximum A Posteriori (MAP) adaptation of neural networks, inspired by MAP adaptation of GMM-HMM models [79] (Sec.III).MAP adaptation estimates speaker dependent parameters as a mode of the distribution where p(θ s ) is a prior density of the speaker dependent parameters.In order to obtain this prior density, Huang et al [176] employed an empirical Bayes approach (following Gauvain and Lee [79]) and treated each speaker in the training data as a data point.They performed speaker adaptation for each speaker and observed that the speaker parameters across speakers resemble Gaussians.Therefore they decided to parameterise the prior density p(θ s ) as where µ is the mean of adapted speaker dependent parameters across different speakers, and Σ is the corresponding diagonal covariance matrix.With this parameterisation the regularization term of the prior density p(θ s ) is which for the prior density p(θ s ) = N (θ s ; 0, I) degenerates to the L2 regularization loss.Huang et al investigated their proposed MAP approach with LHN structured transforms, but noted that it may be used in combination with other schemes.Xie at al [177] proposed a fully Bayesian way of dealing with uncertainty inherent in speaker dependent parameters θ s , in the context of estimating the LHUC parameters r s (see Sec. VI).In this method, known as BLHUC, the posterior distribution of the adapted model is approximated as: Xie at al propose to use a distribution q(r s ) as a variational approximation of the posterior distribution of the LHUC parameters, p(r s |D adapt ).For simplicity, they assume that both q(r s ) and p(r s ) are normal, such that q(r s ) = N (r s ; µ s , γ s ) and p(r s ) = N (r s ; µ 0 , γ 0 ), which results in the expectation for the speaker dependent parameters in (53) being given by : The parameters are computed using gradient descent with a Monte Carlo approximation.Similarly to MAP adaptation, the effect is to force the adaptation to stay close to the speaker independent model when we perform adaptation with a small amount of adaptation data.

VIII. VARIANT OBJECTIVE FUNCTIONS
Another challenge in speaker adaptation is overfitting to targets seen in the adaptation data and to errors in semisupervised transcriptions.This issue can be mitigated by an appropriate choice of objective function.
Gemello et al [150] proposed Conservative Training, which modifies the target distribution to ensure that labels not seen  in the adaptation data will not be catastrophically forgotten.The adjusted target distribution is defined as where S is a set of labels seen in the adaptation data and U is a set of labels not seen in the adaptation data.
To mitigate errors in semi-supervised transcriptions we can replace the transcriptions with a lattice of supervision, which encodes the uncertainty arising from the first pass decoding.Lattice supervision has previously been used in work on unsupervised adaptation [72] and training [73] of GMMs, as well as discriminative [178] and semi-supervised training [74], and adaptation [75], of neural network models.For instance, lattice supervision can be used with the MMI criterion: where the M num r is a numerator lattice containing multiple hypotheses from a first pass decoding and M den r is a denominator lattice containing all possible sequences of words.
Another family of methods prevents overfitting to adaptation targets by performing adaptation through the use of a lower entropy task such as monophone or senone cluster targets.This has the advantage that the unsupervised targets might be less noisy and also that the targets have higher coverage even with small amounts of adaptation data.Price et al [179] proposed to append a new output layer predicting monophone targets on top of the original output layer predicting senones.The layer can be either full rank or sparse -leveraging knowledge of relationships between monophones and senones.Its parameters are trained on the training data with a fixed speaker independent model.Only the mohophone targets are used for the adaptation of the speaker dependent parameters.
Huang et al [110] presented an approach that used multi-task learning [180] to leverage both senone and monophone/senone clusters targets.It worked by having multiple output layers, each on top of the last hidden layer, that predicted the corresponding targets.These additional output layers were also trained after a complete training pass of the speaker independent model with its parameters fixed.Thus, the adaptation loss was a weighted sum of individual losses, for example monophone and senone losses (Fig. 5).Swietojanski et al [181] combined these two approaches and used multi-task learning for speaker adaptation through a structured output layer, which predicts both monophone targets and senone targets.Unlike the approach by Price et al [179], the monophone predictions are used for the prediction of senones.
Li et al [112] and Meng et al [113] applied multi-task learning to speaker adaptation of CTC and AED models.These E2E models typically use subword units, such as word piece units, as the output target in order to achieve high recognition accuracy.The number of subword units is usually at the scale of thousands or even more.Given very limited speaker-specific adaptation data, these units may not be fully covered.Multitask learning using both character and subword units can significantly alleviate such sparseness issues.

IX. DATA AUGMENTATION
Data augmentation has been proven to be an effective way to decrease the acoustic mismatch between training and testing conditions.Data augmentation approaches supplement the training data with distorted or synthetic variants of speech with characteristics resembling the target acoustic environment, for instance with reverberation or interfering sound sources.Thanks to realistic room acoustic simulators [182] one can generate large numbers of room impulse responses and reuse clean corpora to create multiple copies of the same sentence under different acoustic conditions [183]- [185].
Similar approaches have been proposed for increasing robustness in speaker space by augmenting training data with, typically label-preserving, speaker-related distortions or transforms.Examples include creating multiple copies of clean utterances with perturbed VTL warp factors [186], [187], augmenting related properties such as volume or speaking rate [21], [188], [189], or voice-conversion [190] inspired transformations of speech uttered by one speaker into another speaker using stochastic feature mapping [187], [191], [192].
While voice conversion does not create any new data with respect to unseen acoustic / linguistic complexity (just replicas of the utterances with different voices, often from the same dataset), recent advances in text-to-speech (TTS) allows the rapid building of new multi-speaker TTS voices [193] from small amounts of data.TTS may then be used to arbitrarily expand the adaptation set for a given speaker, possibly to cover unseen acoustic domains [111], [115].If TTS is coupled with a related natural language generation module, it is possible to generate speech for domain-related texts.In this way, the speaker adaptation uses more data, not only from the speaker's original speech but also from the TTS speech.Because the transcription used for TTS generation is also used for model adaptation, this approach also circumvents the obstacle of the hypothesis error in unsupervised adaptation.Moreover, TTS generated data can also help to adapt E2E models to a new domain which has more discrepancy in contents from the source domain, which will be discussed in Sec.XI.
Finally, for unbalanced data sets the acoustic models may under-perform for certain demographics that are not sufficiently represented in training data.There is an ongoing effort to address this using generative adversarial networks (GANs).For example, Hosseini-Asl et al [194] used GANs with a cycleconsistency constraint [195] to balance the speaker ratios with respect to gender representation in training set.

X. ACCENT ADAPTATION
Although there is significant literature on automatic dialect identification from speech (e.g.[196]), there has been less work on accent and dialect adaptive speech recognition systems.The MGB-3 [197] and MGB-5 [198] evaluation challenges have used dialectal Arabic test sets, with a modern standard Arabic (MSA) training set, using broadcast and internet video data.The best results reported on these challenges have used a straightforward model-based transfer learning approach in an LF-MMI framework, adapting MSA trained baseline systems to specific Arabic dialects [199], [200].
Much of the reported work on accent adaptation has taken approaches for speaker adaptation, and applied them using an adaptation set of utterances from the target accent.For instance, Vergyri et al [201] used MAP adaptation of a GMM/HMM system.Zheng et al [202] used both MAP and MLLR adaptation, together with features selected to be discriminative towards accent, with the accent adaptation controlled using hard decisions made by an accent classifier.
Earlier work on accent adaptation focused on automatically adaptation of the pronunciation dictionary [203], [204].These approaches resemble approaches for acoustic adaptation of VQ codebooks (discussed in section III), in that they learn an accent-specific transition matrix between the phonemic symbols in the dictionary.Selection of utterances for accent adaptation has been explored, with Nallasamy et al [205] proposing an active learning approach.
Approaches to accent adaptation of neural network-based systems have typically employed accent-dependent output layers and shared hidden layers [206], [207], based on a similar approach to the multilingual training of deep neural networks [208]- [210].Huang et al [206] combined this with KL regularization (Sec.VII), and Chen et al [207] used accentdependent i-vectors (Sec.V); Yi et al [211] used accentdependent bottleneck features in place of i-vectors; and Turan et al [212] used x-vector accent embeddings in a semisupervised setting.
Multi-task learning approaches, where the secondary task is accent/dialect identification has been explored by a number of researchers [213]- [217] in the context of both hybrid and endto-end models.Improvements with multi-task training were observed in some instances, but the evidence indicates that it gives a small adaptation gain.Sun et al [218] replaced multitask learning with domain adversarial learning (Sec.VIII), in which the objective function treated accent identification as an adversarial task, finding that this improved accented speech recognition over multi-task learning.
More successfully, Li et al [219] explored learning multidialect sequence-to-sequence models using one-hot dialect information both as input.Grace et al [220] also used one-hot dialect codes and also explored a family of cluster adaptive training and hidden layer factorization approaches.In both cases using one-hot dialect codes as an input augmentation (corresponding to bias adaptation) proved to be the best approach, and cluster-adaptive approaches did not result in a consistent gain.These approaches were extended by Yoo et al [221] and Viglino et al [217] who both explored the use of dialect embeddings for multi-accent end-to-end speech recognition.Ghorbani et al [222] used accent-specific teacherstudent learning, and Jain et al [223] explored a mixture of experts (MoE) approach, using mixtures of experts both at the phonetic and accent levels.
Yoo et al [221] also applied a method of feature-wise affine transformations on the hidden layers (FiLM), that are dependent both on the networks internal state and the dialect/accent code (discussed in Sec.VI).This approach, which can be viewed as a conditioned normalization, differs from the previous use of one-hot dialect codes and multi-task learning in that it has the goal of learning a single normalized model rather than an implicit combination of specialist models.A related approach is gated accent adaptation [224], although this focused on a single transformation conditioned on accent.
Winata et al [225] experimented with a meta-learning approach for few-shot adaptation to accented speech, where the meta-learning algorithm learns a good initialization and hyperparameters for the adaptation.

XI. DOMAIN ADAPTATION
The performance of automatic speech recognition (ASR) always drops significantly when the recognition model is evaluated in a mismatched new domain.Domain adaptation is the technology used to adapt the well-trained source domain model to the new domain.The most straightforward way is to collect and label data in the new domain to fine-tune the model.Most adaptation technologies discussed in this paper can also be applied to domain adaptation [148], [226]- [228].In the following, we focus on technologies more specific to domain adaption.
While conventional adaptation techniques require large amounts of labeled data in the target domain, the teacherstudent (T/S) paradigm [229], [230] can better take advantage of large amounts of unlabeled data and has been widely used for industrial scale tasks [231], [232].
The most popular T/S learning strategy was proposed in 2014 by Li et al. [229] to minimize the KL divergence between the output posterior distributions of the teacher network and the student network.This can also be considered as learning soft targets generated by a teacher model instead of 1-hot hard targets where P T and P S are posteriors of teacher and student networks, x t and s t are the input speech and senone at time t, respectively.T is total speech frames in an utterance, and N is the number of senones in the network output layer.Later, Hinton et al. [230] proposed knowledge distillation by introducing a temperature parameter (like chemical distillation) to scale the posteriors.This has been applied to speech by e.g.Asami et al. [233].There are also variations such as learning the interpolation of soft and hard targets [230] and conditional T/S learning [234].Although initially proposed for model compression, T/S learning is also widely used for model adaptation if source and target signals are frame-synchronized, which can be realized by simulation.The loss function is [7] where x t is the source speech signal while xt is the framesynchronized target signal.
The biggest advantage of T/S learning is that it can leverage large amounts of unlabeled data by using soft labels P T (s t = y|x t ).This is particularly useful in industrial setups where effectively unlimited unlabeled data is available [231], [232].Furthermore, soft labels produced by the teacher network carry knowledge learned by the teacher on the difficulty of classifying each sample, while the hard labels do not contain such information.Such knowledge helps the student to generalize better, especially when adaptation data size is small.
One constraint to T/S adaptation is that it requires paired source and target domain data.While the paired data can be obtained with simulation in most cases, there are scenarios in which it is hard to simulate the target domain data from the source domain data.For example, simulation of children's speech or accented speech remains challenging.In [138], a neural label embedding scheme was proposed for domain adaptation with unpaired data.A label embedding, l-vector, represents the output distribution of the deep network trained in the source domain for each output token, e.g. , senone.To adapt the deep network model to the target domain, the lvectors learned from the source domain are used as the soft targets in the cross entropy criterion.
It is usually hard to obtain the transcription in the target domain, therefore unsupervised adaptation is critical.Although the transcription can be generated by decoding the target domain data using the source domain model, the generated hypothesis quality is often poor given the domain mismatch.Recently, adversarial training was applied to the area of unsupervised domain adaptation in a form of multi-task learning [235] without the need for transcription in the target domain.Unsupervised adaptation is achieved by learning deep intermediate representations that are both discriminative for the main task on the source domain and invariant with respect to mismatch between source and target domains.Domain invariance is achieved by adversarial training of the domain classification objective functions using a gradient reversal layer (GRL) [235].This GRL approach has been applied to acoustic models for unsupervised adaptation in [236]- [238].
There is also increasing interest in the use of GANs with cycle consistency constraints for domain adaptation [239]- [241].This enables the use of non-parallel data without labels  in the target domain by learning to map the acoustic features into the style of the target domain for training.The cycleconsistency constraint also provides the possibility of mapping features from the target to the source style for, in effect, testtime adaptation or speech enhancement.Meng et al. [242] combine adversarial learning and T/S learning as adversarial T/S learning shown in Fig. 6 to improve the robustness against condition variability during adaptation.When only the left side of the figure is kept, adversarial T/S learning is reduced to T/S learning.If the teacher network is removed and the main network consumes source domain data and its ground-truth labels, then adversarial T/S learning is reduced to adversarial learning.
E2E models tend to memorize the training data well, and therefore may not generalize well to a new domain.Meng et al [243] proposed T/S learning for the domain adaptation of E2E models.The loss function is where X and X are the source and target domain speech sequence, U is the label sequence of length L which is either the ground truth in the supervised adaptation setup or the hypothesis generated by the decoding of the teacher model with X in the unsupervised adaptation setup.Note that in the unsupervised case, there are two levels of knowledge transfer: the teachers token posteriors (used as soft labels) and one-best predictions as decoder guidance.
While most of time the domain adaptation focuses on adaptation to a new acoustic environment, there are scenarios in which the new domain differs from the source domain mainly in content.In such situations, adapting the language model (LM) is more effective.Because E2E models usually have a sub-network working as an LM in traditional hybrid systems, it is possible to adapt E2E models to a new domain using only domain-specific text data.In [120], [121], RNN-T models were adapted to a new domain with the TTS data generated from the domain-specific text.Because the prediction network in RNN-T works similarly to a LM, adapting it without updating the acoustic encoder is shown to be more effective than interpolating the RNN-T model with an external LM trained from the domain-specific text [121].

XII. LANGUAGE MODEL ADAPTATION
LM adaptation typically involves updating an LM estimated from a large general corpus, with data from a target domain.Many approaches to LM adaptation were developed in the context of n-gram models, and are reviewed by Bellegarda [244].Hybrid NN/HMM speech recognition systems still make use of n-gram language models and a finite state structure, at least in the first pass; it is difficult to use neural network LMs (with infinite context) directly in first pass decoding in such systems.Neural network LMs are typically used to rescore lattices in hybrid systems, or may be combined (in a variety of ways) in end-to-end systems.
The main techniques for n-gram language model adaptation include interpolation of multiple language models [245]- [247], updating the model using a cache of recently observed (decoded) text [245], [248]- [250], or merging or interpolating n-gram counts from decoded transcripts [251].There is also a large body of work incorporating longer scale context, for instance modelling the topic and style of the recorded speech [252]- [255].LM adaptation approaches making use of wider context have often built on approaches using unigram statistics or bag-of-words models, and a number of approaches for combination with n-gram models have been proposed, for example dynamic marginals [256].
Neural network language modelling [257] has become stateof-the-art, in particular recurrent neural network language models (RNNLMs) [258].There has been a range of work on adaptation of RNNLMs, including the use of topic or genre information as auxiliary features [259], [260] or combined as marginal distributions [261], domain specific embeddings [262], and the use of curriculum learning and fine-tuning to take account of shifting contexts [263], [264].Approaches based on acoustic model adaptation, such as LHUC [264] and LHN [260], have also been explored.
There have a been a number of approaches to apply the ideas of cache language model adaptation to neural network language models [261], [265], [266], along with so-called dynamic evaluation approaches in which the recent context is used for fine tuning [261], [267].
E2E models are trained with paired speech and text data.The amount of text data in such a paired setup is much smaller than the amount of text data used in training a separate external LM.Therefore, it is popular to adjust E2E models by fusing the external LM trained with a large amount of text data.The simplest and most popular approach is shallow fusion [268], in which the external LM is interpolated log-linearly with the E2E model at inference time only.
However, shallow fusion does not have a clear probabilistic interpretation.McDermott et al [269] proposed a density ratio approach based on Bayes' rule.An LM is built on text transcripts from the training set which has paired speech and text data, and a second LM is built on the target domain.When decoding on the target domain, the output of the E2E model is modified by the ratio of target/training LMs.While it is well grounded with Bayes' rule, the density ratio method requires the training of two separate LMs, from the training and target data respectively.Variani et al [270] proposed a hybrid autoregressive transducer (HAT) model to improve the RNN-T model.The HAT model builds a training set LM internally and the label distribution is derived by normalizing the score functions across all labels excluding blank.Therefore, it is mathematically justified to integrate the HAT model with an external or target LM using the density ratio formulation.Other domain adaptation methods for E2E models were discussed in Sec.XI.

XIII. META ANALYSIS
In this section we present an aggregated review of published results in experiments applying adaptation algorithms to speech recognition.This differs from typical experimental reporting that focuses on one-to-one system comparisons typically using a small fixed set of systems and benchmark tasks and data.The proposed meta-analysis approach offers insights into the performance of adaptation algorithms that are difficult to capture from individual experiments.
We divide this section into four main parts.The first, Sec.XIII-A, explains the protocol and overall assumptions of the meta-analysis, followed by a top-level summary of findings in Sec.XIII-B, with a more detailed analysis in Sec.XIII-C.The final part, Sec.XIII-D, aims to quantify the adaptation performance across languages, speaking styles and data-sets.

A. Protocol and Literature
The meta-analysis is based on 45 peer-reviewed studies selected such that they cover wide range of systems, architectures, and adaptation tasks.Each study was required to compare adaptation results versus a baseline, enabling the configurations of interest to be compared quantitatively.Note that meta-analysis spans several model architectures, languages, and domains; although most studies use word error rate (WER) as the evaluation metric, some studies used character error rate (CER) or phone error rate (PER).Since we are interested in the relative improvement brought by adaptation, we report Relative Error Rate Reductions (RERR).
Overall, the meta-analysis is based on ASR systems trained on datasets of combined duration of over 30,000 hours, while baseline acoustic models were estimated from as little as 5 hours to over 10,000 hours of speech.Adaptation data varies from a few seconds per speaker to over 25,000 hours of acoustic material used for domain adaptation.

B. Overall findings
Fig. 7 (Top) presents the average adaptation gains for all considered systems, adaptation methods, and adaptation classes.The overall RERR is 9.96%1 .Since grouping data across attributes of interest may result in an unbalanced (or very sparse) sample sizes, we also report additional statistics such as number of samples, datasets and studies the given statistic is based on.As can be seen in the right part of the Fig. 7 (Top), the results in this review were derived from 337 samples produced using 33 datasets reported in 45 studies.A single sample is defined as a 1:1 system comparison for which one can unambiguously state RERR.Likewise, a dataset refers to a particular training corpus configuration.Note that there may be some data-level overlap between different corpora originating from same source (e.g.TED talks) and we make a distinction for acoustic condition (e.g.AMI close-talking and distant channels are counted as two different data-sets when they are used to estimate separate acoustic models).A study refers to a single peer-reviewed publication.Depending on which property we want to measure the analysis set can be split into smaller subsets, as the ones shown in the lower part of Fig. 7.The majority of analyses in this review are reported for models adapted using a single method with some additional groupings used to better capture additional details such as complimentarity of adaptation methods or their performance in different operating regimes.
As mentioned in Sec.IV, adaptation methods were historically categorized based on the level they operated at in speech processing pipeline.Fig. 8 (top) quantifies the ASR performance along this attribute, showing that model-based adaptation obtains best average improvements of 11.8%, followed by embedding and feature levels at 7.2% and 5.0% RERR, respectively.This is not surprising, as model level adaptation allows large amounts of adaptation data to be leveraged by allowing the update of large portions of the model (including re-training the whole model).In more dataconstrained regimes, such as utterance or speaker-level adaptation, where only a limited amount of adaptation data is typically available, differences are less pronounced and modelbased speaker adaptation obtains 8.9% RERR while adapting to domains gives 15.5% RERR (cf.middle and bottom plots in Fig 8).Embedding approaches stay at a similar level for speaker adaptation, improving to 9.2% RERR for domain adaptation (although based on only two studies).Feature-space domain adaptation was used in only one study, which reported a small deterioration of -0.3% RERR.
The results for different adaptation clusters, introduced in Sec.II, are shown in Fig. 9. Models benefit more when adapting to accent, from adult to child speech, to the domain, and to disordered speech conditions (such as arising from speech motor discorders), as opposed to speaker or utter- ance adaptation.This is expected, since domain adaptation usually has more adaptation data, and the acoustic mismatch introduced by unseen domains is greater than the mismatch caused by unseen speakers -unless these are substantially mismatched to the training data as it is often the case for child or disordered speech recognition.But in the latter case the adaptation is typically not carried out at the speaker level, but at the domain level (i.e.tailoring the acoustic model to better handle dysarthic speech, not a single dysarthic speaker).Fig. 10 aggregates the adaptation along the two main neural network-based ASR approaches -hybrid and E2E.It is interesting to observe that E2E systems benefit more from adaptation (12.8% RERR) than hybrid systems (9.2% RERR) in both the overall and speaker-based regimes.This reverses for domain adaptation, with E2E and hybrid improving by 12.2 and 14.9% RERR, respectively.This is somewhat expected, as hybrid systems benefit from strong inductive biases -such as access to pronunciation dictionaries and hand engineered modelling constraints -whereas E2E models must learn these from data.Given limited amounts of training data one may expect that E2E may struggle to learn these as well as hybrid models, as such adaptation brings greater gains.These results suggest that adaptation for E2E is a promising direction for future investigations, that remains under-investigated as of now -there are 10 studies in total on this topic in this meta-review.
Next we compare feed-forward (FF) and recurrent neural network (RNN) architectures in both hybrid and E2E models.Hybrid models can leverage either FF or RNN architectures while most E2E systems use some form of RNN.(Note, transformer-based E2E models [301] are build from FF (CNN) modules, however, due to their relative novelty in ASR there is only one accent adaptation study included in our metaanalysis [225]).Fig. 11 reports similar adaptation gains of 9.8% RERR for both FF and RNN architectures.RNNs seem to benefit more when adapting to speakers (9.2% vs 7.4% RERR for RNN and FF, respectively), and less when adapting to domain (10.4% vs 17.0% RERR for RNN and FF, respectively).When controlling for the system paradigm (E2E vs. Hybird), RNNs mostly benefit through adapting E2E models (cf.Fig 12 6.6% vs 15.7% RERR for Hybrid (RNN) and E2E (RNN), respectively).We observed a similar trend for speaker and domain clusters separately (figure not shown).Fig. 13 compares the RERR for unsupervised and supervised modes of adaptation.Overall, deriving the adaptation transform with manually annotated targets results in an average 12.8% RERR, whereas unsupervised methods result in 8% RERR.Fig. 13 shows results specifically for semi-supervised adaptation, which are captured by the 2pass and enrol (Unsup.)conditions.Fig. 14 also shows further analysis on the modes of deriving adaptation statistics (Sec.II).Both online and twopass adaptation are unsupervised, while enrolment mode may be either supervised or unsupervised.The supervised approach offers most accurate adaptation, as expected.Unsupervised enrolment outperforms other the two unsupervised methods mainly due to T/S domain adaptation study [243] (Sec.XI) that leverages large amounts of data.When considering speaker adaptation only, the two-pass approach obtains 8.2% RERR and is more effective than enrol (Unsp.)(7.3% RERR) and online adaptation (6.5% RERR).
Finally, we consider the overall trends for the considered systems and their operating regions.adaptation is more powerful and gives better results than embedding or feature-based approaches; and iii) adaptation is particularly effective when there is a large mismatch scenarios and obtaining matched training data is difficult.Since this meta-analysis combines results across many different studies with many reference systems, the results are not necessarily to be compared at the sample level, but rather in an aggregated form to outline dominant trends and typical data regimes each category was tried in.Data amounts for some systems for the purpose of plotting were assumed approximately to be at a given level: e.g.two-pass systems unless shown otherwise assumed 10 minutes per speaker, while embedding approaches 30 seconds.

C. Detailed findings
In this subsection we investigate the effect of the specific approach to adaptation, beyond the broad categories discussed above.Fig. 17 reorganizes the earlier split into feature, embedding, and model-level adaptation (Fig. 8) into embedding (cf.Sec.V) and model-based transformations (cf.Sec.VI).
The second group in Fig. 17 comprises model-based approaches split into Linear Transform (LT), Activation, and Finetuning-based methods.LT methods introduce new speaker dependent affine transformations in the model, either in the form of new LIN/LHN/LON layers (i.e.[141], [149], [151], [274]) or transforms estimated using a GMM system such as fMLLR [8], [107], [141], [303].Finetune refers to approaches which assume that the adaptation is carried out by altering a subset of the existing model parameters.This is often done in a similar manner to an LT approach by adapting an input, output and/or one or more hidden layers that are already present in the model [108], [147], [206], [207].Finally, activation methods perform adaptation by introducing speaker-dependent parameters in the activation functions of the neural network [107], [156], [304]- [306].Note that, as outlined in Sec.VI, some of activation-based methods can be expressed as constrained LT methods.The results obtained by LT, Activation and Finetune-based methods score 6.7%, 9.0% and 13.9% average RERR, respectively.Fig. 18 (a  in a speaker adaptive manner, whereas the majority of modelbased techniques are carried out in a test-only mannermeaning that speaker-level information is not used during training -though some methods offer SAT variants [161], [307].Fig. 19 shows that SAT trained systems offer a small advantage (8% vs. 7.6% RERR) when adapted with limited amounts of data (up to around 10 minutes).When looking at the average performance across all data-points, however, testonly approaches obtain 10.8% RERR, primarily because of greater adaptation gains for larger amounts of data.See also Fig. 18 (b) for operating regions of SAT and non-SAT systems.Fig. 20 quantifies gains for different adaptation objectives and regularization approaches -results for the online condition are given only for reference, as in this case adaptation information is obtained via an embedding extractor (which is usually not updated, although not always [211]).The second group depicts approaches where the adaptation information is derived by adapting a GMM in model-space using an MLE or MAP criterion when extracting speaker-adapted auxiliary features for NN training [144], [308] or by estimating fMLLR transforms with MLE under a GMM to obtain speaker adapted acoustic features [8], [141], [303].
The third group comprises methods which aim to explicitly match the model's output distribution to the one found in adaptation data.CE is a non-regularized frame-level crossentropy baseline obtaining 8.7% average RERR.This can be improved to 14.8% average RERR by penalizing the adapted model's predictions such that they do not deviate too much from the speaker independent variant by KL regularization (CE-KL) [109].KL regularization can be applied to either CE or sequential objective functions [148], although most models estimated in sequential discriminative manner can be successfully adapted with a CE (or CE-KL) criterion [71], [162], [189], [279] (see also Fig. 21).Teacher-student (T/S) [229] is a special case (see Sec. XI) where the adaptation is carried with the targets directly produced by a teacher model, rather than the ones obtained from first pass decodes (possibly KL-regularized with the SI model).T/S allows to leverage large amounts of unsupervised data and in this analysis was found to offer an average 28.2% RERR when adapting to domains [222], [231], [243].
Fig. 21 further summarizes the adaptability of acoustic models trained in a frame-based (CE) or a sequential (Seq) manner.The results indicate that sequential models benefit more from adaptation when compared to frame-based systems (11.6% vs. 9.8% average RERR).However, when controlling for the same data-set and baseline (reference systems were expected to exist for both CE and Seq) the difference decreases to around 0.6% RERR in favor of the frame-based systems.
Fig. 22 compares the adaptation gains obtained using various model architectures.LSTM benefits the most (15.4% average RERR).The feed-forward TDNN, DNN, and ResNet architectures all improve by around 10.5% RERR.Smaller gains were observed for Transformer, CNN and BLSTM, improving by 7.6, 6.5 and 4.9% average RERR, respectively.This result is somewhat expected as the last three architectures either normalize some of the variability by design, or have access to a larger speech context during recognition.
In Fig. 23  BSV + ivectors [126] DNNEmb + LHUC [70] DNNEmb + fMLLR [70] DiffPooling + fMLLR [278] LHUC + fMLLR [71] f-LHUC + LHUC [140] f-LHUC + ivectors [140] ivectors + LHUC [162] ivectors + fMLLR/VTLN [8] pRELU + fMLLR [280] Relative Error Rate Reduction 6 studies for which there were a complete set of baseline experiments allowing improvements to be quantified when adapting an SI model with Method1, and then measuring further gains when adding Method2.Fig. 23 shows that, on average, stacking adaptation techniques improved the adaptation performance by an additional 4%, from 8% to 12% RERR.Finally, in Fig. 24 we report results for all techniques included in the meta-analysis.These are based on samples where only a single method was used to adapt the acoustic model (cf.Fig. 7 (middle)), spanning results for all adaptation clusters (cf.Fig. 9).These should not be directly compared owing to differences in operating regions, but they offer an indication of the performance of the individual methods.

D. Speech styles, applications, languages
In this subsection, we analyze the efficacy of adaptation methods across acoustic and linguistic dimensions by reporting adaptation gains for different types of speech styles, applications (including ones with a large mismatch to the training conditions), and languages.
Fig. 25 compares gains as obtained for different speech styles.At the top we report three special cases spanning disordered, children's, and accented speech (these are similar to the adaptation clusters from Fig. 9).As expected, acoustic models estimated largely from adult speech of healthy individuals perform poorly in these highly mismatched domains, especially for disordered and children's speech, and domain adaptation improves ASR by over 50% average RERR.
Performance gains from adapting models with accented speech are similar to that obtained on other speech tasks.Note that the presence of non-native speakers in (English) training corpora is fairly common, so the underlying acoustic models may learn to better normalize this variability at training stage.Interestingly, adaptation brings relatively larger gains in commercial applications such as VoiceSearch and Dictation tasks (14% RERR on average).This is also visible in Fig. 26 comparing performance on public and proprietary data.We hypothesize that commercial data is to more likely contain a mix of speech from a diverse set of speakers (including non-native and children speech) and thus benefits more from adaptation.Another explanation could be the public benchmarks have been around for some time, and systems build on these are likely to be more over-fitted in general.
Finally, Fig. 27 summarizes adaptation performance for several languages.Note that speaker adaptation was performed on English, French, Japanese, and Mandarin while for Korean and Italian we only report adaptation gains for disordered and children's speech recognition.The overall improvements for non-English languages when adapting to speakers are similar to gains obtained for English when controlling for the adaptation method (i.e.improvements are between 6 and 10% average RERR), giving some evidence that adaptation helps to a similar degree for different languages, and that some of these primarily English-based findings generalize across languages.In this overview article we have surveyed approaches to the adaptation of neural network-based speech recognition systems.We structured the field into embedding-based, modelbased, and data augmentation adaptation approaches, arguing that this organization gives a more coherent understanding of the field compared with the usual split into feature-based and model-based approaches.We presented these adaptation algorithms in the context of speaker adaptation, with a discussion on their application to accent and domain adaptation.
A key aspect of this overview was a meta-analysis of recent published results for the adaptation of speech recognition systems.The meta-analysis indicates that adaptation algorithms apply successfully to both hybrid and E2E systems, across different corpora and adaptation classes.
E2E modeling is less mature than the hybrid approach, and much of the research focus on E2E modeling is to improve the general modeling technology.Therefore, in this overview paper, many more adaptation methods were introduced in the context of hybrid systems.However, most adaptation technologies successfully applied to hybrid models by adapting acoustic model or language model should also work well for E2E models because E2E models usually contain subnetworks corresponding to the acoustic model and language model in hybrid models; this is supported by findings in our meta-analysis.Different from hybrid models in which components are optimized separately, E2E models are optimized using a single objective function.Therefore, E2E models tend to memorize the training data more and hence the generalization or robustness to unseen data [185] is challenging to E2E models.Consequently, adaptation to new environment or new domain is very important to the large scale application of E2E models.We would expect more research toward this direction as E2E modeling becomes increasingly mainstream in ASR.
Because the size of E2E models is much smaller than that of hybrid models, E2E models have clear advantages when being deployed to device.Therefore, personalization or adaptation of E2E models [114], [115], [120], [121] is a rapidly growing area.While it possible to adapt every user's model on cloud and then push it back to device, it is more reasonable to adapt the model on device, which needs to adjust the adaptation algorithm to overcome the challenge of limited memory and computation power [114].Another interesting direction for the adaptation of E2E models is how to leverage unpaired data especially text only data in a new domain.In [121], several methods have been explored in this direction, but we are expecting more innovations there.
Adaptation algorithms are often deployed for conditions in which there is very limited labeled data, or none at all.In this case unsupervised and semi-supervised learning approaches are central, and indeed many current adaptation approaches strongly leverage such algorithms.However there are significant open research challenges in this area, particularly relating to unsupervised and semi-supervised training of E2E systems, using methods which are able to propagate uncertainty.Current approaches often do this indirectly (e.g. through T/S training), but more direct modeling of uncertainty would be desirable.
Domain adaptation has become central to work in computer vision and image processing, as discussed in Sec.I, with large scale base models (typically trained on ImageNet) being adapted to specific tasks.The closest analogies to this in speech recognition are some of the domain recognition approaches discussed in Sec.XI and for multilingual speech recognition.The idea of shared multilingual representations and language-specific or language-adaptive output layers was proposed in 2013 [208]- [210] and has become a standard architectural pattern.More recently several authors have proposed highly multilingual E2E systems, with a shared multilingual output layer [309]- [312], with the potential to be adapted to new languages.
State-of-the-art NLP systems are characterized by an unsupervised, large-scale base model [45], [301] which may then be adapted to specific domains and tasks [46].An analogous approach for speech recognition would be based on the unsupervised learning of speech representations, from diverse and potentially multilingual speech recordings.Initial work in this direction includes the unsupervised learning from large-scale multilingual speech data [313], [314].More generally, deep probabilistic generative modeling has become a highly active research area, in particular through approaches such as normalizing flows [53], [54], [56], [57].Such deep generative models offer different ways of addressing the problem of adaptation including powerful approaches to data augmentation, and the development of rich adaptation algorithms building on a base model with a joint distribution over acoustics and symbols.This offers the possibility of finetuning general encoders to specific acoustic domains, and adapting the decoder to specific tasks (such as speech recognition, speaker identification, language recognition, or emotion recognition), noting that classic adaptation to speakers can bring further gains [315], [316].

Fig. 1 :
Fig. 1: NN architectures used for hybrid NN/HMM and end-to-end (CTC, RNN-T, AED) speech recognition systems: (a) Scheme of NN architecture used for NN/HMM hybrid systems and for connectionist temporal classification (CTC); (b) architecture for the RNN Transducer (RNN-T); (c) architecture for attention based encoder-decoder (AED) end-to-end systems.

Fig. 2 :
Fig. 2: (a) Bottleneck feature extraction that uses a pretrained speaker classifier.(b) Summary network extracting speaker embeddings which is trained jointly with the acoustic model.

Fig. 3 :
Fig. 3: Structured transforms of an adaptation matrix A s : (a) Learning Hidden Unit Contributions (LHUC) adapts only diagonal elements of the transformation matrix r s = diag(A s ); (b) Low-Rank Plus Diagonal factorizes the adaptation matrix as A s ≈ D s + P s Q s ; (c) Extended LRPD factorizes the adaptation matrix as A s ≈ D s + P T s Q.

Fig. 11 :
Fig. 11: Comparison of adaptation results for FF and RNN architectures.

Fig. 24 :
Fig. 24: Comparison of adaptation results for the standalone techniques.