Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

The polarization of world languages is becoming more and more obvious. Many languages, mainly endangered languages, are of low-resource attribute due to lack of information. Both language conservation and cultural heritage face important challenges. Therefore, speech recognition for low- resource scenario has become a hot topic in the field of speech. Based on the complex network structures and huge model parameters, deep learning has become a powerful science in the process of speech recognition, which has a broad and far-reaching significance for the study of low-resource speech recognition. Aiming at the characteristic of low resource, this article reviews the history and research status of two kinds of acoustic models of deep learning neural networks and acoustic end-to-end structures. We further elaborate on several key techniques for improving performance in the two aspects of data and model training. There are two projects for low-resource languages introduced in this article. The possible future developments are finally pointed out. These works provide some reference for computer speech and language processing.


I. INTRODUCTION
Speech is the most simple and smooth way of communication in human interaction, which can quickly and accurately convey effective information. Nowadays, people are devoted to studying how to communicate with various smart devices through the medium of speech while using them. At present, there are a variety of voice assistants that understand human voice information through interactive real-time intelligent dialogue and realize automatic operation according to the content, such as Apple's Siri and Google's assistant. Therefore, Automatic Speech Recognition (ASR) is the key technology throughout the human-computer interaction processing. The purpose of ASR is to transform speech signals into textual information, thus providing a strong foundation for further semantic understanding. ASR is an interdisciplinary and comprehensive technology, including computer The associate editor coordinating the review of this manuscript and approving it for publication was Javier Medina . technology, acoustics, digital signal processing, statistics, linguistics, artificial intelligence, and so on. Thanks to the rapid development of related disciplines, the performance of ASR system has been greatly improved and widely used in various scenarios, including military, medical, and service industries, which greatly saves human resources and improves work efficiency.
The rapid development of ASR relies more on the support of a large amount of speech data and annotated text. The recognition of majority languages such as Chinese and English has achieved very mature performance, but it is difficult to be applied in many dialects and minority languages. There are many kinds of these languages that lack resources in terms of transcribed speech data, pronunciation dictionaries, language knowledge, text annotation, and others, which are defined as low-resource languages [1]. It is estimated that there are more than 7,000 languages in the world, among which at least 40% are endangered languages, and about half of them have no written forms [2]. From the perspective of natural ecology, the language pattern of the contemporary world is polarized between the major languages and the endangered languages.
As a carrier and an important part of culture, the extinction of language is an irreparable loss to rich language resources of the world and greatly damage the diversity of culture. Faced with the impact of majority languages, globalization, and internet, the international linguistic community has gradually attached importance to the theoretical creation and technological development of language protection. It has adopted high-tech digital means to collect all kinds of endangered languages in the world and established audio information database of digital languages [3]. However, the processing of speech in the construction of corpora is extremely difficult. All the speech in the audio and video materials obtained through field investigations must be manually annotated to make these corpora more widely understandable. Manual annotation requires a lot of manpower. Moreover, there is a shortage of native speakers or professionals who can perform corpus annotation processing. As a result, a lot of original audio or video materials of many languages are piled up and cannot be processed. Efforts to preserve linguistic diversity and sustainability remain to be explored.
Although ordinary ASR systems are not ideal for lowresource languages that are difficult to provide large amounts of training data, the use of speech technology is still the most direct and effective way to conduct research, as demonstrated by unwritten languages. More and more scholars have begun to conduct in-depth research and continuous improvement on speech recognition technology under the condition of limited language resources. This work is called low-resource speech recognition, which has become one of the hot issues and important challenges in the field of ASR. Figure 1 is a statistical graph of the quantity of published articles on the Web of Science website, Engineering Village and Scopus about the keywords Low Resource and Speech Recognition in the past decade. On the whole, the number of papers is increasing year by year, which reflects the international attention on low-resource speech recognition research. Low-resource speech recognition has important research significance, especially playing an irreplaceable role in the protection and promotion of linguistic diversity, cultural inheritance, and human communication in the world.
The rest of this article is organized as follows. In Section 2, we briefly review the principle and history of ASR. In Section 3, we introduce the application and improvement of acoustic models based on neural networks and end-to-end structure in low-resource scenario. Section 4 discusses how to improve the performance of low-resource speech recognition from data-wise and model-wise. Section 5 introduces two projects for low-resource languages. The possible future works of the low-resource speech recognition are given in Section 6.

II. BACKGROUND OF AUTOMATIC SPEECH RECOGNITION
The traditional architecture of speech recognition system is shown in the Figure 2, which consists of four parts: feature extraction, acoustic model, language model, and decoder. The signal preprocessing and feature extraction take speech signal as input, enhance speech quality by eliminating noise and channel distortion and transform signal from time domain to frequency domain for extracting feature vector. After the acoustic model and language model, the best sequence W * corresponding to the speech is output using the maximum posterior probability criterion and related decoding algorithm. W * can be calculated by Bayesian formula. The specific calculation method of W * is shown in the (1): where W is text label sequence, O is feature vector, P(W ) is the probability of language model, and P(O|W ) is the probability of acoustic model. In other words, on the premise of a given text label sequence W , the probability of acoustic feature sequence O is obtained, which can measure the matching degree of the speech feature sequence O and the text label sequence W to construct the model. P(W |O) represents the posterior probability of W , which can be used as the output text sequence of speech recognition with the idea of posterior probability maximization. For a given label sequence, the feature vector is fixed, which does not affect the recognition result, so it can be ignored. The acoustic model plays the most important role in ASR, which is the focus in this article. The most popular ASR systems in recent years usually use spectrogram, Linear Prediction Coefficient (LPC), Mel Frequency Cepstrum Coefficient (MFCC), Perceptual Linear Predictive (PLP) and feature transformation methods including Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Transformation (MLLT), i-Vector, feature Maximum Likelihood Linear Regression (fMLLR) and others as speech feature extraction. The breakthrough period of acoustic model was in the 1980s. The method based on statistical models which were represented by the Gaussian Mixture Model-Hidden Markov model (GMM-HMM) method [4] gradually became dominant in speech recognition research. A series of other related technologies based on the HMM method have also been derived, such as the use of Maximum Likelihood Linear Regression (MLLR) [5] and the maximum posterior probability criterion to overcome the problem of parameter adaptation in the HMM training process. Furthermore, the idea of merging states was used to achieve decision tree state tying when there are many training parameters in the case of less training data [6]. Artificial Neural Network (ANN) also provided a new research idea for speech recognition afterwards. In 2006, Hinton et al. [7] used Restricted Boltzmann Machine (RBM) to initialize the nodes of the neural network and the Deep Belief Network (DBN) came into being. The network used a non-supervised greedy layer-by-layer method to keep the weight of the modeled object as much as possible, and continuously fitted to obtain the weight. Since then, the combination of deep learning and traditional methods has occupied the mainstream, and Deep Neural Network (DNN) has shown a trend of surpassing the GMM model, instead of using the traditional GMM method to HMM state modeling. The first breakthrough was the DNN-HMM acoustic model, which greatly promoted the application of deep learning in speech recognition [8]. This is enough to demonstrate the power of deep learning.
Most importantly, the powerful feature extraction ability of Convolutional Neural Network (CNN) can better understand complex speech features. Recurrent Neural Network (RNN), which is suitable for sequence modeling, can make better use of the characteristics of time series relationships to establish context-dependent models. In particular, the latest end-toend speech recognition system overcomes the problem of forced alignment of traditional HMM and realizes the overall optimization of sentence sequence. The two most mainstream end-to-end models are the Connectionist Temporal Classification (CTC) and the encoder-decoder model based on attention mechanism.
Deep learning uses the multi-layer nonlinear structure to transform the low-level features into more abstract high-level features, and transforms the input features with or without supervision, thereby improving the accuracy of classification or prediction [9]. Deep learning models generally refer to deeper structural models, which have more layers of nonlinear transformations than traditional shallow models. They are more powerful in expression and modeling [10]. They also have advantages in complex signal processing such as non-stationary and random speech signals.

III. LOW-RESOURCE SPEECH RECOGNITION ACOUSTIC MODELS A. ACOUSTIC MODELS WITH NEURAL NETWORKS
In the current speech recognition system based on neural networks, the common hybrid DNN acoustic model has been gradually replaced by more accurate RNN or CNN. These are the two best options for effectively using variable-length contextual information [11]. This section introduces some applications and improvements of these two structures in lowresource speech recognition.

1) RECURRENT NEURAL NETWORK
RNN has the ability to remember and have strong modeling capabilities in time series data learning, which solves the problem of not modeling the dynamic characteristics of speech in DNN-HMM and has become the most widely used neural network structure in the field of ASR. In fact, if the memory window of the basic RNN is too long, there will be problems with unstable training, gradient disappearance or explosion, and it is difficult to deal with the problem of long-term dependence. Therefore, Long Short-Term Memory (LSTM) structure [12] is now commonly used to replace traditional RNN. LSTM is considered to be a complex and delicate network unit which have memory function so that it can store information for a long time. The LSTM structure which can selectively remember historical information contains three types of gates: input gate, forget gate and output gate. The input gate decides when to let the input into the cell unit, the forget gate decides when to remember the memory of the previous moment, and the output gate decides when to let the memory flow to the next moment. LSTM is calculated at time step t according to the following equations: where i t , f t , c t , o t , h t , x t , and y t respectively represent the input gate, forget gate, memory unit, output gate and hidden layer, input and output state at time step t. Where W denotes the weight matrix of each part, such as W ix denotes the weight matrix between the input gate and the input layer. Where b denotes the bias matrix. Where σ denotes the sigmoid and φ denotes the neuron activation function. The method based on RNN has shown excellent performance due to its powerful modeling capability and deeper architecture. For example, deep acoustic model with five layers of LSTM proposed by Google has achieved impressive improvements for large vocabulary speech recognition task [13]. It is well known that the depth of neural networks is critical for acoustic modeling. However, the stack of multilayer LSTMs in low-resource scenario makes the model more difficult to train, because the performance tends to saturate and decline with the increase of depth. The improvement of ASR performance by LSTM structure with residual learning was studied in [14], which introduced cross-layer quick connection in multilayer LSTMs instead of simply stacking layers. These shortcut connections represented feature mapping between shallow and high layers. Not only could they ensure the information flow forward across several layers, but error could pass back across several layers without attenuation. Zhou et al. [15] further proved the effectiveness of the Shared Hidden Layer (SHL) LSTMs with residual learning for multilingual low-resource speech recognition, which alleviated the degradation problem without adding additional parameters and computational complexity.
Actually, in order to make full use of the subsequent context information, most of speech recognition systems adopt the BLSTM (Bidirectional LSTM) structure, which is composed of two unidirectional LSTMs superimposed each other. Its output is determined jointly by the state of these two LSTMs and can provide the output layer with complete past and future context information. The calculation processes are as follows: BLSTM processes data in both directions by using two separate parameter sets, forward parameters and backward parameters.
Graves [16] first tried to use BLSTM for acoustic modeling of speech recognition and achieved the best recognition performance at that time on the TIMIT corpus. Subsequently, many researchers have studied the low-resource speech acoustic modeling of BLSTM [17]- [20]. Although the BLSTM has achieved good results, the structure has a large number of parameters and some complex training mechanisms. In order to solve this problem, several layers of Bidirectional Gated Recurrent Unit (BGRU) can be added to the model, which are used in combination with BLSTM and usually does not completely replace BLSTM [21]. Therein GRU [22] can be understood as a simplified version of LSTM, which not only retains the long-term memory function of LSTM, but also replaces the input gate, forget gate, and output gate in LSTM with update gate and reset gate. In addition, GRU combines the two vectors of cellular state and output. Better expressiveness is shown in low-resource scenario on the condition that GRU has fewer parameters. Kang et al. [23] proposed local BGRU with residual learning. All time-dependency relationships were considered in a fixed local window.

2) CONVOLUTIONAL NEURAL NETWORK
The spectral characteristics of speech signal can be regarded as an image with two dimensions of time and frequency information. Therefore, the extensive application of CNN [24] in the image field also provides ideas for speech processing. For example, the pronunciation of each person is very different, and the frequency band of the formant is different on the spectrogram. CNN can effectively remove this difference, which is conducive to acoustic modeling in low-resource scenario. Typical CNN is usually divided into two parts: convolutional filter and max-pooling. The convolutional filter captures the local structural characteristics, and each node of its feature map is only convolved with f nodes of the local frequency band of the previous layer. Local convolution has two advantages: (1) the clean spectrum can be used to calculate the characteristics with excellent performance, and only a few of the characteristics will be affected by the noise part, so the robustness of the model is improved; (2) the higher layers of the network combine the calculated values of each frequency band to balance the speech information of adjacent frequency bands. After the convolution operation, max-pooling is conducted to provide additional translation and rotation invariance [25].
In the acoustic model based on CNN, the input feature vectors are divided into N non-overlapping frequency bands {v i |i = 0, . . . , N − 1}. Then each s adjacent bands are treated as a band group {v i+r |r ∈ 0, . . . , s − 1}. These band groups pass through weights matrix W , bias vector b and nonlinear transformation function θ of the convolution layer to produce the output h i . The calculation process is as follows: During the convolution operation, the weight matrix W and the bias vector b shift d frequency bands every time, thereby generating a set of convolutional layer frequency band outputs {h j |j ∈ 0, . . . , M − 1}. The number s of frequency bands input by the convolutional layer is called band width, and the number d of frequency bands shifted by the convolutional layer every time is called band shift. Max-pooling follows the convolution operation to achieve the purpose of dimensionality reduction of hidden nodes. The formula is as follows: where p m i stands for the output of the max-pooling operation, i and j are the indexes of the bands, m is the index of the neurons in a band and k is the pooling size.
Time Delay Neural Network (TDNN) is the first proposed simple one-dimension CNN applied to speech recognition tasks without pooling or subsampling. It can be used to efficiently build long term time-dependency relationships whether in small or large data scenarios. TDNN has been shown to be effective in learning the temporal dynamics information of signals even from short term feature representations [26]. The TDNN system for multilingual training was further established in [27]- [29]. Moreover, chain model training can significantly improve the speed and Word Error Ratio (WER), which is characterized by Lattice-Free Maximum Mutual Information (LF-MMI) as the training criterion without frame level cross entropy pre-training.
Initially, CNN was only used as a tool for robust feature extraction, so generally only one or two layers of layers were added at the bottom and then the upper layers were modeled using other neural network structures. For example, a CNN layer was added at the lowest of the hybrid neural network and HMM model on a small vocabulary task to normalize the spectral change of speech signal [30]. Two-layer CNNs were used for high-dimensional speech feature extraction for lowresource speech recognition. Compared with MFCC features, their frequency domain energy changes are smaller, which is beneficial to the learning of high-level networks [31]. Then inspired by VGGNet [32], a very deep convolutional network architecture with up to 14 weight layers was applied to low-resource speech recognition in [33]. A deep structure based on Gated Convolutional Network (GCN) was proposed in [34]. GCN is more suitable for the combination of gate mechanism and convolutional operation of sequential tasks. It can better learn the acoustic feature representation, in which gates are used to control the information passed in the hierarchy. It combines the advantages of RNN and CNN on low-resource task to improve training speed and robustness.
However, the problem of gradient disappearance and overfitting in the optimization process of CNN are still two factors that affect the performance of the model. Convolutional Maxout Neural Network (CMNN) which uses maxout neuron and dropout training is an effective way to solve this problem [35], [36]. The nonlinear function of the original network is changed from sigmoid to maxout. It selects the maximum value for the output of neuron nodes at adjacent positions within the same frequency band, making the model easy to optimize. Dropout discards the neurons in the network with a certain probability during each training, reducing the network parameters to be adjusted to prevent overfitting.
Compared to computer vision, the behavior of the two dimensions of time and spectrum of speech signals may be quite different. Therefore, the two-dimension convolution structure was proposed in [37], which emphasized the importance of considering both time and spectrum in a convolutional filter. The experimental result showed that the two-dimension convolution network was superior to the fully connected DNN in only 10 hours of training data. Over time, spectral convolution became much less important.

B. ACOUSTIC MODELS WITH END-TO-END STRUCTURE
Deep learning algorithms still play a limited role in speech recognition systems in the form of traditional pipeline. The end-to-end model integrates multiple modules in traditional speech recognition such as acoustic model, pronunciation lexicon, and language model into a network for joint training [38]. The end-to-end model realizes the direct mapping of the input sound sequence to the label sequence without carefully designing the intermediate state, which greatly simplifies the training process and significantly reduces the calculation complexity. This integration reduces the dependence on prior expert knowledge and avoids the obstacle that the traditional speech recognition framework cannot overcome for the scarcity of effective data, so it is also applied in the low-resource speech recognition. End-to-end learning allows us to process a wide variety of sounds, including noisy environments, accents, and different languages [39].

1) CONNECTIONIST TEMPORAL CLASSIFICATION
In ASR task, the length of feature sequence of input speech frame is usually greater than that of output label sequence, and the corresponding label is required for each speech frame for effective training. Hence, the speech needs to be preprocessed with frame-by-frame alignment marks before training, which requires to be iterated repeatedly. The proposal of CTC perfectly solves this problem [40]. It focuses on whether the output is consistent with the label as a whole. It can be trained without the frame level alignment of the label in time and does not care about the prediction of the input data at any time. The output of CTC is the probability of the overall sequence, thereby reducing the tedious work of forcing alignment to get frame-level annotations. In addition, CTC adds an additional blank label to the target label set and uses the blank label to indicate the probability of not issuing any labels at a specific time step.
Given an input sequence x = {x 1 , . . . , x T } of length T and a target output sequence y = {y 1 , . . . , y N } of length N , we define a set of target labels (y n ∈ ) and an extended set of CTC target output labels * = ∪ {−}. The CTC first maps x to the path π = {π 1 , . . . , π T } with the same length T , π t ∈ . The conditional probability of any path π is calculated as follows: The set of all paths π of length T is recorded as B −1 (y). Then the CTC needs to perform many-to-one, long-to-short mapping to aggregate multiple paths into a shorter label VOLUME 8, 2020 sequence. The same labels that appear continuously in each path are merged into one and the blank labels are removed. The probability of the target label sequence is calculated as follows: The loss function of CTC is defined as the sum of negative logarithmic probabilities of the correct label. This means minimizing the following objective function: where (x, y) ∈ D denotes training samples.
The essence of CTC is a special loss function or optimization criterion for sequence modeling. Its introduction effectively solves the classification problem of time series data. The combination of deep neural network and CTC makes deep learning more fully applied in speech recognition and becomes a hot spot in end-to-end speech recognition research. The output unit of CTC is very flexible, that can be phonemes, glyphs, syllables and other sub-word units, or even whole words [21].
Rosenberg et al. [41] explored the use of CTC in keyword search and speech recognition in low-resource languages, which did not exceed the result obtained by DNN-HMM but was also competitive. The result indicated the direction for the improvement of the end-to-end speech recognition system. The encoder in CTC model has great potential for improvement. A method of using segmentation to correct CTC loss was proposed in the training process, resulted in improvement while decoding with small beam size [21]. Vydana et al. [18] studied the Subspace Gaussian Mixture Model (SGMM) and the joint acoustic model based on RNN-CTC. Experimental results showed that the joint acoustic model trained with RNN-CTC performed better than the SGMM system on 120-hour Indian language data. Wang et al. [42] combined CTC with Tibetan linguistics knowledge and used bound triphones as a modeling unit to solve the problem of Tibetan acoustic modeling under resource constraints, which made the recognition rate based on the end-to-end acoustic model method exceed the speech recognition system based on BLSTM-HMM. Yu et al [20], [31] used the BLSTM-CTC joint model to achieve phonemic level speech recognition for a few hours of data.
However, CTC excludes the case where the output sequence is larger than the input sequence and makes independent assumptions between each time frame, without modeling the interdependence between outputs and ignoring the correlation between frames. RNN-Transducer (RNN-T) combines acoustics and language modeling on the basis of CTC model by adding a prediction network [43]. RNN-T regards the acoustic model as an encoder, the language model as a prediction network, and the joint network as a decoder. This model has been proven to be effective in speech recognition tasks [44], [45]. But RNN-T is more difficult to train, and it is necessary to pre-training the encoder and the prediction network separately to obtain better result. It has not been well applied in low-resource speech recognition.

2) SEQUENCE-TO-SEQUENCE MODELS
Attention-based model is an end-to-end model of encoderdecoder. The attention mechanism eliminates the need for pre-segment alignment of data and can be used with implicitly learn the soft alignment between input and output sequences, avoiding the conditional independence hypothesis problem in CTC [46]. The encoder in attention-based model converts the entire speech input sequence x to the high-level hidden vector sequence h = {h 1 , . . . , h L }, and then the decoder uses the attention mechanism to select or assign different weights to the vectors in the hidden vector sequence h in each step of generating output label y, so that the most relevant vector is used for prediction.
Chorowski et al. [47] first used attention mechanism for the alignment of input and output sequences in speech recognition task. The encoder was bidirectional RNN. The decoder was the RNN that directly emitted the phoneme stream and sent each symbol based on the context created by a subset of input symbols selected using the attention mechanism. Later in [41] this model was used for low-resource speech recognition, but the encoder and decoder structure were replaced by GRU. This structure focuses the entire encoding sequence, which must wait until the encoding process is completely completed, thus increasing the delay. The introduction of the Listen, attention and Spell (LAS) [48] model solved this problem. The listener is a pyramid BLSTM that encodes the input sequence x into high-level feature sequence h. The speller is an attention-based decoder that generates characters y from h. The training difficulty and efficiency of the model can be optimized by scheduled sampling [49], label smoothing [50], and minimum word error rate training [51]. Moreover, joint training of attention and CTC can greatly improve the convergence of the model [52].
Transformer is a special encoder-decoder structure that avoids recursion and convolution and completely relies on the attention mechanism to describe the global dependency between input and output [53]. There are 6 layers in the encoder part, each of which contains two sublayers of multihead mechanism and position-wise fully connected feedforward network. A residual connection is used between two sublayers, and then layer normalization is used. Decoder has the same structure as encoder, with the difference that each layer contains three sublayers including two multi-head attention mechanisms and a fully connected layer. The first multi-head attention uses mask operation, and the second multi-head attention focuses on the encoding information of encoder. The transformer structure requires a fixed input length. If a sequence is shorter than this fixed length, the padding should be applied to the blank part behind. The function of mask operation is to keep the padding part out of the attention calculation and ensure that the predicted position can only depend on the position of less than known output. In addition, positional encoding is added to input at the bottoms of these encoder and decoder stacks, which contributes to getting some information about the relative or absolute location of tokens in the sequence. The advantage of this change in the internal structure of encoder and decoder is that the model can parallelize training.
Transformer was originally used for Neural Machine Translation (NMT) tasks, both training speed and results far surpass other algorithms in [53]. Transformer was also proved to have good performer for other Natural Language Processing (NLP) tasks [54]. In the premise of the characteristic of the Transformer completing sequence-to-sequence transduction task, people began to try to introduce it into the ASR task.
ASR Transformer was proposed in [55], [56]. Its structure is basically the same as NMT Transformer. Only a linear transformation with layer normalization is added with the purpose of transforming the log-Mel filterbank feature into a dimension that matches the model input. Zhou et al. [57] studied using ASR Transformer to complete multilingual speech recognition on 6 low-resource languages. Another speech-Transformer model was put forward by [58], which took a two-dimension spectrogram with time axis and frequency axis as input and added two 3 × 3 CNN layers of 2 steps and M optional additional modules to the front of the encoder. Considering that the combination of time axis and frequency axis might be helpful in modeling of the temporal and spectral dynamics in a spectrogram, a twodimension attention mechanism was proposed as an additional module to capture temporal and spectral dependencies. Mohamed et al. [59], [60] also combined the convolutional layer with the Transformer, but the positional encoding was eliminated. A two-dimension convolutional block with layer normalization and ReLU and a two-dimension max pooling layer were added at the bottom of encoder of the basic Transformer, also a one-dimension convolutional block with layer normalization and ReLU was added at the bottom of decoder [59]. The addition of convolutional layer in Transformer can better learn the long range acoustic characteristics of speech.

IV. HOW TO LEARN WITH LOW-RESOURCE LANGUAGES
In addition to the improvement of the basic acoustic model units mentioned above, some important technologies are applied in low-resource speech recognition, which is often the key to improving the performance, mainly including two aspects of data and model.

A. DATA AUGMENTATION
The amount of data directly affects the performance of deep learning. Deep learning on small data sets is prone to overfitting. Usually how to solve this problem from the data level is considered in the first place. However, collecting additional resources can be difficult for languages that are not widely used. Data augmentation, a technique designed to increase the amount of data needed to train speech recognition systems, has become a widely adopted approach in the field of low-resource speech recognition. Common data augmentation methods include semi-supervised training, multi-lingual processing, acoustic data perturbation, and speech synthesis [61]. Furthermore, multiple data augmentation methods are used in combination instead of a single method in many studies. For instance, semi-supervised training and acoustic data perturbation were combined in [61], [62]. Data from other languages were additionally used in [62]. Acoustic data perturbation and speech synthesis were combined in [63], resulting in a 14.8% relative WER improvement. Semisupervised training refers to train the model with supervised data and unsupervised data through the confidence threshold. Multilingual processing refers to expand the resource-poor data with resource-rich data. The data obtained by these two data augmentation methods are natural. The two methods of acoustic data perturbation and speech synthesis can obtain artificial data, that is, interfere with the raw data in some way and generate the new data by speech synthesis technology.

1) ACOUSTIC DATA PERTURBATION
Acoustic data perturbation includes multiple types of data interference to achieve the purpose of expansion, such as speed and volume perturbation, noise injection, increasing reverberation, and so on. This method is easy to implement and has been widely used in low-resource speech recognition tasks, which effectively improves the robustness of the acoustic model. But the main disadvantage of such data is bad quality.
Vocal Tract Length Perturbation (VTLP) augments speech data by random linear distortion along the frequency dimension on the spectrogram. Different from Vocal Tract Length Normalization (VTLN) [64], it generates a random warp factor α for each utterance to warp the frequency axis and map the frequency to a new value instead of setting a warp factor for each training and testing speaker [65]. It laid the foundation for increasing data sets in the field of speech recognition without changing labels. It was later applied to several low-resource speech recognition tasks [61], [66]- [69]. Speed perturbation created two counterparts of the original training data by modifying the speed to 0.9 and 1.1 of the original rates. Tempo perturbation was additionally used to correct the rhythm of signal, in ensuring signal at the premise of pitch and spectrum unchanged. Due to the change of the signal length, the GMM-HMM system was used to realign the data after the speed perturbation [68]. Gokay et al. [63] used speed perturbation, volume perturbation and a combination of the two for data augmentation. Kanda et al. [70] studied three distortion methods of vocal tract length distortion, speech rate distortion, and frequency-axis random distortion. In a large vocabulary continuous speech recognition task with only 10 hours training sample, the relative WER with DNN-HMM training was reduced by 10.1%. Hsiao et al. [71] improved the robustness of speech recognition system by artificially adding noise and reverberation. Hartmann et al. [67] used fMLLR transformation of random speakers to enhance bottleneck features. This approach combined with adding noise and speed perturbation.
Inspired by data augmentation in the image field, Google proposed an augmentation strategy for the log-Mel spectrum to help the expansion of speech recognition data called SpecAugment [72]. Its augmentation strategy includes distortion in the time direction and adding masking blocks to the frequency channel and time step. SpecAugment converts ASR from an over-fitting to an under-fitting problem. However, it can get better performance by using a larger network and longer training time. Wang et al. [60] continued to improve in SpecAugment and proposed a semantic mask based on regularization to train the end-to-end speech recognition model. It shielded all features corresponding to the output token during training, such as a word or a word fragment. The motivation is to encourage the model to fill in missing marks based on context information with fewer acoustic features, so that the model has stronger language modeling capabilities and stronger resistance to acoustic distortion.

2) SPEECH SYNTHESIS
Compared with the acoustic data perturbation method, speech synthesis is more flexible. A generation model is often used to obtain new speech data, which is similar to using the Generative Adversarial Networks (GAN) architecture [73] in the image field to generate synthetic new images that simulate the distribution of input data.
Voice Conversion (VC) is a technology that converts non-verbal information of a given voice while retaining language information. Kaneko et al. [74] proposed a nonparallel VC method that did not rely on parallel data called CycleGAN-VC. It used a Cycle-consistent Generative Adversarial Network (CycleGAN) with gated CNN and an identity-mapping loss. CycleGAN used both adversarial and cycle-consistent loss to learn forward and inverse mapping [75]. This made it possible to find the best pseudo pair from unpaired data. Gated CNN trained using identitymapping loss allowed mapping functions to capture order and hierarchy while retaining linguistic information. Because it only learned one-to-one mapping. Kameoka et al. [76] proposed to use StarGAN [77] for non-parallel many-tomany VC. This method was an extension to CycleGAN-VC. It introduced a domain classifier to predict which class an input belongs to. Since StarGAN-VC only requires a few minutes of non-parallel and unmarked speech for each speaker, the architecture is very suitable for low-resource VC. Hsu et al. [78] proposed a non-parallel Variable Autoencoding Wasserstein Generative Adversarial Network (VAW-GAN) VC framework. The Variational Autoencoder (VAE) [79] simulated the speech characteristics of each speaker and the Wasserstein Generative Adversarial Network (W-GAN) [80] synthesized speech from different speakers. Thai et al. [81] used StarGAN-VC and VAW-GAN to perform VC in the Seneca language of 720 minutes. The experimental results showed that data augmentation helped to reduce the WER. Another method of Speech synthesis is Text-to-Speech (TTS). Two TTS methods were used as strategies for Speech data augmentation in [63], namely the Google Translate Text to Speech (gTTS) and Deep Convolutional TTS (DCTTS) architectures [82]. About 10 hours of Turkish speech data were synthesized ultimately. In order to solve the problem that synthesized speech can only provide limited speaker diversity for data augmentation in low-resource tasks, Du et al. [83] proposed a speaker augmentation method that used VAE speaker representation to train the end-to-end TTS system so that TTS could synthesize sounds from unknown new speakers by sampling from the training potential distribution.

B. TRANSFER KNOWLEDGE FROM MODELS
When deep learning is used for small data, superior performance cannot be achieved generally for a single target task. At this point, the researchers come up with the idea that additional tasks could be learned to improve the performance of network learning. This idea is also widely used in the field of low-resource speech recognition, which provides an effective way to solve the problem of data sparsity. It mainly includes multitask learning, transfer learning, and meta learning. The intuitive comparison of the three is shown in Table 1.

1) MULTITASK LEARNING
Multitask Learning is a machine learning technology that aims to optimize the generalization performance of models by learning multiple related tasks in parallel. Especially for small data sets, multitask learning model can outperform the model which only optimize the performance for one task. If multiple tasks are related and share some internal representations, they can transfer knowledge to each other by learning them together. The general framework for multitask learning consists of three parts: (1) the task-specific input layer which is a feature transformation from domain-specific to domaingeneral representation; (2) SHL for task-independent feature extraction; (3) task-specific output layer which each task has a separate softmax used for estimating the posterior probability of the language senones information. But in practical applications it is necessary to adjust them according to specific tasks. Multitask learning is usually divided into monolingual multitask ASR and multilingual multitask ASR for low-resource scenario.
For monolingual multitask learning in ASR, it is usually required to find tasks related to the language of the main task without additional resources. They can be regarded as abstract phonetic categories and these category labels are used as auxiliary tasks for frame-level classification [84], [85]. For instance, triphone modeling and trigrapheme modeling are highly related learning tasks that can be estimated in parallel under the multitask learning framework [86]. Chen et al. [87] took grapheme modeling as an additional learning task and used multi-task learning DNN to learn the phone model of the target language. Fantaye et al. [88] studied to conduct joint training through multi-task learning for basic acoustic units of a low-resource language Amharic, including syllable, phone and rounded phone.
Multilingual multitask ASR usually embodies each language as a separate task and builds a model by bringing together multiple related languages. Taking multilingual correlation as a prerequisite, the structure that the input layer and the hidden layer are shared by all languages and the output layer is not shared is usually adopted, that is, the multilingual model of SHL model [15], [89]- [92]. These hidden layers are shared among different languages, so that the SHL encodes rich senones information that can be used to identify different languages and also makes the input features more discriminative to different languages after layer by layer abstraction [93]. Multilingual multitask ASR derives a Multilingual Bottleneck Features (MBNF) in order to better help acoustic modeling of low-resource languages in a multilingual environment. Unlike the SHL multilingual model, the MBNF-based model contains a bottleneck layer with only a few nodes. And it is usually a linear layer to retain as much multilingual information as possible [94]- [96]. The MBNF-based model is shown in the Figure 3.
However, some unnecessary language-specific information may be included in MBNF-based model. In order to ensure that the SHL can learn language-invariant features, adversarial training is introduced in multilingual learning. the language discriminator is added to the model to identify the language labels of each frame using shared features [97]- [99].

2) TRANSFER LEARNING
Transfer learning uses the similarities among data, tasks, or models to quickly and effectively develop a system with better performance for a new domain by using the knowledge learned from the source domain. Unlike multitask learning which focuses on improving the performance of all task, transfer learning emphasizes the improvement of the performance of the target task by transferring knowledge acquired on similar but different tasks. In transfer learning, the source domain and the target domain should be similar to each other in order to transfer knowledge smoothly. Transfer learning in ASR task is embodied in cross-linguistic acoustic modeling, aiming to transfer knowledge from one or more source language systems built with large amounts of training data to establish a target language system that provides only a limited amount of transcribed audio [100], [101]. Transfer learning is used in two ways for low-resource speech recognition.
The first method showed as the left of the Figure 4 is fine-tuning, a process of initializing the weight of network with the pre-training network weight instead of the original random initialization, and then readjustment for the task in the target domain. Because the pre-training model is usually not fully applicable to the target domain, fine-tuning is necessary. This method is suitable for tasks with high similarity between the source and target domains. The implementation of fine-tuning method is relatively simple. The input and hidden layers of the network remain common to the two domains, the model parameters are borrowed and then a new softmax output layer is created for the target low-resource language [29]. The goal of the output layer is to obtain the posterior information from the monolingual model of lowresource language.
The second transfer learning method is weight-based transfer showed as the right of the Figure 4, which is realized by transferring the hidden layers. First, an acoustic model is trained using high resource language, retaining the n hidden layers. Then the m randomly initialized hidden layers and a softmax layer are added on top of the n hidden layers. Finally, the transferred model is retrained using target low-resource language. In this case, the pre-training model can be regarded as a feature extractor [31]. In practice, the pre-training model can be used in combination with multilingual training.

3) META LEARNING
Meta learning is also called learning to learn. It acquires experience from previous learning in a systematic and data-driven VOLUME 8, 2020 manner to speed up the learning process of new tasks. Generally, the following steps are required for Meta Learning: (1) metadata describing the previous learning task and the previous learning model needs to be collected, which is the algorithm configuration used to train the model, including hyperparameter settings, pipeline compositions and network architectures, the resulting model evaluations, such as accuracy and training time, the learned model parameters, such as the trained weights of a neural net, as well as measurable properties of the task itself; (2) we need to learn from the previous metadata to extract and transfer knowledge that guides the best model search for the new task [102]. Different from transfer learning, meta learning makes use of previous knowledge to make the model self-adjust and optimize according to new tasks.
Meta learning explores how to quickly adapt to unseen data. As meta learning has been widely used in the field of computer vision under the few-shot learning setting [103], [104]. Preliminary attempts have been made in language and speech processing in low-resource scenarios and good results have been achieved, such as machine translation [105], [106], dialogue generation [106] and speaker adaptation [107]. Klejch et al. [107] studied adapting a speaker-independent model in unseen speaker conditions using limited adaptation data, which could be regarded as a special case of few-shot learning. In this article, an acoustic model weight adaptive method based on meta learning was proposed. Treating speaker adaptation as a function adapt, its set of parameters uses adaptive data to adjust a set of weights of the acoustic model f (x, ) into a set of adaptive weights * . The 3-hour data divided into 18 speakers was used to train the meta-learner, and finally achieved a lower WER in DNN and TDNN models. Hsu et al. [91] proposed MetaASR, which was learned from six source tasks through the Model-Agnostic Meta Learning (MAML) algorithm to obtain a good initialization parameter of the shared encoder that performs quick fine-tuning for four target tasks. The results showed that MetaASR performed better than MultiASR in only four target languages directly. Meta learning is the key to the realization of general artificial intelligence. Although it is not widely used in the field of low-resource speech recognition, it is a direction worth further study.

V. PROJECTS FOR LOW-RESOURCE LANGUAGES
A. IARPA BABEL PROGRAM Many speech transcription systems were originally developed for English and have significantly lower performance in non-English language. There is no existing speech technology for those minority languages. And it often takes years to develop and cover a small portion of the world's languages. The effective triage capabilities to assist those few analysts must be rapidly developed. The goal of The IARPA Babel program [108] is to develop agile and robust speech processing technology that can quickly adapt to any human language, so as to provide effective searching ability for analysts and process a mass of real-word recorded speech. The Babel's program worked with diverse languages from the outset and acquired speech data in-country for languages from a broad set of language families, such as Afro-Asiatic, Niger-Congo, Sino-Tibetan, Austronesian, Dravidian, Altaic. In Babel program, data from more than twenty low-resource languages are collected, which allows us to focus on multilingual experiments for feature extraction and acoustic modeling. The detailed information of 23 languages is shown in Table 2. They are available on Linguistic Data Consortium (LDC) website at present [109]. Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format and 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded in UTF-8 in Latin script. Transcripts are included for approximately 80% of the speech. The gender distribution among speakers is approximately equal. speakers' ages range from 16 years to 70 years. Calls were made using different telephones from a variety of environments including the street, a home or office, a public place, and inside a vehicle. This allows us to deal with real recording conditions from the start.
Since 2013, the National Institute of Standards and Technology (NIST) has held an international keyword search (OpenKWS) evaluation every year. This evaluation is part of the Babel program. The goal of NIST OpenKWS evaluation is to establish a speech recognition system with limited training resources and perform keyword searh tasks within a limited time. In each evaluation, a surprise language is released whose language information is unknown beforehand. The primary measure of performance for NIST OpenKWS is Actual Term-Weight Value (ATWV) [110]. Its calculation formula is as follows: where θ is the uniform threshold used to determine whether each possible keyword was a real keyword. K is the number of different keywords. N Miss (kw, θ) and N FA (kw, θ) respectively represent the number of missed detection and false alarms of keyword kw for θ . N True (kw) is the number of reference occurrences of keyword kw. T stands for the size of test speech corpus. β is a constant, which is used to punish false alarms rate of the system. The higher the value of ATWV, the better the performance.
The OpenKWS evaluations for the Babel Program have base period, option period 1, option period 2 and option 3. The performance goals for every period is shown in Table 3.   emotions of the affected population in a particular area may help inform decision makers on how to best allocate resources for effective disaster relief. However, these works can be severely limited by language barriers. The DARPA Low-Resource Languages for Emergent Incidents (LORELEI) program further developed language processing techniques for low-resource languages in the context of this humanitarian crisis. The goal of LORELEI program is to solve the problem that the existing methods cannot achieve the popularization of language technology through the research and development of language technology. It can eliminate the current dependence on huge, manuallytranslated, manually-transcribed or manually-annotated corpora and use the relevant language resources to fully develop low-resource languages. In this program, rather than translating foreign language materials into English, information elements based on local language materials, including situational descriptions, names, places, events, emotions and relationships, are identified [111].
The exploitation of LORELEI program can be divided into three stages. The first is the language analysis stage. In order to reduce the dependence on specific language information, it is necessary to analyze the common attributes and rules of the language from the known language data to establish a universal language technology model. The known resourcerich language information is mapped to the resource-poor language to establish a projection hypothesis relationship. Then the optimization algorithm for language-specific resource is also studied. The second stage is the language technology development stage. Driven by the knowledge fusion engines, run-time models are built and language processing tools are developed by combining the knowledge of linguistic experts VOLUME 8, 2020 and scenario information. Third, to make it easier for analysts to use and analyze event-related data, the LORELEI program creates web service that integrates the Incident Language (IL) tools. Analysts can use these tools to convert low-resource ILs into English summary or other visual forms by accessing web services. The web service is constantly updated and improved as the data increases [112].
Unlike the Babel which focuses on speech processing, LORELEI focuses on situational awareness when emergencies occur, with an emphasis on text processing in low-resource languages. LDC is building text language packs for LORELEI, which includes data, annotations, NLP tools, lexicons and grammatical resources for 23 representative languages (Uzbek, Turkish, Hausa, Amharic, Arabic, Farsi, Hungarian, Mandarin, Russian, Somali, Spanish, Vietnamese, Yoruba, Akan, Bengali, Hindi, Indonesian, Swahili, Tagalog, Tamil, Thai, Wolof, Zulu) and 12 ILs (Uzbek, Mandarin, Oromo and other undisclosed languages) [113]. Representative languages packs which contain monolingual text, parallel text, annotation, text processing tools, segmentation, entity tagging, lexicons and gramma are selected to provide broad typological coverage. while ILs are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation. There are two tools in the language pack, one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter. Data were collected in discussion forums, news, reference, social network and weblog. All text data is encoded as UTF-8.
Low Resource Human Languages Technologies (LoReHLT) evaluation is designed in collaboration with NIST and LORELEI program. LoReHLT 2019 [114] included three tasks, Machine Translation (MT), Situation Frame (SF), and Entity Detection and Linking (EDL). The descriptions of the three tasks are shown in Table 4.

VI. CONCLUSION
In recent years, with the rise of deep learning, ASR technology has made great progress. Significant results have been achieved in low-resource scenarios. Data augmentation, multilingual and cross-lingual training have become the most widely used. But speech recognition systems still need to be designed with more sophisticated models to deal with speakers with accents or with higher levels of background noise. Low-resource speech recognition may also have the following improvement or breakthrough in the future.
First, the improvement of end-to-end system. The integration of additional language knowledge, learning with complex data such as noise and so on still need to be studied. The joint modeling of acoustic model and language model should be strengthened to better explore the correlation and complementarity between acoustics and language, so as to build a more thorough end-to-end speech recognition system to improve performance.
Second, excavating speech structure knowledge of multimodal information fusion. We can expand from a single mode only for speech to related modes such as images and videos. Because in the era of high popularity of multimedia technology, such data is easy to obtain and relatively easy to label. This is an important research direction for low-resource speech recognition with data scarcity.