Projected Minimal Gated Recurrent Unit for Speech Recognition

Recurrent neural network (RNN) has the ability to learn long-term dependencies, which makes it suitable for acoustic modeling in speech recognition. In this paper, we revise RNN model used in acoustic modeling, namely, mGRUIP with Context module (mGRUIP-Ctx), and propose an advanced model which named Projected minimal Gated Recurrent Unit (PmGRU). The paper demonstrates two major contributions: firstly, in the case that adding context information to context module in mGRUIP-Ctx will bring about large amount of parameter, we propose to insert a smaller output projection layer after the mGRUIP-Ctx cell’s output to form the PmGRU, which is inspired by the idea of low-rank decomposition of matrix. The output projection layer has been proved to be able to save most of the effective information with the reduction of model parameters. Secondly, in the case that too much context information of previous layer introduced by context module will cause declining of model performance, we adjust the ratio of context information of the previous layer to the current layer by moving the position of batch normalization layer, and the final RNN model Normalization Projected minimal Gated Recurrent Unit (Norm-PmGRU) is generated. In the five automatic speech recognition (ASR) tasks, the Norm-PmGRU has been proved more effectively in the experiments compared with mGRUIP-Ctx, TDNN-OPGRU, TDNN-LSTMP and other RNN baseline acoustics models.


I. INTRODUCTION
In recent years, deep learning has been widely used in various fields, such as pattern recognition [1], [2], system control [3]- [5], and combination optimization [6], [7]. In the field of speech recognition, the hybrid model which relies on a combination of deep neural network (DNN) and Hidden Markov Model (HMM) still maintain state-of-the-art performance compared to End-to-End model [8]- [12]. DNN in DNN-HMM hybrid model, specifically feed-forward neural network (FFNN) and recurrent neural network (RNN), is used for acoustic modeling to predict phone-level target phonemes.
Time delay neural network (TDNN) [13]- [15] and feedforward sequential memory network (FSMN) [16], [17] are two popular FFNN architectures which improve the performance of speech recognition by modeling the long temporal contexts. For contextual modeling in TDNN, all neurons at each layer receives input from a contextual window, which is The associate editor coordinating the review of this manuscript and approving it for publication was Yan-Jun Liu. composed of multiple outputs of the layer below. As for the acquisition of context information, FSMN is equipped with some memory blocks. These fixed-size memory blocks are obtained by FSMN using taped delay line structure to encode long-term context information.
Unlike FFNN only using the output of the previous layer as input, RNN uses the previous time step and the previous layer as input to determine the forward activation of the current time step. Thus, considering the ability to model a dynamic window of all sequence history instead of a fixed contextual window over the input sequence as FFNN does, RNN is more suitable for sequence modeling. However, as problems of vanishing gradient and exploding gradient highly happen in the vanilla RNN training process, Long Short-Term Memory (LSTM) [18], Gated Recurrent Unit (GRU) [19] and their Long Short-Term Memory with Projection (LSTMP) [20], Bidirectional Long Short-Term Memory with Projection (BLSTMP) [21], Output-Gate Projected Gated Recurrent Unit (OPGRU) [22], minimal Gated Recurrent Unit (mGRU, also named LiGRU) [23], [24], mGRUIP with Context module (mGRUIP-Ctx) [25], are proposed to solve the above problems and well used in speech recognition field. In [21], the author compared the performance of TDNN-LSTMP and BLSTMP. In the LSTM, input gate, output gate, and forget gate are used to control the flow of information. While in the mGRU [23], [24], only the update gate are used. As a simplified version of GRU, the reset gate of mGRU is removed. It has shorter training time and higher accuracy than GRU. But mGRU does not have the ability to model the future context. In [25], authors insert an input linear projection layer into mGRU to compress the input vector and the hidden state vector, and adds context information which from the layer below into the projection layer to realize mGRUIP-Ctx. Due to the introduction of context information, mGRUIP-Ctx gets better performance than LSTMP and GRU. The introduction of context information improves the performance of the model. Unfortunately, due to the large dimension of mGRUIP-Ctx output (in [25], the dimension of mGRUIP-Ctx output is 2560), when the length of context window is increased, the parameters in the model will increase significantly.
Low-rank factorization is a method of using matrix/tensor decomposition to estimate the effective information in matrix/tensor. If a weight matrix is low-rank, we can use the decomposition method to express the matrix as two small matrices, which will significantly reduce the number of parameters. This method is widely used in neural networks to reduce the number of parameters. In [16], [26], [27], it is used in FFNN such as full connection network, TDNN and FSMN. In [21], [28]- [30], it is used in RNN networks such as LSTM and GRU. These experiments keep the original network performance as much as possible while reducing the number of network parameters.
Batch normalization (BN) [31] is a optimization method that normalizes the variance and the mean of each layer's pre-activation in each training mini-batch. It can be applied to different places in the RNN. In [32], the authors suggest that the best way to apply batch normalization in mGRUIP-Ctx as follows: For the candidate state, BN is used for input-tohidden (ItoH) and hidden-to-hidden (HtoH) connections. For the update gate, BN is only used for ItoH. In [33], the authors suggest to apply BN to ItoH only for the candidate state. In [24], the authors apply BN to ItoH both for the candidate state and the update gate.
In this work, in order to increase the length of the context window without significantly increasing the parameters of the model, we propose a new architecture, which is named Projected minimal Gate Recurrent Unit (PmGRU). The proposed PmGRU is inspired by the decomposition of the low rank weight matrix in the previous work LSTMP [21] and cFSMN [16], we insert a separate smaller linear projection layer named as output projection layer after mGRUIP-Ctx cell's output. We believe that the linear projection layer can hold most of the valid information from the mGRUIP-Ctx cell's output. To further improve the performance of the PmGRU, we apply BN to both ItoH and HtoH for the input projection vector to form the final model Norm-PmGRU.
We evaluated our proposed model on five ASR tasks. Result shows that Norm-PmGRU is more effective compared with mGRUIP-Ctx, TDNN-OPGRU, TDNN-LSTMP and other baseline models, which not only has the almost same accuracy of recognition or even better, but also uses the least number of parameters.
The rest of the paper is organized as follows. Section 2 presents the model architecture of mGRU and its variants, including mGRUIP-Ctx, the proposed PmGRU and Norm-PmGRU. Section 3 presents experimental setup. Section 4 presents experimental and analysis of Results. Finally, the conclusion is given in Section 5.

II. PROPOSED MODEL ARCHITECTURES
In this section, we will first make an introduction to the model structure of minimal Gated Recurrent Unit (mGRU) and mGRUIP with Context module (mGRUIP-Ctx). Then the Projected minimal Gated Recurrent Unit (PmGRU) and Normalization Projected minimal Gated Recurrent Unit (Norm-PmGRU) will be introduced in detail.

A. MGRU (LIGRU)
The minimal Gated Recurrent Unit (mGRU), also named Light Gated Recurrent Unit (LiGRU), is a simplified version of the GRU. It is proposed by [23] [24]. Compared to GRU, mGRU has three changes: removing the reset gate, replacing the hyperbolic tangent activation function with the rectified linear unit function and applying batch normalization to feed-forward connections. It leads to the following equations: where x l t , z l t ,h l t , h l t and y l t are the input vector, the update gate activation, candidate state vector, output state vector and output vector of lay l for the current frame t, respectively. The activations of both update gate and reset gate are logistic sigmoid function σ (·). The W denotes weight matrices (e.g. W l zx is the matrix of weights from the input vector x l t to the update gate activation z l t ). The b denotes bias vectors (e.g. b l z is the bias vector of z l t ). ReLU(·) is the rectified linear unit function. is the element-wise product of the vectors. BN(·) means batch normalization. For a layer with d-dimensional input x = (x (1) , . . . , x (d) ), we will make each dimension have the mean of zero and the variance of 1.
where x (k) represents the k-th dimension of x. The expectation E[x (k) ] and variance Var[x (k) ] are computed over the VOLUME 8, 2020 training batch. In this article, the batch size is 128. That is, we will normalize each dimension of all 128 features.

B. MGRUIP-CTX
The mGRUIP with Context module (mGRUIP-Ctx) [25] [32] is obtained by inserting a linear input projection layer into mGRU. And a context module called temporal convolution is designed in input projection layer to effectively model the history and future context. The model is defined by the following equations: where v l t is a projection vector, it is calculated by adding the input vectorx l t and the previous output state vector h l t−1 compressed into the lower dimensional space by weight matrices W l vx and W l vh , respectively. v l 1t and v l 2t are the projected vector fromx l t and h l t−1 . z l t is the update gate, of which the activation function is logistic sigmoid σ (·). The output state vector h l t is also the output vector of model.
x l t is the concatenation of the current input vector x l t and the output state vector of preceding layer of both the future frames and history contexts: In particular, x l t is the input vector of layer l, and h l−1 t+s×i is the output state vector of layer l-1 on the (t + s × i)th frame. S 1 (S 1 ≥ 1) and S 2 (S 2 ≥ 1) are the step size for the history and future frames, respectively. K 1 (K 1 ≥ 1) is the order of history information and K 2 (K 2 ≥ 1) is the order of future information. i (1≤ i ≤ K 1 ) and j (1≤ j ≤ K 2 ) are the loop index.
The mGRUIP-Ctx architecture is shown in Figure 1. In this diagram, each line carries an entire vector, from the output of one node to the inputs of others. Lines merging denotes concatenation, and lines forking means that the contents are being copied and the copies are going to different destinations. The circles represent pointwise operations, like vector element-wise product, and the solid boxes represent the network layers that need to be learned.

C. PMGRU
The introduction of context information always leads to an improvement in model performance. We assume that the number of hidden neurons of the model is dim(cell), and the dimension of the input project layer of the model is dim(ip). The expression v l t andx l t show that if the number of contexts contained in the context module is n, the size of context module for mGRUIP-Ctx can be calculated as follow: In order to reduce the increment of the model size which is caused by the increasing of contexts, an output projection layer is added to mGRUIP-Ctx cell's output. The new model is called Projected minimal Gated Recurrent Unit (PmGRU). It reduces the final output to a lower dimension. Assuming that the dimension of the output project layer of the model is dim(op), the size of context module for PmGRU can be calculated as follow: Assuming dim(op) equal with dim(ip), the difference between N mGRUIP−Ctx and N PmGRU is: Compared to the expression of mGRUIP-Ctx, PmGRU adds an equation: And the equation (12) now becomes: x l t = y l−1 t−s 1 ×K 1 ; · · · ; y l−1 t−s 1 ×i ; y l−1 t ; y l−1 t+s 2 ×j ; · · · ; y l−1 y l t is the output of h l t through output projection layer.x l t is the concatenation of the output vector y l−1 t of preceding layer. Figure 2 is an expansion diagram of PmGRU along the time axis. The superscript of letters represents the number of layers where PmGRU units are located, and the subscript of letters represents the time point where PmGRU units are located. The dashed arrow in the graph represents context module, which extracts context information from the previous layer and adds it to the affine transformation of h l t−1 to get the input projection vector v l t .

D. APPLING BATCH NORMALIZATION IN PMGRU
In the architecture of PmGRU, its context information comes from two parts. The first part is the history context information from this layer (the previous hidden state vector h l t−1 ), it's called HtoH (hidden-to-hidden connection). The second part of the context information is generated by the context module, which is similar to TDNN. It's called ItoH (input-to-hidden connection). It uses a set of time-shift invariant filters to extract the context information (the input vector x l−1 t ) from the layer below.
Theoretically, introducing more context information into the context module will improve the performance of the model. However, experiments demonstrate that while context length reaches a certain number, the model's accuracy will decline. We consider that it is because of the ratio imbalance occurring between the context information from HtoH and ItoH in input projection vector. From equation (6) (7) (8) (12), we can find that the input projection vector v t is obtained by adding the lower layer information v 1t introduced from ItoH and the current layer information v 2t introduced from HtoH. Since the v 1t is obtained by concatenating the input values of multiple time points in the lower layer, the values of each dimension of v 1t will be much larger than those of the corresponding dimension of v 2t .
In this case, if we want to adjust the ratio of v 1t and v 2t in v t , the general idea is to standardize the interior of vectors v 1t and v 2t respectively. However, considering that each dimension of a vector represents different feature, it is not appropriate to standardize the interior of the vector directly. We think of using batch normalization method to standardize each dimension of the vector in a batch.
There are three ways to set the BN layer.
(1) Only apply BN to HtoH: (2) Only apply BN to ItoH: (3) Apply BN to both HtoH and ItoH: The rest of the equations in Norm-PmGRU are shown below: x l t = y l−1 t−s 1 ×K 1 ; · · · ; y l−1 t−s 1 ×i ; y l−1 t ; y l−1 t+s 2 ×j ; · · · ; y l−1 The input projection vector v l t will be used to update z l t and h l t after affine transformation. When updating h l t , a BN layer is added before the activation function ReLU(·). No special operations are performed when z l t is updated.

III. EXPERIMENTAL PREPARATION
This section will introduce five ASR tasks and their experimental setups used in this work. All our experiments were performed on the Kaldi toolkit [34].

A. EXPERIMENTAL TASK
In order to compare the differences between the results of each model, we conducted experiments on four Chinese Mandarin and one English tasks.

1) ST-CMDS TASK
The ST Chinese Mandarin Corpus contains 86 hours of speech, We divide the data sets into training set and test set, the ratio of which is 14:1. The test set contains 74359 characters.

3) AIDATATANG TASK
The AIDATATANG open Mandarin speech task contains 200 hours of acoustic data, which is mostly mobile recorded data. The database is divided into training set, development set, and testing set in a ratio of 7: 1: 2. The development and test sets include 234524 and 468933 characters, respectively.

5) LIBRISPEECH TASK
LibriSpeech is a 1000-hour English Speech Corpus. We select two datasets with high label accuracy, dev_clean and test_clean as test sets which contain 54402 and 52576 words respectively.

B. GMM-HMM EXPERIMENTAL SETUP
GMM-HMM model was used to generate alignment between corpus and it transcription. A monophone GMM-HMM model was firstly trained using 13-dimensional Mel Frequency Cepstrum Coefficient (MFCC) plus 3-dimensional pitch features [36]. After using decision tree to bind monophones, a small triphone model and a larger triphone model were then consecutively trained using 48-dimensional delta features [35]. Finally, Maximum likelihood linear transform (MLLT) and speaker adaptive training (SAT) were applied in training Speaker Dependent GMM-HMM model. But for AISHELL-2, we stopped GMM-HMM training at the speaker independent stage. Because from our experience, on thousands of hours of corpus, the performance of the final system primarily depends on the generalization ability from the DNN model. It is not worth spending too much time at GMM-HMM stage.

C. DNN EXPERIMENTAL SETUP
After getting the alignment of corpus and the word transcription, we used the alignment to train DNN model. All the DNN models in this paper are trained according to the following settings. In order to make the DNN-based acoustic model more robust to the tempo and volume variances of the testing data, we used the speed perturbation technique [37] to expand the training set to three times of the original training set, in which the speed perturbation factor is 0.9, 1.0 and 1.1. The 40-dimensional MFCC, 3-dimensional pitch and 100-dimensional i-Vector features are used in the training of DNN-based acoustic models [35]. The input feature at time t spliced from frame t − 2 to frame t + 2. All models had a delay tolerance of 50ms during training. Add the fixed delay of 20ms in the previous step, so all models have a delay of 70ms. And we used the subsampling technique to reduce the output frame rate of the model from 100Hz to 33Hz [38].
The training criterion adopted in this paper is the lattice-free maximum mutual information (LF-MMI) criterion [39]. The objective function of LF-MMI is given by: where O m is the speech corpus and W r m is the corresponding reference transcription. W m is a hypothesis sequence from the denominator graph. S r m and S m are state sequences corresponding to W r m and W m respectively. The gradient of LF-MMI training can be written as follow: where ∂ log p(o t |s t = j) means the log-likelihood of o t at a given particular state j. γ num m (s t = j) and γ den m (s t = j) represent posterior probabilities of t-th state of reference and hypothesis state sequences with the given input corpus O m being j, respectively.

IV. EXPERIMENTAL AND ANALYSIS OF RESULTS
In this section, we evaluate the effectiveness of our two proposed models on ASR tasks. The first model is the Projected minimal Gated Recurrent Unit (PmGRU), and the second one is the Normalization Projected minimal Gated Recurrent Unit (Norm-PmGRU). All the models in this section are trained after GMM-HMM step following the process of DNN introduced above.

A. PMGRU EXPERIMENTS
We evaluated the PmGRU on AISHELL-1 task and compared its performance with the mGRUIP with Context module (mGRUIP-Ctx). When setting up the PmGRU architecture, we refer to the parameter settings of the mGRUIP-Ctx [32]. The architecture of the neural network is shown in Figure 3. The parameters of PmGRU architecture are set as follows: (1) The PmGRU architecture has five PmGRU hidden layers, the first one of which does not contain context module.
(2) Considering that the dimension of input projection layer in mGRUIP-Ctx is 1/10 of the previous normal layer dimension, we set the dimension of the output projection layer of PmGRU to 1/10 of the dimension of the original output layer. Specifically, the dimension of memory cells in PmGRU structure is 2560, so the dimension of PmGRU output projection layer is set as 256. But for the last PmGRU hidden layer, since a 512-dimensional bottleneck layer is added between the last hidden layer and the output layer, the output projection layer is not set.
The performance of PmGRU with different context modules is compared. The step size s 1 and s 2 corresponding to context order K 1 and K 2 as well as the final results are shown in Table 1. The format of context order and step size in the table is K 1 × s 1 ; K 2 × s 2 . The "Total" item in the table is obtained by weighted averaging of the ''Dev'' and ''Test''  terms. The weight of the two items is the number of characters in the corresponding data set.
Firstly, comparing the mGRUIP-Ctx and PmGRU_A, when the context module is the same, the character error rate (CER) of two models is close, but PmGRU_A provides 20% relative size reduction compared to mGRUIP-Ctx. This shows that the output projection layer can reduce the size of the model while retaining valid information. Compared with mGRUIP-Ctx, the best-performing model PmGRU_C provides 3.2% relative CER reduction whereas consuming 16% less size budget.
Secondly, from the comparison of the four PmGRU models, we can find that within a certain range, as the length of the context increases, the accuracy of the model also increases. However, after the context length exceeds a certain limit, the performance of the model decreases.

B. NORM-PMGRU EXPERIMENTS
For the situation mentioned above that longer context length of context module (ItoH) will result in reducing of the model's accuracy, we think that it is caused by ratio imbalance between the context information from HtoH and ItoH in input projection vector v t . It can assume that too much context information introduced by ItoH will hide the context information of HtoH. Therefore, we believe that adding batch normalization layer can adjust ratio of context information from HtoH and ItoH in input projection vector v t . Several groups of comparative experiments are made on AISHELL-1, and the experiments setup and results are shown in Table 2. The parameter settings of Norm-PmGRU_A and Norm-PmGRU_B in Table 2 are the same as those of PmGRU_C and PmGRU_D in Table 1, respectively. From the comparison of the Norm-PmGRU_A model in Table 2, we can find that the model of applying batch normalization to ItoH and HtoH has the best performance. The values of HtoH and ItoH in a batch are standardized respectively, and their effects on the input projection vector are similar. Comparing the results in Table 2 and Table 1, we can find that the Norm-PmGRU_A has a relative 1.7% CER reduction compared to the PmGRU_C. By comparing Norm-PmGRU_A and Norm-PmGRU_B, we find the performance of Norm-PmGRU_B is not significantly improved compared with that of Norm-PmGRU_A, but the latency of Norm-PmGRU_B is increased by 20 ms. So all the Norm-PmGRU we used in the next step refers to Norm-PmGRU_A.
In order to study the effects of these two improvements on the model, the ablation experiments are conducted. The experimental results are shown in Table 3. Y in the table indicates that the item exists in the model, and N indicates that the item does not exist in the model. From the results, we find  that both using BN (for HtoH and ItoH) and adding output projection layer can improve the performance of the model. However, compared with adding output projection layer, BN can improve the model more obviously.
The impact of BN on the model performance is further analyzed. Figure 4 is the result of decoding a segment of audio (the number of BAC009S0764W0121 selected in AISHELL-1_TEST set) with the PmGRU_C and the three Norm-PmGRU_A models of different BN methods. We retain the input projection vector v t of the fifth PmGRU layer, the context information vector v 1t from the lower layer (ItoH) and the context information vector v 2t from current layer (HtoH) of each decoded frame. In Figure 4, the horizontal axis represents the time, and the vertical axis represents the average value of each dimension in vector. The blue circle represents input projection vector v t , the green pentagram represents v 1t (ItoH), and the red triangle represents v 2t (HtoH).
From Figure 4, we can find that the average value of input projection vector in PmGRU is mainly affected by the average value of context information vector from ItoH. When we only use BN for ItoH or HtoH, the input projection vector will be more affected by another non-BN connection. Only the input projection vector formed by BN for ItoH and HtoH respectively can take into account the influence of both the information from current layer and the information from the layer below.  the details of the Time Delay Neural Network-minimal Gated Recurrent Unit (TDNN-mGRU), when we set up the architecture of TDNN-mGRU, the parameters of TDNN are set as following the context module parameters of mGRUIP-Ctx [32], and the cell dimension is consistent with the setting of LSTMP in TDNN-LSTMP [20]. In Table 4, L f is forward LSTMP layer, L b is backward LSTMP layer, T is TDNN layer, m is mGRU layer, O is OPGRU layer, B is bottleneck layer.

1) COMPARISON BETWEEN NORM-PMGRU AND LSTM'S VARIANTS
In this section, we compare the performance of Norm-PmGRU with TDNN-LSTMP and BLSTMP in different ASR tasks. All results are shown in Table 5.
LSTMP is formed by inserting LSTM a recurrent projection layer and a non-recurrent projection layer after the cell output units. These two projection layers make the LSTMP architecture use parameters more efficiently than the standard LSTM architecture. BLSTMP is a combination of forward LSTMP and backward LSTMP. This structure provides complete past and future context information for each point in the output layer sequence. When comparing Norm-PmGRU with BLSTMP, we can find that Norm-PmGRU performed better than BLSTMP on ST-CMDS, AIDATATANG, AISHELL-1 and LibriSpeech. And the Norm-PmGRU reduces the parameters by about 45% and has less latency compared with BLSTMP. Because BLSTMP has a latency of 2030 ms, it is not suitable for online speech recognition.
The Norm-PmGRU performs better than TDNN-LSTMP on four Chinese ASR tasks, not only has a lower CER, but also requires smaller size. Specifically, as 45% fewer parameters than TDNN-LSTMP, Norm-PmGRU achieves a relative CER reduction of 1.56% over TDNN-LSTMP on ST-CMDS task, 6.18% on AISHELL-1 task, 5.40% on AIDATATANG task, and 1.68% on AISHELL-2 task. At the same time, Norm-PmGRU also brings about 3.5 times the decoding speed which is commonly measured by real time factor (RTF). The minor inadequacy is that Norm-PmGRU introduces a latency 80 ms longer than TDNN-LSTMP.

2) COMPARISON BETWEEN NORM-PMGRU AND GRU'S VARIANTS
In this section, we compare the performance of Norm-PmGRU with TDNN-mGRU, TDNN-OPGRU and mGRUIP-Ctx in different ASR tasks. All results are shown in Table 5.
The OPGRU develops from GRU by projecting its memory cell into recurrent part and non-recurrent part, and replacing its reset gate with an output gate. Recurrent part in projected memory cell is used for recurrence of OPGRU, and the whole projected memory cell is the output feeding into next layer, where the concept is similar to LSTMP. Comparing Norm-PmGRU with TDNN-OPGRU, we can see that the Norm-PmGRU has better performance than TDNN-OPGRU in all the five tasks. Specifically, even that model size of Norm-PmGRU is 39% smaller than TDNN-OPGRU, its CER(WER) is 3.65%, 7.36%, 8.51%, 4.15% and 4.32% relatively lower than TDNN-OPGRU on Task ST-CMDS, AISHELL-1, AIDATATANG, AISHELL-2, Lib-riSpeech, respectively. mGRU is a simplified version of GRU by removing out the reset gate. The mGRUIP-Ctx and Norm-PmGRU are built on the basis of mGRU. By adding TDNN layer to introducing context information, TDNN-mGRU still performs not better than mGRUIP-Ctx and Norm-PmGRU. mGRUIP-Ctx is formed by adding an input projection layer VOLUME 8, 2020 to the mGRU and then introducing context module, while Norm-PmGRU is obtained by adding an output projection layer to the mGRUIP-Ctx and changing position of the batch normalization layer. Norm-PmGRU performs better than mGRUIP-Ctx in the four Chinese tasks. Compared with mGRUIP-Ctx, the CER of Norm-PmGRU relative reduces by 5.95%, 4.52%, 4.83% and 2.43% at task ST-CMDS, AISHELL-1, AIDATATANG and AISHELL-2 respectively, which we believe is the result of adding more context information to Norm-PmGRU. At the same time, Norm-PmGRU's architecture uses parameters more efficiently due to the existence of output projection layer. More context information has been successfully added to the Norm-PmGRU, while the parameter used is decreased by about 16% than that in mGRUIP-Ctx.

V. CONCLUSION
In this paper, we revised mGRUIP-Ctx for speech recognition acoustic model. The proposed Norm-PmGRU architecture is an improved version of the mGRUIP-Ctx, in which an output projection layer is added after mGRUIP-Ctx's output. After that, batch normalization is used to further improve model performance as well as to adjust the ratio of context information of the previous layer to the current layer.
Experiments on five different ASR tasks have shown the effectiveness of the proposed model. Compared to baseline models, our model not only has the almost same accuracy of recognition or even better, but also uses the least number of parameters. Specially, on task ST-CMDS, AISHELL-1, AIDATATANG and AISHELL-2, the character error rate (CER) of Norm-PmGRU relatively reduces by 5.95%, 4.52%, 4.83% and 2.43% than mGRUIP-Ctx. On task LibriSpeech, Norm-PmGRU has the almost same word error rate (WER) with mGRUIP-Ctx. In the meantime, the parameter used in Norm-PmGRU is decreased by about 16% than that in mGRUIP-Ctx. RENJIAN  He is the author of more than 50 research papers in various refereed international journals and conferences. He has successfully completed number of research projects from various national agencies. His current research interests include speech recognition, digital signal processing, multi-source information fusion, wireless sensor networks, and intelligent instrumentation. JIAXUAN YAN received the B.E. degree from the Hefei University of Technology, Hefei, China, in 2020. She is currently pursuing the M.A.Eng. degree with the School of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing, China. Her current research interests include speech recognition, speech signal processing, and deep learning. VOLUME 8, 2020