End-to-End Speech Endpoint Detection Utilizing Acoustic and Language Modeling Knowledge for Online Low-Latency Speech Recognition

Speech endpoint detection (EPD) benefits from the decoder state features (DSFs) of online automatic speech recognition (ASR) system. However, the DSFs are obtained via the ASR decoding process, which can become prohibitively expensive especially in limited-resource scenarios such as the embedded devices. To address this problem, this paper proposes a language model (LM)-based end-of-utterance (EOU) predictor, which is trained to determine the framewise probabilities of the EOU token conditioned on the previous word history obtained from the 1-best decoding hypothesis of the ASR system in an end-to-end manner without an actual decoding process in the test step. Further, a novel end-to-end EPD strategy is presented to incorporate a phonetic embedding (PE)-based acoustic modeling knowledge and the proposed EOU predictor-based language modeling knowledge into an acoustic feature embedding (AFE)-based EPD approach within the recurrent neural networks (RNN)-based EPD framework. The proposed EPD algorithm is built upon the ensemble RNNs, which are independently trained for the three parts, which are the proposed LM-based EOU predictor, AFE-based EPD, and PE-based acoustic model (AM) in accordance with each target. The ensemble RNNs are concatenated at the level of the last hidden layers and then attached into the fully-connected deep neural networks (DNN)-based EPD classifier, which is trained in accordance with the ultimate EPD target. Thereafter, they are jointly retrained at the second step of the DNN training to yield the lower endpoint error. The proposed EPD framework was evaluated in terms of the endpoint accuracy and word error rate for the CHiME-3 and large-scale ASR tasks. The experimental results turn out that the proposed EPD algorithm efficiently outperforms the conventional EPD approaches.


I. INTRODUCTION
Spoken dialogue systems make it possible to control contemporary devices, such as smartphones, navigation systems, and AI speakers through natural voice interaction. Usually, the interaction with such devices is user-initiated by uttering the wake-up-word. Then, an automatic speech recognition (ASR) technique is performed in an online manner until an end-of-utterance (EOU) is automatically detected by a The associate editor coordinating the review of this manuscript and approving it for publication was Victor Sanchez . speech endpoint detection (EPD) algorithm. The EPD is a challenging task since the utterance can be endpointed late due to the ambient noise and early due to long pause hesitation. Since an early endpoint undesirably cuts off the speech region, the performance of speech recognition is often degraded seriously; on the other hand, a late endpoint increases the response latency of the online ASR system. Consequently, degraded endpoint performance often causes the user dissatisfaction [1], [2].
The traditional EPD approaches consist of two cascaded decision processes. First, input speech is classified into VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ speech and non-speech on a frame basis using a speech activity detection (SAD) algorithm designed with engineered features [3]- [7]. Then, the EOU is finally detected when the duration of non-speech obtained using the SAD algorithm reaches the pre-defined threshold value, i.e., 500 ms or 1 s [8]. Chung et al. proposed an EPD algorithm that classifies speech and non-speech states using the SAD technique based on a log-likelihood ratio (LLR) test proposed in [9], and then finds the endpoint with the online decoder designed based on a weighted finite-state transducer (wFST) [10]. Since it is difficult to optimize the LLR test-based SAD and wFST jointly, this EPD scheme was further improved by adopting the quantized LLR states as the wFST input instead of the binary speech/non-speech state [11]. The performance of these EPD structures is dramatically enhanced with the help of the SAD algorithms based on deep neural networks (DNN), which yield the state-of-the-art SAD performance via deep nonlinear hidden layers [12]- [17]. Especially, it was observed that the bottleneck features of the DNN-based acoustic model (AM), called phonetic embedding (PE), which is trained to predict senones (tied triphone states) [18], lead to improved SAD and EPD performances [19]- [21]. Another way is to directly find the EOU from the sequential input features by employing a long short-term memory (LSTM) [22], whereas the traditional EPD schemes consist of the separate SAD and online decoder. The LSTM can model the complex relations between the input feature sequence and the corresponding framewise EPD targets with the memory cell as the temporal state of the network can be successfully controlled by the input, forget, and output gates [23]- [26]. Notably, a unified architecture comprising the convolutional neural networks (CNN), LSTM, and fully-connected DNN, called CLDNN, was proposed to exploit their complementary advantages for the ASR [27] and SAD tasks [28]. However, it was observed that the capability of the convolution layer is diminished when extracting features in adverse noisy conditions [29]. To address this problem, an alternative method called grid-LSTM [30] was presented to model the time and frequency variations of input sequential features properly within separate time and frequency LSTM cells, respectively. Furthermore, a grid-LSTM DNN (GLDNN) [31] was introduced by employing the grid-LSTM in the first layer instead of the convolutional layer of the CLDNN to improve the EPD performance. However, these feature mapping-based EPD approaches often prematurely abandon the speech region due to a pause hesitation or cause a higher detection latency since they cannot adequately consider the context of input feature sequences such as phone or word alignments. In addition, the performance of the EPD approach using the LSTM can be degraded for a long utterance since the LSTM suffers from a state saturation problem due to its degradation of gate controls [32].
In addition, the EPD approaches designed to use the decoder state features (DSFs) of the ASR module as the auxiliary features have been introduced to distinguish the EOU from a short or long pause under the noisy environments. First, Ferrer et al. developed a prosodic feature-based EPD method that yields the EOU decision when a pause of any length is detected, where the decision statistic is determined by using the non-speech duration, prosodic feature, and language model (LM) knowledge [33]- [35]. Besides, Stuker et al. proposed a simple EPD approach, which is similar to the aforementioned approaches, to segment continuously recorded speech by triggering the EOU when the pause duration reaches the maximum pause threshold. Here, the pause duration can be obtained from phone alignment corresponding to the 1-best ASR decoding hypothesis [36]. However, this approach cannot be applied to the online ASR systems since the 1-best decoding hypothesis frequently changes during the online decoding process, which makes the EPD system unstable. To overcome this disadvantage, the expected pause duration is introduced as the stable feature for the online EPD task since it is obtained by interpolating the pause durations within all active hypotheses [37], [38]. Furthermore, it was observed that the word embedding (WE), which is obtained from the word LSTM [39] trained with the 1-best ASR decoding hypothesis to detect the turn-taking word, can yield the significant performance improvement of acoustic feature embedding (AFE)-based EPD without an actual decoding process, whereas the combination of the AFE, WE, and expected pause durations achieves the state-of-the-art EPD performance. Also, [40] incorporated the EOU symbol into the output within unified recurrent neural networks (RNN) transducer-based ASR system. Although various frameworks for on-device speech recognition have been proposed [41]- [43], the speech recognition accuracy is still limited due to the heavy computational cost since there is the trade-off between the word error rate (WER) and computational cost. Indeed, it is difficult to assess the advantages of the superior ASR system, which requires high complexity in limited-resource scenarios. Moreover, the WE cannot be considered as a reliable feature for the online EPD task since the context-dependent EPD approaches including the WEbased EPD suffer from the ambiguity of the turn-taking word due to the flexibility of a natural language [44].
In order to address the aforementioned disadvantages of conventional EPD approaches, this paper proposes an endto-end EPD algorithm by incorporating both the acoustic and language modeling knowledge into the AFE-based EPD algorithm. First, the LM-based EOU predictor, which is trained to determine the framewise probabilities of the EOU token conditioned on the previous word history of the 1-best hypothesis obtained using the ASR decoding process, is presented. Once the proposed EOU-predictor is trained, it can derive the framewise probabilities of the EOU token given the input acoustic features without the actual ASR decoding process. Further, we introduce a novel EPD framework, consisting of the proposed LM-based EOU predictor, PE-based AM, and AFE-based EPD. When training the EOU predictor, the 1-best ASR decoding hypothesis with the N -gram LM is used to obtain the probabilities of the EOU token, which correspond to the training target of RNN. And, the AFEbased EPD algorithm is designed with RNN to classify the speech frame into four labels, namely speech, initial silence, final silence, and intermediate silence, on a frame-by-frame basis. Also, the AM for extracting the acoustic modeling knowledge with the use of the PE is trained with RNN by incorporating the sequential input features as the input along with senone targets. Then, the last hidden layers of the three ensemble RNNs are concatenated to train the fully-connected DNN-based classifier according to the hand-made EPD label. Finally, all the designed EPD networks are jointly retrained, thereby leading to a lower endpoint error. The proposed EPD algorithm was evaluated in terms of the early endpoint time, late endpoint time, and WER for the CHiME-3 ASR task [45], which includes various simulated and real acoustic conditions and a large-scale ASR task. Overall, the proposed EPD algorithm without the decoding process was observed to achieve a lower endpoint error, which leads to a lower WER and lower latency.
The rest of this paper is organized as follows. In the next section, we review the recently proposed EPD algorithms. In Section III, we describe the design of the proposed EPD approach. An extensive evaluation of the proposed algorithm is discussed in Section IV, and the conclusions are presented in Section V.

II. REVIEW OF PREVIOUS WORKS
This section briefly describes the conventional EPD algorithms which will be compared with the proposed EPD algorithm later.

A. EPD USING A GLDNN
The CLDNN-based architecture was previously introduced to exploit the complementary modeling advantages of the CNN, LSTM, and fully-connected DNN [27]. First, the CNN can extract the time and frequency-invariant features from sequential input features such as the Mel-frequency cepstral coefficients (MFCC) and log-Mel filterbank energies. In addition, the LSTM can model the short-and long-term temporal contexts of input features and the fully-connected DNN can model the complex relation between the features, which is represented via the CNN and LSTM, and the EPD target through multiple nonlinear hidden layers. As discussed in [29], the convolution layer for feature extraction is deteriorated in highly noisy conditions; hence, the alternative architecture called GLDNN [31] was introduced to replace the convolution layer with the grid-LSTM layer [30]. The grid-LSTM models the variations of successive features in the time and frequency axes through separate grid time LSTM (gT-LSTM) and grid frequency LSTM (gF-LSTM), respectively. Here, grid-LSTM is similar to the convolution layer in that both models are used to represent the input features over a restricted local time-frequency block and they use the shared model parameters. However, the grid-LSTM differs from the convolution layer in that it models frequency variations through a recurrent state that is passed along the frequency axis, whereas the convolution layer independently extracts the locally invariant features via the convolution and pooling operations.
The GLDNN-based EPD technique consists of the stacked grid-LSTM layers, standard LSTM layers, and fullyconnected DNN layers. Once the time and frequency invariant features are extracted by the grid-LSTM layer, their shortand long-term temporal contexts are modeled by the standard LSTM layers. The EOU predictor finally classifies each frame into four distinct classes, namely speech, initial silence, final silence, and intermediate silence, to distinguish the final silence from the different silence states in the utterance. In the test step, the posterior probability of the final silence is computed and the EPD is triggered when it exceeds the given threshold value.

B. EPD BASED ON COMBINING AFE, WE, AND DSFs
The combined feature-based EPD algorithm [39] consists of three parts to detect the EOU exactly by fusing multiple features. They are an acoustic LSTM trained on the acoustic features, the word LSTM trained on the 1-best ASR decoding hypothesis, and the DSFs composed of three types of pause durations, which are described as follows. First, the acoustic LSTM trains the AFE in accordance with the framewise endpoint target. The corresponding SAD target is also trained in a multi-task fashion to distinguish the final silence from the initial silence and intermediate silence. Unlike the acoustic LSTM, the word LSTM is trained from the acoustic feature sequence to detect the turn-taking word. Hence, the word LSTM is triggered when alignments corresponding to the turn-taking word are observed instead of the final silence region, where the alignment is obtained from the 1-best ASR decoding hypothesis. To consider the decoder state, three types of expected pause durations extracted from the active ASR decoding hypotheses are utilized as the DSFs. Specifically, the DSFs consist of the best path pause duration, expected pause duration, and end pause duration, which are explained as follows. For this, Letting X t = {x 1 , x 2 , . . . , x t } and s i t = {s i 1 , s i 2 , . . . , s i t } be the input feature sequence until the t-th frame and the state sequence of the i-th active hypothesis until the t-th frame, respectively, the posterior probability of the i-th hypothesis is denoted by P(s i t |X t ). First, the best path pause duration is determined by L i max where L i t denotes the pause duration according to the i-th hypothesis. Second, the expected pause duration D(L t ) is obtained by interpolating the active hypotheses as follows: where N t denotes the number of active hypotheses at the t-th frame. The expected final pause duration D end (L i t ) can be determined as follows: where S end denotes the end state of the LM. At each frame, the feature vectors for the EPD are combined with the last hidden layer of both the acoustic LSTM and word LSTM along with the DSFs. The fully-connected DNN-based classifier is finally trained with the combined feature vector in accordance with the framewise endpoint target.
In the inference step, the EPD is triggered when the posterior likelihood of the endpoint exceeds a given threshold. To safeguard the lower and upper latency (pause duration) bounds, the SAD decision of the acoustic LSTM is additionally used as follows. If the pause duration obtained by the trained SAD does not reach the minimum pause duration, T min , the endpoint is not triggered. Furthermore, the endpoint is enforced to be triggered if the pause duration obtained by the SAD is longer than the maximum pause duration, T max .

III. PROPOSED END-TO-END EPD ALGORITHM BASED ON ENSEMBLE RNNs
As shown in Fig. 1, the novel EPD algorithm is proposed to exploit the ensemble of the AFE-based EPD, PE-based AM, and decoder embedding (DE) derived from the LM-based EOU predictor that is the main contribution of this study. The LM-based EOU predictor directly yields the framewise probability of the EOU token conditioned on the previous word history of the 1-best ASR decoding hypothesis with the N -gram LM. Accordingly, the LM-based EOU predictor, AFE-based EPD, and PE-based AM are separately trained, and then the fully-connected DNN-based EOU predictor is trained with the combined feature vector, which is composed of the last hidden layers of the three ensemble RNNs, in accordance with the framewise hand-labeled EPD targets as described in the following subsections.

A. PROPOSED LM-BASED EOU PREDICTOR
As shown in [39], the combination of the AFE and WE can yield superior EPD performance without the actual decoding process, closely matching the performance of the EPD system based on the AFE, WE, and DSFs, which can be obtained by performing the online ASR decoding process. However, in natural language processing (NLP), it is difficult to detect the turn-taking word in an online fashion due to the flexibility of the natural language. The natural language can express the user's intentions variously according to grammatical rules [44]. For instance, the user's intention tends to be expressed with the action and object information only such as ''turn the lights on''. Also, the user's intention is often expressed by including the specific location information by attaching an additional phrase such as ''in the kitchen''. In other words, from the expression ''turn the lights on'', it cannot be clearly identified whether ''on'' is the turn-taking word or not, whereas ''on'' is not the turn-taking word if the phrase ''in the kitchen'' follows the above expression. Thus, the WE, which is extracted from the word LSTM, cannot be considered as a reliable feature for the online EPD task since it is trained to detect the turn-taking word as depicted in Fig. 2(a). This figure shows the example pairs of the sentence and label from [46], which are used to train the word LSTM. It can be observed that different labels are given for the same word sequence, depending on whether the additional phrase follows or not. In order to address these problems, this paper proposes the LM-based EOU predictor, which is similar to the word LSTM in that they are trained depending on the 1-best ASR decoding hypothesis. However, it differs from the word LSTM in that the proposed EOU predictor is trained to determine the framewise probabilities of the EOU token conditioned on the previous word history instead of binary classification for finding the turn-taking word. As depicted in Fig. 2(b), the same probabilities of the EOU token for training the EOU predictor are given without a reference to whether each word is the turn-taking word or not, where each probability of the EOU token is obtained from the 4-gram LM. After the word ''on'' is shown, the probability of the EOU token is 0.372 since the probability that the additional phrase is attached after the observed sentence to contain the specific information additionally is 0.628, which is obtained from the 4-gram LM. And, the probability of the EOU token is decreased to almost zero after the word in the middle of the sentence ''in'' is observed since P(EOU|lights, on, in) ≈ 0. On the other hand, the probability of the EOU token rapidly increases after the last word in the sentence ''kitchen'' is detected. The method to obtain the framewise probability of the EOU token conditioned on the previous word sequence is described as follows.
The ASR technique aims to determine the most likely word sequenceŵ, given the input acoustic feature sequence X, whereŵ is expressed as follows: Instead, the Bayes' rule represents it into the equivalent form as follows:ŵ where the likelihood P(X|w) is determined by the AM usually based on the DNN and the prior probability P(w) is obtained by the LM. Here, the LM is utilized to derive the probability of each word conditioned on the previous word history as P(w i |w <i ). For large vocabulary continuous speech recognition (LVCSR), it is approximated by the N -gram LM according to the Markov chain rule, where the N -gram LM determines the probability of each word conditioned on the last N − 1 words only, instead of the entire word history. However, the major drawback of the N -gram LM originates from data sparsity when trained with insufficient corpora. It can be mitigated by the combination of discounting and backing-off algorithms, called the Katz smoothing algorithm [47]. The 3-gram LM is suggested to obtain the probability of the EOU token conditioned on the word history as follows: (5), P(EOU|w i−1 ) also can be alternatively obtained via the backing-off method if C(w i , EOU) = 0 or the discounting method if 0 < C(w i , EOU) ≤ C . The 1-best ASR decoding hypothesis at t, calledŵ t , can be obtained as follows: The probability of the EOU token at t can be derived according to the last two words of the 1-best ASR decoding hypothesis by employing the 3-gram LM as follows: whereŵ t,u and U denote the u-th word ofŵ t and the number of words ofŵ t , respectively. Specifically, the probability of the EOU token given X t can be obtained by marginalizing overall all the possible hypotheses at t. It can be represented by (8) with the assumption that the probability of the 1-best hypothesis dominates the probability mass of all the possible hypotheses such that P(ŵ t |X t ) = 1. Furthermore, (8) can be rewritten as in (9) since it can be assumed that the probability of the EOU token is conditionally independent to X t . Finally, the probability of the EOU token given X t can be determined according to the last two words of the 1-best hypothesis at t via the 3-gram LM approximation as in (10). In this study, the LM-based EOU predictor is first presented to determine directly the probability of the EOU token P(EOU|X t ) in an end-to-end manner. As depicted in the upper part of Fig. 3, the framewise probabilities of the EOU token in the training stage are obtained from the 1-best ASR decoding hypothesis of each training-datasetŵ t with the help of the decoding module. The probability of the EOU token conditioned on the previous word history is obtained by the N -gram LM, used as the target for the training. Then, the proposed LM-based EOU predictor using RNN is trained along with the targeted probability of the EOU token. The key idea is to train the LSTM network to minimize the mean square error (MSE) function for the LM-based EOU predictor, which is expressed as follows: where DE l is the model parameter of l-th RNN layer and h DE l,t denotes the hidden state of the l-th hidden layer at the t-th frame for the DE, respectively. Also, V DE , b DE , and σ denote the weight parameter, bias parameter, and logistic sigmoid function, respectively. Once the proposed end-to-end EOU predictor is completely trained, the framewise posterior probabilities of the EOU token are determined at the inference stage as in (11)-(13) without the actual ASR decoding process while eliminating the gray block as depicted in Fig. 1. Furthermore, they will be used as the LM knowledge for the final EPD decision.

B. AFE-BASED EPD
According to [31], the AFE-based EPD method can be used to classify each frame into four states, i.e., speech, initial silence, final silence, and intermediate silence to distinguish the final silence from the other silence states, where the high posterior probability of the final silence is likely to be the true endpoint. The AFE-based EPD is formulated as follows: where AFE l is the model parameter of l-th RNN layer and h AFE l,t denotes the hidden state of the l-th hidden layer at the t-th frame for the AFE, respectively. Also, V AFE and b AFE denote the weight and bias parameters of the output layer, respectively. All the parameters in the LSTM for the AFEbased EPD are trained to minimize the cross-entropy (CE) error function.

C. PE-BASED AM
According to the previous studies [19], [20], [48], and [49] on the phone-aware training method using the latent feature of the DNN-based AM (called PE), the main idea can be further improved for other applications such as the speech enhancement and SAD tasks. Hence, in this study, we incorporate the PE for the EPD task to reduce the endpoint error. The PEbased ASR is derived as follows: where PE l is the model parameter of l-th RNN layer and h PE l,t denotes the hidden state of the l-th hidden layer at the tth frame for the PE, respectively. Furthermore, V PE and b PE represent the weight and bias parameters of the output layer, respectively. It is expected that the LSTM better models the PE-based AM to minimize the CE error function with the framewise senone label c t , which can be obtained by performing the forced alignment process based on the Gaussian mixture model-hidden Markov model (GMM-HMM)-based ASR system [50].

D. PROPOSED END-TO-END ENDPOINT DETECTION BASED ON ENSEMBLE RNNs
We propose the novel EPD framework that reduces the early and late endpoint times simultaneously by leveraging the AFE-based EPD algorithm with the acoustic and language modeling knowledge. As in [19]- [21], which show that the complementary advantages of multiple features can be easily combined using the DNN by injecting the features together, the last hidden layers of the PE-based AM and the proposed LM-based EOU predictor are concatenated with that of the AFE-based EPD algorithm as the acoustic modeling context and language modeling context, respectively. After the AFEbased EPD, PE-based AM, and LM-based EOU predictor are independently trained in accordance with each target, the ensemble RNNs are concatenated at the level of the last hidden layers and then fed into the DNN-based EPD classifier, which is used to classify each input frame into four states indicating the speech, initial silence, final silence, and intermediate silence, as follows: where h EPD l,t denotes the hidden state of the l-th layer at the tth frame. In addition, V EPD l and b EPD l denote the weight and bias parameters at the l-th hidden layer, respectively. To build the model, the CE error function is directly applicable to the objective criterion, thus, the posterior probability of the final silence representing the speech endpoint is established. After the classifier based on the DNN is trained, all the modules including the ensemble RNNs for extracting the AFE, PE, and DE and the DNN for the classifier are dependently optimized again by the joint retraining (JRT) process, which is similar to phase 3 of [19], in accordance with the EPD label to further enhance the EPD performance, whereas they consist of differentiable parameters as shown in Fig. 1, which illustrates the feed-forward and error back-propagation paths.
In the inference stage, the probability of the EOU is computed by feeding the input acoustic feature sequence into the proposed EPD algorithm. The EOU is finally detected when P EPD (EOU|X t ) exceeds the probabilities corresponding to the speech, initial silence, and intermediate silence.

IV. EXPERIMENTS AND RESULTS
This section describes the performance evaluation of the proposed EPD approach. For the objective comparison, our approach was compared with the conventional GLDNNbased EPD [31] and the EPD based on combining the AFE, WE, and DSFs [39]. Since the DSFs-based approach in [39] and the proposed EPD approach are commonly based on the combination of the trained embeddings, such as [AFE, WE, DSFs] and [AFE, PE, DE], respectively, the performances of the sub-EPD systems based on single embedding alone and their combinations were tested also to verify the superiority of the DE for the proposed EPD algorithm. In [31] and [39], the performances of the EPD systems were evaluated using the following metrics. First, the EPD performances were assessed using the late endpoint time describing how the final EPD decision is triggered late compared with the EPD label. The late endpoint time can be considered as the response latency of the online speech recognition system since 1-best ASR decoding hypothesis can be obtained when the EPD is triggered. Besides, the EPD performances were compared using the WER since bad early endpoint errors undesirably cut off the speech region and increase the deletion error rate. In order to evaluate the performance of the EPD systems in terms of early endpoint error itself also, we reported the WER as well as the early endpoint time, which describes how the final EPD decision is prematurely triggered compared with the true EPD label. For the performance comparison, the EOU of each speech sample on the evaluation-dataset was obtained by independently performing the EPD systems, and then the EPD performances were assessed in the following. The early endpoint time was obtained by averaging the gap between the actual EOU and the moment the EPD algorithm was triggered within the speech samples for which the EPD approach was prematurely triggered. The late endpoint time was obtained by averaging the gap between the actual EOU and the moment the EPD algorithm was triggered within the speech samples for which the EPD decision was triggered late. Then, the WER was evaluated by performing the ASR decoding process from the first frame to the EOU frame determined by each EPD algorithm, while the WER was computed by the summation of the substitution, deletion, and insertion error rates [51].
The first part of the experiments used a relatively small speech dataset, namely CHiME-3, to evaluate and analyze the conventional and proposed EPD algorithms with various acoustic configurations. The second part of the experiments scaled up the size of utterances to be augmented by using the acoustic environment simulation method with the clean speech database, namely SiTEC Dict01 [52]. These experiments mainly show the effectiveness of the proposed EPD framework. Note that all the frameworks were implemented using the TensorFlow library [53].
A. CHiME-3 ASR TASK 1) DATA PREPARATION We emphasize that the simulation must be very realistic; hence, we selected the CHiME-3 dataset [45] developed for the far-field ASR task with a multi-microphone tablet device in everyday environments, i.e., a bus, cafe, pedestrian area, and street, each of which consists of real speech data (REAL) and simulated speech data (SIMU). The real speech data consist of six-channel recordings and were sampled at 16 kHz. Twelve English speakers were asked to read the sentences from the WSJ0 corpora [54] while using the multi-microphone tablet. They were encouraged to adjust their reading positions so that the target distance continued to change over time. The simulated speech data were generated by artificially mixing the clean utterances from the WSJ0 into background recordings. The speech data consists of three datasets, including the training-, development-, and evaluation-datasets, which have 18 h of speech data (3 h REAL and 15 h SIMU) uttered by 87 speakers, 2.9 h of speech data uttered by 4 speakers, and 2.2 h of speech data uttered by 4 speakers, respectively. The development-and evaluationdatasets have a 1:1 ratio of REAL and SIMU. We used the training-and development-datasets for the training of each EPD framework and the evaluation-dataset for the performance comparison. In particular, ''Beamformit'', which is a weighted delay-and-sum beamforming algorithm [55] was performed to extract the speech signal of interest from background noise. Note that the beamforming algorithm was carried out using only five microphones facing the speaker, and we excluded the second microphone since it was located on the rear side of the tablet device and contained less speech.
To prepare the senone targets and P(EOU|X t ) labels used to train the AM and the LM-based-EOU predictor, respectively, we used the baseline ASR system of the CHiME-3 task provided by the KALDI framework [56]. The baseline ASR system was prepared using the training-and developmentdatasets described as follows. First, the training-and development-datasets were represented with 25 ms frames of 13-dim MFCC features computed every 10 ms with the Hamming window. We obtained 1,952 types of senone labels by training the GMM-HMM-based ASR system with a 40-dim feature space maximum likelihood linear regression (fMLLR) context by speaker adaptive training (SAT), whereas the input feature was spliced with three left and three right feature frames (91-dimension). Subsequently, the DNN for the AM, which has 7 hidden layers and 2,048 hidden units with the sigmoid activation function on each hidden layer, was trained as in the following steps. First, each hidden layer of the DNN was initialized via the layerwise unsupervised learning process called pre-training by the contrastive divergence (CD) algorithm [57]. Then, the DNN was trained to minimize the CE error function, whereas the DNN input was also spliced with five left and five right fMLLR contexts (440-dimension). Finally, the DNN was trained again with the state-level minimum Bayes risk (sMBR) criteria [58]. The 3-gram LM was used for the baseline ASR system, which was developed within the 5k vocabulary and pruned by the pre-defined threshold values. After the DNN-based AM was trained, the senone labels of each dataset were prepared by performing the ASR decoding process. Furthermore, the framewise P(EOU|X t ) labels of each dataset were established according to the word alignment, which was obtained from the 1-best ASR decoding hypothesis. In addition, we made framewise reference EPD decisions on the enhanced speech data of each dataset by manually labeling each frame as speech, initial silence, final silence, and intermediate silence for every 10 ms.

2) TRAINING PROCESS FOR EACH EPD MODEL
The proposed EPD framework was constructed as follows. First, the training-and development-datasets were represented with 25 ms frames of 64-dim log-Mel filterbank energies computed every 10 ms, which were used as the input feature for the EPD task. The AFE-based EPD and PE-based AM consisted of two LSTM layers with 100-dim cells per layer and the fully-connected DNN-based classifier with the soft-max function, yielding the 4-dim and 1,952-dim output, respectively, for classifying the input frame into four types of states, which are speech, initial silence, final silence, and intermediate silence frames and senone labels, respectively. The EOU predictor-based EPD also consisted of two LSTM layers with 100-dim cells per layer and the fully-connected DNN yielding the 1-dim output layer through the sigmoid logistic function. The EOU predictor-based EPD was trained with the MSE function where the probabilities of the EOU token P(EOU|X t ) obtained using the N -gram LM were used to train the EOU predictor. After they were trained, their last hidden layers were concatenated to train the EPD classifier consisting of two 100-dim fully-connected DNN layers and the 4-dim soft-max layer. The batch size was set to 64. The learning rates for the training of the AFEbased EPD, PE-based AM, LM-based EOU predictor, and classifier were set to 0.01, 0.01, 0.001, and 0.01, respectively, for the first 10 epochs, and then decreased by 10% after each epoch. When the proposed EPD architecture was jointly retrained for further optimization, the initial learning rate was set to 0.0001, and then decreased by 10% upon each epoch.
For the EPD performance comparisons, the conventional EPD approaches were established as follows. For the GLDNN-based EPD [31], the grid-LSTM used 12-dim grid-LSTM units, where the filter size was 8 with the stride 2 (overlapped by 6). Furthermore, two LSTM layers with 64-dim cells per layer and two 100-dim fully-connected DNN layers with the 4-dim soft-max layer were cascaded. The GLDNN-based EPD was trained with the 64-dim log-Mel filterbank energies in accordance with four types of labels: speech, initial silence, final silence, and intermediate silence.
The batch size and learning rate were set to 64 and 0.01, respectively. As for [39], 64-dim log-Mel filterbank energies were also used as the feature. The acoustic LSTM and word LSTM were constructed with two LSTM layers with 100-dim cells per layer and the fully-connected DNN-based classifier. The acoustic LSTM was trained in accordance with the four types of EPD labels, which are speech, initial silence, final silence, and intermediate silence unlike [39] since the post-processing for safeguarding the lower and upper pause duration bounds was not used for a reasonable comparison. The word LSTM was also trained with the binary labels, which were given depending on whether the turntaking word or not and were obtained by performing the ASR baseline for the CHiME-3 task. After the acoustic LSTM and word LSTM were trained, the classifier consisting of the two 100-dim fully-connected DNN layers with 4-dim softmax function was trained from the sequential features, which were composed of the last hidden layer of both LSTMs and the DSFs (three types of expected pause durations), which were obtained by performing the ASR decoding process in an online manner. The batch size and learning rate were set to 64 and 0.01, respectively, for training the acoustic LSTM, word LSTM, and classifier.
The sub-EPD systems based on the single embedding alone or their combinations were built also in order to verify the superiority of the DE for the proposed EPD. Specifically, the embeddings including the AFE, PE, WE, and DE were prepared by feeding the training-dataset into the AFE-based EPD, AM, word LSTM, and proposed LM-based EPD system and capturing the hidden states at the level of the last hidden layer, respectively. The classifiers of the sub-EPD system were separately trained in accordance with the framewise endpoint target by feeding the single embedding alone or their combinations into the EPD classifiers, while the CE error function was used. The batch size and learning rate were set to 64 and 0.01, respectively, for training each EPD classifier.
The Adam optimization algorithm [59] was commonly applied for all the training processes. Furthermore, an early stopping scheme was performed using the developmentdataset to avoid the over-fitting, after 50 epochs were completed.

3) EXPERIMENTAL RESULTS
Before demonstrating our experiments, we assessed the performance of the EPD systems based on the various DEs, which were obtained by the N -gram LM with different orders, since the performance of the LM-based EOU predictor for the EPD task is highly dependent on not only the DNN architecture but also the N -gram LM used to build the targets. Table 1 shows the average early and late endpoint times obtained within the development-dataset, where DE N denotes the EPD based on the DE trained using the N -gram LM, and the bold numbers indicate the best result among the DE-based EPD systems. The performances of the AFE-, PE, and WE-based EPD algorithms are also reported for the relative performance comparison. The DE trained with the 4-gram LM achieved lower endpoint errors compared with the others; thus, we used the 4-gram LM for training the EOU predictor. The proposed EPD algorithm and the conventional methods were extensively evaluated on the CHiME-3 ASR task to assess the EPD performance under the bus, cafe, pedestrian, and street scenarios for both the simulated acoustic condition and the everyday environment. Fig. 4 shows an example of the prediction result of P(EOU|X t ) and the final EPD decision according to each EPD algorithm under the REAL bus noise scenario, where this example includes the short pause regions at 2.4, 3.6, and 4.2 s and the long pause region from 4.8 to 5.0 s. As shown in Fig. 4(b), P(EOU|X t ) was observed to be high in the short and long pause regions and is likely to detect the short pause as an endpoint since the GLDNN-based EPD cannot fully consider the language modeling knowledge such as the phone or word alignments. Especially, the probability of the EOU was sufficiently high to trigger the final EPD decision prematurely in the short and long pause regions. In contrast, the final EPD decision of the proposed EPD algorithm was correctly triggered in the final silence region. Further, the late endpoint time could be reduced by the JRT process. The performance of the proposed EPD algorithm was compared with that of the conventional EPD approaches in terms of objective measures described as follows.
First, the performance of each EPD algorithm was evaluated in terms of the early endpoint time. Table 2 shows the performance comparison for the conventional and proposed EPD algorithms under the various acoustic conditions in terms of the early endpoint time where the bold numbers indicate the best result in terms of the early endpoint time. In Table 2 [31], [39], and the proposed EPD framework, respectively. As shown in  These results indicate that the WE and DE are useful features for the EPD task to avoid the early endpoints since they can distinguish the EOU from the intermediate silence well compared with the AFE and PE, which were trained without considering the context of the input feature sequence. Furthermore, the [DE] classifier achieved a better EPD performance than the [WE] classifier, where the WE was trained to detect the turn-taking word and it cannot be considered reliable for the natural language, as mentioned earlier. And, the performance of the [AFE] classifier can be improved by incorporating the WE or DE as an additional input feature. Especially, the [AFE, WE] classifier showed a higher endpoint error than the [AFE, DE] classifier, which is more desirable for the EPD task regarding the natural language. The GLDNN-based EPD algorithm, which can be considered as the complex version of the [AFE] classifier, showed a lower early endpoint error than the single embedding-based EPD system. However, the EPD systems based on their combination outperformed the GLDNN-based EPD approach in terms of the early endpoint time. The additional use of the DSFs for the EPD task could enhance the EPD performance of the [AFE, WE] classifier. Furthermore, the proposed EPD algorithm, namely [AFE, PE, DE] classifier, showed a superior EPD performance compared with that of the conventional EPD algorithms under the overall acoustic conditions, and the early endpoint time of the proposed EPD algorithm was further improved by the JRT process as reported in Table 2.  Moreover, the performance of each EPD algorithm was compared in terms of the late endpoint time. Table 3 shows the performance comparison for the conventional and proposed EPD algorithms under the various acoustic conditions in terms of the late endpoint time. From Table 3, it is evident that the [WE] classifier exhibited the highest endpoint time among the single embedding-based EPD architectures. Furthermore, the late endpoint time of the [AFE] classifier was reduced by using the WE or DE as an additional feature for the EPD task, while the DE can be considered as a more reliable feature for the EPD task compared with the WE. The proposed EPD framework yielded a superior EPD performance than the conventional EPD approach in terms of the late endpoint time. Moreover, the late endpoint time was further reduced by the JRT process.
The performance of each EPD algorithm was also assessed in terms of the WER by using the baseline ASR system. The EOU frame of each speech utterance was obtained by performing each EPD algorithm and then the ASR decoding was accomplished from the first frame to the EOU frame determined by each EPD algorithm. As reported in Table 4, the proposed EPD algorithm also achieved a better performance than the conventional EPD approaches, and the WERs were further improved by the JRT scheme since this scheme can enhance the early and late endpoint times. The final decision of each EPD algorithm can also be obtained based on a soft decision instead of a hard decision. The EPD decision makes the trade-off between a quick endpoint and avoiding cutting off the speech uttered by the user. More specifically, an aggressive decision threshold provides a faster response at the expense of increasing the WER, whereas a lower WER increases the late endpoint time.
To show the trade-off between the WER and the late endpoint time for each EPD algorithm, the WER-median late endpoint time curve is shown in Fig. 5, which was obtained by varying the decision threshold; here, the lower curves are better. As shown in Fig. 5, the median late endpoint times of the GLDNN-based EPD approach, DSFs-based EPD approach, and the proposed EPD algorithm without and with the JRT are approximately 270, 230, 190, and 170 ms, respectively, with the same WER of approximately 20%. As shown above, the proposed EPD algorithm with the JRT process showed a better EPD performance than the conventional EPD approaches. To assess the EPD performance of the conventional and proposed EPD approaches with large corpora, we used a large vocabulary continuous Korean speech dataset, namely DICT01, developed by the Speech Information Technology and Industry Promotion Center (SiTEC) [52]. This dataset consists of 20,833 sentences, each containing 6 to 25 words (average: 7.63 words). The speech database was recorded with 200 males and 200 females and each speaker uttered 104 or 105 sentences. The speech signal was sampled at 16 kHz where the recording conditions are described in [52]. We randomly divided the speech database into three datasets, which are the training-dataset (160 males and 160 females), development-dataset (20 males and 20 females), and evaluation-dataset (20 males and 20 females). We made the reference decision on the clean speech data of each dataset by manually labeling each frame as four types of state, which are speech, initial silence, final silence, and intermediate silence, for every 10 ms.
We constructed a noisy and reverberant speech database using an image method [60] for a comparison among the   EPD approaches under the various acoustic conditions similar to real-life environments. We first simulated the reverberant environments by convolving the clean speech of training-, development-, and evaluation-datasets with the room impulse responses, which correspond to small rooms of the REVERB challenge dataset for which the reverberation time T 60 is approximately 0.25 s [61]. Then, the bus, cafe, pedestrian, and street noises obtained from CHiME-3 [45] were artificially added to each reverberant speech dataset in a timedomain while maintaining the signal-to-noise ratio (SNR) at 5, 10, 15, and 20 dB. In addition, office noise from YouTube was artificially added to the reverberant speech of the evaluation-dataset to evaluate the performances of the conventional and proposed EPD approaches under the unseen acoustic condition. Consequently, approximately 1,342, 171, and 168 h of noisy and reverberant speech data of the training-, development-, and evaluation-datasets were prepared, respectively.

2) TRAINING PROCESS FOR EACH EPD MODEL
We constructed each EPD framework by using large corpora for the performance comparison among the conventional VOLUME 8, 2020  and proposed EPD approaches for the large-scale ASR task. The experimental setup was similar to that of the previous CHiME-3 ASR task. First, the ASR baseline was trained to obtain the senones and framewise P(EOU|X t ) labels, which were respectively used to train the PE and LM-based EOU predictor as follows. The SAT algorithm was carried out with the fMLLR features extracted from each utterance of the training-dataset to extract the forced alignment. After the DNN for the AM was initialized via the pre-training procedure based on the CD algorithm [57], it was trained with the CE error function and then trained again with the sMBR criteria. In each step, as in the CHiME-3 task, the development-set of large corpora was used for the early stopping scheme. The ASR decoding process was performed with the training-dataset of large corpora to prepare the senone labels for the training of the PE model. Furthermore, the framewise P(EOU|X t ) labels of the training-dataset were prepared using the 4-gram LM and the 1-best hypothesis obtained from the built-in ASR system. Second, the conventional GLDNN-based EPD algorithm and DSFs-based EPD algorithm were constructed with the configurations similar to the experimental setup of the CHiME-3 task. The ensemble RNNs for the proposed LM-based EOU predictor, the PE-based AM, and the AFEbased EPD modules were separately trained in accordance with the P(EOU|X t ) label and senones, which were obtained by performing the ASR system described above and the hand-made EPD label, respectively. Subsequently, their last hidden layers were concatenated to be fed into a fullyconnected DNN-based classifier, which was then trained according to the EPD label. Finally, the proposed EPD framework was jointly retrained to optimize the EPD performance further. During all the training processes, the developmentdataset was used to perform the early stopping scheme after 50 epochs.

3) EXPERIMENTAL RESULTS
The performances of the EPD frameworks for the large-scale ASR task were also evaluated in terms of the early endpoint time, late endpoint time, and WER by using the evaluationdataset of the large corpora we prepared in this study.
First, the proposed and conventional EPD algorithms were evaluated in terms of the early endpoint time under the reverberant and noisy conditions, including the bus, cafe, pedestrian, street, and office environments. Table 5 shows the early endpoint time of each EPD approach for the large-scale ASR task, where the bold numbers indicate the best result in terms of the early endpoint time. It is shown in Table 5 that the [WE] and [DE] classifiers, which were trained according to the 1-best ASR decoding hypothesis, yielded a relatively lower early endpoint time compared with the [AFE] and [PE] classifiers, which were trained without considering the context of the input sequences such as the word or phone alignments. The [WE] classifier yielded a higher early endpoint time compared with the GLDNN-based EPD method under the overall acoustic conditions. In contrast, the [DE] classifier achieved a better EPD performance than the GLDNN-based EPD method under most of the low-SNR conditions in terms of the early endpoint time. The early endpoint time of the [AFE] classifier was significantly improved by incorporating the WE or DE. From these results, it is concluded that the context-dependent embeddings such as the WE and DE can prevent the early endpoint within the short or long pause regions. While the performance of the [AFE, DE] classifier was enhanced by using the DSFs as the additional feature, the proposed EPD framework yielded a superior EPD performance which was further improved by the JRT process. Note that the proposed EPD algorithm also showed considerable performance improvement under the office noise environment as summarized in Table 5, where the office is the unseen acoustic condition; hence, it was not used in the training-step.
Second, the proposed and conventional EPD algorithms were evaluated in terms of the late endpoint time.  WE] classifier was further improved with the help of the DSFs, which can be obtained by the online ASR decoding process with a great deal of computation and a large amount of memory. Notably, the proposed EPD scheme yielded superior EPD performance, without the actual ASR decoding process, in terms of the late endpoint, which was further improved by the JRT process.
Finally, the proposed and conventional EPD algorithms were evaluated in terms of the WER. Table 7 shows the WER, which was obtained by performing the ASR system from the first frame to the EOU frame determined by each EPD algorithm. As shown in Table 7, the proposed EPD approach yielded better performance in terms of the WER with the help of the superiority of the proposed EPD architecture, especially in terms of the early endpoint time. Overall, the proposed EPD algorithm outperformed the conventional EPD approaches under both the seen and unseen noise conditions.

V. CONCLUSION
In this paper, we proposed the speech EPD strategy for the robust online low-latency speech recognition by combining the AFE, DE, and PE to incorporate the acoustic and language modeling knowledge into the AFE-based EPD.
The first contribution of this study is to investigate the LM-based EOU predictor using the RNN to derive the framewise probabilities of EOU token given input speech without the actual decoding process to consider the decoder states which are particularly useful for the EPD task but demands a great deal of computation and a large amount of memory. Second, we present the novel EPD architecture that can be constructed by combining the last hidden states of the AEbased EPD, the PE-based AM, and LM-based EOU predictor and training the DNN-based classifier in accordance with the framewise endpoint label and be further enhanced by the JRT technique.
The superiority of the proposed EPD algorithm was assessed under the CHiME-3 and large-scale ASR tasks. According to the experimental results, the proposed EPD algorithm showed a significantly improved EPD performance in terms of both the endpoint accuracy and the WER.