Self-Defined Text-Dependent Wake-Up-Words Speaker Recognition System

In recent years, wake-up-words (WUW) technology is highly developed in some speaker recognition system. It is the progress of verifying a person’s claimed identity from their voice characteristics, and can be efficiently deployed in some consumer applications. In this paper, we proposed a self-defined text-dependent wake-up-words (WUW) speaker recognition system and its implementation. The whole system is divided into two phases: training phase and testing phase. In the training phase, a wake-up word by language is recorded, and the voice segment is cut out by using Voice Activity Detection (VAD). Then we use the Mel-Frequency Cepstral Coefficients (MFCC) as the pre-processing to extract the speech features. After obtaining the speech features, we use Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) simultaneously for training. In the testing phase, we build GMM and HMM continuously and use the Levenshtein Distance (LD) to calculate the differences of the state sequences between the dataset and the unknown speech input. If the unknown speech input passes the threshold, then it means a wake-up event is derived. The experimental results show that the average accuracy is 93.31 %, 82.42% and 3.38 % in 10dB, 5dB and 0dB of Signal Noise Ratio (SNR) respectively. The CPU and memory usage of entire system is around 757 MIPS and 40MB respectively.


I. INTRODUCTION
With the rapid development of human-computer interaction and the Internet of Things (IoT) technology, Natural Language Processing (NLP) becomes more and more popular. One of application is intelligent voice assistant. It helps users to get important information on the household appliances easily by using voice. Speech recognition problem can be described as seeking the most suitable word sequence based on a segment of voice. It is constructed by the model to convert and find a word sequence, such as the translation to a sequence of Hidden Markov model [1]- [3].
Key Word Spotting (KWS) and WUW are two main techniques with some similar basis. KWS is not user specific. It detects specific keywords within other words, sounds, and noises often without individually modeling the nonkeywords [4]. Some research used HMM to build the specific keyword and other non-keywords to determine whether it is the correct wake-up words or not [5]- [7]. The problem The associate editor coordinating the review of this manuscript and approving it for publication was Yu-Cheng Fan. of the most KWS methods is that they need many datasets to train the model to achieve high accuracy, such as Apple and Google. As a result, it is difficult to obtain such a large amount of speech data generally. The accuracy will decrease when the training data is not enough. To solve this problem, dynamic time warping (DTW) is a method of template matching proposed in [8]. However, it is not robust since it only used DTW for KWS. Some researches show a useful method with Human Factor Cepstral Coefficients ENS (HFCC-ENS) and DTW to improve the result [9]. It represents each speech frame by combining segmental DTW and GMM [10]. They used TIMIT dataset to train and test in their work. A result shows that too many Gaussian components will cause the model to be very sensitive to variations such as little noise in the training data since overfitting. Based on the results, the number of 50 for GMM components is the best choice in their work.
The WUW is related to KWS. The difference between them is that the goal of WUW system is to detect the right word. In [11], the authors explain that system will always detect the voice and wake up if the word is right. This means WUW allows to activate these systems with speech commands. There are some researches about WUW paradigm in different aspects, such as the WUW with noise environments, the speed of utterance, the location of the target speaker, and so on. A research used the open source in CMU Sphinx to set up speech recognition WUW system and modified it as in [12].
Most of devices design the Wake-Up-Word (WUW) to activate the service in practical issue. As usual, any WUW between 3 and 6 syllables of WUW is very suitable in daily life. If there are too few syllables, it is easy to cause awakening when other words with the same syllable are spoken in daily conversations. On the other hand, when there are too many syllables (more than 6), it will be less intuitive and inconvenient to use. Also, the voiceprint comparison and sequence comparison will be also less accurate. Nowadays, most of the WUW is fixed and cannot be changed on mobile devices. If the device is going to wake up, users must say the words set up already by the developers. It is inconvenient for consumers.
In this paper, a self-defined WUW recognition system is proposed. As the important self-defined feature, it means the speaker can customize their own WUW by their wish. Any WUW between 3 and 6 syllables is allowable for awaken and become their own wake-up word. Our system performs well with high accuracy and low false accept rate. To widely apply our technique into most applications, we implement it an embedded system and operate smoothly in real-time. Another feature in our system is that it does not need to connect to Internet for using, so users' voice data will keep privacy and safety.
This paper is organized as follows: In Section 2, we introduce the related works of WUW. In Section 3, the proposed system is introduced. Section 4 presents the experimental results of software algorithm and embedded system implementation. Finally, Sections 5 gives the conclusions.

II. RELATED WORKS
Speaker recognition technology has been widely used. An important technique is to recognize the speaker by comparing the corresponding WUW stored in system. There are two kinds of speaker recognition: text independent and text dependent. A research shows the difference between them [13]. Any kind of text can be spoken during testing and training phase in text independent technique. On the other hand, the spoken text should be same during each phase in text dependent.
Text independent speaker recognition is introduced in [14]. They combined GMM and support vector machine (SVM) approaches to improve speaker identification system. To extract the voice feature, the authors take every 20 ms to make a frame where frame step is 10 ms. It used 20 dimensions of MFCC to make feature matrix. Then they used MFCC feature matrix to train a GMM with 50 components by Expectation-Maximization Algorithm. The Gaussians components were first shown to represent characteristic spectral shapes from the phonetic sounds which comprise a person's voice [15], [16]. Identification performance of the GMM is insensitive to the method of model initialization. This means model initialization can be random generated and maintains high identification performance with increasing the number of the speakers.
The text dependent speaker recognition technique is more suitable in a pre-defined system and will be introduced in following works. In [17], a method for text-dependent speaker recognition with Cepstral Compensating Vector and HMM has been proposed. The system is based on learning speaker-specific compensators for each speaker. The compensator is essentially a speaker to speaker transformation. This speaker-specific compensator captures the characteristics of the speaker to recognize. The experimental result shows high accuracy in this work. In [18], they used Vector Quantization (VQ) which designed a source codebook to represent a particular speaker saying a specific utterance. The speaker is accepted if the verification utterance's quantization distortion is less than speaker-specific threshold. It means that VQ can also be a method for text dependent speaker recognition. In [19], the author proposed a method for text-dependent speaker recognition in Vietnamese. The system is modeled for each speaker using GMM and the phonemes in the keywords are represented by HMM [20]. The prior and posterior probabilities for keywords and speakers have been combined together to identify speakers. The results show that using the posterior probability models has improved recognition results for sort keywords, giving a high correct recognition rate. Also, a GMM based speaker recognition system and a speaker verification system are introduced in [21], [22]. They presented a framework for speaker verification by preserving the speaker privacy and showed that GMM is a simple and effective approach for speaker recognition.
Speech feature extraction plays an important role in speech processing. It transforms the speech signal to a set of feature vectors. The speech spectrum has been shown to be very effective for speaker recognition. This is because spectrum reflects a person's vocal structure, and shows the main physiological factor which distinguishes one person's voice from others. Many feature extraction methods such as Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Codes (LPC) has been discussed in general speaker/speech recognition works [23]. Adapting LPC in speaker recognition can be found in [24]- [26]. A research combined MFCC method with the Hilbert spectrum has been proposed [27]. This structure is tested not only for clean speech, but also for speech corrupted by low-frequency noise and environmental noise. It shows a better result than the one only using MFCC. Furthermore, other improvement structures method is Modified Mel-Frequency Cepstral Coefficients (MMFCC) and Gammatone frequency cepstral coefficient (GFCC) in [28]. Overall, there are many methods and modified algorithms for extracting speech features. However. the extracting method depends on specification in different systems. Some systematic verifications had been discussed. In [29], it described about the speaker verification work. And a novel research demonstrates the classifiers and databases of text dependent speaker recognition [30].
To find out the exact and correct pronunciation, the work requires an assessment system to measure the distance in the pronunciation of English words [31]. The assessment system requires a method to measure the distance parameters that will be used in the assessment system. Parameters to be measured are Phonetic, Syllable, and Phonetic Length.

III. PROPOSED SYSTEM
The flowchart of the proposed system is provided in Figure 1. Since our motivation is to construct a complete and playable system, several individual techniques for voice processing is developed. In training part, we use Voice Activity Detection (VAD) to cut the voice first. Then we use MFCC to extract the features. By the feature data from MFCC, GMM is used to make the speaker recognition model, and HMM is focused on training the time sequence model at the same time. By the training work, we have the GMM and HMM into model pool for comparison. In our system, whenever we want to add a new wake-up words (no matter what it is), we must reenter the training phase. After recording the wake-up words for three times, the speech data are modeled and placed in the database. In the testing part, the system will check the similarity of the GMM models first. If the similarity score is higher than the threshold, then the system will compare the HMM model continuously. In comparison of the HMM FIGURE 1. System overflow. After recording wake-up word, the system will make a speaker model pool and then enter the detection mode. models, Levenshtein distance algorithm is used to calculate the differences of the state sequences between the sequences data in the model pool and the unknown voice input. After the two different scores passing their individual thresholds sequentially, the system will be awakened.

A. PRE-PROCESSING ON VOICE ACTIVITY DETECTION
In the pre-processing stage, the system will capture the voice segment by VAD with 16K sample rate and 16 bits in every data point. Considering the system in [32], we cut every 20ms for one frame where the overlap is 10ms. Next, the VAD module is used to calculate every frequency bands energy. The frequency bands are 80Hz∼250Hz, 250Hz∼500Hz, 500Hz∼1kHz, 1kHz∼2kHz, 2kHz∼3kHz and 3kHz∼4kHz respectively. These bands are inputted to GMM to make the judgement whether the audio segment is silence or not. After the judgement, it will stack the segment if it is not a silence.

B. FEATURE EXTRACTION
MFCC is a mainstream method to extract features from sound data where Mel filter bank is the simulation of human ear [33]. The proposed system uses MFCC to extract features. Figure 2 shows the detailed steps. First, we use the pre-emphasis to emphasis the high frequency and eliminate the effects of the vocal cord and lips during speaking. Second, we use the sliding window and multiply it with Hamming   window. Third, Fast Fourier Transform (FFT) will transform the signal from time domain to frequency domain. Fourth, the energy is calculated after FFT and then multiplied with 20 Mel filter banks. Figure 3 shows the Mel filter bank where 20 sets of 257-dimentional Mel matrices is constructed for the following multiplication. Figure 4 demonstrates the matrix multiplying. After FFT, the data in a frame will pass the Melfilter banks and get 1 × 20 feature data. Finally, we take Logarithm and Discrete Cosine Transform (DCT) to convert to Cepstral, where we choose top 20 result as MFCC output. The Mel filter bank formula is shown as following [34]: where M is number of fillers, f () is the list of M + 2 Mel spaced frequencies calculated from: m = 2595 * log 10 (1 + f /700) f is frequency. The number of FFT we choose is 512 point because there are 320 samples per frame. After taking 20 dimensions from MFCC result, we calculate the first order differential as another 20 dimensions feature and stack together to make a 40-dimension feature matrix. It displays more timing variation characteristics in the feature data.

C. SPEAKER MODEL
GMM is a probability model to describe irregular distribution, and is usually used in speaker recognition [35]. The parameter ''Component'' means how many Single Gaussian Model (SGM) in GMM. The researches give the details of derivation [36], [37]. When input data x is multiple dimension, SGM distribution is as: where u is expectation vector, is covariance matrix and D is dimension of input x. The formula of GMM is below: where λ is GMM, x is input, k is component, α k is probability of each SGM. To train GMM for the speaker, we use expectation-maximization algorithm [38]. This algorithm is very popular for training GMM in unsupervised manner. In expectation-maximization algorithm, the expectation step is by use of the probability to find max likelihood, and the maximize step is by use of max likelihood to update parameters. The formula of E-step and M -step is as below: E-Step: where j = 1, 2, . . . , N ; k = 1, 2, . . . , K M -Step: where k = 1, 2, . . . , K VOLUME 9, 2021 We repeat E and M step until the model converges so that the phase of making speaker model will be finished. Figure 5 shows the development of the speaker model. In Fig. 5, we differentiate it once and stack it with the original one by the feature data.

D. WAKE-UP-WORD (WUW) MODEL
We use Baum-Welch algorithm to train a HMM in Fig. 6 and use Viterbi algorithm to find the state sequence of the WUW model. The reason we choose HMM to make the word model is that HMM is a statistical model. The characteristic is that hidden states can only be calculated from observation. It can describe a time-dependent series perfectly to solve the problems which needs to consider timing information [39], [40]. This method is very useful in speech processing [41]. Simultaneously, we train a 16 component GMM and save these models and state sequence into model pool. In our simulation, we observe that when the 16 sets of Gaussian models is used in GMM, the accuracy will be the highest and not too sensitive. An HMM can be expressed as: where A is state transition probability matrix, B is observation probability matrix and π is initial state probability vector. To train the HMM by Baum-Welch algorithm, we set the random parameters of A, B and π first. Then we use the forward procedure as: The probability of seeing the observations y1 to yt and state i at time t. This is found recursively as: Then we use backward procedure as: That is the probability of the ending partial sequence y t+1 to y T given starting state at time t. We calculate this by: Now we can calculate the probability variables as: The parameters of the HMM can now be updated as: which is the expected frequency spent in state i at time 1. , Here, a * ij is the expected number of transitions from state i to state j and b * i (v k ) is the expected number of times. The output observations have been equal to v k while in state i over the expected total number of times in state i. These steps are now repeated iteratively until converge or a certain number of times.
Next, we can find the state sequence after completing training HMM. To do this, the Viterbi Algorithm is used: where x is input and i is the most possible state of input x. Figure 7 shows the schematic diagram of the state sequence by Viterbi algorithm. The proposed system will save HMM and state sequences with speaker GMM models into the model pool for testing in the next step.

E. SPEECH MODEL ON TESTING
After the model building phase, the proposed system will be operated in the testing phase or in the listening phase because the system will always detect the sound. In the testing phase, system will check an unknown speech segment cut by VAD and then find out the corresponding speaker. We need to calculate the log likelihood of the GMM to find the corresponding speaker. The log-likelihood formula is shown as: where π , µ and is the parameters of GMM, k is component. In order to judge how similar between the two state sequences, we use Levenshtein distance to compare the result. Levenshtein distance is a dynamic algorithm to calculate similarity between two sequences with different length, hereby same length also works. This method is usually used for DNA analyze, spelling checking and speech recognition [42]- [44]. The formula is as: where the length of sequence a is I and length of sequence b is J . By a matrix calculating, every element of matrix has been determined. The Levenshtein distance value of these two sequences is at lev (I , J ). The value is non-negative, and the smaller the more similar between sequences. If the input GMM log-likelihood is passed and state sequences are similar, the system will be awakened successfully. After users wake up the system successfully, the whole process is done. If the WUW is not correct, the system will not be awakened and continued to listen the next voice input.

IV. EXPERIMENT RESULTS
We set the testing environment with 1.5 meter between speaker and microphone, the angle between the white noise and the speaker's voice is 60 degrees. The experimental environment is shown in Fig. 8.
The whole system is performed and divided into two phases: training phase and testing-comparison phase. The training phase has been constructed and discussed on Section 3. In the testing-comparison phase, VAD and Mel-Frequency Cepstral Coefficients are still used for unknown voice input. Next, this feature will be calculated through the log likelihood of the GMM to find the corresponding speaker, and the Viterbi algorithm is used to calculate the state sequence of the unknown speech through Hidden Markov Model.
The system was tested under the environment in 10dB (A task), 5dB (B task), and 0dB (C task) of SNR, respectively. Due to the difficulty of collecting voice data, we test five people in the laboratory. Next, we choose ten wake-up words: ''Hello Smart Light'', ''Chih ma kai men'', ''Ni hao pai tu'', ''Hsiao ai tung hsueh'', ''Smart Mirror'', ''OK Google'', ''Hey Siri'', ''Hello Jarvis'', ''Magic Robot'' and ''My Computer'' as training data. All these ten wake-up words include VOLUME 9, 2021  English, Chinese and Vietnamese. Every wake-up word is repeatedly played with 1000 times to test the accuracy. Table 1 and Table 2 show the accuracy results of WUWs testing under clean environment and different white noise environments, respectively. To evaluate the whole system, the False Reject Rate (FRR) and the False Accept Rate (FAR) are introduced below. FRR is the correct input but judgement failed. In our work, it is represented that the speaker says the correct WUW but the    system does not wake up. The FRR is 1-Accuracy. FAR is incorrectly taking the wrong input as true answer. These two indicator scores are very important for deciding performance of a WUW system.
We test on actual human speaking with 100 times in experiment and can get the 99 % accuracy. To test FAR, we play the audio wave file from Amazon Alexa open source testing data where it is 24-hour long sequence. To make a fair evaluation, this test sequence should not have any correct WUW. For FAR testing result, it shows three times FAR in 24 hours, as also shows in Table 3.
The design in embedded system is also provided. We construct a real and playable demonstration system on Raspberry Pi 3B board and operated in the real environment as shown in Fig. 9. We use Respeaker microphone array v2 to capture the sound data and use a control board to control the proposed system so that we do not need the keyboard, screen and mouse such as the product of smart home appliance. In Table 4, the CPU and memory usage of entire system is around 757 MIPS and 40MB, respectively. Fig. 10 shows the profile of the timing consumption in each module. According to the test results, the overall identification time is about 0.5s which meets the real-time requirement. Finally, Fig. 11 shows the histogram of the accuracy of WUWs with different numbers of syllables in three different environments. The x-axis is the syllables and the y-axis is accuracy. This result shows that our system performs well in wake-up words with 3 to 6 syllables.

V. CONCLUSION
This paper presents self-defined wake-up words speaker recognition and its embedded system implementation that can be operated in real time. All of the processes are executed without Internet so that the sound data from users will be in privacy and safety. In the proposed system, VAD technique is used for cutting the voice, and MFCC is used for extracting voice features. We build GMM model to make the speaker's voiceprint database, and Gaussian distributed HMM model to make each sequence model of the phonemes. In detail, we take twenty dimensional MFCC and its first-order differential to make speech feature matrix after cutting voice slides by VAD. We use Levenshtein distance to compare the state sequence. In the training phase, the GMM and the HMM are used to establish the speaker model and the speech model simultaneously. In testing phase, we calculate Gaussian Mixture Model similarity first and use Levenshtein distance to compare dataset state sequence with the unknown speech state sequence. If both of them pass the threshold, then it means a wake-up event is derived successfully. The experimental results are all performed in Raspberry Pi 3B embedded board with different languages as testing. Several kinds of noisy environments are also involved for simulation. The result shows that it can achieve 99.84% accuracy in clean environment, and three times per day of False Accept Rate. The processing time of speaker recognition is around 0.5 sec to meet the real-time requirement.