Recognizing Tonal and Nontonal Mandarin Sentences for EEG-Based Brain–Computer Interface

Most current research has focused on nontonal languages such as English. However, more than 60% of the world’s population speaks tonal languages. Mandarin is the most spoken tonal languages in the world. Interestingly, the use of tone in tonal languages may represent different meanings of words and reflect feelings, which is very different from nontonal languages. The objective of this study is to determine whether a spoken Mandarin sentence with or without tone can be distinguished by analyzing electroencephalographic (EEG) signals. We first constructed a new Brain Research Center Speech (BRCSpeech) database to recognize Mandarin. The EEG data of 14 participants were recorded, while they articulated preselected sentences. To the best of our knowledge, this is the first study to apply the method of asymmetric feature extraction method for speech recognition using EEG signals. This study shows that the feature extraction method of rational asymmetry (RASM) can achieve the best accuracy in the classification of cross-subjects. In addition, our proposed binomial variable algorithm methodology can achieve 98.82% accuracy in cross-subject classification. Furthermore, we demonstrate that the use of eight channels [(F7, F8), (C5, C6), (P5, P6), and (O1, O2)] can achieve an accurate of 94.44%. This study explores the neurophysiological correlation of Mandarin pronunciation, which can help develop a tonal language synthesis system based on BCI in the future.


I. INTRODUCTION
S POKEN language is one of the most common forms of communication between people. However, for patients with locked-in syndrome (LIS), such as severe spastic quadriplegic cerebral palsy, stroke, and advanced amyotrophic lateral sclerosis, most of their voluntary muscles are paralyzed except for vertical eye movement communication or blink. Even if these patients are conscious, they cannot communicate through language. This may lead to undesirable long-term consequences, including reduced quality of life, reduced social interaction, and increased burden of caregivers [1]. Although patients with communication difficulties can benefit from longterm support and speech therapy, the long-term care needs of this population today. Despite motor abnormalities such as LIS or quadriplegia, their brains still function well.
With advances in sensor technologies, it is now possible to develop intelligent applications to manage, control, and automate our living environments without human intervention. The Internet of Things (IoT) is a good example of automation science and information technology. Many studies have combined the IoT and intelligent medical or rehabilitation systems with brain-computer interfaces (BCIs) [2]- [6]. In recent decades, BCI technologies have made many advances [7]- [14]. BCI is considered to be a new communication platform that utilizes the dynamics of the user's brain. Recognizing speech through neural signals has been an emerging research area in the past few years. Previous researches usually required the performance of hand motor imagery or some other conversation-irrelevant task [15], [16]. However, those BCI methods are all nonintuitive. Various methods of recording brain activity can be used as the basis for direct speech synthesis in brain-computer communication. Electroencephalography (EEG) is widely used in the field of BCI due to its high temporal resolution and low cost. Electrocorticography (ECoG) can provide more information about brain signals but requires invasive implantation of subdural electrodes [17].
Many speech perception studies focused on nontonal language (e.g., English or German). In fact, the neural evidence of lexical tone processing is scarce. More than 60% of the languages in the world are tone languages, and the words in them are distinguished by tonal features [18]. The "tones" of a word can represent different meanings [19], [20]. Mandarin is one of the most spoken languages in the world (about 1. it uses five tones, which is very different from nontonal language (see Fig. 1) [21]. In Mandarin, the meaning of words cannot be determined without tonal information. For example, the syllable /ma/ can be accented with four lexical tones (i.e., Tone 1-flat-level tone; Tone 2-mid-rising tone; Tone 3-mid-falling-rising tone; and Tone 4-high-falling tone) to represent four distinct meanings: mother " " hemp " " horse " " and curse " ", respectively.
To the best of our knowledge, fewer BCI studies focused on BCIs that can translate brain signals into Mandarin than those BCIs for English [22]- [27]. The ultimate goal of this work is to develop a direct BCI for Mandarin. To achieve this goal we need to answer the following two questions first. 1) What is the difference in brain activity between speaking Mandarin with and without tones? (to find out the tone feature)? 2) What is the cognitive process in the brain when speaking a tonal language like Mandarin? Therefore, this study proposes to investigate the functional difference in the brain while speaking in tonal and nontonal Mandarin. Native Mandarin speakers were asked to participate in two experiments: one is to speak normally, and the other is to speak flat-tone Mandarin. Then, we analyze the difference in EEG activities between the two. This experimental design is mainly to avoid observing the phenomenon caused by the second language [28]- [31]. Therefore, this study aims to use the same language to understand that the observed phenomena are caused by tones. One can use either a bottom-up or a top-down approach to decide neural signals during speech production. The bottom-up approach maps the basic language units [32], [33] (e.g., phonemes or syllables) onto articulation areas (e.g., motor cortex and premotor cortex). This study uses the topdown approach to decode Mandarin. We first map speech to sentence level and then corresponds to brain signals.
The purpose of this study is to investigate whether Mandarin spoken with and without tone can be distinguished based on the subject's EEG. The previous study [59] demonstrated that the left hemisphere is relevant for generating grammatical sentences and syntax rules, and the right hemisphere is key to participate in adding the emotional intonation to speech. However, the process of tone in the human brain is still not clear. Because of the functional hemispheric asymmetries, this study proposed an asymmetric feature extraction method to obtain the important features for tone. Also, this study proposed the binomial variable algorithm (BVA) to easily extract the significant features cross-subjects.
The remaining parts of this article are organized as follows. Section II introduces the related work. Section III discussed the experiment design method. Section IV presents our research method. Section V shows the evaluation results among the single and cross-subjects. Section VI discussed our numerical results. Section VII concludes this study.

II. RELATED STUDY
Recently, different neuroimaging modalities, such as functional magnetic resonance (fMRI), ECoG, EEG, etc., have been used to measure neural activities for decoding speech.
1) fMRI measures brain activities by detecting changes related to blood flow. This technique relies on the coupling of cerebral blood flow and neuron activation. Several studies have used fMRI to decode the spatial correction of speech [34]- [36]. However, the temporal resolution of fMRI (including that of the latest high-field fMRI) is limited to a few seconds, whereas the human speech articulation process takes less than a quarter of a second. However, human speech articulation involves the cooperation of different functional cortices, which originates from the mind and is manifested by the sensorimotor areas. The temporal correlation of different cortical locations and its relation to speech articulation cannot be decoded by fMRI with the low time resolution. 2) ECoG directly measures the electrical activities on the cortical surface. It has been used for accurate preoperative localization of epileptic seizures and provides highdensity neural recordings. Recent studies have shown that ECoG can decode speech, including the ability to map speech evoked sensorimotor activations [37]; generate neural encoding mode of perceived phonemes [32], words [38], and sentences [22]; reconstruct acoustic properties of perceived [39]; generate natural-sounding synthetic speech from brain activity [24]; and immediately identify volunteers' spoken responses to a set of standard questions based solely on their brain activities [25]. Although recent studies have reported impressive progress in using neural signals for speech decoding, the complex dynamics, especially for Mandarin speakers, have yet to be fully elucidated. 3) EEG uses electrodes placed on the scalp to measure the electric potential of a large ensemble of simultaneously firing neurons. It is the most commonly used method for recording neural signals and has a huge advantage that it is noninvasive. EEG is widely used in BCI research because of its high temporal resolution and low cost [7]. It is easy to access, which helps the development of a BCI-based system for language generation. Various studies have used EEG to convert vocal speech to imaginary speech of the English vowels [40], syllables [41], [42], and "yes" and "no" [43]. Some EEG-based BCIs have used deep-learning-based automatic recognition for English words [44], vowels [44], and vocabulary [45].

A. Subjects
Fourteen healthy subjects aged 20-26 years (average age: 23.50 ± 1.99 years) were recruited to participate in this study. All subjects were native Mandarin speakers, right-handed and without neurological and mental illness, and no drug or alcohol abuse. The experiment was performed in accordance with the country's laws and approved by the Institutional Review Board (IRB) of the National Chiao Tung University (NCTU). Each participant provided written informed consent prior to participation. The participants were compensated approximately U.S. $25 after the experiments. The experimental protocol was approved by the IRB and assigned the number NCTU-REC-108-127E.

B. Experimental Paradigm
This study uses a "Focus Group Interview" [46] method to create a new Brain Research Center Speech (BRCSpeech) Database to analyze spoken Mandarin. Focus Group Interview is a method of collective discussion of specific research issues. During the interview, the interviewees are stimulated to construct ideas [46]. The study aimed to determine the differences in EEG activity when subjects spoke preselected sentences with and without tone in the BRCSpeech Database.
We adopt the contract method proposed by Duanmu's language experts [33]- [35]. The "Contract" refers to two words that sound different, that is, two words with different phonetic forms. We use two contracts. One is the tone contract. For example, tones 1 and 4 are the maximal pitch contract of tones. The other is the contract of articulators. For example, Labial and Dorsal are different articulators. In linguistics, these are the differences in speech. We assume that the differences will also be reflected in brain activities. The BRCSpeech Database we created also referenced the Texas Instruments/Massachusetts Institute of Technology (TIMIT)'s database [24], [50]. Because people's normal speakings mix both long and short sentences, the design of the 460 sentences in the TIMIT's database includes 3-12 words for each sentence. And our BRCSpeech Database is also composed of 3-12 words of each sentence.
The BRCSpeech Database collected sentences composed of all Mandarin pronunciation, covering all combinations of Mandarin vowels and consonants. In this study, the BRCSpeech Database will be selected as the source of the sentences while speaking Mandarin in the tonal and nontonal experiments and the BRCSpeech Database contains combinations of various Mandarin characters' sounds, which can rich the data collection. In this study, the important features of tones are identified. In our future study, we will analyze the four types of Mandarin tones based on the results of this study.
In our EEG speech experiment, sentences were shown on the computer monitor. Each sentence is composed of 3-12 words, which were randomly selected from the BRCSpeech Database, and the duration of the sentence displayed on the monitor was adjusted according to the length of the sentence. The baseline between two consecutive trials is a white screen that is 2 s long. Each session of the experiment lasts 25 min (about 185-191 sentences), as shown in Fig. 2. In order to avoid the differences in degree of cognitive control with and without tonal information in Mandarin, the experimental design of this study is divided into two different sections. One section required the subject to speak normally (involving tone), and the other required the subject to speak a flat tone like a robot without changing the tone (only Tone 1 was permitted). Each subject was asked to complete two different sections in random order.
Before recording data, the subjects would practice at least 16 trials in order to reduce the unexpected phenomenon caused by cognitive control with and without tone in Mandarin.

C. EEG Data Acquisition
This study used the SynAmps system (Australia Compumedics Ltd.) to record the EEG data, which has 64 unipolar sintered Ag/AgCl EEG electrodes placed on the scalp according to the international 10-20 system and referred to the linked mastoids (average of channel A1 and channel A2). Fig. 4 shows the layout of EEG electrodes on the cap. The impedance of all electrodes was kept below 5 k . The EEG data were sampled at 1000 Hz with a 32-bit quantization. The spoken sentences were recorded with a microphone. (Sampling rate: 44.1 kHz/16 bit; Dimensions: 325-mm circumference).

D. Speech Phone Labeling
To avoid unwanted noise (other than the main physiological signals) [51] and simplify the experiment to obtain better results, we designed an experiment to adapt to the random speaking speed of the subjects. In addition, the subject did not need to perform other activities (i.e., press buttons) other than speaking [24]. Subjects were asked to speak the sentence shown on the monitor at their regular pace immediately after watching the display. This study used a high-quality microphone to record the speech and synchronized the audio recording with the subjects' EEG. Speech was synchronized by using Presentation ( c 2020 Neurobehavioral Systems, Inc.).

IV. RESEARCH METHODS
In this section, we detail the processes applied to the EEG data speech recognition.

A. EEG Data Preprocessing
EEG signals were first filtered to 0.5-180 Hz, and then downsampled to 500 Hz for data compression. We used MATLAB R2019b (The Mathworks, Inc.), Python, and the open-source EEGLAB toolbox (http://sccn.ucsd.edu/eeglab) [52]. As shown in Fig. 3, by using the EEGLAB visualization tool, EEG signals containing electrode noise and a large number of muscle artifacts can be identified and simply removed to improve the signal-to-noise ratio.  Timing of a trial of the paradigm. EEG signals from baseline to end of speaking. This study analyzed the brain activities in three periods: 1) BS: 1 s before the sentence is displayed (the 2nd second of the baseline), 2) Before Speak: after the sentence is displayed and before the subject articulates the sentence, and 3) Speak: when the subject is articulating the sentence.

B. Feature Extraction
This study analyzed the brain activities in three periods: 1) BS: 1 s before the sentence is displayed (the 2nd second of the baseline); 2) Before Speak: after the sentence was displayed, before the subject articulates the sentence; and 3) Speak: when the subject is articulating the sentence (see Fig. 5). Many previous studies have shown that emotions cause differences in brain activities during rest [53], [54]. To exclude the influence of emotions, we removed the BS state data in the Before Speak and Speak states to ensure that the classification results were based on the tonal and nontonal language classification, while excluding the impact of the baseline emotion. In Fig. 5, the power in the range of 0.5-170 Hz of each time bin in BS was averaged at time bins as the average baseline power of 0.5-170 Hz. In addition, the average baseline power was subtracted from the power spectrum at each time bin of Before Speak and Speak states.
Zheng et al. [55] found that the following different features and electrode combinations are effective for EEG-based emotion recognition: 1) power spectral density (PSD); 2) differential entropy (DE); 3) differential asymmetry (DASM); and 4) rational asymmetry (RASM) features from EEG. As a result, we used these features in this study. Further, this study also uses additional asymmetric (AASM) for feature extraction. The length of the window size used in this study was 0.5 s, which was based on the average time of each word (about 0.4-0.6 s in this study).

1) Power Spectral Density (PSD):
The EEG signals of each trial for all 62 channels were first transformed into timefrequency domain to get EEG PSD using the short-time FFT. The six different frequency band power, delta-theta (0.5-7 Hz), alpha (8-12 Hz), beta (13-30 Hz), gamma (30-60 Hz), high gamma , and all bands (0.5-170 Hz), was selected and averaged as the feature of each time bins. This procedure resulted in 372 features (six frequency bands by 62 channels) for each trial (sentence). 2) Differential Entropy (DE): Shi et al. [56] found that EEG signals are subject to Gaussian distribution in a few subbands after band-pass filtering from 2 to 44 Hz. As such, the DE (denoted by h(X)) of the EEG signals (denoted by X) in the frequency band i can be derived by substituting the probability density function of a Gaussian random variable X into Then, we can obtain and σ 2 is the signal variance of X.

C. Dimensionality Reduction
The aim of this study is to implement a real-time BCI. Fewer features of real-time BCI correspond to more time-related calculation. This study used principal component analysis (PCA) to reduce the dimensionality. Also, we customized the BVA to extract the significant features cross-subjects, based on the binomial hypothesis test and multifactor-dimensionality reduction (MDR) [64]. There are two hyperparameters to be set in BVA as follows.
1) O: The proportion of the optimal feature.
2) X: The threshold of the number of interactions between individuals. In the BVA method, two steps dimensionality reduction are as follows.
Step Step 2 (Selecting the Important Features for the Cross-Subjects by Setting X Value): f(s, n): The setting of optimal features for each subject. Bf(n): To compute the important features for cross-subjects, where s is the subject number ranged from 1 to 14, and n is the feature number ranged from 1 to 162 (Note that there are 162 features by using the RASM method).
The value of function f (s, n) equaled 1 (f (s, n) = 1) if f (s, n) was selected as the optimal feature; otherwise, it equaled 0 (f (s, n) = 0), Then, Bf (n) is defined as Bf (n) = s i=1 f (i, n), if Bf (n) was greater or equal to X, the feature n was regarded as the important feature for cross-subjects. For the pseudocode, see Algorithm 1.
A small value of O meant that the number of optimal features was small, that is, the feature achieving excellent classification performance of each subject was taken as the threshold; contrarily, a larger value of O indicated a larger range of thresholds. A larger value of X meant that a feature would be set as an important one if the feature was shared by multisubjects, which imposed relatively strict restrictions.
Therefore, both the values of O and X would influence the final number of selected important features. A larger number of important features were less favorable to the real-time BCI design, but were likely to enhance the classification effect, suggesting that the optimization of operating parameters was a crucial consideration.
We now describe how hyperparameters are selected in our methodology. Set the value of O (try O = 5, 10, 15, 20, and 25, respectively), and then set the value of X (14 subjects in total, and find X begins from 14, then 13, and substitute one by one successively). We will find the value of important features for each pair of hyperparameters O and X. Finally, we will choose the appropriate number of important features according to the results. In this study, it is expected that the final number of channels will be between 10 and 20, which will be favorable for future BCI development. for j = 1 to n do 3: if Accuracy(i, j) ∈ Oset(i) then 4: f (i, j) = 1 5: else 6: f (i, j) = 0 7: end if 8: end for 9: end for 10: Bf ← 0 11: for j = 1 to n do 12: for i = 1 to s do 13: Bf (j) = Bf (j) + f (i, j) 14: end for 15: end for %According to the X value to compute the important features for cross-subjects. 1: for j = 1 to n do 2: if Bf (j) ≥ X then Therefore, if O is set to five, it will be too harsh, and if O is set to 20, it will be too loose. In the study, we set O = 10. After setting O, we set X again. Because more important features represent that an additional number of channels are required, it will be difficult to implement BCI. However, less important features may cause poor classification effect. Therefore, we selected the situation that is more likely to realize BCI for analysis, in which X = 7 has nine important features, and X = 6, there are 14 important features, therefore this setting has a good chance of achieving BCI, hence this study selected X = 7 and X = 6 for further analysis.

D. Classification
This study applied LDA, K-nearest neighbor (KNN) algorithms and fivefold cross-validation to the EEG features to classify the spoken Mandarin with versus without tones. K-Fold evaluation is a popular and easy to understand technique. It ensures that every observation in the original data set has a chance to appear in the training and test sets. This study evaluated the associations between EEG power in different frequency bands at different channels and Mandarin speech tones.

V. EVALUATION
In this section, the classifier is combined with different feature extraction methods to classify tonal Mandarin versus nontonal Mandarin. First, we classify a single subject and compare the impacts of the feature extraction methods and classification on different frequency bands. Tables I and II exhibit the results of single subjects, and  Tables III and IV show the results of the classification of 14 subjects by the leave-one-out cross-validation (LOOCV) method (13 were trained, and another one was tested, with each subject being a test subject once). Table III shows the classification results of EEG activities measured in the baseline (BS), Before Speak, and Speak, and Table IV compares the classification results of the PCA dimensionality reduction method and BVA dimensionality reduction method.

A. Evaluating Single-Subject Performance
For each EEG feature (data point) of each subject's trial, fivefold cross-validation was used to estimate the accuracy of LDA classification. We randomly used 370 trials for classification from the total trials. Therefore, there are 370 trials in each subject (185 "nontonal" and 185 "tonal" sentences). We split the data into five subsets, and each subset has 74 trials. We trained the model with 296 trials, and tested it on the remaining 74 trials. We repeated this procedure five times and averaged the accuracy obtained by each subject. Brain activities in different frequency bands usually reflect their distinct cognitive activities [53], [54], [65]- [67]. Therefore, this study divided brain activities into six frequency  III  USING ALL FEATURES TO RUN LOOCV TO TRAIN THE LDA-BASED  CROSS-SUBJECT MODEL; THIS TABLE SHOWS  THE CROSS-SUBJECT MEAN ACCURACY

B. Evaluating Cross-Subjects Performance
The LOOCV method was used to perform an LDA-based cross-subject classification on the 14 subjects, of which data from 13 subjects were used for training and data from the one remaining subject were used for testing. Table III shows the results obtained from using different feature-extraction methods. The RASM feature-extraction method performed best in the cross-subjects classification result. It achieved a classification accuracy of 98.82% in the Speak state, with a standard deviation of 0.66, which is the minimum standard deviation among the five feature-extraction methods in the Speak state. The RASM feature-extraction method achieved the best accuracy of 97.93%, 97.72%, and 98.82% in the BS, Before Speak, and Speak, respectively. The accuracy (%) of the PSD, DE, AASM, DASM, and RASM-based classification in the three states can be averaged to 52.06, 70.91, 68.26, 67.73, and 98.16, respectively. It suggests that RASM features achieved the highest classification accuracy (98.16%), followed by DE (70.91%). Table III indicates that RASM achieved the highest classification accuracy under the LDA-based cross-subject classification. Table IV compares the classification results using the RASM method in conjunction with the dimensionality reduction algorithm (PCA versus BVA). 10% of the features (as RASM has 162 features; 10% x162 = 16) are taken in the PCA algorithm, that is, the first 16 principal components were used as the important features.
Under the LDA-based classifier, the PCA-based dimensionality-reduction method reached the accuracy of 84.51%, 73.91%, and 72.90% in the BS, Before Speak, and Speak state, respectively. With (X ≥ 7), 9 important features were taken in the BVA-based dimensionality-reduction method, which achieved the accuracy of 77.71%, 71.22%, and 72.62% in the BS, Before Speak, and Speak state, respectively; with (X ≥ 6) 14 important features were taken in the BVA-based method, which achieved the accuracy of 85.09 %, 77.52%, and 76.90% in the BS, Before Speak, and Speak state, respectively. For the LDA model-based classifier, we find that the numbers of features/components both PCA and BAV methods affect the performance significantly. For the PCA algorithm, when the dimension is reduced to 16, the accuracy (%) dropped from 97.93% to 84.51%, from 97.72% to 73.91%, and from 98.82% to 72.90% in the BS, Before Speak, and Speak phase, respectively. For the BVA (X ≥ 6) algorithm, when the dimension is reduced to 14, the accuracy dropped from 97.93% to 85.09%, from 97.72% to 77.52%, and from 98.82% to 76.90% in the BS, Before Speak, and Speak state, respectively.
When we used BVA (X ≥ 6), there are 14 important features. The results of the LDA-based classification with these 14 important features can be compared with the results of the PCA-based using 16 important features. The accuracy of using PCA during the BS, Before, and Speak states is 84.51%, 73.91%, and 72.90%, respectively. The accuracy of using BVA (X ≥ 6) during BS, Before Speak, and Speak states is 85.09%, 77.52%, and 76.90%, respectively. We also performed t-test analysis to test the statistical significance between BVA and PCA methods. The BVA (X ≥ 6) using 14 features outperformed the PCA using 16 components significantly (ρ < 0.01).
Using the PCA/BVA methods to reduce the dimensionality, the KNN classifier can obtain a higher accuracy because the important features are identified to improve the disadvantages of the traditional KNN algorithm. The traditional KNN classification has three limitations.
1) High Calculation Complexity: To find the k nearest neighboring samples by KNN, all the similarities between the training samples must be calculated. When there are few training samples, the calculation time is not overwhelming, but if the training set contains a large number of samples, the KNN classifier needs more time to calculate the similarity [68]. The blue circles represent All-band, the purple circles represent High Gamma, the gray circles represent Gamma band, the pink circles represent both Alpha and All-band, and green circles represent both Alpha and Gamma band.

2) Dependence on the Training Set:
The classifier is only generated with the training samples and does not use any additional data. This makes the algorithm dependent on the training set excessively. It needs to be recalculated even if the training set has a small change.

3) No Weight Difference Between Samples: Because train-
ing samples are treated equally in the KNN, there is no difference between the samples with small and large amounts of data. With the help of the BVA (X ≥ 6) method using only 14 features, the KNN classifier can achieve an accuracy of 99.2%, 94.3%, and 96.7% in the BS, Before Speak, and Speak states, respectively. We also find that the BVA outperforms the PCA when ρ < 0.01 under a KNN model-based classifier. Table IV reveals that using all features, the LDA classifier could achieve higher accuracy. Presumably, the LDA can only learn simple linear boundaries among the data clusters. The high classification performance obtained by the LDA indicates that there were obvious differences in the data distributions of EEG signals under different conditions, which may reflect different cognitive activities of the brains.
As shown in Table IV, we speculate that the 14 features with BVA (interaction threshold is larger than 6, x ≥ 6) are important for distinguishing tonal and nontonal sentences in the EEG classification.   Table V. When the (C5, C6) All-band feature were removed, the accuracy dropped significantly; however, the removal of other features did not have a significant influence, indicating that (C5, C6) were the critical features in this study, which was consistent with prior literature results that C5 and C6 were most relevant to the brain areas in the Speak state [69]. Table V divided the 14 important features obtained through the BVA into the frontal lobe, temporal lobe, parietal lobe, and occipital lobe of a brain. Aiming to reduce the dimensionality based on regions, we took a channel pair as feature values in each region, with a total of 20 combinations (Frontal: 5 x Temporal: 1 x Parietal: 2 x Occipital: 2 = 20). The KNN method was used to classify the tones based on the brain dynamics in the Speak state. The results given in Table VII indicates that the combination of (F7, F8), (C5, C6), (P5, P6), and (O1, O2) achieved the highest accuracy (94.44%), and the frontal lobe-based data channel of the pair achieved the highest accuracy in conjunction with the combination of (C5, C6), (P5, P6), and (O1, O2) in each block in Table VII. Thus, it is implied that using eight channels can achieve an accuracy of 94.44%.

VI. DISCUSSION
Communications in the real life are achieved through sentences. The use of a sentence-level design to study speech can be applied to natural languages than word-level design. This study aims to understand the neural processing before and during speaking. We investigated the brain activities during the BS (baseline), Before Speaking, and Speaking a sentence in  VII  DATA CHANNEL PAIR CORRESPONDING TO EACH BRAIN AREA IN  TABLE VI, WITH THE FEATURES EXTRACTED USING THE RASM  METHOD. THE KNN CLASSIFIER IS USED TO CLASSIFY THE  TONE AND NONTONAL SENTENCES ACCORDING TO [72], and with the aid of dimensionality reduction, our proposed BCI method can be used for accurate and real-time classification of the tonal versus nontonal language. This study will have the following potential impacts.

A. New Findings in Tone-Speaking Brain Dynamics
By analyzing the EEG recordings, the study demonstrates that it is possible to differentiate whether a native Mandarin speaker is using tone or not. Specifically, the cognitive processes of speaking tonal or nontonal Mandarin are different. Many Feature extraction methods for analyzing the EEG, have been repeated in the literature, such as the asymmetric feature extraction method [55], [61]- [63] and the singlechannel-based method [40]- [45], [73]. Most of the previous EEG speech recognition research works were based on singlechannel-based methods for feature extraction. To the best of our knowledge, this is the first study to apply the asymmetric feature extraction method for speech recognition through EEG signals. Additionally, this study finds that the RASM feature extraction method can achieve the best accuracy among these feature extraction methods (PSD, DE, AASM, DASM, and RASM) in the classification of cross-subjects. RASM is one of the asymmetric feature extraction methods and is just like a normalized process that changes the values of numeric columns in the data set to a common scale. It is obtained by dividing the left channel's value by the right channel's value. After RASM, the accomplishment of the best classification result by the classifier also proves the asymmetric cognitive process of the left and right hemispheres when the speech is delivered with tone or not [74] (important result 1).
To exclude the influence of emotions, we removed the BS state data in the Before Speak and Speak state counterpart to ensure that the classification results were based on the tonal and nontonal language classification. According to previous studies, emotion might affect brain activities at the rest state [53], [54], [75]. From Table IV, we observe that there are great classification results between Baseline and Before Speak states. Therefore, the results support that speaking with or without tone is related to speaking motivation and articulation(important result 2). Therefore, this study can help design the direct speech BCI and facilitate human-machine interaction (HMI). In the design of BCI or HMI, a prejudgment involving emotion before speaking is a very important design link for many patients with aphasia. It also truly implements the spirit of automation science and engineering [76].
When we communicate with others, no matter whether we speak a tonal or nontonal language, there may come with emotions when we speak. For example, when we are saying that I am very happy, we may have happy emotions, and the prosody of speech may also be changed. From the conclusions of (important result 1) and (important result 2), we also speculate that the asymmetric feature extraction method may be not only helpful for the tonal language but also for the nontonal language when we are speaking in a natural situation. The hypothesis is worthwhile being verified in the future.
Previous research results indicated that the brain elicits high-Gamma (70-160 Hz) oscillations during linguistic phonetic processing [77], [78]. Although the cognitive process of speech in the brain is still unclear, we can speculate that high-Gamma will be an important feature for analyzing brain dynamics when speaking. Single-subject results in this study showed that when using the RASM method, the best classification band in the Speak state is High Gamma (99.00%) (important result 3).
The results of this study have seen not only the languagerelated brain areas, such as parietal [79] and temporal [69] but also the frontal and occipital area, which may be triggered by the stimulus-driven executive control. We also found that channels (C5, C6) are the critical feature in this study, which is consistent with the prior research results mentioning that C5 and C6 are most relevant to the brain area when speaking (articulation) [69] (important result 4).
From those results, this study obtained a satisfactory classification accuracy, indicating that different brain mechanisms may be used by the tonal and nontonal Mandarin in terms of cognitive behaviors. While the majority of previous studies have focused on the brain studies for the nontonal languages, this is the first study to analyze the presence or absence of tones in sentences based on the EEG signals for tonal languages. By using the machine learning classification approach, we confirmed brain activities (cognitive-behavioral differences) are different when people speak with or without tone.

B. Key Step to Direct-Speech BCI
This study found that the cognitive process of the brain while speaking with or without tone is different. Previous studies of speech synthesis have already indicated the articulation space of the brain when speaking English [24][25][26][27]. English is a nontonal language, according to this study, a nontonal language synthesis's model cannot be directly applied to a tonal language. However, over 60% of the world population use tonal languages [18], and Mandarin is one of the most widely spoken tonal languages. To ascertain the tonal feature is not only the key step to the direct-speech BCI of tonal languages but also the cross-language direct-speech BCI.
The largest difference between the tonal and the nontonal languages lies in their tones. If the tonal feature can be interpreted by physiological signal analysis, there is an opportunity to add tonal features based on the articulation space of English shown in past studies. Then, not only can the tonal languages be synthesized but the cross-language direct-speech BCI can also be achieved. BCIs can serve all ethnic groups and languages, which is the ultimate goal of Automation Science and Engineering [76]. We are looking forward to the invention of such a BCI.

VII. CONCLUSION
This study investigated the brain dynamics of human speech in tonal and nontonal Mandarin based on EEG recognition. In contrast to ECoG and fMRI, EEG signals have the advantages of low cost, mobility, fieldability, high-temporal resolution, and noninvasiveness. The brain activities corresponding to the tonal and nontonal Mandarin sentences exhibit different behaviors that can be distinguished by classifying EEG. To the best of our knowledge, this is the first study to apply the asymmetric feature extraction method for speech recognition through EEG signals. This study finds that the RASM feature extraction method can achieve the best accuracy in the classification of cross-subjects. Also, our proposed methodology, BVA, can achieve an accuracy of 98.82% in cross-subject classification. Furthermore, we show that using eight channels [(F7, F8), (C5, C6), (P5, P6), and (O1, O2)] can achieve an accuracy of 94.44%. The methods to discover different brain activities developed in this study will benefit and shed the light on the design of future BCI of speech synthesis for 60% of people in the world who use tonal languages.