electromyography; sensors placement; sequential forward selection algorithm.

—Automatic speech recognition (ASR) based on surface electromyography (sEMG) sensors is an important technology converting electrical signals into computer-readable textual messages, which can overcome the limitation of acoustic sensors that are easily contaminated by environmental noises. However, current placements of sEMG sensors mainly depend on the experimenter's experience, which could miss important information about the major muscular activities and lead to the decline of classification performance. In this study, 120 closely-spaced sEMG sensors were utilized to collect high-density sEMG signals for recognizing ten digits in English and Chinese. The linear discriminant analysis classifier was used to classify the speaking tasks, and the sequential forward selection algorithm was utilized for analyzing the optimal position of the sensors. The results showed that the HD sEMG energy maps could help visualize the dynamic muscle activities during the speaking process, and significantly different muscular contraction patterns were observed for different speaking tasks. The classification accuracies when using the facial sensors were significantly lower than those on the neck, although with the same number of sensors. Moreover, the classification rates could be higher than 90% with only 15 optimally selected sensors that were mainly distributed on the neck instead of the face. This study suggests that the neck muscles could be the main contributor, and more sEMG sensors should be placed on the neck to improve the ASR performance. The findings of this study could provide valuable clues for the development of a practical sEMG-based speech recognition system, especially for patients with speaking disorders.

PEAKING activity, as one of the necessary ingredients of human life, is an essential way for human social communication. Speaking is a complex process controlled by a large number of articulatory muscles associated with phonation. Speaking different words or languages requires different ways of pronunciation and therefore involves different muscular contraction patterns, which could be recorded by a non-invasive technique called surface electromyography (sEMG) via placing EMG sensors on the skin surface for measuring the corresponding electrical signals. Since the sEMG signals contain substantial dynamic information about the articulatory muscle activities, the sEMG sensors could be used in automatic speech recognition (ASR) systems that convert the electrical sEMG signals associated with human speaking into computer-readable textual messages [1]. Unlike conventional recognition methods using the human voice collecting from acoustic sensors, the sEMG-based ASR systems do not rely on any acoustic signals, that are not always available and easily contaminated by various environmental noises. Therefore, it can be used even if the subject does not produce any audible voices (silent speech), such as patients with speech disorders [2,3]. Therefore, the sEMG-based ASR system has developed into a prevalent technique with a wide variety of applications for speaking recognition in both audible and silent modes [4][5][6].
Since the sEMG technique is non-invasive and easy to use, the sEMG-based ASR has been reported in numerous studies during the past decades. For example, fifteen English words were classified by using the sEMG signals recorded from two sensors over the neck muscles of the subject with a firefighter's self-contained breathing apparatus [7]. In another study, a three-channel EMG system was developed for patients with speech impairment, and three Arabic vowels were recognized by using the sEMG signals recorded from facial muscles [8]. Three channels of sEMG sensors were placed on the facial muscles, and eleven voiceless Bangla vowels were classified by using the artificial neural network [9]. A total of eight sEMG sensors (4 on the face and 4 on the neck) were used to record the sEMG signals when reading phrases constructed from a 2500-word vocabulary for silent speech recognition of patients at least 6 months after total laryngectomy [3]. Five sensors (two on the face and three on the neck) were used to acquire the sEMG signals, and fourteen sEMG features and four classifiers were examined to classify eleven Thai words [10]. Five channels of sEMG sensors located on the facial muscles were utilized to classify nine Thai syllables for the rehabilitation of dysarthric patients [11]. Ten sEMG sensors placed on the facial and neck muscles were used to recognize ten specific silent speech commands in Chinese [12]. Moreover, different sEMG-based speech recognition systems have also been developed for different languages, such as English [13][14][15], Chinese [16][17][18], Japanese [19,20], Thai [21,22], Korean [23], Aceh [24] and Malay [25].
However, the placements of the sensors in the above-mentioned studies were mostly decided based on the experience or the trial-and-error method of the experimenter without any quantitative analysis, leading to a possible declination of the performance of the ASR system due to improperly placed sensors. A possible solution might be that the experimenter could place the sensor according to the physical distribution of the articulatory muscles [26]. Nevertheless, the speaking process is complex neuromuscular activities involving a larger number of small facial and neck muscles, and therefore the speaking of different words might generate dramatically different patterns of muscular involvements [27]. Moreover, each language may have its unique activation pattern of the articulatory muscles because of its specific pronunciation style [28]. Studies showed that the role of the articulatory muscles could be significantly different for different languages, and the placement of sensors could considerably affect the performance of the ASR system accordingly [29,30]. Therefore, the investigation of the contribution of different articulatory muscles is helpful for providing objective guidelines on optimal sensor placements in cases with an inadequate number of sensors, so that the accuracy of the sEMG-based ASR system could be considerably improved. However, up to the present time, there are few studies to investigate the contributions of the facial and neck muscles in speech recognition of different languages.
In addition, most of the previous studies used only a few sensors located on the facial and/or neck muscles to record sEMG signals as the input of the sEMG-based ASR system. However, the muscles responsible for speaking are characterized by a large number and small shapes, and these muscles spanned a relatively large area across the face and neck to achieve subtle movements. The usage of a few empirically placed sensors in sEMG measurements may not provide adequate information to investigate the contributions of the facial and neck muscles in speech recognition. It is still not clear the muscles of which region (the face or the neck) play a more important role in the ASR system, due to the lack of comprehensive analyses of full information from all the muscles. Thus, these challenges have motivated the emergence of the high-density sEMG (HD sEMG) technique using multi-channel sEMG sensors in the sEMG-based ASR field. The HD sEMG technique uses a large number of closely placed sensors to record electrical activities of a large area of muscles so that the comprehensive information of a group of target muscles could be fully revealed [31]. Over the past few decades, the HD sEMG signals had been adopted in many research studies to decode motion intents for human-machine interaction systems, to evaluate the swallowing functions in patients with dysphagia, to study the behavior of the paraspinal muscles in people with low back pain and to analyze the motor unit decomposition in a non-invasive way [32][33][34]. It is also clinically useful in the assessment of motor fiber conduction velocity [35] and fatigue evaluation of motor unit action potentials [36] due to its non-invasiveness and its capacity to record over very long periods. The introduction of HD sEMG technique into the sEMG-based ASR system could overcome the limitation of current methods with insufficient sensors so that complete information about articulatory muscles could be obtained to analyze the contributions of the facial and neck muscles in ASR, which helps to provide practical guidelines on how to place the sEMG sensors to improve the performance of the sEMG-based ASR system.
The purpose of this study is to investigate the contributions of different articulatory muscles for English and Chinese speech recognition using multi-channel sEMG sensors. A total of 120 surface sensors closely placed over the facial and neck muscles were utilized to simultaneously collect HD sEMG signals when the subjects were speaking ten English and Chinese digits, respectively. A set of topographic maps were constructed to visualize the dynamic energy distribution of the articulatory muscular activities during the speaking process. The classification accuracies were calculated and compared for different sensor groups of the face and neck regions. The distribution of the optimal sensors automatically selected by a sequential forward selection algorithm was also analyzed to investigate the roles of different articulatory muscles. This study could provide a useful guideline for appropriately placing sEMG sensors and pave the way for the development of a clinically feasible system for sEMG-based speech recognition, especially for patients with speaking disorders.

A. Subjects and experimental procedure
A total of eighteen healthy volunteers (eleven males and seven females) with normal speaking and hearing capabilities were recruited to participate in the experiment of this study. All of the volunteers were native Chinese speakers with no less than ten years of English learning. Before the speaking tasks, the subjects were introduced with the intentions and procedures of the experimental protocols in detail. The experiments were approved by the Institutional Review Board of Shenzhen Institutes of Advanced Technology (#IRB ID: SIAT-IRB-170815-H0178). Every subject willingly provided their written informed consent and permitted the scientific and educational use of their photos and data.
In the experiments, the subjects were required to speak different digits with an audible speech in both English and Chinese, and the corresponding HD sEMG signals were collected from the articulatory muscles on the face and neck regions by multi-channel sEMG sensors. Before each session, 40 seconds of electrical signals were recorded when each subject remained in a relaxed state without any speaking or movements to obtain the baseline for the sEMG signals. Then the subjects were asked to speak ten digits (0 to 9) in English and Chinese, respectively (Table I). For each trial, each digit was spoken within one second, followed by a three-second rest to avoid muscle fatigue. Each trial was repeated 28 times before continuing to the next digit. The experiments were carried out in an electromagnetic-shielded room to ensure high-quality HD sEMG recordings.

B. HD sEMG acquisition
In this study, the HD sEMG signals were synchronously recorded by a total of 120 sEMG sensors closely placed on the face and neck regions. The REFA 128 system (TMSI, REFA, the Netherlands) was used for the data collection with a sampling of 2048 Hz for each channel. Before the data acquisition, the skin surface was cleaned carefully by using the alcohol pad for removing extra dust, dander, and skin oil that could affect the quality of the sEMG signals. The 120 sEMG sensors were arranged as a set of two-dimensional arrays to cover all the facial and neck muscles, and the distance between each adjacent sensor was kept at a small interval of 15 mm to obtain comprehensive electrophysiological information at a high spatial resolution. As shown in Fig. 1, eighty sensors were structured in a 5 × 16 grid evenly located on the neck muscles. Meanwhile, two sensor arrays in a 4 × 5 grid (40 channels in total) were symmetrically placed on the left and right sides of the facial muscles.  In order to compare the contributions of different muscles in speech recognition, the sensor arrays were grouped in six ways ( Fig. 2): (1) F-40: all the 40 sensors on the facial muscles (channel F1 to F40); (2) NO-40: the 40 sensors located at the odd columns of sensor arrays in the neck region (channel N1, N3, …, N79); (3) NC-40: the 40 neighboring channels located at the central area of the neck (channel N5 to N12, …, N69 to N76); (4) NE-40: the 40 sensors located at the even columns of the neck region (channel N2, N4, …, N80); (5) NA-80: all the 80 sensors on the neck muscles (channel N1 to N80); (6) FN-120: all of the 120 sensors on the facial and neck muscles.

C. HD sEMG topographic energy maps
Firstly, the original sEMG data were filtered by a fourth-order Butterworth band-pass filter with cut-off frequencies from 30 to 500 Hz to attenuate low-frequency baseline wander and other high-frequency noises. Besides, a custom notch filter was utilized to reduce the power-line interferences at 50 Hz and its harmonic frequencies. Then, the HD sEMG signals of each channel were calculated by a set of analysis windows (length of 250ms) to generate the root mean square (RMS) of the HD sEMG recordings. Then, the RMS values were normalized (NRMS) across all channels of electrodes by using the maximum and minimum RMS of HD sEMG recordings. Afterward, a sequence of the topographic energy maps was constructed by the NRMS values for visualizing and evaluating the contraction patterns of the facial and neck muscles during the speaking tasks.
The RMS was calculated for each analysis window to obtain the average energy distribution of the muscular activities as follows: where R{v[m]} is the RMS value of sEMG signals for each analysis window, v[i] is the i th sample in the analysis window, and m is the total number of windows.
The normalized RMS values were symbolized by NR as follows.
where NR(i) is the normalized RMS value of sEMG signals in channel i, R(i, j) is the RMS value of channel i in analysis window j, min(R) is the minimum RMS value of channel i, and max(R) is the maximum RMS value of channel i.

D. Features extracting and word classification
Then, the features of the HD sEMG signals were extracted for providing useful information embedded in the sEMG signals to recognize the intended speech tasks. The filtered HD sEMG signals containing all the 28 repetitions were manually sliced for each digit, with only the sEMG signals corresponding to the audible speaking process reserved to form the activity data. Afterward, the activity data containing the 28 repetitions of the same digit were partitioned into the sEMG series by a 400-point (almost 200 ms) sliding window with a 200-point increment for the computation of the sEMG features. Signal features that are in the time domain (TD), frequency domain (FD), and time-frequency domain (TFD) are used in sEMG-based pattern recognition. Among these different features in different domains, the TD features were used most frequently in sEMG classification due to their easy implementation, low computation complexity, and satisfactory performance [37][38][39][40][41][42]. Moreover, Hudgins's feature set, including the Mean Absolute Value (MAV), Waveform Length (WL), Zero Crossing (ZC), and Slope Sign Change (SSC), could comprehensively reflect the temporal and spectral properties of sEMG signals [10] and therefore they were widely used by many other studies about prosthesis control and muscle-computer interface [43][44][45][46][47][48]. Thus, in this study, these four time-domain features, including the MAV, WL, ZC, and SSC, were extracted from the preprocessed sEMG signals for English and Chinese word classification, and the formula to compute these features were shown in Table II.  [37,49] ( ) ( ) xk represents the EMG signal in a segment k and n denotes the length of the EMG signals.
Then 5-fold cross-validation arithmetic was employed to segment the matrix of extracted features and the corresponding targets into training and testing sets. These sets were subsequently fed into the linear discriminant analysis (LDA) classifier for recognizing the speech patterns inherent in the extracted sEMG features. Classification accuracy is one of the most popular metrics in various pattern recognition applications including speech recognition. In addition, classification accuracy is the simplest clustering quality measure to evaluate clustering results associated with the ground truth. It is essential for the accurate realization of a user's intent, and directly presents the recognition results of the speaking tasks. Thus, classification accuracy was considered as our core metric for evaluating the contributions of different articulatory muscles in speech recognition [50]: where Acc is the classification accuracy, Ncor is the number of correctly classified samples，and Ntest is the total number of testing samples.

E. Sensor optimization analysis
In this study, the optimal sensor number was also calculated, and the distribution of the optimal sEMG sensors was analyzed to compare the contributions of different muscles for speech recognition. The sequential forward selection (SFS) algorithm, which automatically selects a subset of features that is most relevant to the problem, was employed to calculate the optimal sensor number for given classification accuracy. The SFS algorithm was easy to implement and shows great performance in various circumstances of data dimension reduction [51,52]. The SFS algorithm started with a null feather set, and then the channel with the highest classification accuracy was selected among all the 120 channels. Subsequently, one more channel with the largest accuracy increment was added at each step of the algorithm until it reached a target desired classification accuracy, as shown in Fig. 3. Given that the optimal channel sets {(i-1)Sch} containing a total of (i-1) channels had already been selected in the (i-1) th iteration for the SFS algorithm, each channel (EMGj) from the rest sensors would be picked out and combined with the selected sets {(i-1)Sch} in the i th iteration (4). This procedure was repeated until all the rest channels have been tested, and the optimal channel EMG * with the highest classification accuracy would be selected for the i th iteration. Accordingly, the sets {(i-1)Sch + EMG * } would be selected as the i th optimal channel sets {(i)Sch} indicated by (5).
In this study, different numbers of optimal sEMG sensors, involving 5 channels (5-ch), 10 channels (10-ch), 15 channels (15-ch), 20 channels (20-ch), 25 channels (25-ch), and 30 channels (30-ch), were selected from the total 120 sEMG sensors by using the SFS algorithm, respectively. Then, the location of these optimally selected sensors was analyzed according to their distribution and the sensor number from different groups of muscles was counted separately.
The statistical analyses of one-way ANOVA were performed to analyze the effects of different sensor groups on the classification accuracies for Chinese and English speech recognition, respectively. Meanwhile, the distribution of the optimal sEMG sensors was also compared among different sensor groups to evaluate the contribution of different muscles for different speech recognition tasks. All the statistical results were obtained by comparing the p-value with a confidence level of 0.05. In this study, all the analyses of the offline HD sEMG data, such as digital filtering, feature extraction, SFS algorithms, and pattern recognition, were implemented in the Matlab software platform (MathWorks, Natick, MA, USA).

A. HD sEMG topographic energy maps for the entire speaking process
In this study, the dynamic HD sEMG topographic energy maps, which could demonstrate the energy distribution of the articulatory muscular activities when the subject was speaking, were constructed from the sEMG signals and a typical example was shown in Fig. 4, where high energy intensity was represented by red color. The entire speaking process was segmented into six temporal frames (frame 1 to frame 6) for exhibiting the dynamic activities of the facial and neck muscles when the subject was speaking the English words "zero" and "one", respectively.
Before the subject started to speak the word "zero", the energy map kept at low intensity on both the face and neck regions in frame 1, as shown in Fig. 4(a). In frame 2, a high-energy area appeared at the bottom center of the neck, indicating the beginning of the word speaking. Then the energy concentration area started to move upward, and the maximum muscular activities were observed in the middle of the facial region in frame 3, with constantly diminishing EMG activities when moving away from the mouth. Afterward, the region with maximum muscular activities traveled downwards back to the lower edge location of the neck region, while the activities of the facial muscles decreased to a low intensity in frame 4. Thereafter, the intensity of the high-energy area on the neck gradually declined in frame 5, and finally disappeared in frame 6 when the speaking task completed.
On the contrary, the HD sEMG topographic energy map in Fig. 4(b) demonstrated a significantly different pattern when the subject spoke a different word of "One". Unlike Fig 3(a) in which the energy concentration area traveled forward and backward between the face and the neck, the EMG activities of the word "one" showed a briefer and simpler pattern. In Fig  3(b), noticeable muscular activities were first observed in frame 2 over the facial muscles around the mouth region. Then the intensity of the facial muscular activities considerably increased in frame 3, and the range of the active region spread downward to the center of the neck. After that, the intensity of the active areas significantly decreased in frame 4, with some residual energy distributed along the mouth region. From frame 5 to frame 6, no apparent muscular activities were observed on either the face or the neck region.
Additionally, when comparing with the two speaking tasks in Fig. 4(a) and 4(b), it was observed that the energy maps showed approximately symmetric left-and-right distributions for both the face and neck muscles during the whole speaking process.

B. Averaged HD sEMG topographic energy maps for different words
For comparing the HD sEMG topographic energy maps among different speaking tasks, all the temporal frames (Fig. 4) during the speaking process were averaged for each digit word, and the averaged energy maps of 10 different words were shown in Fig. 5. It was observed that the EMG activities of the facial muscles were mainly located around the mouth regions while those of the neck muscles exhibited on the center of the neck across all the ten speaking tasks. Nevertheless, evident differences were also observed among different speaking tasks. While rather high intensities of muscular activities were observed on the neck region for the words "four", "five" and "seven", the significantly lower amplitude of neck energy distribution were seen for other words such as "two", "three", "six" and "nine". For the neck region, the area with the highest energy tended to locate at the lower portion for most of the word speaking tasks. Moreover, it was observed that the muscular activities showed coarse left/right symmetry for the neck regions. However, significant differences between the left and right could be observed in the facial region, especially for the words of "zero", "two" and "five". In other words, the facial areas with the highest energy were inclined to distribute around the mouth at the lower portions.  5. The typical HD sEMG topographic maps when speaking ten different English words, including zero, one, two, three, four, five, six, seven, eight, and nine.

C. Comparison of classification accuracies among different sensor groups
To evaluate the performance of the speech recognition system among different sensor groups, the confusion matrices of classification accuracies were computed and compared for the F-40 and NO-40 sensor groups, as shown in Fig. 6. It was noted that the accuracy of the "rest" task attained 100% on F-40 and NO-40 groups for both English and Chinese recognition. In Fig. 6(a), for the F-40 group, while the accuracy reached 91.3 % for recognizing digit one, it dropped to around 67% in classifying digits eight and nine. Most of the accuracies were lower than 80% for the F-40 group. In contrast, most of the English words had a recognition accuracy above 80%, with the only exception of digit seven. In Fig. 6(b), using the F-40 sensor group for Chinese speech recognition showed slightly higher overall accuracy than English recognition tasks, with half of the tasks having accuracies higher than 80%. For the NO-40 group, only one task (digit 6) had a classification accuracy less than 80% and the highest accuracy could reach up to 94.5%.
For investigating the contributions of different muscle activities towards the sEMG based speech recognition, the 120 HD sEMG sensors were grouped in six different ways based on their locations (Fig. 2): F-40, NC-40, NO-40, NE-40, NA-80, and FN-120. A typical example of the classification accuracy (averaged across digits) as a function of the sensor group was shown for both languages in Fig. 7. It was observed that the F-40 and NC-40 groups had the lowest averaged classification accuracies (as low as 77.92%) for both languages. With the same number of sensors, the NO-40 and NE-40 groups showed significantly better performance with a classification accuracy as high as 91.58%, and there were no significant differences between the two groups. When all the 80 neck sensors were used for the recognition, the NA-80 group demonstrated the highest classification accuracy up to 95.09%. Moreover, the Chinese recognition showed higher averaged classification accuracies and smaller standard deviations across all the sensor groups when compared with the English tasks, especially for the F-40 and NC-40 groups.  To further investigate the contributions of different regions of muscles, the classification accuracies averaged across all the different digits and subjects were compared among all the six different sensor groups (F-40, NC-40, NO-40, NE-40, NA-80, and FN-120) for both English and Chinese, as shown in Fig. 8. It was observed that the classification accuracy of the F-40 group was the lowest for both English (76.9%) and Chinese (81.11%) recognition, with the NC-40 group having slightly better performance. The NO-40 and NE-40 groups showed considerably higher accuracies than the F-40 and NC-40 groups, and there were no significant differences between the NO-40 and NE-40 groups. Further increase in the sensor number would also lead to additional performance improvement in the speech classification, such as the NA-80 and FN-120 groups, with the highest accuracy up to 96.54%. It was also observed that the accuracies for English recognition were slightly lower than that of Chinese recognition across all the sensor groups, especially for the F-40 group.

D. Distribution of optimal sensors for different classification accuracies
To further localize the best subset of all the HD sEMG sensors that contributed mostly to speech recognition and to reduce the sensor number for practical sEMG-based applications, the SFS algorithm was proposed to automatically find the optimal channel after searching all the 120 sEMG sensors. Then the number of the optimal channels that came from three different sensor groups (F-40, NO-40, and NE-40) was counted for each sensor group, respectively, and the distribution of the optimal channels among the three sensors groups was illustrated in Fig. 9. As shown in Fig. 9(a), as the optimal channel number increased from 5 to 30, the corresponding classification accuracy improved from 74.11% to 94.9% for the English recognition tasks. It was also observed that the optimal channel numbers selected from facial muscles were much less than that from the neck muscles. For instance, for an optimal channel of 5, there was only one optimal channel selected from the F-40 group, while there were both two channels selected from the NO-40 and NE-40 groups. Notably, when the optimal channel number increased, significantly more optimal channels came from the neck region (either the NO-40 or the NE-40 group) instead of the face region. Similar patterns of the optimal channel distribution were also observed for Chinese recognition tasks in Fig. 9(b), with significantly more channels coming from the neck muscles. Moreover, the Chinese recognition tasks seemed to have slightly more optimal sensors coming from the facial muscles (F-40 group), when compared with the English speech recognition. To further investigate the contributions between the facial and neck muscles for speech recognition, the number of the optimally selected sEMG sensors by the SFS algorithm were statistically analyzed across all the enrolled subjects, and the averaged optimal channel numbers coming from the three different sensor groups (F-40, NO-40, and NE-40) were shown in Fig. 10. It was observed from Fig. 10(a) that classification performance showed substantial improvement (from 64.66% to 90.85%) when the optimal channel number increased from 5 to 30, for English speech recognition. The optimal channel number from the F-40 group was significantly lower than either the NO-40 or the NE-40 group, and there was no significant difference between the two groups of the neck region. Similar observations were found for the distribution pattern of the optimal channels for Chinese speech recognition in Fig. 10 (b). It was noteworthy that the average classification accuracies for Chinese recognition were systemically higher than that of English recognition for the same optimally selected channel number.

E. Contributions of different muscle groups with the increasing class number
In addition, to further examine the overall performance of our system, we increased the word number by combining the ten English and ten Chinese speaking tasks as a new set with a total of 20 speech tasks, and then the classification accuracies were compared across different sensor groups (F-40, NE-40, and NO-40) as the class number increased from 1 to 20. As it was shown in Fig. 11 below, the classification accuracies of the F-40 group remained at the lowest level when compared with the NO-40 and NE-40 groups. Meanwhile, a decrease in the classification accuracies was constantly observed when increasing the speech class number, regardless of the sensor groups (F-40, NO-40, or NE-40). However, the declining rate of accuracy was quite different among different sensor groups. When the class number increased from 1 to 20, the classification accuracies dropped from 100% to 73.76% for the F-40 group, 85.1% for the NO-40 group, and 87.67% for the NE-40 group, respectively. In comparison, the classification accuracies for NO-40 and NE-40 groups were significantly higher than the F-40 group when recognizing 20 words, and the results were consistent with the findings from classifying 10 English or Chinese words. Moreover, the distribution of the optimally selected sensors was also calculated and compared among the F-40, NO-40, and NE-40 sensor groups when recognizing 20 speaking classes, as shown in Fig. 12. As could be observed from the figure, when the optimal channel number increased from 5 to 30, the corresponding classification accuracy dramatically increased from 70.94% to 93.87% for recognizing 20 speech classes. By further examining the sensor distribution, it was found that the optimally selected sensors are mainly distributed on the neck muscles (either the NO-40 or the NE-40 group) instead of the face muscles (F-40). For example, among the 5 optimal selected channels, there was only one sensor coming from the F-40 group. When the optimal channel number increased, the number of optimal sensors from the F-40 group was always smaller than the NO-40 or NE-40 group, similar to the findings of 10 English or Chinese word classes.

IV. DISCUSSION
The sEMG-based ASR is a technique that enables the recognition of speaking activities into a textual representation using the sEMG signals recorded from the articulatory muscles associated with speaking activities by the sEMG sensors. The principal objective of this study was to examine the contributions of different articulatory muscles for the sEMG-based ASR, which would be helpful for providing practical guidelines for sEMG sensor placement. This purpose was achieved by using the HD sEMG signals recorded from the facial and neck muscles when speaking ten digits in English and Chinese, respectively.
The study showed that the energy maps calculated from the HD sEMG signals could help to visualize the dynamic energy distribution of the muscular activities during the speaking process (Fig. 4) and provide physiological clues to identify different word pronunciations (Fig. 5). The HD sEMG topographic energy maps are attributed to the vocal cord vibration and mouth movement during the physiological process of speaking [53,54]. The dynamic spatiotemporal patterns in normal subjects ( Fig. 4 and 5) could illustrate the characteristics of a normal speaking process and, therefore, could establish a standard for the diagnosis of the articulatory muscle activities. Meanwhile, the placement of the electrode used for evaluating the speaking functions should follow the myoelectrical characteristics of the speaking activities. Based on the results of this study, the electrodes located in the center of the neck or close to the mouth picked up the largest amplitude of sEMG signals, and therefore they are important for providing the most reliable information for speaking assessment. The findings would suggest that the HD sEMG topographic energy maps could be possibly used as a potential tool for finding the proper sensor placement for speaking related researches, such as speech recognition or evaluation of phonation function. It should also be noted that there were significant individual differences in the classification accuracies when using the same group of sensors (Fig. 7), which might be a result of the different speaking styles or habits of different individuals or languages. Therefore, the purpose of this study is to obtain an individual-independent general understanding of the contributions of different articulatory muscles and therefore provide practical guidelines for sensor placements that are applicable to all individuals.
The use of the HD sEMG technique with multi-channel sensors plays an important role in the investigations of this study by means of covering all the small articulatory muscles in high space-resolution and providing full information about the muscular activities during the speaking process. In most of the previous studies, the sEMG-based ASR investigations depended on the sEMG signals recorded from a few numbers of sensors whose positions were chosen empirically with no quantitative analysis, such as five facial sensors for Thai word recognition [11], eight face and neck sensors for English silent speech recognition [10], and ten face and neck surface sensors for Chinese silent speech classification [13]. However, the insufficient small number of sEMG sensors chosen by experience might lead to the missing of important muscle coverage and major electrical activities that would be essential for speech recognition. For example, placing all the sEMG sensors along the edges of the face or the neck region ( Fig. 1) may miss the large amplitude of muscular activities in the middle regions and result in the rather low amplitude of sEMG signals containing little information about the speaking activities ( Fig. 4 and 5), leading to the deterioration of the classification performance of the speech recognition. The HD sEMG technique utilized a total of 120 sEMG sensors that are enough to cover all the face and neck muscles, and ensured that no important information about the muscular activities was missed to investigate the contributions of different muscles thoroughly.
In this study, the 120 HD sEMG sensors were divided into six different groups based on their locations to assess the contributions of different articulatory muscles for the sEMG-based ASR systems. The results from Fig. 7 and 8 showed that the facial sensor group (F-40) had significantly lower classification accuracies than any of the neck sensor groups (NC-40, NO-40, NE-40), although with the same channel numbers. Meanwhile, the results in Fig. 11 showed that the classification accuracy of the facial group F-40 was the lowest compared with the sensor groups on the neck (NO-40 and NE-40), when the class number of speaking tasks increased from one to twenty. These results demonstrated that the neck muscles should be the main contributor towards satisfactory speech recognition performance. The findings confirm that the placement of the sensors greatly affects the classification rate of the ASR system, and the neck muscles show a more important role in better speech recognition than the facial muscles. It may be explained by the physiological fact that there are more articulatory muscles distributed along the neck regions, and there are more muscles activated or involved during speech production [27]. The insignificant differences between the NO-40 and NE-40 groups may be attributed to the reason that the two groups are interlaced with extremely close space between neighboring columns to cover nearly the same information sources. The finding that the NC-40 group had significantly lower classification accuracy than either the NO-40 or the NE-40 group suggests that the sEMG sensors should cover larger areas to achieve better classification performances, which is also consistent with the findings of our previous studies on sEMG-based speech recognition [55,56]. These findings of this study may be useful for providing useful recommendations about sensor placements in routine practices of sEMG-based speech recognition.
Considering that there could be redundancy within the HD sEMG signals and placing sEMG sensors as many as 120 could be time-consuming, the SFS algorithm was proposed to automatically select the optimal channels with the highest classification accuracy so that the sensor number could be greatly reduced. The results in Fig. 9 showed that the classification accuracy dramatically increased with the optimal channel number, and it could reach about 90% for only 15 optimally selected sensors. By further analyzing the origination, it was found that significantly more optimal sensors came from the neck sensor group (either NO-40 or NE-40) than the facial sensor group (F-40), although the sensor groups had the same number of channels. Besides, similar findings were shown in Fig. 12, the classification accuracy was 88.62% with only 15 optimally selected sensors when mixing all the English and Chinese words. These findings indicated that the neck muscles should be a more significant contributor to sEMG-based speech recognition, which agrees with the findings in Fig. 7 and 8. It may be explained by the fact that speeches were generated by the quasi-periodic vibration of the vocal cords located within the larynx, which were mainly controlled by the articulatory muscles around the neck. The results of our study suggest that instead of placing an equal number of sensors on the face and neck, it may be a better practice to place more sensors along the neck region to further improve the classification performance. Other approaches besides the SFS algorithm could also be used to further reduce the number of the sensor channels, so the wearable sEMG-based speech recognition systems or devices could be developed by placing only a few electrodes on the optimal locations. Acoustic and inertial sensors could also be employed in future studies so that the information from different types of sensors could be fused to reduce more channels and additionally improve the performance of the sEMG-based speech recognition.
The language also acts as an essential factor in the speech recognition, and many different sEMG-based speech recognition systems were developed in previous studies for different languages, such as English [14,15], Chinese [16,17], Japanese [20], Portuguese [57], Spanish [22] and Arabic [58]. In this study, the performances of sEMG-based speech recognition were systemically compared between English and Chinese under different conditions. The results of Fig. 8, 9, and 10 indicated that the classification accuracies of Chinese speech recognition were slightly higher than English recognition, regardless of the sensor groups or the optimal sensor channels. The slight superior performance of Chinese speech recognition may be attributed to the fact that all the recruited subjects are native Chinese speakers, and they were more fluent in Chinese speaking. It was also observed that when using only the facial sensors (F-40) or the neck sensors (NO-40 or NE-40), the English recognition showed significantly lower classification accuracies than Chinese, indicating that the English-speaking task may rely more heavily on the coordination between the facial and neck muscles. However, only ten digits were employed in the experiments of this study; more different words or phonemes could be involved for further investigating the differences between English and Chinese speech recognition in future studies.

V. CONCLUSION
In this study, multi-channel sEMG sensors (120 channels) were placed on the facial and neck muscles with high spatial resolution, and the recorded HD sEMG signals were used for automatic speech recognition of English and Chinese digits. The energy maps calculated from the HD sEMG signals showed that the muscular activities of different locations demonstrated significant patterns during the speaking process, and they could help to visualize the dynamic energy distribution of the articulatory muscular activities. The classification accuracies when using only the sensors on the face were significantly lower than those for the neck muscles, although with the same number of channel numbers. The optimal sensors automatically selected by the sequential forward selection algorithm mainly distributed along with muscles on the neck instead of the face. The classification accuracies of Chinese speech recognition were slightly higher than English recognition, regardless of the sensor groups or the optimal channel number. The findings of this study showed that the multi-channel sEMG sensors could be useful to study the muscular activation patterns during speech recognition comprehensively, and the muscles on the neck should be the main contributor towards satisfactory classification performance. This study could provide valuable clues for the development of a practical sEMG-based speech recognition system, especially for patients with speaking disorders.

VI. ACKNOWLEDGMENTS
We would thank all the members of our research laboratory at the Research Center for Neural Engineering, Institute of Advanced Integration Technology, Shenzhen Institutes of Advanced Technology, for their supports and assistance in conducting the experiments and signal processing. The work also was supported by the Shenzhen Institute of Artificial Intelligence and Robotics for Society.