Optimizing Arabic Speech Distinctive Phonetic Features and Phoneme Recognition Using Genetic Algorithm

Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural networks is first built. This model is used to benchmark the results of proposed features selection method with a method that employs all elements of a features vector. Experimental results, utilizing the King Abdulaziz City for Science and Technology Arabic Phonetic Database, show that the average genetic algorithm based phoneme overall recognition accuracy is maintained slightly higher than that of recognition method employing the full-fledge features vector. The genetic algorithm based distinctive phonetic features recognition method has achieved a 50% reduction in the dimension of the input vector while obtaining a recognition accuracy of 90%. Moreover, the results of the proposed method is validated using Wilcoxon signed rank test.

background noise. Some features that are used in ASR systems are acoustic features such as spectrogram, melfrequency cepstral coefficients (MFCCs), and short-time energy just to name a few. There are also other types of features that are highly representative, which are the distinctive phonetic features (DPFs). These features are introduced to a system as binary vectors where each bit of that vector describes the presence or absence (denoted as + or -, respectively) of some articulatory and acoustic properties that are associated with a particular phoneme utterance. DPFs are VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ language dependent and each spoken language has its own finite set of DPFs, where a unique binary vector is assigned to each phoneme of the language [1]. Theoretically, the DPFs can describe all phonemes with uniquely distinctive binary patterns. DPF elements (bits) can be very useful also in categorizing phonemes based on similarity among them that can be directly traced by matching the DPF vectors of the different phonemes [2].
Here is an example, the phonemes /s/ and /z/ are very close to each other in the DPF space. Both phonemes share the same phonetic features (e.g., consonant, fricative, alveodental, etc.) except for the voicing feature, which refers to the physiological activity on the vocal folds, in which this pair of phonemes shows contrary values. That is, vocal folds must vibrate in order to vocalize /z/, otherwise a pure /s/ will be uttered [4]. The DPF elements in modern standard Arabic are listed in Table 1, depending on most references [1].

A. ARABIC LANGUAGE OVERVIEW
Modern standard Arabic (MSA) has 34 phonemes: three short vowels /a, i, u/, three long vowels /a:, i:, u:/, and 28 consonants that are grouped under a number of subcategories such as plosives, affricatives, nasals, trills, etc. There are two subcategories of Arabic phonemes that are not found in many languages, which are the pharyngeal and the emphatic phonemes [4]. The duration of phonemes in Arabic is phonemic. That is, phonemes (vowels and consonants) can be uttered in short or long periods, where both ways directly affect the word meaning [5].
Words in Arabic consist of syllables, where each syllable must have at least one vowel. Therefore, a word would have as many syllables as there are vowels in that word [6].

B. LITERATURE REVIEW
Extracting DPFs has been tackled in many published studies. In [7], DPFs are extracted using multilayer perceptron (MLP), and have been demonstrated to enhance the robustness of ASR. In [8], a canonicalization process was proposed composing of multiple DPF extractors in order to neutralize the effect of speaker's gender on ASR system robustness. Similarly, in [9] multiple DPF extractors were deployed to eliminate the effect of hidden factors and to reduce the effect of noise. In extension to that, in [10], the DPF extractors are utilized to neutralize hidden factors of speakers' variability in addition to gender and to eliminate the effect of noise. In [11], a DPF extractor was proposed to enhance the accuracy of speech segmentation, using recurrent neural network (RNN) followed by an MLP neural network. In [12], a phoneme recognition system was proposed consisting of two-stage DPF extraction: the first stage converts acoustic features to a 45-bit DPF vector, while the second stage makes the vectors orthogonal before being fed to a hidden Markov model (HMM) classifier. The work in [13] proposed the use of recurrent neural networks to detect phonological features in continuous speech.
Articulatory Features (AFs) are utilized in [14] to develop pronunciation models for ASR systems. In [15], the articulatory features are investigated with respect to monolingual, cross-lingual, and multilingual ASR. The work published in [16] is an attempt to develop a large vocabulary ASR system utilizing the distinctive phonetic features instead of the ordinary short-term spectra features. DPF-based phone-level segmentation is reported in [17], where the system is built using recurrent neural networks and a multi-layer neural networks. In [3], a representation method is proposed such that a speech waveform is represented by some abstract linguistic descriptors from which a set of discriminative features is derived and fed to ASR systems. The work in [18] attempted to improve the ASR performance by adopting a multi-stream technique of DPFs and spectral features. A noise-robust ASR system that applies logarithmic normal distributions of HMMs for the purpose of approximating DPF elements was proposed in [19]. The robustness of ASR under Low-SNR of car environments was investigated in [20], where DPFs along with spectral cues are utilized to enhance system robustness. Phoneme classification for Bengali Language using DPFs and deep neural network was reported in [21]. In [22], a deep neural network is used to predict historical phonetic features drawn upon synchronic phonetic patterns arising from coarticulation and statistical constraints in Proto-Indo-European language. In [23], extracted acoustic features of speech signal using hamming window and pre-emphasis filter, in addition to extracted decompositional features using daubechies-filtered 5th-depth Wavelet Packet Decomposition (WPT), are optimized using genetic algorithm to classify Turkish vowels.
The relevance of evolutionary-based algorithms, like a genetic algorithm, that belong to a family of search algorithms inspired by the process of evolution in nature, was demonstrated in a recent study showing that optimizing the topology of an Artificial Neural Network may lead to a high classification rate of spoken utterances of both native and non-native English speakers [24].

C. DISTINCTIVE PHONETIC FEATURES IN THE ARABIC LANGUAGE
The Arabic language has a number of unique characteristics, such as the presence of a relatively large number of pharyngeal and emphatic sounds, in addition to various types of allophones, many of which are the result of emphaticness and gemination. Arabic also has several lexical stress systems, likely unknown in other languages, but regrettably unstudied and in need of thorough investigation. In the context of the present investigation of DPFs in Arabic, only a limited number of previous studies dedicated to the subject are available. In [25], Arabic DPFs were extracted using modular connectionist architectures with rule-based systems (SARPH). In [26], Selouani et al. deployed neural networks of mixed architectures fed with continuous speech in order to recognize complex Arabic phonemes.  VOLUME 8, 2020 In our previous works, the multidimensional phonological feature structure of Arabic was investigated by assessing the performance of statistical and connectionist approaches in performing the complex mappings between DPFs and associated acoustic cues [27]. In a review paper [28], a background on Arabic DPFs, highlighting the historical and geographical varieties, the problem of ambiguous definitions between classical and modern phonology, and the deviations in phonemes and DPF elements across dialects of Arabic were investigated and presented. HMM was used with an original normalization technique to perform Arabic phoneme classification using the DPF elements and utilized DPFs for the purpose of introducing a canonical process for phoneme level classification by means of substituting the speech waveform with its phonetic binary DPF vector [29]. In another work [30], the problem of DPF modeling and extraction of modern standard Arabic is tackled by using deep neural networks (DNNs) and compared with the classical MLP models. The representativeness of several acoustic cues for different DPF elements was measured additional to the proper evaluation measures satisfying the imbalanced nature of the DPF elements which was addressed. It is important to note that our previous work on DPF modeling using DNN had an objective acousticto-phonetic conversion, where Arabic DPFs were extracted from acoustic features using DNNs. However, input feature selection was not within the scope of that previous work. On the other hand, the present work has a different objective and scope, which is to come up with a unified reduced set of acoustic features that can be used to extract any DPF element using any machine learning technique.

D. MOTIVATION AND OBJECTIVES
The aim of this work is to consider methods to reduce the size of features' vector employed for phoneme recognition by using a genetic algorithm-based approach. The objective is to select the relevant input features that yield to reduce the computational complexity of the recognition algorithm while improving the DPF recognition accuracy. Genetic algorithms (GAs) have been successfully integrated into various speech-processing applications such as speaker adaptation of acoustic models or speech enhancement [31]. GAs have also shown advantage in enhancing the performance of voice communication systems [32]. The main advantage of using GA to optimize the feature selection is their ability to extend the search space of best parameters by applying the principle of maintaining and manipulating a large population of solutions. Their methodology consists of implementing a 'survival of the fittest' strategy in their search for better solutions. Thus, the original idea of this article is to use the ability of GAs to select the relevant acoustic features from speech. GAs and neural networks are very common and effective in processing digital speech mainly in recognition and classification problems. Reducing speech acoustic features while keeping the nominal system accuracy is a very important goal that will help to reduce central processing unit (CPU) time and memory requirements. Hence, the main contribution of this work is to build a robust features selection model, whose input is a wide range of multiple acoustic features. This proposed model is in the form of a hybrid of genetic algorithm and neural network, to predict distinctive phonetic features of phonemes in modern standard Arabic.

E. PAPER'S ORGANIZATION
After Section I, Introduction, the remaining of this article is organized as follows. Section 2 presents an introductory background about the genetic algorithm and gives an overview of the proposed GA-based DPF recognition method. Section 3 provides information about the dataset and features used in this study. Also, in this section, the extracted features are examined and preprocessed for the purpose of normalization and reducing the outliers. Section 4 presents experimental results pertaining to the development of a baseline model based on Feedforward Neural Networks (FF-NN). This model will be used to evaluate the phoneme recognition accuracy of the GA-based features selection method against the whole feature vector-based method. Details of the GA-based features selection method is given in this section. Section 5 presents and analyzes the performance of phoneme recognition based on DPF elements, while the discussion is presented in Section 6. Section 7 gives the concluding remarks.

II. OVERVIEW ON THE GENETIC ALGORITHM BASED METHOD
This sections presents the architecture of proposed GA-based DPF recognition method. Also, it gives brief introduction about the basics of GA. Figure 1 shows the proposed architecture for phonemes classification using DPF elements and GA for adaptive features selection. In this model, the dataset consists of N preprocessed features. The N-point features vector is applied to M GAs followed by M FF-NNs working in parallel, where M is the number of DPF elements. The output of each branch is one bit with value either '0' or '1', depending on the DPF element it represents.

A. SYSTEM MODEL
The selected features by each GA and the parameters of each FF-NN are determined through a training process. In the testing phase, a phoneme is identified by measuring the Euclidean distance between the outputs of M FF-NNs, a vector of M bits, and the actual DPF vectors of all phonemes.

B. GENETIC ALGORITHM
GAs belong to a family of computational models, namely evolutionary algorithms, inspired by the process of natural evolution [33]. They have received increasing popularity due to their robustness and efficiency in solving complex problems in which many classical mathematical methods fail [34]. In particular, GAs work by executing five main steps [35].
• Coding. The parameters of a given problem are encoded often using a binary string.
• Initiation of population. A set of randomly generated strings (chromosomes) are generated as candidate solutions.
• Evaluation of responses. A goodness of fit is applied to each string (chromosome) to determine its chance to be selected for creating the next generation.
• Reproduction. It involves two steps: 1) selecting a set of strings from the previous population, 2) generating a new population through combining parts of selected strings (cross over operation).
• Mutation. It maintains genetic diversity from one generation of a population to the next. It alters one or more elements (genes) in a string (chromosome) from its initial state. In binary encoding, a gene of value '1' gets changed to '0' and vice versa. The proposed genetic algorithm based feature selection method using feedforward neural network is a metaheuristic method for dimensionality reduction. It has a potential application in automatic speech recognition and its applications. For example, in [24], the authors used evolutionary algorithms based optimized deep neural network for recognition of diphthong vowel sounds in the English phonetic alphabet. In [36], the author tied genetic algorithm with Manhattan distance to classify plain and emphatic vowels in continuous Arabic speech. Also in [37], the genetic algorithm was exploited in the segmentation of Arabic speech, and in [38], it has been used with the K-nearest neighbour algorithm to build a voice command recognition system. Using the proposed GA-based optimization method allows to maintain the selected indices of features after finding them during the training process. The training process as to be applied once through the corpus. Therefore, the reduced size of features' vector contributes in reducing time cost when performing the real time applications. In addition, using genetic algorithm can help in removing redundancy in the dataset under investigation [39].

III. DATASET AND FEATURES EXTRACTION
This section is to address a fundamental step in this work, pertaining to the preparation of data for the subsequent experiments. It is of great importance to provide the system with suitable data that carries rich phonetic information. Also, feature extraction is an essential preprocessing step, where representative acoustic features are extracted and prepared for training the acoustic-to-phonetic conversion models.

A. KAPD DATASET
The dataset used in this work is extracted from the KACST Arabic Phonetic Database (KAPD) [40], [41] as summarized in Table 2. KAPD is a phonetically rich speech corpus recorded by seven native Saudi male subjects. Each Arabic phoneme appears in a carrier word in one of three different positions: initial, middle, and final position. Also, for each position of these, three different carrier words exist, where the target phoneme co-articulates with one of the three short vowels (i.e., /a/, /i/, and /u/) of Arabic in each word. For a consonant phoneme in a middle position, the carrier word contains the phoneme in one of two states: single and geminated. Each one of the aforementioned combinations is uttered by each one of the seven subjects in eight different experiment each one of them aims at capturing some physical characteristics of the speech signal, in addition to recording the uttered word. KAPD is composed of the following subsets: Subset A for aerodynamic data, Subset C for liplabeled face images, Subset E for epiglottal imaging data, Subset G for electroglottographic measurement data, Subset N for nasal and oral air pressure measurements data, Subset P for electropalatal imaging data, Subset V for vocal folds imaging data, and Subset X for velopharyngeal imaging data. The experiments carried out in this work are based on random samples taken from KAPD dataset. That is, 13,766 phonemes are used and split into two subsets: training subset consisting of 9,636 phonemes (70%), and test subset consisting of 4,130 phonemes (30%).

B. FEATURES EXTRACTION
The input acoustic features are extracted from each phoneme waveform. That is, a number of 15, evenly spaced, 20-ms long frames are sampled from each waveform. Spacing between frames vary from one waveform to another since phonemes are not all equal in the time duration. Each frame is windowed by a 20-ms Hamming window. Other preprocessing of DC removal and pre-emphasis using α = 0.97 are also applied. The following acoustic features are extracted from each frame:  3) Zero-crossing Rate (ZCR), where each frame yields one scalar value that brings to a total of 15 ZCR values per input vector. 4) Short-time Energy, where there is also one scalar value per frame summing up to 15 values per input vector. 5) Voicing Percentage, which is one value representing the percentage of frames (among the 15 frames) that carry valid (nonzero) pitch values. There is only one percentage value in each input vector. Therefore, the total length of original features vector is 4,456 (= 3,940+585+15+15+1) points. In this study, only the first 15 points are considered from MFCC coefficients for each frame instead of considering all 39 coefficients. This is because the remaining 24 MFCC coefficients have been found significantly of small values. Selecting a large number of MFCC coefficients results in more complexity in the model [42]. Based on this modification, the number of MFCC coefficients for all 15 frames is now 15 × 15 = 225 points instead of 585 points, which brings the features vector used in our experiments to 4,096 points.

C. FEATURES NORMALIZATION
A features'vector consists of the aforementioned five different types of features, each of which has its own dynamic range. Therefore, it is essential that features'vectors are normalized before being applied to a machine learning algorithm. In our development, each type of features of a given features vector is normalized so that it has unity variance. Figure 2 shows a sample of the normalized spectrogram features of a phoneme represented by 15 records.
Note that the normalized spectrogram has spikes, corresponding to resonances in the vocal tack. The amplitudes of the spikes vary between the records of a phoneme, and also vary between those of other different phonemes. Figure 3 shows the boxplot of normalized spectrogram features of the dataset under consideration. A boxplot is a standardized way to display data distribution by using five statistical measures, which are the minimum, first quartile (Q1), median, third quartile (Q3), and maximum of dataset [43]. It tells about the skewness of data distribution and presence of outliers.
In this context, the minimum and maximum of a dataset are defined as Q1−1.5xIQR and Q3+1.5xIQR, respectively. Here, IQR is the difference between Q3 and Q1. Therefore, any sample of value less than the minimum or greater than the maximum is considered an outlier. For the normalized spectrogram features, the minimum value is -1.2661, the first quartile (Q1) is -0.47869, the third quartile (Q3) is 0.046241, and the maximum value is 0.83364. However, there is a large number of outliers due to the presence of spikes. In fact, the presence of these spikes causes a large dynamic range for spectrogram features. Therefore, it is important to limit spikes'amplitudes so that they all have relatively comparable values.
Let s(n) be the n th sample of the features shown in Figure 2. The new scaled feature sample, s (n), is then given by The parameter 'β' is a scalar whose value is greater than or equal to 0. This parameter controls the amount of scaling. If β = 0, then no scaling is performed; that is, s (n) = s(n). However, for extremely large values of β, the value of s (n) becomes zero. In what follows, the value of β is set to 1. With this value, samples of large amplitudes will undergo high attenuation, while those of relatively low amplitudes will pass almost unchanged. Figure 4 shows the resulting normalized and scaled spectrogram features. Figure 5 shows the normalized MFCC features of the same phoneme considered above. The boxplot of whole MFCC features of normalized KAPD dataset is shown in Figure 6. It is clear from the figure that these features have large dynamic range. Thus, the scaling operation is applied to these features in a similar manner to what is described in (1). Figure 7 shows the normalized and scaled MFCC features. Figure 8, on the other hand, shows the complete features vector, including the   pitch percentage and the normalized energy and zero-crossing features. The boxplot of all features vectors after being preprocessed is shown in Figure 9.

IV. FEATURE SELECTION APPROACH
This section considers the selection of appropriate configuration for the GA-based features section process. Therefore, a model suitable for phoneme classification is first introduced. This model is needed to act as a base line against which the performance of the GA-based features selection method is compared.

A. DEVELOPMENT OF A BASELINE MODEL
With the normalized and scaled dataset described in Subsection III-C, machine learning algorithms can be used to classify different phonemes. However, the performance of such algorithms is greatly affected by many factors, including specific parameters related to the input data; e.g., the length of features vector and the correlation among its entries, and the signal-to-noise ratio. For the normalized and scaled KAPD (NS-KAPD) dataset, the length of each features vector is 4096, as described in Subsection III-B With this high dimensional vector, its entries may not be all equally important, as it may contain redundant and/or irrelevant features and/or noise. Therefore, it is essential that the size of the input features vector be reduced to contain only the features which contribute to the classification process. By doing so, the original representation of data will not be affected, and may even provide better readability and interpretability. Furthermore, the computational complexity will be reduced, and the classification accuracy could be improved.
In this subsection, the problem of selecting a small subset of entries of a features vector is addressed by applying GA. Features selection based on GA has been widely studied and a large number of methods have been developed in different applications; in [44], the authors used genetic algorithm to design decoder-tailored polar code, where in [45], the problem of finding optimal distance for a traveling salesman is solved using genetic algorithm. In [46], the effect of using different configurations of GA is investigated. Using the available NS-KAPD dataset, a model suitable for phoneme classification is considered here. This model will be used as a base line against which the performance of the GAbased features selection method is compared. The proposed model is a simple Feedforward Neural Network (FF-NN), consisting of an input layer, two hidden layers each of which has 100 neurons, and an output layer in the form of a binary vector of size 34. In the ideal case, one element of the output binary vector is '1' and the remaining elements are zeros. The active element corresponds to one of the 34 different phonemes. Figure 10 shows the architecture of proposed FF-NN model, where W is the weights vector and b is the bias.
The FF-NN model is trained using 70% randomly selected features vectors of the NS-KAPD dataset. The remaining 30% features vectors are used for testing. Table 3 presents the performance in terms of four measures: the average classification accuracy, Area Under Curve (AUC), G-mean, and F-score. These numbers are our baseline to evaluate the performance of GA to select a subset out of the 4096 features of an input vector. In other words, the performance given in the table is the yield of the system when our full features vector is used without the involvement of GA selection scheme.

B. GENETIC ALGORITHM BASED FEATURES SELECTION METHOD
In our development, each features vector is encoded by 61 bits divided in order as follows: a) 15 bits encode the 15 spectrogram records. That is, each bit encodes one spectrum record. If the bit value is '1', this means the corresponding record will be included in the new features vector, otherwise it will not be included. b) 15 bits encode the 15 MFCC records. That is, each bit encodes one MFCC record. If the bit value is '1', this means the corresponding record will be included in the new features vector, otherwise it will not be included. c) 15 bits encode the 15 zero-crossing values. d) 15 bits encode the 15 energy values. e) 1 bit encodes the value of pitch percentage. Figure 11 shows the schematic diagram of the encoding process. For each possible binary string of length 61 bits, the corresponding features are selected and used to train and test the proposed FF-NN, as described in Figure 12. Note that the GA needs to search for the binary string which gives the maximum possible classification accuracy. Each spectrum or MFCC record is encoded by one bit to reduce the search space of GA. Following this encoding scheme, the search space becomes 2 61 candidate features vectors. If, however, entries of spectrogram and MFCC records are not encoded, then this leads to a search space of size 2 4096 .
For the remaining genetic operations, there are many possibilities each of which may be effective for one type of application but worse for another. In fact, there is no one choice fitting all, and the majority of research efforts are focused on finding an optimum choice for a specific setting. In what follows, the GA performance is investigated using the commonly used configurations in literature, as follows. For parent selection operator, Roulette wheel selection, tournament selection [47], and their hybrid combination [48] are considered. For crossover operator, the single point crossover, double point crossover, and uniform crossover [49], [50]    are considered. The crossover operator is followed by bit flip mutation. The details of each operator are well explained in its relevant reference. Table 4 shows the performance of the five configurations, in terms of the classification accuracy, when the GA is applied along with the FF-NN to the NS-KAPD dataset. It is evident from the table that the GA with the selection operator combing Roulette wheel and tournament schemes, and uniform point crossover is the best performing algorithm. Therefore, this GA configuration is selected for our analysis to follow. Table 5 gives further details about the performance of the best performing GA in terms of AUC, G-mean, and F-measure. Compared with the performance of FF-NN alone, it is observed that the GA gives almost similar results but with a reduced size features vector. By comparing the performances using all four measures, it is noticed that there is almost a full match with the corresponding figures given in Table 2. By this, it can be concluded that the same performance is kept by using almost 50% of the features vector length generated by the GA. In particular, the GA shows that only a features vector of size 3231 is sufficient to achieve the performance of the full-fledge features vector. The confusion matrix is depicted in Figure 13. This matrix shows that the phonemes 'sb10', 'db10', and 'fs10' are greatly confused with the phonemes 'ss10', 'zb10', and 'vs10'. This is intuitively not surprising because the features vectors of these phonemes may not be well separable. Figures 14 (a) and (b) show results when the t-distribution stochastic neighbor embedding (t-SNE) algorithm [51] is applied to the corresponding features of phonemes 'sb10' and 'ss10', and the two mostly separable phonemes ('hz10' and 'ss10'). The t-SNE algorithm is used to reduce the data dimensionality from 4096 to 2, while preserving both local and global structure of data, hence it facilitates its visual inspection.
From the figures, it is observed that features of phonemes 'sb10' and 'ss10' overlap, which makes phonemes' discrimination difficult. This overlap led to 59 times confusions between these two phonemes. Therefore, it can be concluded that there is a big similarity between the two phonemes 'sb10' and 'ss10', but, on the other hand there is a big dissimilarly between 'hz10' and 'ss10' phonemes.

V. PHONEME RECOGNITION PERFORMANCE USING DISTINCTIVE PHONETIC FEATURES ELEMENTS
In this study, each phoneme is represented by 30 DPF elements that are listed in Table 1, which can be used for phoneme recognition. Note that the NS-KAPD dataset has feature vectors of dimension 4096. It is possible that the 4096 features may not all contribute to the recognition of a VOLUME 8, 2020 DPF element. In this section, GA is used to determine the features well represent a particular DPF element, and build for each element an FF-NN for its classification. Table 6 shows the performance of the developed 30 FF-NNs, along with the number of selected features for each DPF element. The table also displays the number of '1s' in the final output code vector of GA.
By virtue of Table 6, it is of interest to note that the average accuracy across all over the 30 DPF elements is 90%, while the average AUC, GM, and F_Score are 0.85, 0.84, 0.78, respectively. This excellent performance has been achieved with a great reduction in the required number of features, which ranges from 2982 down to 1131 with an average 50% (=2047/4096) of total number of features. Table 7 gives more details about the selected features for each DPF element, and provides the final 61-binary string output of GA. For example, the first DPF element has a 61-bit string vector with 31 entries of '1s'. This corresponds to the selection of spectrogram features computed from 8 frames, MFCC features computed from 10 frames, zero-crossing features computed from 6 frames, and energy features computed from 7 frames. The pitch percentage for this PDF element, however, is not selected. Figure 15 depicts the number of times each entry of the 61-bit string vectors carries the value of '1'. It is evident from the figure that entries number 42 and 57 have the lowest frequency of having the value of '1', while entry number 15 has the highest. These three entries represent the corresponding frames of zero-crossing percentage, energy, and the spectrogram features, respectively.
As described in Subsection II-A, Figure 1 shows the targeted architecture for phonemes classification using DPF elements and GA for adaptive features selection. In this figure, N = 4096 and M = 30. Therefore, the 4096-features vector is applied to 30 GAs followed by 30 FF-NNs working in parallel. The output of these FF-NNs constitutes a binary vector of length 30 of '0s' or '1s', depending on the DPF elements of phoneme under consideration. Figure 16 shows the performance, in terms of the confusion matrix of proposed classification system, where the features that are nominated for training and testing are those that are selected by the GA. In this setting, the outputs of 30 FF-NNs constitute the predicted DPF vector corresponding to a particular phoneme. Therefore, a phoneme is identified by measuring the Euclidean distance between the output of the model and the actual DPF vectors of all phonemes. The phoneme whose DPF vector has the minimum distance is selected. The test set is composed of 100 samples of each phoneme except the phonemes ('as21', 'is21', 'us21') that have representation of 28, 26, and 26, respectively, in the KAPD dataset. Note that the two confusion matrices in Figure 13 and Figure 16 represent results of two methods of phoneme recognition using the output of GA-FFN model. The first method computes the confusion matrix right after the neural network, while in the second method the confusion matrix is computed from the predicted DPF elements. Figure 17 shows the identification accuracy for each phoneme computed from Figure 16, where the two phonemes 'bs10' and 'fs10' are that of the worst performance as each phoneme gets confused with other phonemes. The two phonemes 'sb10' and 'ss10' are of lower performance as they are mutually confused. This later observation is consistent with our previous observation in Section IV.B. The t-SNE plot for the features of both phonemes, 'sb10' and 'ss10', reveals the presence of severe overlap between them, as depicted in Figure 14 (a). VOLUME 8, 2020 Wilcoxon signed rank test [52] is used to judge the significance of the GA-FNN's results, as compared to the corresponding uttered phonemes. Wilcoxon signed-rank test is a non-parametric statistic test used for comparing two paired sets of observations whose difference comes from a distribution of zero median. In our experiments, the p-value of a twosided Wilcoxon signed-rank test is 0.12. This result indicates that the test fails to reject the null hypothesis of zero median in the difference at a significance level of 5%. That is, this means that the difference between the median of GA-FNN based outputs and that of the corresponding real phonemes' sequence is zero; hence, the two paired of sets (model's output and corresponding real pronounced sequence) are not statistically different.
These promising results demonstrate the effectiveness of the approach applied in this work in realizing an efficient compromise between the following three challenging requirements that are commonly encountered when developing speech processing systems: First, the ability to deal with variability in speech signal, which is a crucial requirement for system robustness. Such variability is captured in system input via a diversity of acoustic features that, in turn, would significantly increase input dimensionality and model complexity. Second, the urging need to reduce input dimensionality, which would greatly limit the involvement of multiple types of acoustic features in system input. Lastly, the fundamental requirement to increase system performance, which is directly affected by the aforementioned requirements. Therefore, these results pave the way for more efficient DPF-based system design approaches in order to enhance system robustness by means of diversifying input acoustic cues, while maintaining lower input dimensions, and superior system performance. This effort is also validating the ability of GA to reduce features space of speech signal, in general, which is very useful in digital signal processing front-end in order to minimize CPU time and memory usage by removing duplicate redundant features. Certainly, that will have direct positive impact on real-time speech recognition systems that are implemented on low-resource computers.

VI. DISCUSSION
The GA has different parameters to configure and cost functions to estimate, which contribute to the total complexity of the entire algorithm. Therefore, each variant of GA has different time complexity based on algorithm implementation; for example, time complexity is shown to be polynomial of degree two in [53], where in [54], it is proportional to number of samples in the training set multiplied by squared number of total features, whereas in [55], it is proportional to number of features under investigation. By analyzing the whole process of the proposed GA-FFN model, as described in Section IV-B, it would not be difficult to determine the time complexity as O(NPG), where N is the length of features vector, P is the population size, and G is the number of generations. This result is consistent with the finding reported in [56]. In the proposed model, the GA is only used in the training phase to select the best and optimal set on input features. The optimal configuration of input vectors composed of selected features is used in the testing phase. That is, in the testing phase GA is no longer needed, and the time complexity will be solely due to the FFN network, which is O(N ) [57]. On the other hand, for a training phase having a constraint of short time processing, GA full-parallel implementation on a dedicated hardware (e.g. Field Programmable Gate Arrays (FPGAs)) can be considered; see [58] and the references therein.  This solution provides high-performance and higher speed when compared to sequential solutions.
Although the proposed GA-FNN method achieved good result in reducing the dimensionality of feature space, it has some limitations. Because this model uses GA for features selection, the running time during the training phase using Intel Core i9-9900k processor with 64 RAM is quite large (few hours). In order to cope with the problem of computational time, GA full-parallel implementation on a dedicated hardware (e.g. Field Programmable Gate Arrays (FPGAs)) can be considered [58]. Another limitation is related to the selection of the appropriate operators such as crossover and mutation to prevent algorithm divergence. Therefore, there is no guarantee of optimality of the obtained solution. On the other hand, besides its role as a classifier (phoneme recognizer) the FNN was used to estimate the GA objective function during the evaluation process of the huge amount of individuals produced through generations. This dual role assigned to FNNs as fitness estimators as well as classification and recognition engines must be assessed by comparing it to an approach using two different systems, dedicated separately to the estimation of the objective function on the one hand, and to the classification of phonemes on the other hand.
Compared to comparative methods, GA is selected to reduce the complexity of speech acoustic features in order to remove data redundancy in signal front-end processing. To the best of our knowledge, this is the first time to be considered in the literature. On the other hand, FNN was considered as a vehicle in performing GA task and to be used as a baseline in evaluating the huge amount of produced generations and chromosomes to help avoid exhaustive options. FNNs methods are well-known engines in performing classification and recognition with straightforward design methods and tuneups. In linguistics, there are comparative methods that are based on systematic process of reconstructing the segmental and suprasegmental inventory of an ancestral language from cognate reflexes by performing a feature-by-feature comparison in the genetically related ancestor languages [59]. Indeed, comparative methods are supposed to deal with higher levels of language units such as phonemes in NLP disciplines, but the scope of the current work is to deal with acoustic feature engineering of speech signals.

VII. CONCLUSION
This work has considered the problem of reducing the size of features' vector employed for DPF and phoneme recognition. Specifically, the GA has been used to perform the features selection process. The experimental results obtained using the GA-based selection method show that a 79% reduction in the size of features' vector with performance, at least, as good as that obtained using the full-fledge features vector can be achieved. In particular, a features' vector of an average size of 3231 elements, selected by GA aided with FF-NN for phoneme recognition, has an accuracy of 68.2%, as compared to 68% obtained using the full-fledge features vector whose length is 4096 elements. For DPF recognition, the reduction in features' vector size is 50% in average with recognition accuracy of 90%. Therefore, the proposed method contributes to the reduction of computational complexity of the problem at hand with no degradation in the system's performance. Further, it opens a new direction for research, where other evolutionary algorithms can be tested for achieving further reduction in the size of features' vector.
MANSOUR ALGHAMDI received the Ph.D. degree in speech analysis, synthesis, and perception from the University of Reading, in 1990. He took several positions, including the General Director of scientific awareness and publishing with KACST. He is currently a Consultant at the Education and Training Evaluation Commission. He has more than 80 published articles and books, and five patents. He is also a PI and a Team Member of more than 20 scientific research projects that produced software systems, algorithms, and databases. He has supervised and examined several Ph.D. students. He has lectured and reviewed articles and projects in his field. His published work has more than 1000 citations with 20 H-index on Google Scholars. He has been working in public sectors for 47 years.