Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network

Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.


I. INTRODUCTION
Automatic speaker identification (ASI) is the process of extracting the identity of a speaker by using a machine from a group of familiar speech signals. Speech signals are powerful media of communication that always convey rich and useful information, such as emotion, gender, accent, and other unique characteristics of a speaker.
The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk.
These unique characteristics enable researchers to distinguish among speakers when calls are conducted over phones although the speakers are not physically present. Through such characteristics, machines can become familiar with the utterances of speakers, similar to humans. Speaker utterances are trained with machine learning algorithms from the collected dataset, and then speakers are identified using the test utterances.
In general, speakers can be identified using two different approaches: text-independent and text-dependent. For the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ text-dependent speaker identification system, the text being spoken during testing must be exactly the same as that spoken during the training of the system. By contrast, for the text-independent speaker identification system, the speaker identification process does not depend on the text being spoken by the speaker. Furthermore, speaker recognition is divided into two processes: speaker identification and speaker verification. Speaker identification involves the identification of a speaker utterance from a group of trained speaker utterances. Then, the speaker with a high probability of test utterance is identified as the speaker. Alternatively, speaker verification involves the process of determining whether a speaker of a test utterance belongs to a group of speakers through binary classification. In this study, text-independent speaker identification task is considered due to its applications in current technological speech advancement. Speaker recognition has become an area of intense research due to its wide range of applications, including forensic voice verification to detect suspects by government law enforcement agencies [1], [2], access control to different services, such as telephone network services [3], voice dialing, computer access control [4], mobile banking, and mobile shopping [5]. Furthermore, speaker identification systems are extensively used to improve security [6], automatic speaker labeling of recorded meetings [7], and personalized caller identification using intelligent answering machines [8]. Various studies have been conducted in the area of speaker identification. These studies utilize Mel frequency cepstral coefficients (MFCC)-based features [9], Gaussian mixture models (GMM) [9]- [11], and vector quantization [12] to identify speakers. Then, these features are fed to simple machine learning classifiers [13] to construct speaker identification models.
The major challenge in speaker identification is the extraction of discriminative features from speech signals that can elicit improved performance from classification algorithms. In this regard, many studies have proposed different feature-engineering techniques, such as MFCC, linear prediction cepstral coefficient (LPCC), power-normalized cepstral coefficient, spectral features, and time-domain features. However, the aforementioned features are inefficient for speaker recognition in complex and noisy datasets, such as LibriSpeech [14], and exhibit low classification performance. The classification performance of MFCC and LPCC degrades as a result of channel variations caused by environmental noise and magnetic interference in handsets or microphones [15]. To overcome the limitations of the aforementioned features, this study proposed a novel fusion of MFCC and time-based features (MFCCT) from speech signals for the speaker identification task. In addition, a deep neural network (DNN) [16], [17] was used to construct an artificial neural network (ANN) to identify speakers based on unique voice patterns [18]. Moreover, the proposed MFCCT and constructed deep neural network-based speaker identification system was evaluated on the publicly available Lib-riSpeech [14] corpus database. The main contributions of this paper are: 1. Propose efficient MFCCT based features and deep neural network (DNN) for speaker identification in large speech data to improve recognition accuracy. 2. Propose two-level hierarchical classification model to identify speakers' gender and identity. The first level identifies the gender of the speaker (i.e., male or female), whereas the second level identifies the specific identity of the speaker. 3. Evaluate the performance proposed features on well-standardized, complex, and publicly available LibriSpeech corpus database. 4. Rigorously evaluate the performance of the proposed deep neural network (DNN) model and MFCCT features by comparing their performance with baseline techniques and features; 5. Compare the suitability of the proposed hierarchical classification model of two-level with one-level classification models. 6. To the best of our knowledge, this study is the first to evaluate efficient Mel frequency cepstral coefficients (MFCC)-based features and time domain feature fusion (MFCCT) for speaker identification. Moreover, existing techniques were evaluated on a small corpus that contained speaker utterances with minimal length and varying sample rate while the proposed technique is compared with large speech dataset.
The rest of this paper is organized as follows. Section II describes existing works on speaker identification. Section III presents the corpus used for the experiments, feature extraction process, classification process, evaluation metrics, and different experimental settings. Section IV reports the results of different experimental setting. Section V discusses the significance of the observed findings. Finally, Section VI concludes this paper.

II. LITERATURE REVIEW
The field of artificial intelligence combined with cognitive science is rapidly growing. It includes design and development of various real-time applications such as speech recognition, decision making, face recognition, and DNA analysis. Recently, voice biometrics have been utilized to authenticate individual identification.
Human voice is the most useful medium of communication due to its features of simplicity, uniqueness, and universality. In comparison with other biometric verification systems, the benefits of speaker identification are as follows: 1. Voice is easily accessible, easy to use and costs are low. 2. Voice is very easy to obtain and comparatively simpler for users to recognize people.
As speech recognition systems need to operate under a wide variety of conditions, therefore, such systems should be robust to extrinsic variations induced by a number of acoustic factors such as transmission channel, speaker differences and background noise. In order to enhance classification performance, most of the speech applications perform digital filter, where the clean utterance estimation is learnt by passing noisy utterance through a linear filter. With such concept, the subject of noise reduction becomes how to design a best filter that can considerably remove noise without noticeable loss of useful information. Therefore, several researchers are investigating how to minimize the effects of environmental noise in order to correctly classify speech signals. For instance, Lim, et al. [19] proposed spectral subtraction, which overlays a slight background noise over the speech signals. In the speech signals, those components equivalent to the noise will be hidden. However, spectral subtraction can also destroy several spectral features in the original speech signal [20], leads to the loss of some valuable features. In order to overcome this issue, support vector machine (SVM) [21] classifies speech features into various classes, aiming to minimize the difference among speech features of same class to enhance classification accuracy. Nonetheless, this approach often needs a large number of training utterances and is not beneficial for timely response applications.

A. RELATED STUDIES
Human voice is used in the universal practice to exchange information with one another. Speaker recognition refers to the identification of speakers based on the vocal features of the human voice. Speaker recognition has become an area of intense research due to its wide range of applications, such as forensic voice verification to identify suspects by government law enforcement agencies [1], [2]. Feature extraction in the speaker recognition process plays a key role because it significantly affects the performance of a speaker recognition classification model. In recent years, various researchers in the area of speaker recognition have proposed novel features that have been proven useful in effectively classifying human voices. Murty, et al. [22] extracted residual phase and MFCC features from 149 male speaker utterances from the NIST 2003 dataset to form a master feature vector. The authors fed the extracted MFCC features as input to an auto-associative neural network classifier and obtained approximately 90% classification accuracy. Nonetheless, the proposed features and classifier may be ineffective for complex datasets, such as LibriSpeech. Fong, et al. [23] performed a comparative study to classify speakers using various time-domain statistical features and machine learning classifiers; they obtained the highest accuracy of approximately 94% by using the multilayer perceptron classifier. Although, the experimental results of the study achieved good classification accuracy, the results cannot be generalized to a wider scale because the authors used only 16 speaker voices from the PDA speech dataset in the experiment. In addition, the study used a small amount of speaker utterances in the training and testing sets. Ali, et al. [24] recently proposed a speaker identification model for identifying 10 different speakers using the Urdu language dataset. The study fused deep learning-based and MFCC features to classify speakers using a support vector machine (SVM) algorithm. The experimental results achieved 92% classification accuracy. Hence, the results are promising. However, the dataset used in the experiments suffers from several weaknesses. First, only 10 speaker utterances were used in the experiments. Second, each utterance comprised only one word. Thus, the fusion-based features proposed by the authors may be inefficient and ineffective for complex human voices. Soleymanpour, et al. [25] investigated clustering-based MFCC features coupled with an ANN classifier to categorize 22 speakers from the ELDSR dataset. The experimental results of the study achieved 93% classification accuracy. Laptik, et al. [26] and Prasad, et al. [27] evaluated MFFC features and a GMM classifier to classify 50 and 138 speakers from the CMU and YOHO datasets, respectively. The results of the experiment that used the proposed feature extraction methods exhibited 86% and 88% classification accuracy. Nidhyananthan, et al. [28] proposed a set of discriminative features to classify 50 speaker utterances from the MEPCO speech dataset. These authors extracted RASTA-MFCC features to classify speaker utterances. The extracted features were inputted into a GMM-universal background model classifier to learn the classification rules. The results achieved 97% classification accuracy. Although the results demonstrated reasonable classification accuracy, they cannot be applied to a wider scale because the study only utilized six utterances, with one utterance lasting for only 3 s. Therefore, for speaker utterances that are over 3 s long, RASTA-MFCC features may prove to be insignificant. To address the issues in the existing literature, Panayotov, et al. [14] provided a standard and complex speaker utterance dataset, called ''Lib-riSpeech,'' for the speaker identification problem. The wellknown MFCC features did not exhibit promising results when they were extracted from the LibriSpeech dataset and fed to a classifier. To improve the classification accuracy of the LibriSpeech dataset for speaker identification, the present study proposed novel MFFCT features to classify speaker utterances. Furthermore, a DNN was applied to the extracted MFCCT features to construct a speaker identification model. The details of the proposed features and model are discussed in the subsequent sections.

III. PROPOSED METHODOLOGY
This section describes in detail the methodology ( Figure 1) used to identify the speakers. First, several speaker utterances were collected for the experiments. Second, various useful features were extracted from the collected speaker utterances to form a master feature vector. This master feature vector was then fed as an input to a feed forward deep neural network architecture to construct the speaker identification model. To investigate the classification performance of the proposed speaker identification model, two performance metrics, namely, overall accuracy and AUROC (Area Under the Receiver Operating Characteristics), were used. Finally, the performance of the constructed model was evaluated using a separate test set and existing speaker identification baseline techniques. The details of these methods are discussed in subsequent sections.

A. DATASET
The LibriSpeech [14] corpus was used for the experiments conducted in this study. This corpus is publicly available and is prepared from audiobooks of the LibriVox with careful VOLUME 8, 2020  segmentation and alignment to develop automated speech recognition and speaker identification models using machine learning and deep learning techniques. LibriSpeech includes English speeches related audio files that belong to male and female of various accents but majority are USA English. All the utterances in this dataset are sampled at 16 kHz frequency and sample size of 16 bits. This corpus includes five different training and testing sets for developing an automatic speaker identification model. In this study, one subset of the corpus, i.e., train-clean-100, was considered for the experiments because it includes 100 h and 25 min of speeches of male and female speakers with several utterances. In addition, 50 male and 50 female speakers were selected from this dataset for the experiment (Table 3). Moreover, 80% and 20% of the utterances of each male and female speaker were used for training and testing, respectively. Each speaker served as a class label in the selected corpus to identify the speaker through MFCCT features and DNN [16] architecture.

B. SPEECH PRE-PROCESSING
Speech signals pre-processing is very critical phase in the systems where background-noise or silence is completely undesirable. Systems like automatic speaker identification and speech recognition requires efficient feature extraction approaches from speech signals where most of the spoken portion includes speaker-related attributes. Therefore, in this study pre-emphasis and silence removal techniques were employed.
The pre-emphasis method increases the strength of high frequencies of speech signal, while the low frequencies remain in their original condition in order to improve the signal-to-noise ratio. Pre-emphasis works by enhancing the high-frequency energy through applying high-pass filter (FIR) which is equivalent to where α is pre-emphasis coefficient. FIR inevitably changes distribution of energy across frequencies along with overall energy level. This could have critical impact on the acoustic features related to energy [29]. On the other hand, signal normalization makes the speech signals comparable irrespective of variations in magnitude by using Eq 2.
where S i is the i th part of signal S, σ and µ are are the standard deviation and mean of S respectively, S Ni is the normalized i th part of signal S.

C. FEATURE ENGINEERING
In general, classification performance relies on the quality of a feature set. Thus, irrelevant features may produce less accurate classification results. In deep learning and machine learning, extracting discriminative feature sets is an important task to obtain reasonable classification performance [30]. Moreover, the authors of [30] concluded that the feature engineering step is a key step in machine learning and deep learning because the success or failure of any speaker identification model heavily depends on the quality of the features used in the classification task. If the extracted features correlate well with the class, then classification will be easy and accurate. By contrast, if the extracted features do not correlate well with the class, then the classification task will be difficult and inaccurate. Furthermore, the collected speaker utterances are frequently unavailable in a proper form to learn the classification rules. Thus, to make these utterances useful for the speaker identification task, various useful features are extracted from collected utterances, and the extracted features are appropriate for learning classification rules. In general, most of the effort in speaker identification is required in the feature engineering step. It is an interesting step in the speaker identification process, where perception, innovation, intuition, creativity, and ''black art'' are equally important as technical and subject knowledge. The construction of a classification model is frequently the fastest step in the speaker identification task because feature engineering is responsible for extracting discriminative features from speaker utterances and transforming these features into a numeric master feature vector. This vector is then used by a machine learning or deep learning classifier to quickly learn the classification rules and develop a classification model. Feature extraction is more challenging than feature classification because of its domain-specific nature compared with the general-purpose nature of the classification task. Thus, in the present study, an innovative feature extraction process was adopted to extract useful and effective features, known as MFCCT features, from speaker utterances to construct an accurate classification model for speaker identification. The detailed functionality of the proposed MFCCT features is discussed in the subsequent subsection.

1) PROPOSED MFCCT FEATURES
This section discusses the functionality of MFCCT features which comprises of three distinct steps: (1) MFCC feature extraction, (2) time-domain feature extraction from MFCC features, and (3) appending target SIDs using the extracted features of each speaker utterance. These steps are discussed in the subsequent paragraphs.

a: EXTRACTING MFCC FEATURES
MFCC features were initially extracted from speaker utterances using Algorithm 1. MFCC-based features have been proven useful in speaker identification tasks [31]. These features represent the vocal tract information of a speaker. The MFCC feature extraction process comprises framing, windowing, discrete Fourier transform (DFT), logarithm of magnitude, warping frequencies on Mel scale and applying discrete cosine transform (DCT). Each speaker utterance was divided into a frame length of 25 ms. Moreover, 10 ms overlapping was used in successive frames to avoid information loss, as shown in Figure 2. Thus, the total number of frames for each speaker can be determined using Eq 3.
In addition, the total number of samples per frame (N) can be computed using Eq 4. In the dataset used in this study, speaker utterances were recorded at a sample rate of 16 kHz and a frame step of 10 ms was used.

Algorithm 1 MFCC Features of Speaker Utterance
Input : path to speaker utterances 1 Procedure: After the framing steps, hamming windowing was performed on each individual frame to smooth the edge of each individual frame using Eq 5.
N is the number of samples in each frame. Thereafter, the magnitude spectrum of each frame of N samples was computed by using DFT in the third step. Each magnitude spectrum was passed through a series of Mel-filter bank. Mel is a measuring unit based on the perceived frequency of human ears. The estimation of Mel can be written as where f represents the physical frequency and Mel(f) represents the perceived frequency.
In order to imitate the perception of human ears, the warped axis was implemented using Eq 6. The most widely employed triangular filter bank with the Mel-frequency warping is the Mel-filter bank. Afterwards, the Mel-spectrum was calculated by multiplying the each of the triangular filters magnitude spectrum X(k) using Eq 7.
where M is the number of triangular filters. H m (k) is the weight assign to the k th bin of energy spectrum that contribute to the m th output band and is written as: with m varies from 0 to M-1.
Finally, MFCC features were computed by taking DCT of each log Mel spectrum by using Eq 9.

b: EXTRACTING MFCCT FEATURES
After extracting the MFCC features from speaker utterances, MFCCT features were extracted from the extracted MFCC features. The detailed functionality is depicted in Algorithm 2. MFCCT features were extracted using three distinct steps. First, binning was performed on the extracted MFCC features for every 1500 rows of each column. A binning size of 1500 was used because it achieved better accuracy (for details, refer to Figure 9). In the second step, 12 different time-domain features ( Table 2) were extracted from each bin of extracted MFCC features. The 12 features were used because they obtained the highest classification accuracy (for details, refer to Table 4). As shown in Algorithm 2, the variable matrix represents the extracted MFCC features in matrix format, and the size variable represents bin size (1500 in this case). The GetFeatureVector is a method that returns to the final master feature vector (MFV) for classification. The variable rows represent the number of rows that contain the MFCC feature matrix values. The variable cols is used for speaker utterances that has been used in columns. The variable bins is used to contain the total number of bins, and n represents the number of MFFCT features  (12 in this case). Thus, for each speaker, the total number of rows will be the number of bins (bins) multiplied by n (MFCCT features), and the columns will be the number of utterances for each speaker.
The generalized form of above equations can be written as where m = number of features n = number of utterances for single speaker i = 1,2,3,....,m j = 1,2,3,.....n t = 1,2,3,.....100 To the get the class label Eq 13 can be written as Two feature vectors were prepared in the third step. In the first feature vector, each row represents one MFFCT feature, and columns represent speaker utterances (Eq 13). In the second feature vector, each row represents SID, and columns represent the number of each speaker utterances (Eq 14 and Algorithm 3). Finally, both feature vectors are fed as input to DNN [16] to construct a classification model for speaker identification. The hierarchical classification approach was used to identify the speaker. In this approach, the top-level classification layer identifies whether the speaker is male or female. Then, the second-level classification model is used to identify the specific SID. Thus, three classification models, namely, gender identification, male SID, and female SID models, were constructed. The detailed functionality is depicted in Figure 3. The two-level hierarchical classification approach was used because it obtained better results than the one-level classification model (Section IV-E). Furthermore, several recent studies in various domains have applied the hierarchical classification model and reported that it outperformed the one-level classification model [32]. In all three classification models, a feed forward deep neural network was used to construct the classification model. This classification algorithm was selected because it achieved promising results in several pattern recognition applications. Moreover, the performance of a feed forward deep neural network was compared with various traditional classification algorithms in Section IV-A. In the subsequent paragraph, a brief description of a deep neural network is presented.

1) DEEP NEURAL NETWORK
In recent years, many ANNs have been proposed for speech recognition, speaker identification, image processing, sensor data processing and other application areas [17], [33]. The feedforward neural network (FFNN) as shown in Figure 5 has one input layer, one output layer, and one or more hidden layers. The input layer feeds the features to the hidden layer. The output layer computes the prediction of each class, and the results are applied to the input data through the series of functions of the hidden layers. Each layer consists of neuron-like information processing units, which are the basic building blocks of ANNs. Each neuron performs a simple weighted sum of the information it received and then applies the transfer function to normalize the weighted sum, as shown in Figure 4 [34]. Neural transfer functions are used to compute the output of a hidden layer from the input and to return the matrix of n elements. However, the softmax VOLUME 8, 2020 neural transfer function is used differently in the output layer compared with that in the hidden layers to compute the predictions of each class. Figure 4 shows the weights w that are connected to each input x of a neuron and bias b. The two parameters are updated by a neural network during the training phase through the training function. Other details of ANN and various types of training and transfer functions can be found in [34].
In the current study, the customized FFNN was used as a classifier to identify the speaker. Several configurational changes were used in FFNN to identify the speakers and reduce the overall misclassification results [35]. The default FFNN architecture consists of one input layer, one hidden layer, and one output layer. The customized DNN architecture used in this study to classify speakers consist of 1 input layer, 5 hidden layers, and 1 output layer, as shown in Figure 5. Input layer used 48 neurons, which are equal to the number of features of each speaker utterance. Each hidden layer used 200 neurons, because the performance of neural networks depends on the number of neurons. A minimal number of neurons can contribute to underfitting, whereas a large number of neurons can lead to overfitting [34], [35]. Each hidden layer used the hyperbolic tangent-sigmoid (tansig) transfer function to compute output from the input within the range of -1 and 1.
However, the output layer used the softmax transfer function to compute the output values for multiclass classification (Table 3). Moreover, the trainscg function, which is the widely utilized function for pattern recognition-related problems, was used to train DNN [34]. Furthermore, to achieve a generalized performance of the training model and to avoid overfitting, different training functions, namely, trainscg, trainrp, traincgb, and traincgp, were used to train DNN [36]. Later MFCCT features were fed to the trained DNN to identify the speaker based on the unique patterns of the speaker's utterances.

E. EVALUATION METRICS
Overall accuracy and AUCROC were used to measure classification performance in all the experiments. These metrics are briefly discussed in the next paragraphs.

1) OVERALL ACCURACY
Overall accuracy is the ratio of the number of accurately predicted utterances to the total number of utterances for prediction. Eq 15 presents the mathematical definition of overall accuracy.
where N is the total number of instances.

2) AUROC
Area Under the Receiver Operating Characteristics (AUROC) is a useful measure and is extensively used in machine learning tasks that involve imbalanced datasets [39], [40]. Moreover, this measure analyzes the performance of a classifier with respect to each class and provides a good performance summary of ROC curves to compute the performance of a classifier by plotting a curve and computing the area under it. If the value of the area under the curve (AUC) is close to 1, then the performance of that classifier is good; by contrast, a value that is less than 0.5 indicates that the performance is poor [40], [41].

3) EQUAL ERROR RATE
Equal error rate (EER) is used to find the common value for its false acceptance rate (FAR) and its false rejection rate (FRR). The lower EER value indicates the higher accuracy of the system. FAR and FRR can be calculated using Eq 16  and 17 [42] while equal error rate (EER) can be calculated using Eq 18.

F. EXPERIMENTAL SETUP
This section presents the experimental setup of the construction of the speaker identification model using the proposed MFCCT features and DNN algorithm. An extensive set of experiments was performed to measure the performance of the constructed model and to compare its performance with baseline speaker identification models. To evaluate the performance of the constructed speaker identification model through the proposed MFCCT features, experiments were performed systematically in four different settings, as follows: 1. Proposed MFCCT features and classification algorithms: In this setting, the proposed MFCCT features were extracted from human voices. The extracted MFCCT features were then fed to six different classification algorithms, namely, DNN, random forest (RF), k-nearest neighbor (k-NN), SVM, naïve Bayes (NB), and J48, to construct the speaker identification models. In this setting, six analyses (one feature engineering technique (MFCCT) × six classification algorithms) were run to evaluate the performance of the classification algorithms coupled with the proposed MFCCT features.
2. Performance comparison of the proposed MFCCT features with baseline features: In this setting, the performance of the proposed MFCCT features was compared with those of MFCC features and time-domain features.
3. Comparison of different binning sizes for MFCCT features: In this setting, the performance of several binning sizes (such as 500, 1000, 1500, 2000, 2500, and 3000) was evaluated to obtain the optimal learning curve for the DNN algorithm. In addition, these binning sizes were used because of their implementation feasibility, which allows the evaluation of the performance of the classification algorithms within a suitable operating range.
4. Selection of various time-domain features to compute effective MFCCT features: In this setting, the performance of several time-domain features (shown in Table 3) was evaluated to obtain the best set of time-domain features to compute the MFCCT features and determine the optimal learning curve for the DNN algorithm. 5. One-level versus two-level classification models: The hierarchical classification method was designed to improve the accuracy of speaker identification. To ascertain the efficacy of the hierarchical classification method, experiments were performed to compare the results of one-level classification with two-level classification. In one-level classification, all 40 speakers were labeled using their respective SID numbers. The proposed MFCCT features were used with the DNN algorithm to construct a classification model. 6. Evaluation of proposed method on different databases: In this setting, the performance of proposed MFCCT features coupled with DNN and other four classification algorithms were evaluated on three different speaker identification datasets to observe the effectiveness of proposed method. For this setting, 30 analyses (3 datasets 5 machine learning algorithms 2 classification models i.e. male and female) were performed. The EER evaluation metric was used to measure the effectiveness of all these 30 analyses. For all the experiments, speaker voice preprocessing, feature extraction, and classification were performed in MATLAB R2017a. The matplotlib Python library was used to generate accuracy graphs, AUC graphs, and utterance pattern.

A. RESULTS OF EXPERIMENTAL SETTING I
This section presents the results of Experimental setting I, in which the extracted MFCCT features were fed to five machine learning classification algorithms (i.e., RF, k-NN, NB, J48, and SVM) and DNN. The overall accuracies of all the aforementioned algorithms for the first level (gender-based classification model) and second level (male and female classification models) are shown in Figure 6. As shown in Figure 6, the DNN algorithm outperformed the other five machine learning-based algorithms by obtaining an overall accuracy of 92.9% for the gender identification classification model. In addition, this algorithm obtained 88.5% and 83.5% overall accuracy for the male and female speaker identification models, respectively. In the other five machine learning-based algorithms, an irregular trend can be observed in achieving overall accuracy. The k-NN and RF algorithms obtained the highest accuracy (89.6% and 88.2%) for the gender-based speaker identification model compared with the other three machine learning-based algorithms. In addition, in the male and female speaker identification models, the RF algorithm obtained the highest accuracy (81.2% and 80.4%, respectively) compared with the other four machine learning-based algorithms. In all the experiments, the NB algorithm obtained the lowest accuracy, followed by J48 and SVM.
In summary, the DNN algorithm outperformed the other five classification algorithms in obtaining good classification accuracy in the first-level and second-level classifications for speaker identification. In addition, the ROC diagrams [43] for all three classification models obtained through the highest-performing DNN algorithm are presented. Figure 8 (c) and (d) shows the ROC diagram and the confusion matrix for the first-level classification model. The performance of the male speaker class is marginally better than the performance of the female speaker class, because many analysis techniques, like pitch and formant are less accurate for high-pitched utterances (females) as compared to low-pitched utterances (males) [44]. Figure 8 (a) shows the ROC diagram for all 50 male speakers. The prediction accuracy of all male speakers is acceptable. Figure 8

C. RESULTS OF EXPERIMENTAL SETTING III
This section presents the results of Experimental setting III, in which the performances of different binning sizes for MFCCT features were compared. All three classification models were evaluated using different binning sizes (500, 1000, 1500, 2000, 2500, and 3000). The overall accuracies of these binning sizes are shown in Figure 9. The bin size of 1500 achieved the highest overall accuracy when MFCCT features were used. Overall accuracy gradually decreased when bin size reached more than 1500, and the lowest accuracy was observed at a bin size of 3000.

D. RESULTS OF EXPERIMENTAL SETTING IV
This section presents the results of Experimental setting IV. This setting evaluates the combination of various time-domain features (shown in Table 2) to compute effective MFCCT features. In this setting, the first two time-domain features of Table 2 were initially used to compute MFCCT features. The number of time-domain features was increased to 4, 6, 8, 10, and 12 to compute the MFCCT features. Thereafter, the resultant MFCCT features were computed from 2, 4, 6, 8, 10, and 12 time-domain features and then fed to the DNN algorithm to construct 18 different classification models (as shown in Table 4) to evaluate classification accuracy across all 18 models. Table 4 shows that an incremental trend is observed in classification accuracy. The highest accuracy of 92.9%, 88.5%, and 83.5% of the three models (i.e., gender of speaker model, male speaker model, and female speaker model) was observed when 12 different time-domain features were used to compute MFCCT features. In addition, the lowest classification accuracy was observed when MFCCT features were computed using 2 time-domain features.

E. RESULTS OF EXPERIMENTAL SETTING V
In this setting, experiments were performed to ascertain the effectiveness of the hierarchical classification model. Table 5 shows the classification results obtained from the one-and VOLUME 8, 2020 two-level classification models. The hierarchical classification model achieved better results than the one-level classification model. Nonetheless, 22 speakers exhibited accuracies that were the same in the one-level and hierarchical classification models. Moreover, one speaker (SID 060) achieved high accuracy in the one-level classification model. However, accuracy was reduced to 9% in the two-level classification model. To summarize, the two-level classification yielded better results than the one-level classification in most of the cases.

F. RESULTS OF EXPERIMENTAL SETTING VI
This section shows the experimental findings of 24 analyses which were performed to evaluate the effectiveness of proposed MFCCT features across three different randomly selected datasets. The detailed results are shown in Table 6. As can be seen here, across all three datasets, our proposed MFCCT features coupled with DNN shown the best classification performance in both male and female classification models. The performance of proposed approach showed lowest EER using LibriSpeech dataset followed by VCTK dataset across both male and female classification models. The highest EER was observed in ELSDSR dataset across both male and female classification models. The lowest EER was observed in male classification model compared to female classification model across all three datasets. The DNN classifier outperformed SVM, RF, k-NN, and J48 classifiers in both male and female classification models across all the datasets. It can be inferred from these 30 analyses that the proposed MFCCT features coupled with DNN is robust and it has merit to perform better across several speaker identification datasets.

G. COMPARISON OF PROPOSED MODEL WITH BASELINES
To show the effectiveness of our proposed MFCCT features coupled with DNN, we compare the performance of proposed method with three baseline methods. The results of these experiments are shown in Table 7. As shown here, the proposed MFCCT features coupled with deep neural network outperformed aforementioned three baseline methods [45], [46]. However, the performance of our proposed method is a hair less than that of baseline [47]. The possible reason behind this marginal performance difference may be because, in [47] the authors have classified only 10 speakers with 8 utterances. Conversely, our proposed model was  developed using 100 speakers (50 male and 50 female) and achieved 89% accuracy which is a hair less than that of [47]. Therefore, the model proposed in [47] possibly shows less performance as the number of classes or speakers increase. To confirm this, we performed an experimental evaluation where we initially employed our proposed model on 10 speakers and we then gradually increase the number of speakers by 10 till 50. The comparative analysis of these experiments can be seen in Figure 11. As can be seen here, as we increase the number of speakers, the accuracy of classification model is decreasing. Nevertheless, the accuracy of proposed model is much better on 10 speakers from Lib-riSpeech dataset ( Figure 10) compared to the model proposed in [47]. Hence, it can be concluded that our proposed model is much more accurate and generic than the model proposed in [47].

V. DISCUSSION
This sectionprovides the theoretical analysis of the speaker identification techniques used in this study. The experimental results of this study show that the proposed MFCCT features and DNN can classify speaker utterances with an overall accuracy between 83.5% and 92.9%. As indicated in the experimental results (Section IV-B), the proposed MFCCT features presented the highest accuracy and outperformed MFCC and time-domain features. The possible reasons for the poor performance of MFCC features may be their production of short-time Fourier transform, which has an extremely weak time-frequency resolution, and their inherent pre-assumption that a signal is stationary [48]. Meanwhile, the possible reason for the poor performance of time-domain features is their inability to produce representative and discriminative visual patterns for different speaker utterances. To support this assertion, the visual patterns of three different speaker utterances are presented in Figure 12. Each column in Figure 12 shows the patterns generated through the proposed MFCCT, MFCC and time-domain features respectively. The utterance patterns of visual speakers generated by MFCCT features are discriminative across all three different speakers. The utterance patterns of visual speakers generated by MFCC and time-domain features are not sufficiently discriminative across all three different speakers.
Thus, the classifier can effectively classify the patterns generated by MFCCT features and produce less classification error. By contrast, the classifier may encounter difficulties in classifying the patterns generated by MFCC and time-domain features, and thus, can cause high misclassification rates. Therefore, MFCCT features are recommended to accurately identify speaker utterances with a minimal misclassification rate instead of MFCC and time-domain features. The proposed MFCCT features consume less computational time during speaker identification model training and classification because these features are extracted from MFCC features by computing various descriptive statistical functions using a specific binning size. Thus, an enormous number of MFCC features can be transformed into few powerful and discriminative MFCCT features. To transform MFCC features into MFCCT features, 12 different descriptive statistical functions shown in Table 3.4 were used for each binning size. In our experiments, the combination of different descriptive statistical functions was evaluated to effectively transform them into MFCC features. Our experimental results showed high correlation between descriptive statistical and classification accuracy improvements. Moreover, classification performances were evaluated using various binning sizes to transform MFCC features into MFCCT features and to obtain the optimal learning curve for the classifier. The obtained results showed that the learning curve increased in performance with an increasing order of binning size from 1 to 1500. The learning curve decreased in performance when binning size crossed 1500. Thus, the binning size of 1500 should be used when transforming MFCC features into MFCCT features.
The findings of Experimental setting I is discussed in Section IV-A. The DNN classifier outperformed five other machine learning classifiers. The DNN classifier is well-suited for identifying complicated and nonlinear patterns from high-dimensional datasets [49], thereby providing better discriminative power for speaker identification. Moreover, the visual patterns learned by DNN were similar in intra-speaker utterances and discriminative in FIGURE 12. Visual patterns of different feature engineering techniques. VOLUME 8, 2020 inter-speaker utterances. Thus, classification accuracy was better with the DNN classifier compared with the other classifiers. However, the experimental results showed that female speakers model yielded the low accuracy compared to male speakers' model. This is because in dataset there is one female (SID 2092) where the misclassification rate is higher. The possible reason behind this misclassification might be because the frequency of voices of this speaker is very similar to other female speakers and hence are challenging for classifier to classify accurately. To confirm this, we have performed experiments by replacing that female speaker (SID 2092) with female speaker (SID 2182) from the dataset. Our experimental results showed the improved accuracy (90% accuracy) on female model. In our future work, we will investigate the features that best classify the voices of speakers like SID 2092 will minimum misclassification rate.
The classification performance of RF was marginally lower than that of DNN. The fully grown trees in RF were not pruned [50], and the random split selection of features [51] led to computing better classification results. However, a large number of trees in RF can make the classification slow in real-time applications [52]. The classification performance of k-NN was marginally lower than that of RF. k-NN is highly effective when the amount of training data is large [53]. However, its computation cost is high due to the distance calculations for each cluster [53]. In many cases, J48 and SVM exhibit good performance; however, they demonstrated poor performance on the LibriSpeech dataset because the slight difference in training speaker utterances and single uncharacteristic features [54] can lead J48 to exhibit poor classification performance [55]. The default settings of several key parameters may cause the SVM classifier to present low classification performance [56]. The lowest overall accuracy was observed in the NB classification algorithm. The NB classifier assumes conditional independence among features that are probably invalid for the current dataset [57] and may result in poor performance. This conditional dependence on features becomes more complicated as the number of features increases, thereby negatively affecting the performance of the NB classifier. The results in Experimental Setting V are discussed in Section IV-E. The hierarchical classification approach outperformed the traditional one-level classification approach possibly because it split speaker training utterances to build two sub-classifiers [58] for effective prediction. Moreover, hierarchical classification takes less computational time than one-level classification [59]. The possible reason for the low performance of traditional one-level classification is because it considers 40 classes at a time; hence, differentiating among the utterances of 100 different speakers with different genders may be challenging for any classifier [60].

VI. CONCLUSION
In this study, effective MFCCT features were proposed for speaker identification through a hierarchical classification approach. The hierarchical classification approach was implemented in cascading style, where the first-level classification layer identifies the speaker gender and the second-level identifies the specific speaker identity. Moreover, five machine learning algorithms and one deep learning-based DNN were used to classify speaker gender and SID. The rigorous experimental results showed that the performance of the proposed MFCCT features in terms of overall accuracy was approximately 83.5%-93%. Moreover, DNN was found to be suitable for speaker identification through the proposed MFFCT features. The experimental results prove that the proposed speaker identification system is efficient, accurate, and robust in terms of number of speakers, testing utterances, and utterance length compared with other baseline speaker identification models. The promising results show that the proposed speaker identification system can be used in many application areas, including access control and security. In the future, we intend to improve classification accuracy by reducing the classification errors between speakers with similar voice patterns using deep learning with deeper architectures. Moreover, deep learning hyper-parameter tuning can be implemented to enhance the speaker recognition model. In addition, we are currently collecting a large corpus of speaker identification dataset to further improve the proposed model.  He is currently working with King Faisal University, Saudi Arabia, as an Assistant Professor with the College of Computer Sciences and Information Technology. His current research interests include digital image processing, digital image watermarking, pattern recognition, data authentication, and cryptography. He has published more than 40 articles in international journals and conferences in the field of image processing, data authentication, medical image watermarking, data security, and biometrics.
GHULAM MUJTABA received the master's degree in computer science from FAST National University, Karachi, Pakistan, and the Ph.D. degree from the Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia. He has received the gold medal for the master's degree. He has been an Associate Professor with Sukkur IBA University, Sukkur, Pakistan, since 2006. Prior to join Sukkur IBA University, he was with a well-known software house in Karachi for four years. He has vast experience in teaching and research. He has also published several articles in academic journals indexed in well-reputed databases, such as ISI and Scopus. His research interests include machine learning, online social networking, text mining, deep learning, and information retrieval. His research interests include wireless sensor and ad hoc networks, energy harvesting, cognitive radio networks, and performance optimization. He is a member of the Mexican National Researchers System (level I). He is also serving as an Associate Editor for IEEE ACCESS.
UZAIR ISHTIAQ received the bachelor's degree in information technology from Bahauddin Zakariya University, Multan, Pakistan, and the master's degree in computer science from National University, FAST, Islamabad, Pakistan. He is currently pursuing the Ph.D. degree with the Faculty of Computer Science and Information Technology, University of Malaya, Malaysia. He has been a Lecturer with COMSATS University Islamabad, Vehari Campus, Pakistan, since 2014. His research interests include image processing, medical image analysis, and deep learning. He received a Gold Medal for the bachelor's degree.
MUHAMMAD ZAHEER AKHTAR received the bachelor's degree in computer science from Allama Iqbal Open University, Islamabad, Pakistan, and the master's degree from the University of Agriculture, Faisalabad, Pakistan. He is currently pursuing the Ph.D. degree with the Department of Computer Science, Allama Iqbal Open University. He has been a Lecturer with COMSATS University Islamabad, Vehari Campus, Pakistan, since 2014. His research interests include pattern recognition, time-series data analysis, data mining, and machine and deep learning.
IHSAN ALI received the M.S. degree in computer system engineering from the GIK Institute, in 2008. He is currently pursuing the Ph.D. degree with the Faculty of Computer Science and Information Technology, University of Malaya.
He is currently an Active Research Associate with the Centre for Mobile Cloud Computing Research (C4MCCR), Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia. He has published more than 40 high impact research journal articles including a highly reputable the IEEE Communication Magazine. He has been actively involved in research & teaching activities for the last ten years in different country including Saudi Arabia, USA, Pakistan, and Malaysia. His research interests include wireless sensor networks, robotics in WSNs, sensor cloud, fog computing, the IoT, and ML/DL in wireless sensor networks.
Mr. Ihsan has served as a Technical Program Committee Member for sev-