Mel Frequency Cepstral Coefficient and its Applications: A Review

Feature extraction and representation has significant impact on the performance of any machine learning method. Mel Frequency Cepstrum Coefficient (MFCC) is designed to model features of audio signal and is widely used in various fields. This paper aims to review the applications that the MFCC is used for in addition to some issues that facing the MFCC computation and its impact on the model performance. These issues include the use of MFCC for non-acoustic signals, adopting the MFCC alone or combining it with other features, the use of time series versus global representation of the MFCC, following the standard form of the MFCC computation versus modifying its parameters, and supplying the traditional machine learning methods versus the deep learning methods.


I. INTRODUCTION
The complexity of any system is a critical issue, though reducing it has been a promising area for researchers. One of the basic demands for any machine learning application is to have an enormous size dataset in terms of number of samples and variables. Nowadays, the size of the collected dataset tends to be larger such that it leads to the increase of the adopted system complexity. For this purpose, many approaches have been developed to reduce dimensionality [1]. Dimensional reduction helps in data compression since it reduces storage space and computation time. It is also a way to eliminate redundancies from the dataset. Dimensional reduction can be conducted in using feature extraction and feature selection techniques [2].
The feature extraction is the process of extracting and tackling hidden information in the raw data signal [3]. Data can be more manageable by applying feature extraction because it removes all ineffective features from the data without losing any important or relevant data. Moreover, feature extraction techniques help to develop a system with less machine's efforts and increase the speed of learning and generalization steps in the machine learning process [4], [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino .
Nowadays, many feature extraction techniques are available in a variety of fields based on the characteristics of the raw data. In most of the fields, finding harmonics and sidebands of signal in both time and frequency domain are important to any pattern recognition system. Power spectrum using Fast Fourier Transform (FFT) is used to capture the harmonics and sidebands of the signal in the time domain. While cepstrum; such as Mel Frequency Cepstrum Coefficient (MFCC), Gamma Tone Cepstrum Coefficient (GTCC), is capable to extract harmonics and sidebands of the spectrum version of the signal [6].
Although MFCC is used to extract feature from input data of various fields, it faces many challenges, which have not been covered comprehensively in the literature. The purpose of this paper is to conduct a comprehensive review of MFCC and its applications including speech recognition, speaker recognition, emotion recognition, bearing fault detection, gear fault detection, Electrocardiogram (ECG) and Electroencephalogram (EEG) classification. Additionally, the paper targets answering the following questions regarding the use of MFCC: 1. How wide is MFCC applied to other fields rather than acoustic signal? 2. Is the MFCC well-tuned to fit the demand of various applications?
3. Since the MFCC is computed in short term signal, has it been used more in a time-series or in a global representation? 4. Is MFCC mostly combined with other features, or it is used alone? 5. Does MFCC feed deep learning more or traditional machine learning methods? 6. What are the most well-known used windows in the data framing process? 7. Which classifier is used more frequently with MFCC and what could be the reason? The rest of this paper is organized as follows; the computation of MFCC is explained in section 2. Section 3 introduces MFCC applications. Followed by a summarized analysis presentation in section 4, and finally the conclusion is drowned in section 5.

II. MFCC IMPLEMENTATION
MFCC is one of the commonly used features that has been used in a variety of applications especially in voice signal processing such as speaker recognition, voice recognition, and gender identification [6]. The MFCC can be calculated by conducting five consecutive processes, namely signal framing, computing of the power spectrum, applying a Mel filter bank to the obtained power spectra, calculating the logarithm values of all filter banks, and finally applying the DCT. Figure (1) illustrates the processes of MFCC computation [7].

A. PRE-EMPHASIS
Pre-emphasis is one of the common pre-processing practices in signal processing area which is used to compensate the high frequency of the signal that was suppressed during the signal production. Pre-emphasis is the very first step during the MFCC adaptation, which can be adopted by simply applying a high-pass filter with a setting of [1, −0.97]. The filtering process alters energy distribution across frequencies, as well as the overall energy level [8].

B. SIGNAL FRAMING AND WINDOWING
The idea behind splitting signals into distinct ''frames'' is to break down the raw data signal into frames where the signal tends to be more stationary. For stable acoustic characteristics, speech needs to be examined over a sufficiently short period of time. Regarding the speech signal, the period of 20-30ms time is reported to be a Quasi-Stationary Segment (QSS) since the time between two glottal closure is shown to be around 20ms. However, vowel voices are reported to be captured in 40ms-80ms [9]. Hence, short-term spectral measurements are typically carried out over 20 ms windows, and each frame is overlapped by 10 ms with the next one. Overlaps of frames by 10 ms enables the temporal characteristics the speech signal to be tracked. With overlapping of frames speech, sound representation would be approximately centered at some frame.
On each frame, a window is applied to narrow the signal towards the border of the frame. In general, Hanning and Hamming windows [10] are among the most well-known nominees. These windows can enhance harmonics, smooth edges, and diminish edge effect while taking a DFT on the signal. Figure (2) illustrates the rectangular Hamming and Hanning windows in both time and frequency domains.

C. POWER SPECTRUM
A power spectrum can be described as the distribution of the power of the frequency components that composes the signal [12]. Traditionally, Discrete Fourier Transform (DFT) is utilized to compute the power spectrum. The power spectrum of each of the obtained frames must be determined based on the below equation (1): where x(n) is discrete signal and N is the length of the signal.

D. MEL FILTER BANK
The Mel band-pass filter is a bank of filters, which is constructed based on pitch perception. The Mel filter was originally developed for speech analysis and like human ear perceiving of speech, it targets extracting non-linear representation of the speech signal. The convention Mel filter-bank is constructed of 40 triangular filters [13]. The transfer function (TF) of each of the m-th filter can be computed via equation (2), where, f (m) is the centre frequency of the triangular filter and The Mel scale to the response frequency and vice versa is computed by equations (3) and (4)

E. DISCRETE COSINE TRANSFORMS (DCT)
A Discrete Cosine transform (DCT) expresses a finite sequence of data points regarding a summation of cosine functions oscillating at different frequencies. The DCT was introduced by Nasir Ahmed in 1972. In the MFCC process, the DCT is applied on the Mel filter bank to select most accelerative coefficients or to separate the relationship in the log spectral magnitudes from the filter-bank [14]. The DCT is computed by the below equation (5) where x n is discrete signal and N is the length of the signal.

III. MFCC APPLICATIONS
Nowadays, many huge datasets are available that have been collected in various fields. It is not applicable to deal with the whole of these datasets, which are used to develop automatic systems [15]. Mel Frequency Cepstrum Coefficient (MFCC) is a framework feature extraction technique that has been adopted in many areas and frequently reported to be useful for various applications. In this section a set of these applications are presented (see figure 3).

A. MFCC IN ACOUSTIC ANALYSIS 1) SPEECH ANALYSIS
Speech signal analysis aims to find out more informative, compact, and relevant knowledge than the speech signal raw data itself. Vocal tract features (also named as segmental, spectral or system features) [16] are one of the well-known representations of speech analysis. When the process of speaking starts, the air is traveling out from the lungs, the air movements within the vocal tract create a unique version of the sound. Vocal tract features are well reflected in the frequency domain of the speech signal. There are various developed techniques to extract vocal tract features, one of which is the MFCC. The MFCC can capture the vocal tract features, since it extracts human ear information, which has a response of a non-linear scale instead of a linear scale [17]. Among the most well-known applications that utilize MFCC of the speech signal are emotion recognition in speech, language and dialect recognition and speech recognition as shown in the following subsections.

a: AUTOMATIC SPEECH RECOGNITION
The ASR model targets the recognition of four main categories, which are isolated word, connected word, continuous speech, and spontaneous speech [18]. Regarding the isolated word speech recognition, the speaker needs to pause briefly between words. While connected work recognition aims to recognize words that are separated with a minimum pause. On the other hand, a continuous speech recognition system does not require pause between words as the speech is continuous [19]. The continuous speech recognition model is dealing with an almost natural sound. However, spontaneous speech can handle natural speech in various languages. One of the shared challenges in all the models is providing a generalized and robust feature of the speech signal to achieve high recognition accuracy. The MFCC feature is one of the well-known features especially for isolated word speech recognition for instance the MFCC feature was extracted from speech signals of spoken words (isolated) and then classified via Support Vector Machine (SVM) and Maximum Likelihood Classifier [20]. Dhingra et al. proposed a model to isolate 10 digits through their speech signal. The MFCC was extracted from the spoken words and identified based on a similarity measurement which was measured by Dynamic Time Warping (DTW) [21]. Omer et al. used MFCC with two other spectral based features to feed a pairwise SVM classifier and the model was used for isolating uttered Kurdish digits (0-9) recognition systems [22]. Author in [23] used MFCC to feed an Artificial Neural Network (ANN) classifier to identify the isolated Arabic words.
In 2019, NASSIF et al. reviewed 174 papers which were developed ASR models based on deep learning models where all the deep learning models were fed by extracting specific features. Moreover, all the papers (174) have been published between 2006 to 2018. They found that most of the researchers, which was about %69, still use MFCCs as a feature to feed the machine learning models [24].
Despite these usages of MFCC, there is a limitation of this feature for speech recognition system due to the fixed window size (20ms-30ms) and this limitation leads to two problems. Firstly, the obtained frequency resolution for the Quasi-Stationary Segment (QSS) longer than 20 ms is quite low compared to what could be achieved using longer analysis windows. Secondly, the analysis window can span the transition between two QSSs and the MFCC vectors extracted  from such transition segments do not provide the information about a single unique (stationary) class which may lead to poor discrimination in a pattern recognition problem [9]. Table (1) summarizes some research, where the MFCC is involved in speech recognition application in the past decade.

b: SPEECH EMOTION RECOGNITION
Speech Emotion Recognition (SER) is the process of identifying the emotion of the human from the speech signal. Based on the existing literature, it is well understood that MFCC features for SER can reach a high recognition rate compared to other cepstrum features because it is a shortterm spectral based feature that leads to extract a rich amount of information from speech signals [38]. Consequently, the MFCC is adopted frequently for SER by the researchers for instance: Aouani and Ayed examined two set of features that includes MFCCs. Features were examined using 39 MFCCs and 65 MFCCs together with the SVM classifier, which was used to evaluate both sets. The finding of this paper shows that the first set which was 39 MFCCs, can outperform 65 MFCCs [39]. On the other hand, the related information to the emotion is reported to be sparse and distributed in various features [40]. Consequently, researchers integrated MFCCs with other sort of features, for example, MFCCs and Gamma-Tone Cepstral Coefficients (GTCCs) features were fused together for SER and fed either to the GMM classifier [41] or ESN model [42]. Improvement in emotion recognition has also been noticed by hybridizing some features including formants, pitch, zero crossing, MFCC and its statistical parameters [43]. Aouani et al. studied a model for emotion recognition based on the SVM where the SVM was fed by 42 -dimensional vector which consisted of 39 coefficients of MFCC, a Teager Energy Operator, a Zero Crossing Rate, and a Harmonic to Noise Rate. The obtained feature (42 -dimensional vector) was reduced by Auto-Encoder, the results confirmed that the reduced dimension representation effectively improved the SER accuracy [44]. However, in another study, six basic emotional states were classified based on GMM, which was fed by two sets of features, namely Sub-band based Cepstral Parameter and MFCC features. Here, the Sub-band based Cepstral Parameter features is reported to outperform the MFCC features [45]. Table (2) provides more details about other research that used MFCC for human emotion and music emotion recognition.

c: LANGUAGE AND DIALECT RECOGNITION
Language recognition system aims to recognize the language of the speaker, which can be developed based on the machine learning model. Dialect recognition is quite like language recognition. However, the dialect recognition focuses on categorizing the dialects or accents rather than categorizing the languages and consequently, more challenging due to the linguistic similarities among dialects. Dialect recognition is an important application, and it has been considered as one of the very first steps to speech recognition. This may be the reason for the plenty of articles that have been published. Various features for both language and dialect recognition are available in the literature such as MFCC (see table (3)), singular values decomposition and Linear Predictive Codes (LPC).
Warohma et al. developed a model for Indonesian dialect recognition based on the multilayer perceptron. The multilayer perceptron was fed by obtaining MFCC from the speech signal [49]. Tawaqal and Suyanto claimed that the use of the traditional machine learning approach gives a low accuracy for dialect recognition, and they proposed a model for increasing the performance of Indonesian dialects recognition system based on deep recurrent neural network (DRNN) where the DRNN was trained by MFCC feature [50]. Mansour et al. investigates the usage of MFCC and LPC for language identification, namely Arabic, English and French. Both features were evaluated based on the Artificial Neural Network (ANN) and as a result, MFCC shows better performance to identify these languages compared to LPC [51].
The MFCC feature was extracted by conducting MFCC steps and fed to the GMM approach for identified dialects of Himachal Pradesh. The result shows high performance of MFCC for identification of some dialects [52]. However, Al-Talabani et al [53], evaluated three features including the MFCC, Local binary pattern (LBP), and LPC separately and as a merged form. The adopted features used to feed the SVM model to identify Kurdish dialects and distinguish between some languages including Kurdish, Arabic, Turkish, and Persian. The result shows that the LPC has better performance than MFCC and a fusion feature LBP-LPC outperformed the rest of the feature's form.
Based on the existing literature, the MFCC features do not always show capability to outperform other well-known features such as LPC. This may be due to the language and  dialect properties in addition to the nature of collected data. However, the studies in the literature show that it is one of the informative features for language and dialect recognition.

2) BIOMETRIC APPLICATION
Biometric application is an application which allows to either confirm or determine the identity of a certain characteristic of an individual. The aim of these applications is to confirm that the provided services are accessed only by a legitimate user. Biometric applications can be classified into two categories based on their task such as verification and identification application, where the verification application validates a person's identity by comparing the captured biometric information with her/his own biometric template which is stored in the database and the relationship between the objective and the database is called one to one. While, identification application is called one to many as the application tries to identify an individual by searching the templates of all the users in the database [22]. Based on the existing literature, the MFCC has been used in two biometric applications, speaker recognition and gender recognition over the phone call.

a: SPEAKER RECOGNITION
Speaker recognition is the process of automatically recognizing the speaker based on their speech signal and can be classified into six categories such as speaker identification, speaker verification, speaker detection, speaker segmentation, speaker clustering and speaker diarylation [55]. Speech signals contain speaker-specific information which can be extracted and fed to a machine learning algorithm to learn a specific pattern in it. MFCC and LPC are considered as the most two well-known features that have been widely used by researchers for speaker recognition applications, due to their capacity to capture the repetitive nature and efficiency of speech signals [55], [56]. Plenty of research have claimed that the MFCC is an effective feature in correctly recognizing VOLUME 10, 2022 speakers. For instance, three features for automatic speech recognition were evaluated based on the Fuzzy approach including MFCC, dynamic time warping, and fast Fourier transform. Based on their result, the MFCC improves the performance of the fuzzy model compared to fast Fourier transformation features [57]. Al-Ali et al. enhanced forensic speaker verification based on the fusion features namely the MFCC and Discrete Wavelet Transform (DWT), where their models were evaluated in a noisy environment [58]. Abdul investigated the MFCC feature for speaker identification. He used the MFCC features to feed to the CNN and the result shows that the MFCC features could be used to train the CNN model to distinguish between speakers [7]. Despite the capacity of MFCC in capturing the characteristics of a speaker, the performance of MFCC degrades on complex speech datasets and in noise environments. For instance [59] present that the speaker recognition using MFCC and k-NN significantly degrades under noisy environment, and conclude that denoising the input signal can improve the result more when the highest MFCCs is adopted. To overcome this problem, in [60] a multi-channel training framework within the deep speaker embedding network was proposed for speaker recognition under reverberant and noisy environment. The method receives the time-, frequency-, and spatial-information from the multi-channel input to improve the robustness speaker embedding process. The work concludes that little increase in model parameters, can make the method significantly outperform i-vector with MFCC system with front-end signal enhancement. Additionally, Jahangir et al proposed fusion features based on MFCC and time-based features. The fusion features were fed to the DNN to identify speakers. The result shows that the limitation of the MFCC feature can be solved by this approach [56]. Table (4) displays the summarization of a further thirty papers that used the MFCC for speaker recognition.

b: GENDER RECOGNITION OVER THE PHONE CALL
Gender recognition is the process of identifying a speaker's gender via analysis and comparison of patterns. Gender recognition is one of the effective techniques for biometric and has been a widely interested application for forensic teams. The most Common feature for gender recognition is the pitch feature of the speech. However, the literature, refers to the use of the MFCC features for gender recognition as well. For example, MFCC features were merged with the speaker's mean pitch for gender recognition. Both features were extracted from the acoustic signal and fed to the Gaussian Mixture Model [89]. The MFCC features were extracted from telephonic speech. The MFCC features were evaluated with several machine learning algorithms such as random forest, k-nearest neighbor, multilayer perceptron, naïve Bayes, and support vector machine [90]. Kang and Chang optimized the MFCC feature based on minimum classification error ''SVM'' and called weighted MFCC as a weight vector of the coefficients was optimized. The weighted MFCC feature outperformed conventional MFCC [91]. More details are shown in Table (5).

3) DIGITAL FORENSIC
Digital forensic can be defined as an application or a process which helps the forensic team to identify, analyze and extract important information regarding a specific crime or fakes from digital data. Usually, it is considered as robust evidence in court of law. Digital forensic applications vary depending on obtained information such as Network information and personal information about criminals.

a: SENTIMENT SPEECH ANALYSIS
Sentiment speech detection is the analysis of the speech signal to extract subjective information in source material for identifying the mood of the speaker and customer that helps a business understand the social sentiment of their product and service during the monitoring of online chats. Generally, sentiment speech analysis classifies acoustic data into three statuses including positive, negative, and neutral categories [94]. The MFCC feature has been used a lot for sentiment speech analysis; for instance, Amiriparian et al.  [96]. Authors in [97] studied Sentiment Analysis on Speaker Specific Speech by extracting the MFCC feature from the speech signal and feeding it to Dynamic Time Wrapping (DTW).

b: FAKE SPEECH DETECTION
Although many recent technologies have been developed for voice recognition speaker verification, it still struggles and opens issues which is having a universal fake speech and it needs to be detected. The main aim of fake speech detection is to differentiate between a fake speech and a natural speech [98]. There are plenty of features that have been used for detecting fake speech such as MFCC and LPC. Sanchez et al. in [99] proposed a model based on the statistical classifier for synthetic speech and the MFCC was used as an authorized baseline. Lui et al. in [100]

B. MEDICAL APPLICATIONS
Nowadays, machine learning approaches have been involved in many healthcare applications and it helps to observe critical clues among huge data about the patient where the observation is beyond the scope of human capability, or it is difficult to know them in the shortest time [112]. Both traditional machine learning and deep learning approaches have been adopted by researchers to convert medical signals like ECG and EEG into clinical insight which is meaningful for the doctors. One of the crucial features is MFCC features, which was originally developed to extract vocal track features, however, shows the ability to extract useful features from medical signals to find out a clinical insight as it is mentioned below.

1) EEG ANALYSIS
An electroencephalogram (EEG) is a recording of the electronic action generated by the brain. EEG has a crucial role to identify or recognize brain activities and their means as well. Plenty of research is available that focuses on brain activities and there are still many challenges in this field. One of the vital challenges is obtaining a proper feature from the brain signal.
Based on the literature, some researchers proposed MFCC features for monitoring the brain activity and knowing the emotion of the patient or client, for example, Othman et al. extracted MFCC features from the EEG signal for identifying and recognizing the emotions. The MFCC feature then fed to the multi-layer perceptron to classify the emotions (happy, fear, sad and calm) based on the EEG signal and the result shows that the accuracy can gain up to 90% [113]. The performance of the EEG classification method was improved by combining 13 MFCCs with 19 articulatory features and then, the features were classified by gated recurrent unit [114]. Rajesh extracted 15, 25, and 35 MFCCs features from the EEG signal and fed them to an ANN model separately for classifying normal and abnormal activity. Based on his result, 25 MFCC components are fair enough to provide a maximum accuracy compared to 15 and 35 MFCCs [115]. Based on the information in the table (7), the MFCC is extracted from both EEG and acoustic source to diagnose brain activities.

2) ECG ANALYSIS
An electrocardiogram (ECG) signal is a recording of the electronic action generated by the heart. Doctors and health staff usually use ECG to check the status of the heart. An automatic ECG classification system is an experimental area and there are still many efforts and challenges in identifying and classifying several waveforms in the ECG signal [121]. Recently, the MFCC features have been widely used for this purpose; for example, Yusuf & Hidayat evaluated two well-known features which are 13 MFCCs and Discrete Wavelet transformation when fed to the kNN. The obtained result shows that the performance of the 13 MFCCs outperformed DWT [122]. Boussaa et al. claimed that the MFCC is a robustness feature for ECG classification into normal and abnormal when it feeds into a multilayer perceptron [123]. Authors in [124] extracted two sets of features, which were MFCCs and Motifs, extracted from the ECG signal and the features were used to train the SVM individually. Based on their experiments, the best results were achieved when the MFCC trained the SVM. Table (8) shows the narration for some other research on this application.

3) DISEASE DETECTION APPLICATION
Nowadays, the adoption of innovative technologies has been increasing in the field of medicine rapidly to detect signs of the illness in the earlier stage or to enhance and monitor the status of the patients. Behind any modern technologies in the medical application, there are several signal processing methods as they all work on the observation data. Recently, the MFCC has shown up in the detection of some diseases such as Parkinson's disease, voice disorder, vocal fold disorder, pathology detector, and Cough detection (see table (9)) which are described briefly below.

a: PARKINSON DISEASE
Parkinson's disease is known as one of the brain diseases and it was described by James Parkinson in 1817. The brain cells, which produce and store dopamine, are subjected to a gradual loss and this is also called a neurological disorder. Parkinson's disease leads to some difficulties in the patient such as shaking, difficulty with walking, stiffness, and unbalance [129].
Plenty of intelligent systems have been developed for classifying healthy and patient people by analyzing the voice as the disease hits the voice in the beginning and becomes impaired [130]. So, as it is related to the voice, MFCC features have been used a lot for this purpose, for example, 1-20 coefficients of the MFCC extracted from the voice of the vowel / a / and is fed to the Linear kernels SVM to detect the PD. The consequences of the experiment shows that the first 12 coefficients of the MFCC are the best classification accuracy achieved which was 91.17% [131]. A modification of the MFCC was done by changing the bandwidth of the Mel scaled bank filter to extract the region of interest [300 Hz, 1700 Hz] that are more related to the PD and then trained a radial basis function (RBF) network to classify healthily and patient [132]. Authors in [133] used Linear discriminant analysis to reduce the dimension of the MFCC coefficients and fed it to the SVM and the result shows that 10, 12 and 18 MFCC are more related to the PD signs.

b: VOICE PATHOLOGY DISORDERS
Voice pathology disorder is a problem on the vocal track voice, which appears in pitch, volume, tone and other qualities of a voice. The problem occurs whenever the vocal cords do not vibrate normally. Computer-aided tools have played a significant role in the diagnosis and identification of a lot of diseases including voice pathology disorders [134]. The MFCC as a feature, has been adopted to train computer-aided tools for classifying normal voice and voice pathology disorders, for instance a model based on the MFCC feature was developed for the same purpose and the MFCC was extracted from the speech signal when the speech has undergone the distortion of an analog communications channel. The performance of the MFCC is still reasonable to be conducted. However, the performance of the model was degraded because of the added distortion [135]. Ali et al. used the MFCC with the Gaussian mixture model (GMM) to develop an automatic detection system that could differentiate between normal and pathological voices [136]. Enhanced the traditional MFCC and named adaptive weighted Thomson multi-taper MFCC. The enhanced MFCC was given to the GMM model as input for differentiating normal and disorder speech, and the performance of the obtained MFCC outperformed the traditional MFCC [137]. In 2018, the 19 MFCC were extracted from the speech signal and ANN was trained for discriminating between healthy and pathological voices, and the high classification rate achieved up to 99.96% [138].

c: COVID 19 DETECTION VIA COUGH
Cough is a symptom of over thirty medical conditions including COVID-19. For more than three decades, the detection of medical conditions based on coughing has been interesting to researchers who work in this field [139], [140]. The performance MFCC feature for classifying cough and ordinary speech was reported to be better than some other features such as Short-time Fourier transform (STFT) and Mel-scale filter banks (MFB) [141]. Recently, the MFCCs have been used as a feature for detecting whether the patient has COVID-19 or not, for example, the MFCC features were obtained from cough sounds and fed to the kNN to detect COVID-19 infected patients [142]. Bansal et al. trained a CNN model by the MFCCs, which was obtained from the cough sound, and classified into two classes such as Covid and Non-Covid classes [143].

C. INDUSTRY ANALYSIS
Reliability of machines or tools is gaining importance in the industry because of the need to decrease the possible loss of production whenever the machine experiences an abnormal situation during the working load. There is plenty of market assessment tools, which are utilized by analysts to understand the competitive dynamics of an industry, and they are used in industry [157]. A Health monitoring system is one of the well-known assessment tools that has been used to monitor a dynamic system and predict a failure in the early stage [158]. In condition monitoring systems whether it deals VOLUME 10, 2022 with vibration [159] or acoustic signal [160]. The MFCC is known as an effective feature to monitor tools or equipment and has been used quite recently for this purpose such as gear health monitoring, bearing health monitoring, turbine health monitoring, and Pump health monitoring. Table (10) provides more details on other industrial applications.

1) GEAR HEALTH MONITORING
A gear is a type of machine equipment that has teeth around cone-shaped or cylindrical surfaces. The teeth are designated around the shape with equal spacing. A force from one shaft, which is named the driving shaft, transmits to another shaft (driven shaft). Monitoring gears in rotation machinery devices have been focused on the industry as the failure in gear downs the entire system where the gears are involved in construction [161]. Various faults are related to the gears and many machine learning-based models have been developed to recognize these faults [162]. The MFCC has become one of the effective features in gear fault detection, for instance, Benkedjouh et al. extracted the MFCC feature and fed it to the SVM and claimed that the first three MFCC components contain the most defect information of gears [163]. However, based on the research of Abdul et al, 1-13 MFCC are more effective to be taken to train LSTM [164] and Jin et al. evaluated some sets of MFCCs (16,21,26,31,36,41 MFCCs) by feeding them into a CNN model individually and the result shows that 41 MFCCs outperformed the others for gear fault detection. In conclusion, the dimension of the MFCC features is important for increasing the classification rate [109].

2) BEARING HEALTH MONITORING
Bearings are the most widely used components in industrial rotating machinery, which consists of rolling elements, inner race, outer race, and separator. The bearing is used for rotating or linear shaft applications. There are plenty of types of bearing that are available in the industry, namely Ball Bearings (like Deep-Groove Ball Bearings and Angular Contact Ball Bearings) and Roller Bearings (like Cylindrical Roller Bearings and Tapered Roller Bearings). In earlier times, researchers used some statistical features to predict how the bearing such as mean, variance, standard deviation, root mean square, skewness and kurtosis is healthy?. However, these features are not sufficient to monitor all faults that might occur in bearings [165] So many distinctive features have been extracted to overcome this problem such as MFCC feature and gamma tone spectrum coeffects. Nelwamondo and Marwala investigated the performance of a combination of features, which consisted of the MFCC and kurtosis and gave it to the Gaussian Mixture model. Based on their result, combining the Kurtosis can improve the performance of the MFCCs features by 5% [166]. Nelwamondo et al. used the first thirteen MFCCs feature as temporal feature and hidden Markov Models (HMM) was trained to classify bearing faults. they observed that the first 13 MFCCs is the optimal for bearings fault and beyond 13 does not increase the fault detection rate [165]. In another research, 12 MFCC coefficients were extracted from the vibration signal and then combined with both Wavelet Packet Decomposition Energy features and the Zero-crossing rate features. The mixture feature was fed to the support vector machine to classify gear faults [167]. Mingsi et al. however, studied two limitations of the MFCC to fault detection for rolling bearing fault diagnosis. Firstly the Mel scale originally developed to speech signal based on human hearing and it is not suitable for the frequency distribution of rolling bearing vibration signal and the second, the ability of the MFCC in denoising. For these limitations, the adaptive frequency cepstrum coefficient (AFCC) method was proposed by [168] and fed to the XGboost algorithm for classification bearing faults. Table (10) displays some researches in which MFCC has been used as feature.

3) TURBINE HEALTH MONITORING
A turbine is a set of devices that harness the kinetic energy of some liquid such as water, air, steam, and turns this into the rotational motion of the device itself [172]. The turbine plays a vital role in converting power and many application tools have been developed for monitoring the turbine [173]. The essential requirement for any monitoring system based on machine learning is a dataset as decisions come out based on the data analysis. Based on the existing literature, MFCC has been used to obtain defected features in turbines, for example the MFCC and Code Excited Linear Prediction were used to extract a feature set from the acoustic signal. Then the feature of each technique was compared into different states which were fault and non-fault states [174]. Wang et al. optimized MFCC features, which were extracted from wind turbine sound signals, based on the entropy weight method for fault VOLUME 10, 2022   diagnosis of the wind turbine blade. The optimized feature was later clustered using the K-means clustering [175]. Kilic evaluated three sets of features, which are MFCC, spectrogram, and Mel spectrogram, by feeding them into three machine learning models such as LSTM, RNN and CNN. Based on Kilic's result the MFCC feature with CNN can get better than the other adopted model [176].

4) PUMP HEALTH MONITORING
A pump is a device which moves fluids from one place to the other by mechanical action and it is particularly important to monitor it well as any failure of pump will affect productivity and safety. There are many types of pumps which are available in the industrial field, and they are classified based work, namely dynamic, and positive displacement. MFCC as feature has been used to know the status of a pump for instance for fault isolation of Solenoid pumps, Akpudo and Hur employed a locally linear embedding (LLE) for reducing dimension of the MFCC features which obtained from vibration signal and the reduced MFCC feature fed to the SVM with Gaussian kernel [177]. In 2021, Akpudo and Hur proposed another model for Fault Detection and Isolation in Electromagnetic Pumps. They used the same feature, which was 12 MFCC (2nd to 13th MFCCs) features, with two other features namely first-order and the second-order differentials (12 Delta and 12 Delta-Delta features), which are driven from  the MFCC feature. A feature selection technique known as rank-based recursive feature elimination, was then used to select the most effective features among 36-dimension and later given to the SVM classifier [169]. Table (11) shows some other studies that MFCC has been used in the industrial field. Table (12) shows a statistic obtained from the survey about the use of MFCC in various methods for different applications. Since the MFCC is mostly derived for audio-based applications, one can clearly observe that it has been applied frequently to acoustic signals. However, despite that, this cepstral feature has recently started to be used for non-acoustic signals in various areas such as medical and industrial applications. MFCC is reported to be a powerful feature hence it has mostly been adopted alone, much less been combined with other features. The reason of combination of MFCC with other features is mostly to overcome the sensitivity of MFCC in noise environment [109].

IV. SUMMARY
Most of the works in the literature followed the standard approach of using MFCC. For example, the coefficient in most of the works is adopted to be 13 or less, and third of the papers tried to enhance the MFCC computation parameters. However, the traditional way of MFCC computation followed the demand of audio signal modeling, which is not necessarily to be useful for non-audio signal such as ECG, EEG, and vibration. Even though the input signals of the adopted application in this paper are of time-series nature, most of researchers represented the MFCC in a global way by computing the statistics of the coefficients along with the signal frames. Despite the promising result of MFCC in the industrial field, table (12) shows that only around 13% of the papers that are using MFCC is applied to industrial application. This encourages more investigation of MFCC in this field.
The handcrafted features including the MFCC are mostly feeding the classical machine learning and rarely supply deep learning classifiers. The reason could be the nature of some deep learning methods that are able to extract deep features directly from the raw data.
If we take the reviewed papers as a sample of what has been conducted using the MFCC, we can observe the following: • Most of the acoustic-based applications such as speech recognition, speaker recognition, gender recognition from speech and language recognition are using the MFCC alone more than the combination of it with other features as it is shown in the figure (4). However, the works that adopt the emotion recognition application are mostly using the combination of MFCC with other features. This could be an indication of the sparse nature of emotional information in the speech signal which may not be captured properly by the MFCC alone.
• Among the same works of acoustic application, the conducted works are mostly using the standard form of the MFCC, while around half of the works of the speaker recognition-based articles are adopting improving the MFCC parameters (see figure (4)). This may be due to the diversity of the speakers' vocal tracts, which needs MFCC computation process modification.
• Although the MFCC is computed in a short-time feature, most of the works are using the global version of the feature representation rather than the time series one as illustrated in figure (5). The reason could be the issues that may appear when using the time series version of the feature representation such as the diversity of the input signal length. Additionally, the application itself can benefit from the time series representation more than other representations.
• Works that adopt the use of non-acoustic application are mostly using the standard form of MFCC computation.
Since the standard parameters of the MFCC are derived for audio applications, optimizing the MFCC parameters may have significant impact on the recognition results (see figure (6)).
• Most of the publications are using Hamming window in the signal framing step. Very little are using Hanning window, and none is adopting the use of rectangle window, however, some are not mentioning the adopted window in their work(see figure (7)).
• Based on reviewing 186 articles, the MFCC has been used with various classifiers including suitable classifiers for the global representation of features (such as SVM, kNN, ANN and MLC) and classifiers that are used with time series representation (such as DTW, HMM and LSTM). The review shows that the most used classifier with the MFCC feature depends on the application. Table ( 13) presents the ratio of the most used classifier for each covered application in this review. It is clearly seen that SVM, and ANN are widely adopted with MFCC for their capability to find optimum separators among classes. SVM is a high-performance classifier with representative features and is not affected by the curse of dimensionality. On the other hand, ANN has the capability to learn the linear and non-linear characteristics of various classes from their data samples. Despite its importance, deep learning classifiers such as CNN have not been widely utilized with the MFCC, because CNN is capable of extracting learned features directly from the raw data. The nature of the application has its role to choose the classifier as well.
For example, the ASR is a time series-based application, therefore, classifiers that detect the characteristics of the time-varying samples such as DTW, HMM and RNNs are more adopted for such an application. One can conclude from these results that nothing special is observed about the relation between the MFCC and other classifiers since the most adopted classifiers are those that are widely used with other applications as well.

V. CONCLUSION
This paper is a review of using MFCC in various applications and presents some issues regarding the computation of it. One can conclude from the review that covers 186 papers, that the MFCC is a widely used feature for acoustic-applications and a promising one for other applications such as EEG, ECG and industrial signals. However, there is no comprehensive investigation of MFCC for non-acoustics applications. Another finding is that most of the papers have not adapted the MFCC computation for non-acoustic applications. However, there are many indications that this modification may improve the performance of the non-acoustics models. Even though the MFCC is computed in a short time signal, however, most of the works has adopted the use of the global representation of the MFCC, which are mostly globalized by computing the statistics the features along the frames. The duality of time series versus non-time series is worth much more investigation to show their capability in modelling various applications.