Mel-Weighted Single Frequency Filtering Spectrogram for Dialect Identification

In this study, we propose Mel-weighted single frequency filtering (SFF) spectrograms for dialect identification. The spectrum derived using SFF has high spectral resolution for harmonics and resonances while simultaneously maintaining good time-resolution of some speech excitation features such as impulse-like events. The SFF spectrum can represent speech characteristics such as burst time and glottal closure instances better than the short-time Fourier transform (STFT) spectrum. Our hypothesis is that these intricate representations in the SFF spectrum should help in distinguishing dialects. Therefore, we built a dialect identification system which uses an unsupervised, bottleneck feature representation of the Mel-weighted SFF spectrogram (Mel-SFF spectrogram) with sequence-to-sequence deep autoencoders. The language invariance of the proposed system was evaluated using two datasets: the UT-Podcast database (English) and the STYRIALECT database (German). The proposed representations gave a relative improvement of 9.47% and 4.69% in unweighted average recall (UAR) compared to the best baseline method on the development and test datasets, respectively, of the UT-Podcast database. The proposed representations also gave a comparable performance to the best baseline method for the STYRIALECT database. In addition, the fusion of the autoencoder bottleneck features computed from the Mel-SFF and Mel-STFT spectrograms improved the overall performance indicating complementary information between these features. By further analyzing the performance of the proposed representation with different utterance lengths using the UT-Podcast database, we observed that the proposed representation performed better on short utterances. The improved performance given by the Mel-weighted SFF spectrogram for recognizing dialects in both databases supports our hypothesis.


I. INTRODUCTION
In listening to speech, humans not only analyse the speech signal's linguistic content but they also make conclusions about the speaker's regional origin, social background and emotional state. Dialect identification refers to a research area where the goal is to find the regional origin of the speaker using the temporal and spectral characteristics of his or her speech signal. Each dialect group has its own pronunciation pattern and vocabulary compared to other dialect groups. These variations in speech due to dialect have been shown The associate editor coordinating the review of this manuscript and approving it for publication was Jing Liang .
to decrease the performance of automatic speech recognition (ASR) systems. An efficient dialect identification system followed by a dialect-specific pronunciation dictionary and a dialect-specific language model can improve the performance of ASR [1]- [3]. In addition, dialect information can be used in speaker profiling in biometrical applications, it can help solve dialect related issues in speaker and language identification, and it can also be used in the development of dialect-personalized voice assistants.
Dialect identification studies in the literature can be classified into two groups: text-dependent and text-independent [4]. In the former, the transcription of an utterance for which the dialect needs to be identified should be known apriori. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Block diagram describing the steps involved in the single frequency filtering (SFF) method [33].
The dialect is determined by finding the closest dialect match for a word/utterance by either phone modeling or word modeling. Phone/word sequences are modeled using n-gram models. For the dialect identification task, different phone modeling approaches are widely used and they include phone recognition followed by language modeling (PRLM) and parallel phone recognition followed by language modeling (PPRLM) [5]- [8].
Dialect identification studies belonging to the second class, text-independent dialect identification, model dialectal variations using acoustic features derived from speech signals [9]- [13]. The acoustic features used in this area include shifted delta cepstral coefficients (SDCs) [14], prosody based features, frame-by-frame phone posteriors, supervectors [15]- [17], and i-vectors obtained from acoustic features and speech attributes [18]- [21]. In one of the recent studies in language identification, bottleneck features (BNFs) derived from a pre-trained deep neural network with i-vector modeling showed significant improvement over the SDC features, and the developed BNF-based system stands out as state of the art [22], [23]. A shortcoming of this approach is that the deep neural network had to be trained over a transcribed corpus which contains only English speech without phonetic variations in pronunciation [24].
This article studies text-independent dialect identification without taking advantage of any pre-trained models or transcriptions. The proposed system takes advantage of an autoencoder which is trained using the Mel-weighted single frequency filtering (Mel-SFF) spectrogram to obtain BNFs which are used in the classification. A baseline system with a similar architecture is trained using the Mel-weighted short-time Fourier transform (Mel-STFT) spectrogram. The autoencoder architecture used is similar to the one developed in [25], [26]. This autoencoder model converts a variable-length feature vector to a fixed-length representation in an unsupervised manner. This architecture was chosen in the current study because it was shown in [25] to be the best performing system in dialect classification compared to two reference techniques. The spectrum computed by single frequency filtering (SFF) has been shown to give good spectral resolution to indicate harmonics and resonances [27] and good temporal resolution to model speech excitation features such as impulse-like events [28]. The SFF spectrum has also shown promising performance in determining burst-onset points related to voice-onset time (VOT) and glottal closure instances compared to the short-time Fourier transform (STFT) spectrum [28]- [30]. Previous studies in dialect identification have shown the significance of VOT for identification of accent [31]. Inspired by this, we propose to use the Mel-weighted SFF spectrogram with autoencoders to derive fixed-length speech representations for dialect identification.
The organization of the paper is as follows: Section II describes the SFF method and the computation of the Mel-weighted SFF spectrogram. Section III provides a detailed description of the proposed dialect identification system. The experimental setup is described in Section IV. Results are presented in Section V. Finally, Section VI summarizes the study.

II. SFF AND COMPUTATION OF THE MEL-WEIGHTED SFF SPECTROGRAM
This section describes the steps involved in the SFF method and in the computation of the Mel-weighted SFF spectrogram.

A. SFF
The SFF method is used to derive the amplitude envelope of the speech signal at every sample for a given frequency [32]. The SFF spectrum has been shown to be useful in finding burst-onset points [29] and glottal closure instants [30], and it has been demonstrated to exhibit high spectral resolution for important speech features such as harmonics and resonances [27].
In SFF, the pre-emphasized speech signal is used for deriving the amplitude envelope at each frequency by frequency-shifting the signal and by filtering it using a single-pole filter as shown in Figure 1. The pole of the filter is located on the negative real axis close to the unit circle in the z-plane, i.e., the angle of the pole corresponds to the Nyquist frequency ( fs 2 ). Therefore, the effect of other frequency components will be reduced giving high spectral resolution. The steps to derive the SFF spectrum are given below [32].
• Speech signal (s[n]) is pre-emphasized to remove low-frequency variations. The pre-emphasis is computed as follows where α is set to 0.95 in the present study.  • The pre-emphasized speech signal x[n] is multiplied by complex sinusoid e jω k n as follows whereω k = π − 2πf k f s , f k is the desired frequency and f s is the sampling frequency.
• Signal x k [n] is passed through a single-pole filter. The transfer function of the filter is defined as where r ≈ 1, i.e., the pole is close to the unit circle on the negative real axis in the z-plane. In this study, r is set to 0.99.
• The output of the filter is given by • The amplitude envelope (v k [n]) of the signal with the desired frequency f k is given by where y kr [n] and y ki [n] are the real and imaginary parts of y k [n]. The amplitude envelope can be computed for several frequencies using a frequency interval ( f ) as follows 2048 . From the amplitude envelope v k [n], the SFF spectrum of the signal is obtained at each instant of time.

B. COMPUTATION OF THE MEL-WEIGHTED SFF SPECTROGRAM
This section describes the computation of the Mel-weighted SFF spectrogram. The procedure, depicted in Figure 2, consists of the extraction of the filter-bank energies obtained by filtering the SFF spectrum with triangular Mel-spaced filters followed by logarithm. The resulting Mel-filter bank energies (MFBE) are referred to as MFBE-SFF or simply as the Mel-weighted SFF spectrum. For convenience, we refer to the spectrogram obtained by using this process as the Mel-weighted SFF spectrogram or Mel-SFF spectrogram.
As explained in Section II-A, SFF provides the spectrum at each instant of time. Instead of considering the spectrum at each time instant, computational load is reduced in the current study by considering the spectrum unchanged in a segment of T ms. One of the following four approaches can be used in defining the spectrum using the segment of T ms. (a) Average SFF spectrum (S avg ): In this approach, the SFF spectrum is computed by averaging the amplitude envelope v k [n] defined in Eq. 5 for every frequency k over the entire segment. (b) Minimum SFF spectrum (S min ): In this approach, the SFF spectrum is selected as the instantaneous spectrum of v k [n] which shows the minimum spectral energy (sum of the squared amplitude envelope values) over the entire segment. (c) Maximum SFF spectrum (S max ): In this approach, the SFF spectrum is selected as the instantaneous spectrum of v k [n] which shows the maximum spectral energy over the entire segment. (d) Uniform SFF spectrum (S uniform ): In this approach, the SFF spectrum is computed by sampling v k [n] at regular intervals defined by the segment duration. The performance of the above four approaches was compared in this study. As will be reported in Section V, it was observed that S avg gave the best performance. Therefore, the Mel-SFF spectrogram computed using S avg was used as a spectral representation of speech and this representation was further processed in an unsupervised manner by an autoencoder to obtain fixed-sized BNFs to be used in dialect identification.

III. PROPOSED SYSTEM
The proposed system has three stages: the Mel-SFF spectrogram extraction, obtaining an unsupervised representation from spectrograms using an autoencoder and classification. The block diagram shown in Figure 3 describes the proposed system architecture. For convenience, we refer to the unsupervised representation from the STFT spectrogram as VOLUME 8, 2020  In addition, we also studied a system using feature fusion where the unsupervised representations (BNFs) derived from Mel-SFF spectrograms (BNF Mel-SFF ) were combined with the unsupervised representations (BNFs) derived from conventional Mel-STFT spectrograms (BNF Mel-STFT ). The fusion system, depicted in Figure 4, is discussed in Section III-C.

A. MEL-SFF SPECTROGRAM CONFIGURATION
The time-domain speech signal is processed as described by Eqs. 1-5 in Section II to compute the SFF spectrogram. Instead of considering the spectrum at every sample, averaging of the spectrogram is computed as explained in Section II-B. We varied the average energy operation (S avg explained in Section II-B) by considering seven values of T (6.25 ms, 12.5 ms, 18.75 ms, 25 ms, 31.25 ms, 37.5 ms, 43.75 ms). The best performance was obtained with T = 37.5 ms. Therefore, this segment duration value is used throughout this study unless otherwise mentioned. Mel-filter bank energies are obtained from the spectrum using linearly spaced 128 filters in the Mel-scale. To eliminate the effect of background noise, the spectral values are clipped using a threshold as in [25]. In this study, five thresholds (−30 dB, −40 dB, −50 dB, −60 dB, and −70 dB) are explored. Spectrograms are normalized to be in the range of [−1, 1] to match with the decoder output function in the autoencoder. The clipping and normalization operations are conducted in the same way in computing the proposed Mel-SFF spectrogram and in computing its reference spectrogram, the conventional Mel-STFT spectrogram.

B. SEQUENCE-TO-SEQUENCE AUTOENCODERS
Sequence-to-sequence autoencoders compress high-dimensional frame-level representations to low-dimensional utterance-level latent representations by capturing the relevant information to reproduce the input sequence. The low-dimensional utterance-level representations capture the required information compactly, which can be used for classification. A sequence-to-sequence recurrent neural network (RNN) with an autoencoder framework is used to convert the Mel-SFF spectrogram to an utterance-level fixed representation. The autoencoder framework has two modules, the encoder and the decoder. In both modules, sequence-to-sequence RNNs are used. Motivated by [26], we used gated recurrent cell units (GRU) throughout the study.
The encoder converts the Mel-SFF spectrogram to a fixed-length representation. A fully-connected layer with a tanh activation function converts the output of the encoder to a hidden input format which is considered the BNF. The BNFs extracted from the trained autoencoder are used for dialect identification.
The BNFs are passed to the decoder as hidden inputs and the decoder learns to reproduce the Mel-SFF spectrogram sample by sample. The estimated output is recurrently passed to the next time-steps as hidden state input. The autoencoder network is trained by minimizing the root mean square error (RMSE) between the estimated decoder output and the original Mel-SFF spectrogram. The initial hidden state input for the first time steps of the RNNs in the encoder and the initial input in the decoder are set to 0 for all utterances. The autoencoders of this study are implemented using the AuDeep toolkit [26], [34].

C. FUSION SYSTEM
In order to investigate whether there is complementary information between the STFT and SFF spectrograms, we developed a fusion system. The block diagram of the fusion system is shown in Figure 4. In this system, the bottleneck features extracted from the autoencoders trained on the Mel-SFF spectrogram (the proposed BNF Mel-SFF system) and Mel-STFT spectrogram (the baseline BNF Mel-STFT system) are concatenated to train the classifier. Two separate autoencoders are trained, each one capturing the underlying latent space representations from respective input spectrograms. The classifier trained using these fused features is used for dialect identification in a similar manner as the system described in Section III-D.

D. CLASSIFICATION
The third stage in the proposed system is classification. We experimented with three different classifiers: Gaussian linear classifier (GLC), multi-class logistic regression (MCLR) and support vector machine (SVM). GLC is a generative classifier model and MCLR as well as SVM are discriminative classifier models. GLC was implemented based on [35]. For MCLR and SVM, we used the implementations from [36]. Both MCLR and SVM use the one-vs-rest strategy to classify dialects.

IV. EXPERIMENTAL SETUP
In this section, the databases used in the study are described. In addition, the section discusses the baseline systems and the evaluation metrics adopted in the study.

A. DATABASES
Two speech databases, STYRIALECT and UT-Podcast, are used in the study. The STYRIALECT database includes the dialects of Styria in German [25]. The database consists of 5227 utterances for training and 2570 utterances for development. The STYRIALECT test set is not provided with labels and therefore the results of this database are reported only for the development set in this study. The sampling frequency is 16 kHz and the duration of each utterance is 2 s. The database has three classes (Styrian dialects of German) with different distributions. More details about the database can be found in [25].
The UT-Podcast database consists of three major dialects (US, UK, and AU) of English [37]. This data is collected from different websites for each dialect and it covers a wide range of topics. Since the data is collected from online podcasts, speech is more spontaneous than in STYRIALECT and not very well structured. Therefore, the collected speech captures all the dialectal traits (pronunciation, vocabulary, and grammatical variations). The speech signals are segmented in such a way that each utterance is 17 s in duration and contains 46 words on average. The sampling frequency is 8 kHz. The database is divided into train and test sets as described in [37].
For the experiments in this study, half of the original test set of the database is used for development and the other half for testing.

B. BASELINE CONFIGURATIONS
The proposed system is compared to three baseline systems. The first baseline is the ComParE'19 system, which uses the BNFs of a sequence-to-sequence autoencoder, which is trained using the Mel-STFT spectrogram [25]. This system will be referred to as the BNF Mel-STFT system. The second baseline is an i-vector system which is trained using Mel frequency cepstral coefficients (MFCCs) [20], [38]. The third baseline is an i-vector system, which is trained using the BUT/phonexia bottleneck features [39]. The second and third system will be referred to as the i-vector MFCC system and the i-vector BUT-BNF system, respectively.
In the BNF Mel-STFT system, spectral analysis of speech is computed with STFT using 80-ms frames with the Hann window and a shift of 40 ms. Mel-bank energies are computed from the spectrum using 128 channels. Amplitude clipping is performed on the Mel-STFT spectrogram to reduce the effect of noise captured in the recordings. Five clipping thresholds, denoted as −40 dB, −50 dB, −60 dB, −70 dB, and −80 dB, are generated as in [25]. The baseline BNF Mel-STFT system uses a similar autoencoder architecture as the proposed BNF Mel-SFF system in order to make a fair comparison between the use of the two Mel-weighted spectrograms in dialet identification. The RNNs in the autoencoder have two layers with 256 GRUs in each layer and the encoder network is unidirectional while the decoder network is bidirectional. The network is trained for 16 epochs with a drop out of 30%. The BNF Mel-STFT system is trained in a similar manner as in the proposed BNF Mel-SFF system to obtain unsupervised representations (BNFs) from the Mel-weighted spectrograms. These representations (BNFs) are used to train the classifier, which is then used for the dialect prediction.
The other two baseline systems (the i-vector MFCC system and the i-vector BUT-BNF system) differ only in the feature representations used for i-vector training. In the former, i-vectors are extracted from 13 static mean normalized MFCC features and their shifted delta coefficients. In the latter, BNFs are extracted from a multi-lingual phone recognizer neural network. For our experiments, we considered a pre-trained phone recognizer from BUT/phonexia [39] which was trained using 17 Babel languages. For both systems, 100-dimensional i-vectors are extracted using 256 Gaussian mixture components and the obtained i-vectors are transformed by a whitening transformation to be used in the dialect prediction [20], [38].

C. EVALUATION METRICS
The evaluation metrics used are the unweighted average recall (UAR) and accuracy. UAR gives the unbiased scores for the classification and therefore it is considered as the primary evaluation metric for this study. These evaluation metrics were chosen in order not to create any bias towards the majority class as the classes are unevenly distributed.

V. RESULTS
In this section, the results obtained in dialect identification using the proposed system and the baseline systems are reported separately for the STYRIALECT database and the UT-Podcast database.

A. RESULTS FOR THE STYRIALECT DATABASE
In this section, different variants of the SFF spectrogram computation methods described in Section II-B are first investigated to find the best approach for the proposed BNF Mel-SFF system in dialect identification. Then, the proposed system with the best approach is compared to the baseline systems and to the fusion system. In both parts, SVM is used as a classifier. Finally, we validate the performance of the proposed system and the fusion system with different classifiers (SVM, MCLR and GLC) in comparison to the best baseline system obtained from the former analysis.  In order to find out which of the four spectrogram computation approaches described in Section II-B is best, we conducted dialect identification experiments by using all these approaches and by using four values for the clipping threshold (−40 dB, −50 dB, −60 dB, and −70 dB). The obtained results (in UAR %) are plotted in Figure 5. It can be observed that S avg outperformed the other approaches for all threshold values. Therefore, the feature extraction from the Mel-SFF spectrogram was computed in all further experiments of this study using S avg with a segment duration of 37.5 ms.
The dialect identification results reported in UAR [in %] and accuracy [in %] are shown in Table 1 for the proposed BNF Mel-SFF method and for the three reference methods by using the SVM classifier. The first column refers to the systems under comparison. In addition to the BNF Mel-SFF system and the three reference systems, the table also includes the fusion system (the lowest row of the first column). The second column includes the different clipping thresholds. From the table, it can be observed that the MFCC-based i-vector system performed better than the BUT/phonexia i-vector system. The BNF Mel-STFT system showed the best performance among all the systems at the threshold of −70 dB. Among the three baseline systems, the BNF Mel-STFT system gave the best performance in both metrics.
The proposed BNF Mel-SFF system showed better performance in UAR at the threshold of −40 dB when compared to the BNF Mel-STFT baseline system. In addition, the proposed system showed comparable performance with BNF Mel-STFT at the threshold levels of −50 dB and −60 dB. Furthermore, the fusion system outperformed the best baseline configuration at the thresholds of −40 dB by 0.42% in UAR.
Furthermore, we compared the dialect identification performance with the three different classifiers described in Section III-D using the BNF Mel-STFT baseline system, TABLE 1. Performance in UAR (%) and accuracy of the three baseline systems, the proposed system and the fusion system using the development data of STYRIALECT. SVM is used as the classifier. The utterance length is 2 s. the proposed BNF Mel-SFF system and the fusion system. The results reported in Table 2 are shown for the best configurations from Table 1, i.e., with the threshold of −70 dB for the BNF Mel-STFT system and with the threshold of −40 dB for the BNF Mel-SFF system and for the fusion system. From these experiments, it can be observed that the SVM classifier performed better than the other two classifiers. Furthermore, it can also be observed that the fusion of the BNFs derived from the Mel-STFT and Mel-SFF spectrograms improved the overall performance compared to any of the individual feature extraction methods.

B. RESULTS FOR THE UT-PODCAST DATABASE
From the results reported for the STYRIALECT database in Section V-A, it can be observed that the performance of the i-vector MFCC and i-vector BUT-BNF systems is poorer compared to the other systems. Hence, these two systems were removed in evaluating the UT-Podcast database. Table 3 presents the UAR results for the three remaining systems separately for the development and test sets and for the three different classifiers. The configurations of these systems are as in Sections III-A and IV-B, except that the clipping threshold is fixed to −40 dB. The autoencoder is trained for 128 epochs with batch size 4 and 128 GRUs in each layer.
From the table, it can be observed that the BNF Mel-SFF system performed better than the BNF Mel-STFT system for all classifiers. As in the results discussed in Section V-A, SVM showed higher performance compared to the two other classifiers. The proposed BNF Mel-SFF system gave a relative improvement of 9.47% and 4.69% in UAR for the development and test set, respectively, when compared to the baseline BNF Mel-STFT system. The results above support our hypothesis: Since the SFF spectrum is extracted in principle at every sample, the temporal resolution of the spectrogram is preserved. The hidden BNFs derived from the SFF spectrogram showing high spectral and temporal resolution result in better discrimination of speech sounds across dialects.
Furthermore, from the results reported in Tables 2 and 3, it can be observed that the fusion of the BNFs derived from the Mel-STFT and Mel-SFF spectrograms improved the overall performance compared to any of the individual systems. The improvement achieved with the fusion system for both databases shows that there is complementary information between the spectral representations computed by STFT and SFF.
It is to be noted that the results reported in Table 3 were obtained by using the speech sounds of UT-Podcast over the entire length of the utterance (i.e., 17 s). In order to study the effect of the utterance length for dialect identification, additional experiments were carried out using utterance lengths of 10 s and 2 s for the UT-Podcast database. The results reported in Table 4 show that the proposed BNF Mel-SFF system performed consistently for all utterance lengths. Furthermore, it can be observed that the proposed system showed a clearly larger improvement (15.35% relative) compared to the STFT-based reference system for shorter utterances than for longer utterances (4.69% relative). Furthermore, the fusion system showed an improvement for both utterance lengths compared to the individual reference systems, again indicating complementary information between the features.

VI. SUMMARY AND CONCLUSION
This study explored the use of the Mel-weighted single frequency filtering spectrogram for dialect identification using the STYRIALECT and UT-Podcast databases. Dialects were identified by training an autoencoder with the Mel-SFF spectrogram and by feeding the bottleneck features of the autoencoder to a classifier. The proposed Mel-SFF spectrogram gave better performance compared to the i-vector based baseline systems. Furthermore, the fusion of the unsupervised representations (BNFs) computed from the Mel-SFF and Mel-STFT spectrograms using the sequence-to-sequence autoencoders yielded the best UAR score (46.9%) for the STYRIALECT database. In UT-Podcast, the proposed and fusion systems gave a relative UAR improvement of 4.69% and 7.71% compared to the Mel-STFT spectrogram-based baseline system, respectively. Furthermore, the proposed system showed better performance especially in short utterances compared to the baseline system in the experiments with the UT-Podcast data. Therefore, we conclude that the high spectral and temporal resolution of the SFF spectrum leads to an improvement in dialect identification for the studied German and English dialects. In addition, we conclude that the proposed Mel-SFF spectrogram system distinguishes dialects better from short utterances than its STFT-based reference system. In the future, we plan to explore the Mel-SFF spectrogram derived features for dialect identification in noisy conditions [27], [28], [30] and for larger corpora.