Voice Spoofing Countermeasure for Logical Access Attacks Detection

Voice-driven devices (VDDs) like Google Home and Amazon Alexa, which are well-known connected devices in consumer IoT, have applications in various domains i.e., home appliances automation, next-generation vehicles, voice banking, and so on. However, these VDDs that are based on automatic speaker verification systems (ASVs) are vulnerable to voice based logical access (LA) attacks like Text-to-Speech (TTS) synthesis and converted voice signals. Intruders can exploit these attacks to bypass the security of such systems and gain access of victim’s bank account or home control. Thus, there exists a need to develop an effective voice spoofing countermeasure that can reliably be used to protect these VDDs against such malicious attacks. This work presents a novel audio features descriptor named as extended local ternary pattern (ELTP) to capture the vocal tract dynamically induced attributes of bonafide speech and algorithmic artifacts in synthetic and converted speeches. We fused our novel ELTP features with the linear frequency cepstral coefficients (LFCC) to further strengthen the capability of our features for capturing the traits of bonafide and spoofed signals. We employ the proposed ELTP-LFCC features to train the deep bidirectional Long Short-Term Memory (DBiLSTM) network for classification of the bonafide and spoof signal (i.e., TTS synthesis, converted speech). Performance of our spoofing countermeasure is measured on the large-scale and diverse ASVspoof 2019 logical access dataset. Experimental results demonstrate that the proposed audio spoofing countermeasure can reliably be used to detect the LA spoofing attacks.


I. INTRODUCTION
We have witnessed a tremendous evolution in voice biometrics-based user authentication systems in the last few years. Automatic speaker verification (ASV) systems are commonly embedded in various devices such as mobile phones, smart speakers (Google Home, Amazon Alexa), etc., for user authentication in different application domains i.e. banking, electronic-commerce systems, home automation, apps login [1], etc. For example, Siri in iPhone, Baidu's ASV in lenovo or Google Home receives voice commands from the users and execute different functions such as opening/closing doors, setting reminders, call or text some person, unlock cellphone [2], song play, etc., based on the ASV [3]. In banking sector, we observed many voice-driven based authentications solutions deployed for customers verification like Barclays Wealth and BBVA's bank in Turkey have been using the ASV to verify telephone callers. Whereas, Garanti bank has developed voice-driven interface that allows the users to perform transactions on their app by sending voice commands [4].
The COVID-19 pandemic has resulted in an exponential growth of voice-based authentication systems as lockdown and social distancing measures have restricted the ability to investigate the claimants face-to-face using facial or fingerprint recognition. This pandemic indulges the world to drastically change the verification measures by discouraging human-to-human and human-to-machine interactions (i.e., fingerprint scanning, password-based verification, etc.). Thus, voice biometrics technology has emerged as a feasible solution among various biometric techniques (i.e., Facial, Iris and, Fingerprint). Moreover, voice biometrics-based authentication systems are considered economical and computationally more efficient over other biometrics systems. Although voice biometrics-based user authentication systems are considered more feasible these days, however, these systems are susceptible to different malicious presentation/spoofing attacks i.e., speech synthesis, voice conversion, replays, etc. These presentation attacks are used to spoof a voice biometric system by a claimant to imitate an authorized person to access the control of someone's home, bank account, device (laptop or mobile), etc. In recent times, three cases were filed in the United States where the imposters used the synthetic voice of CEO's of different organizations to fool their employees and robbed millions of dollars electronically [5]. To address the vulnerabilities of ASV systems, researchers are developing robust voice spoofing countermeasures/detection systems to add a protective layer before ASV systems that can discard the spoof sample before sending those audios to the ASV systems.
Voice spoofing attacks are categorized into logical-access attacks i.e. voice conversion (VC) [6], Text-To-Speech (TTS) synthesis [7] or physical-access attacks i.e. replays [8], impersonation [9]. These spoofing attacks are generated through modifying the bonafide audio signal into a variety of ways. For example, in voice conversion, speech signal spoken by the original speaker is manipulated to sound as if it was spoken by some target speaker while keeping the linguistic information unchanged. Speech synthesis represents the artificial/machine generated voice of the target speaker. As both the voice conversion and TTS synthesis can impersonate a target speaker's voice, thus pose a significant threat to the ASV systems. Additionally, since the converted voice originates from a live person and contains the dynamic variations of human speech as compared to speech synthesis that is void of these variations and contains cloning algorithm artifacts, therefore, we believe that detection of converted speech is more challenging. In replay spoofing, imposter plays a pre-recorded speech in front of the ASV system to get an access on behalf of the bonafide speaker.
With the advent and evolution of generative adversarial networks (GANs) in the last few years, we have witnessed amazing results in synthetic image and audio generation that looks and sounds very realistic. Shown in FIGURE 1 is one practical example of Deepfake voice phishing (Vishing), in which synthetic speech is used to impersonate Google Assistant to initiate the fraudulent transaction Mr. Visher: the intruder, researches the targeted victim and collects his/her voice samples (from online meetings, voice mails, phone calls, etc.). These voice samples are then used to train the voice conversion or speech synthesis algorithms to imitate the victim's voice. Victim's bank account is connected to voice enabled devices like Google Home for Android or iPhone devices. Equipped with the voice synthesis/conversion capability, the intruder can attempt to manipulate VDDs. It is relatively easy for someone, potentially with malicious intentions, to get access to VDDs by being in proximity of the victim. In this scenario, we clearly assume that the intruder has permanent or temporary access to VDDs. When attacking VDDs, the intruder can simply send synthetic/converted voice to impersonate himself as a legitimate user and exploit the VDDs into transferring funds to his account by sending the command "Hey Google, I want to transfer funds to Mr. Visher's account". Due to inability to detect the synthetic/converted speech, attacker will be successful in transferring funds into his account. This scenario shows that current VDDs are unable to differentiate between the bonafide and spoofed voice samples reliably. This demands to develop a reliable spoofing detection system for VDDs that can provide a protective spoofing countermeasure layer in front of the ASV systems.
In recent years, various research efforts have been made to detect the spoofing (synthetic/converted) attacks [10] in conventional ASV systems to authenticate the legitimate user in financial sector [11]. The idea of voice texture is a relatively new concept of voice characterization as spectral analysis reveals that the texture of cloned voice signals varies as compared to the bonafide ones. The concept of texture is well explored in image processing domain. Texture descriptors such as local binary patterns (LBP) and local ternary patterns (LTP) are found to be effective for texture-based classification of images. Later, this texture descriptor was introduced to develop an acoustic LBPbased voice spoofing countermeasure. LBP has two main limitations, i) noise sensitive, and ii) possibility of different LBP patterns assignment into the same class, which reduces its discriminating property. We proposed the acoustic-Local Ternary Patterns (LTP) [12] to overcome these limitations. However, acoustic-LTP features are vulnerable for certain scenarios that must be addressed. The potential limitations of this fixed threshold-based approach of our prior acoustic-LTP method are: (a) non-robust over dynamic pattern detectionspectral analysis of the synthetic voice reveals that the signal has dynamic repetition pattern that can be effectively captured using a dynamic threshold approach. However, the acoustic-LTP uses a static threshold for computing the LTP codes, therefore, there exists a need to improve the existing acoustic-LTP features for ASV applications. (b) brute-force optimization-as in acoustic-LTP we need a brute-force approach for threshold optimization, which makes it difficult to achieve better accuracy in real-time applications under diverse conditions. (c) intolerance over non-uniform noiseacoustic-LTP is robust against the consistent uniform noise that is available in the indoor audios experienced in fall detection applications, whereas we experience the nonuniform noise in the outdoor environments for applications like voice spoofing detection. Therefore, static thresholdbased acoustic-LTP features are not robust under non-uniform noise and hence, not reliable for voice spoofing detection in diverse environments. The motivation behind the proposed work is to develop an effective features representation scheme that is robust to above-mentioned limitations and can reliably detect the logical-access (LA) attacks in diverse scenarios. To address these issues, we develop a novel audio features descriptor named extended local ternary pattern (ELTP) where we propose an automated threshold computation approach based on calculating the standard deviation locally for each audio frame. Our ELTP features analyze the patterns of the audios in time domain by using the dynamically computed automatic threshold approach capable of capturing the algorithmic artifacts in the synthetic speech signals and vocal tract induced variations in the genuine signals. Moreover, we exploited the ability of linear frequency cepstral coefficients (LFCC) features to effectively extract the significant information from the low-and high-frequency bands of the audios. Thus, we integrated the frequency-domain LFCC with our novel time-domain ELTP features to better enhance the features representation in terms of capturing the vocal tract induced variations of bonafide voice and algorithmic artifacts of synthesized speech. The proposed ELTP-LFCC features are later used to train a BiLSTM model to reliably detect the LA attacks. The main contributions of our research work are: 1. We propose a novel extended local ternary pattern feature descriptor to effectively capture the traits of speaker induced variations in bonafide audio and algorithmic artifacts in converted and synthetic audio.
2. Our novel ELTP features are robust to non-uniform noise and dynamic patterns detection that makes them to perform well for voice spoofing detection in diverse indoor and outdoor environmental conditions. 3. We integrated our ELTP features with the LFCC to develop more effective descriptor that further strengthen the performance of our spoofing countermeasure.
4. Rigorous experimentation was performed to illustrate the significance of the proposed countermeasure for detection of LA based voice spoofing attacks.
The rest of the paper is organized as follows. Section II investigates the existing state-of-the-art voice spoofing countermeasures. Section III explains the proposed voice spoofing detection framework. Section IV comprises the details of dataset and experiments conducted to measure the performance of our countermeasure. Lastly, Section V presents the conclusion.

A. Shallow Machine Learning-Based Approaches
Existing methods have heavily explored the GMM along with different variants to develop various algorithms for synthetic speech and converted voice detection. In [13], Constant Qtransform cepstral coefficients (CQCC) were used to classify the speech samples as synthetic or bonafide. Few works have highlighted the significance of modified group delay function (MGDF) in synthetic/converted speech signals. In [14], MGDF-based and relative phase shift features were employed for synthetic speech detection. Similarly, in [16], a featuresset comprised of mel-frequency cepstral coefficients (MFCC)cosine-normalized phase-based cepstral coefficients (CNPCC), and linear prediction residual cepstral coefficients (LPRCC) along with some existing features i.e. Modified group delay cepstral coefficients (MGDCC) and CNPCC was used to train a bi-class GMM to distinguish spoofed (synthetic/converted) speech from the bonafide. In [15], mean pitch stability (MPS),,mean pitch stability range (MPSR), and jitter were computed by analyzing the pitch pattern to distinguish between the genuine and synthetic speech. The integration of multiple features makes these solutions [15], [16] computationally complex for real time applications. Few works [17], [26] have used LBP, MGDF, and CNPF to detect the LA attacks. Since the LBP is sensitive to noise that results in generation of similar patterns for both classes, thus, makes them less effective to better differentiate between the bonafide and spoof samples. Similarly, [27] highlighted the significance of relative phase information derived from Fourier spectrum and fusion of relative phase information with existing phase-based features for voice spoofing (synthetic/converted) detection. In [28], authors used the fusion of long-term modulation and short-term spectral features to discriminate between the bonafide and synthetic speech. This method uses filter-bank energies to reduce dimensionality that might result in the loss of some detailed information in modulation features. In [29], a combination of cochlear filter cepstral coefficients (CFCC) and change in instantaneous frequency (IF) was used to capture the traits of natural and spoofed (synthetic/converted) speech. The classification performance of CFCCIF features was increased when used in combination with the MFCC. An anti-spoofing system based on calculating linear predictive coding (LPC) pair-wise distances between genuine and converted speech was proposed in [30]. This countermeasure takes advantage of the prior knowledge of the attack. A spoofing countermeasure based on high-order spectral analysis specifically quadrature phase coupling (QPC), Gaussianity and linearity test statistics was used in [7] for cloned audio detection. In [31], authors investigated an utterance level feature termed as longer contexts or high level feature (HLF) and voice assessment tool (p563) which calculates Mean Opinion Score to detect artificial signals. The latter approach was unable to discriminate between genuine and artificially produced signals effectively.

B. Deep Learning-Based Approaches
In recent years, ASV research community have widely explored the deep learning-based methods for logical access attacks detection. In [18], MFCC, CQCC, and STFT were employed to train the ResNet model for audio spoofing attacks detection. It was demonstrated that the fusion of three variants of residual convolutional neural networks: MFCC-ResNet, CQCC-ResNet and Spec-ResNet achieve better classification performance than the ASVspoof baseline spoofing detection methods (LFCC-GMM, CQCC-GMM). In [20], spoofingdiscriminant network was employed to obtain the spoofing vector (s-vector) for each utterance. Later, mahalanobis distance with normalization was applied to s-vectors for spoofing (synthetic/converted) detection. Fusion of two magnitude-based features was used with the multilayer perceptron classifier in [19] to detect the LA attacks. This method attained improved classification performance but at higher features computation cost. In [21], a deep dense convolutional network with 135 layers was used to detect the converted voice spoofing. Similarly in [23], two low-level acoustic features i.e. log power magnitude spectra (logspec) and CQCC were employed to train the deep neural network (DNN) models based on several variants of Squeez-Excitation network and residual networks to classify between the spoofed and bonafide speech. This method [23] achieves better classification results than other contemporary methods however, fusion of several DNN models significantly increase the training time of these methods.
Besides extracting the spectral features like MGDF, MFCC and others, which are then fed to different machine learning or deep learning model for classification, few works [22], [32], [33] have also employed machine learned features. In [32], a DNN was used to generate a bottleneck feature and frame level posteriors to discriminate between the bonafide and spoofed (synthetic/converted) samples. In this method, GMM classifier was trained using both the extracted and machine learned features. In [22], authors used the fusion of Light convolutional neural network (LCNN) and a deep feature extractor termed as Gated Recurrent neural network (GRNN). Extracted deep features were then used to train three different classifiers i.e., linear discriminant analysis (LDA), and its probabilistic version (PDLA), and SVM for voice spoofing detection. Similarly in [33], DNN-based frame-level features and RNN-based sequence-level features were extracted to train different classifiers i.e. LDA, gaussian density function (GDF), and SVM for LA attacks detection. More specifically, for DNN, authors employed three model structures that are stacked autoencoder, spoofing discriminant deep neural network (DNN), and multi-task joint-learned DNN. Whereas, in RNN-based system, LSTM-RNN and bidirectional LSTM-RNN were implemented. Autoencoder compresses the information that can result in loss of relevant content. These methods achieve better classification performance however, with increased features computation cost.

III. PROPOSED FRAMEWORK
A detailed description of our voice spoofing countermeasure is presented in this section. We proposed a novel audio feature descriptor ELTP to represent the input audio signals. The details of ELTP feature descriptor are also provided in this section. We fused our ELTP features with the LFCC for audio signal representation. We designed a bidirectional LSTM (DBiLSTM) recurrent neural network and train it by using our ELTP-LFCC features for classification of the bonafide and synthetic/converted signals. The architecture of the proposed framework is presented in FIGURE 2.

A. FEATURE EXTRACTION
For accurate detection of logical access attacks, we need to develop a robust audio feature descriptor that can effectively capture the algorithmic artifacts in synthesized signals and dynamic speech attributes of human speaker in the bonafide speech. Moreover, audio features must also be robust over non-uniform noise which is quite apparent in the outdoor environments where voice samples can be recorded and later used for voice spoofing detection. To better address these concerns, we proposed a novel audio features representation method ELTP that is robust to non-uniform noise and dynamic patterns detection, and capable of capturing the dynamic varying attributes in genuine speech and algorithmic artifacts in the cloned voice. Moreover, we integrated the LFCC features with our ELTP features to further enhance the performance of spoofing detection.

1) EXTEND LOCAL TERNARY PATTERNS (ELTP)
We partition the input audio signal Y[n] having N samples into non-overlapping frames of length l. The concept to generate the ELTP features are taken from the image processing research that consider the closest neighborhood of a pixel comprising of the 8 surrounding pixels in a 3×3 window for 2D LTP features [34]. However, for 1D audio signals, we employed different number of neighbors and found better features representation with 10 neighbors. Thus, we selected VOLUME XX, 2017 6

FIGURE 2. Architecture of proposed framework
10 neighbors around a central sample to create each frame of length 11 (FIGURE 3) in the input audio. LTP extends LBP to 3 valued codes, which quantize the width ± around c, ones above and below this are quantized to 1 and -1 respectively. The process for converting a region into its ELTP representation is as follows: Where P( ,c, θ) represents the acoustic signal, c is the central sample of the frame F with neighbors where represents neighbor index and θ represents the threshold. To compute the ELTP, we compute the magnitude difference between central sample c and the 10 surrounding audio samples by applying θ around the c. In our prior work of 1D LTP features [12], we used the fixed threshold that is not much robust to noise. To overcome this limitation, we develop an automatic scheme to calculate the threshold dynamically using an auto-adaptive method instead of using a fixed threshold . We computed this auto-adapted threshold as follows: where σ is the standard deviation computed for each frame of the audio, and α is a scaling factor. We employed a linear searching mechanism to optimize the value of the scaling factor α by finding the convergence point between 0 and 1. We found this optimized value of α =0.6 as we achieved the best results on this value. Thus, we used α=0.6 for computing the threshold . Next, we split each ternary pattern of ELTP into its positive ( ) and negative ( ) halves. All values quantized to +1 or -1 are maintained in and respectively. And replacing all other values with zeros using the (3) and (4) Inspired from the concept of uniform patterns in image processing research [35], we used this idea for voice signals since they provide valuable information about the signal. In contrast to non-uniform patterns, which provide less significant signal information, uniform patterns include substantial signal information. It is also worth noting that uniform patterns are more prevalent than non-uniform patterns. We computed positive uniform and negative uniform patterns from the earlier mentioned and and represented these patterns in decimal forms using the (5) and (6) as follows: Next, we compute the histogram of and separately to obtain the details of both the patterns. The number of bins is significantly reduced by assigning all nonuniform patterns to one bin, without losing too much data. Histograms are calculated as follows: Here n is the histogram bins. Through extensive experimentation, we concluded that first 10 uniform patterns from both categories were sufficient to capture the distinctive traits in bonafide and spoof samples. Therefore, we used the 10-dimensional ELTP code each for positive and negative uniform patterns. Finally, (7) and (8) are concatenated to create a 20-dimensional ELTP features as follows: Recently, we have seen many methods that employed different spectral features alone or in combination for voice spoofing detection. Spectral features such as MFCC, GTCC, CQCC, etc., have been employed to develop features representation schemes for anti-spoofing methods [36], [37]. ASVspoof community have provided two baseline models, one using the CQCC and other the LFCC for physical-and logical-access attacks detection. MFCC features were proposed based on the resemblance to human auditory system. LFCC is identical to MFCC in terms of feature extraction computation but with the difference of linear filter bank. Furthermore, according to speech production theories, some characteristics of speaker associated with the anatomy of vocal tract are greatly reflected in high frequency areas of the speech [38]. This argues the use of linear scale frequency for speaker identification and spoofing detection. Additionally, a comparative analysis in [39] performed for synthetic speech detection demonstrates the effectiveness of LFCC in terms of capturing the distinctive traits available in the high-frequency bands over other cepstral coefficients. This fact motivated us to integrate the LFCC with our novel ELTP features to better capture the vocal tract induced variations of bonafide voice and algorithmic artifacts of synthesized speech. For this work, we extracted the 20dimensional LFCC features with software implementation in MATLAB provided by the ASVspoof 2019 challenge [10] and fused them with our ELTP features for acoustic signal representation. The process of LFCC extraction that returns 20 dimensional LFCC coefficients is represented in FIGURE 4. Pre-processing block includes the tasks of framing and windowing. We obtained the spectrum of each audio frame using the Fast Fourier Transform. A set of linear filters is applied to Fast fourier transform of the audio signals and gain (gk) is calculated. Next, log of each gk is calculated and discrete cosine transform is applied to obtain the LFCC features. LFCC features are calculated as follows:  (11) and (12) from t = 1 to T: ℎ 7 = ( 89 7 + 99 ℎ 7;5 + 9 ) (11) where W represents the weight matrices (e.g., Wxh is the input-hidden weight matrix), B is the bias vectors (e.g., Bh is hidden bias vector) and is the hidden function. For the LSTM network, we computed the hidden function at time t using (13) to (17) as follows: 7 = = ] 8> × 7 + 9> × ℎ 7;5 + ?> × 7;5 + >^ (13) 7 = = ( 8! × 7 + 9! × ℎ 7;5 + ?! × 7;5 + ! ) (14) 7 = = ( 8@ × 7 + 9@ × ℎ 7;5 + ?@ × 7 + @ ) (15) 7 = 7 7;5 + 7 ℎ( 8? × 7 + 9? × ℎ 7;5 + ? ) (16) where is the hard-sigmoid function, f, i, o, c and h are forget gate, input gate, output gate, cell memory, and hidden vector respectively. VOLUME XX, 2017 8

FIGURE 4. Illustration of LFCC descriptor
LSTM is also commonly employed for classification of time series data, however, LSTM network is limited as it uses the previous context only. Bidirectional RNN (BRNN) [40] successfully overcomes this issue by accessing the data in both directions. Thus, we employed the BiLSTM network in the proposed method. As illustrated in FIGURE 2, a forward hidden sequence ℎ ⃗ , backward hidden sequence ℎ ⃖ and output sequence is computed by iterating the forward layer from t=1 to T, backward layer from t=T to 1. The output layer is updated by concatenating the outputs of forward and backward hidden sequences as follows: Our model used 10 bidirectional LSTM layers, each with 64 hidden units. Extracted ELTP-LFCC features are fed to the first BiLSTM layer. The outputs of one BiLSTM layer are concatenated and passed to the next BiLSTM layer. Feature vector from the 10 th BiLSTM layer is passed into a fully connected (FC) layer. The output of FC layer is propagated to a softmax layer and finally to a classification layer that assigns each input to one of the mutually exclusive classes as shown in FIGURE 2. We used Adam optimizer [41] to tune our network with initial learning rate set to 0.001 and squared gradient decay factor set to 0.999. We tuned various parameters during the network training. Specifically, we tunned state and gate activation functions, mini-batch size, maximum epochs and number of hidden units. We performed experiments by setting number of hidden units equal to 64,100 and 150 and found best results with 64 hidden units. For network training, mini-batch size was tuned at values of 128, 64 and 30 and received best results on 30 mini-batch size. Maximum number of epochs was set to different values and finally selected as 100 epochs as optimal results were achieved on this setting. We also tuned the system on tanh and soft sign for state activation function where tanh outperforms the soft sign in almost all experiments, as tanh delivers better training performance for multilayer neural networks [42]. Similarly, we tuned the system on sigmoid and hard-sigmoid for gate activation function and found best results on the hard-sigmoid.

C. ADDRESSING THE LIMITATIONS OF ACOUSTICS-LBP AND ACOUSTICS-LTP APPROACHES
As we discussed in the Introduction section, existing approaches like acoustics-LBP are sensitive to noise and hard-coded threshold-based acoustics-LTP features are nonrobust over dynamic pattern detection that makes it difficult to achieve better accuracy in real-time applications under diverse conditions. As the proposed ELTP features analyze the patterns of audios in time domain, therefore, reliably captures the algorithmic artifacts of synthetic samples and dynamic vocal tract traits of the genuine audios for effective classification of the genuine/bonafide and cloned samples.
To demonstrate the effectiveness of our ELTP features for distinctive representation of bonafide and synthetic/cloned sample, we generated the box plots of ELTP for the bonafide and synthetic samples of the same speaker as shown in FIGURE 5. From the FIGURE 5, we can see that the spoof sample has a larger distributional variance over the bonafide sample of the same speaker. Moreover, most of the feature values of spoof samples are high as compared to the bonafide samples. These facts signify the effectiveness of our ELTP features for more distinctive representation of the bonafide and spoof samples. Moreover, our ELTP features also address the limitation (non-robustness against noise) of acoustics-LBP approach. We can prove from FIGURE 3 that our proposed ELTP features are robust against the noise. As the noise can enhance or reduce the value of central sample within a frame resulting in generation of wrong code, however, we can see from the audio frame shown in FIGURE 3 that the value of c now remains within the c+θ and c-θ range, thus, achieves more robustness against the noise.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
This section presents the details of experiments conducted to measure the performance of our technique. We also provided a discussion on the results of these experiments. Moreover, details of the dataset used for performance evaluation is also presented. The evaluation plan of ASVspoof 2019 dataset considers tandem detection cost function (t-DCF) and equal error rate (EER) as primary and secondary evaluation metrics, respectively. Thus, we also used the t-DCF and EER to measure the performance of the proposed countermeasure. For experimentation, we used the training subset of ASVspoof 2019 LA dataset for training and evaluation subset for testing.

A. DATASET
Performance of our proposed countermeasure is investigated on the logical access subset of ASVspoof 2019 dataset. It comprises training, development, and evaluation subsets. Each subset contains bonafide and spoofed samples where spoofed samples are generated from genuine speech samples using several spoofing algorithms (A01-A19) [43]. Genuine speech samples are collected from 107 speakers. The training subset contains 25,380 samples, the development subset contains 24,986 samples, and the evaluation (eval) subset contains 71,933 audio samples. The statistics of ASVspoof 2019 LA dataset in terms of number of spoofed and bonafide samples in each subset, number of male and female speakers, spoofing algorithms, and sampling rate are listed in TABLE II. The duration of each utterance is in the range of one to two seconds and all audio files in these three subsets are stored in flac format. The details can be found at [43].   From the classification results, we can see that the proposed countermeasure outperforms the existing methods including the ASVspoof baseline methods for LA attacks detection. Thus, we argue that our method can effectively be used to detect the LA voice spoofing attacks.

E. PERFORMANCE COMPARISON OF PROPOSED ELTP-LFCC AND BASELINE FEATURES FOR LA SPOOFING DETECTION
Since we proposed a novel features descriptor for voice spoofing detection, therefore, features wise comparison against the existing baseline features (CQCC and LFCC) on the same classifier is important to evaluate the significance of our ELTP-LFCC features set. For this, we compared the performance of our features against the ASVspoof baseline features CQCC and LFCC on DBiLSTM classifier and results are shown in

V. CONCLUSION
This paper has presented an effective voice spoofing countermeasure using the novel ELTP-LFCC features and Deep Bidirectional LSTM to combat the TTS synthesis and converted voice samples of logical-access attacks. We presented a novel audio features descriptor ELTP and fused it with LFCC to better capture the characteristics of the vocal tract speech dynamics of bonafide voice and cloning algorithm artifacts. Performance evaluation on the diverse ASVspoof 2019-LA dataset demonstrates the significance of our system for reliable detection of logical access spoofing attacks. Performance comparison against the baseline and existing contemporary methods shows that our spoofing countermeasure provides better detection performance over the existing voice spoofing countermeasures. The fact that the ASVspoof evaluation set contains the unknown bonafide and spoof samples and voice samples of unseen human speakers indicate that our system can provide better performance on cross-dataset scenario. Experimental analysis showed encouraging results on TTS synthesis attacks however, we found that converted voice samples are more difficult to detect due to the fact that voice conversion algorithms take voice samples as input over the TTS which takes the digitized text as input. This makes the voice conversion algorithms to better preserve the prosodic qualities of the speaker in synthesized samples which might be missing in the synthetic speech generated using TTS algorithms. In the future, we plan to improve the performance of our countermeasure against the voice conversion attacks.