A Survey on Text-Dependent and Text-Independent Speaker Verification

Speaker verification (SV) aims to detect an individual’s identity from his/her voice. SV has been successfully applied in various areas such as access control, remote service customization, financial transactions, etc. Depending on whether the text content is pre-defined or not, SV can be text-dependent or text-independent. This paper reviews recent research on text-dependent SV (TD-SV) and text-independent SV (TI-SV). Because most modern SV systems apply deep learning methods to boost performance, we focus on the studies that use deep speaker embedding, a technique representing a person’s identity via a fixed-dimensional vector encoded from a variable-length utterance. Rather than detailing every existing SV system, we make an overview of the representative SV systems that have attracted wide attention. Furthermore, an increasing number of SV systems have been devoted to addressing real-world challenges such as reverberation and noise, and this has driven a large number of studies on practical SV. Therefore, the survey compares the existing SV systems in the Far-Field Speaker Verification Challenge 2020 (FFSVC 2020) to illustrate the most effective techniques for both TD-SV and TI-SV.

The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy . SV can be text-independent or text-dependent. For 31 text-independent SV (TI-SV), because there is no constraint 32 on the lexical content, the speaker embedding extractor is 33 trained on long utterances to suppress the adverse effect 34 of phonetic variability [3], [4]. TI-SV has been studied 35 extensively due to the ease of collecting large-scale text-36 independent data. In contrast, the lexicon in text-dependent 37 SV (TD-SV) is constrained to a small set of words or 38 phrases. Because of the low degree of phonetic variability, 39 TD-SV usually outperforms TI-SV under short-duration sce-40 narios. This property makes TD-SV more advantageous when 41 the utterance duration is short and the response should be 42 quick [5]. However, to build a well-performed TD-SV sys-43 tem, we need to collect a large amount of in-domain data, 44 which is very expensive in practice. 45 The recent advances in deep learning and deep neural net- 46 works have changed the landscape of speaker verification. 47 Various backbones, such as ResNets and DenseNets, have 48 been integrated into the speaker embedding networks. Com- 49 pared with i-vector, deep speaker embedding has achieved 50 i-vector/PLDA framework. With the advancement of deep 91 learning, deep speaker embedding has led to significant per-92 formance improvement in TI-SV systems [2], [9], [10], [11]. 93 Classical deep speaker embedding uses a speaker iden-94 tification network to create a speaker-embedding space. 95 Typically, the embedding network comprises a frame-level 96 subnetwork, a pooling layer, and an utterance-level subnet-97 work. After frame-level feature extraction and utterance-98 level aggregation, the speaker embeddings are extracted from 99 the affine output of a fully-connected (FC) layer of the 100 utterance-level subnetwork. Under this framework, various 101 architectures based on convolutional neural networks (CNNs) 102 have been used for frame-level processing. A classic exam-103 ple is the x-vector which uses time delay neural networks 104 (TDNNs) to extract the frame-level features [2]. Later, more 105 advanced networks, such as ResNets [9], DenseNets [10], 106 [12], and Res2Nets [11], [13], were introduced to better 107 model the spectral-temporal relationship across the acoustic 108 frames. Simultaneously, diverse aggregation methods have 109 been proposed to aggregate the frame-level information into 110 utterance-level embeddings, e.g., statistics pooling [2], multi-111 head attentive pooling [

128
Besides the cascade of a front-end and a backend, there 129 are end-to-end architectures with an SV-loss objective (see 130 Fig. 1(b)) for TI-SV [23], [24]. These systems strictly con-131 form to the SV objective in that they directly map an 132 VOLUME 10, 2022 enrollment-test pair to a score or a decision probability, ence requires TD-SV to adopt additional strategies to deal 161 with the content information, which is ignored in TI-SV.

162
For example, a phrase recognizer may be used to assist the 163 verification of Target-wrong and Impostor-wrong trials in 164   under the x-vector framework, the standard deviation vectors 214 rather than the affine output from the utterance-level subnet-215 work are used as the speaker representations for TD-SV [40]. 216 In [37], a bidirectional attentive pooling layer is incorporated 217 into a DenseNet to better establish the contextual information 218 across the frames. When these models are trained on suffi-219 cient in-domain data, good performance can be achieved even 220 though they are expected to perform speaker classification 221 only, i.e., the single-task style.

222
Although TI-SV models can be well adapted to TD-SV, 223 the text information is actually exploited in an implicit 224 way. A more intuitive strategy is to explicitly explore the 225 phrase information through multi-task learning. In [35], the 226 authors proposed a j-vector system to deal with the con-227 textual information in utterances through multi-task learn-228 ing. Besides using a speaker classifier as in single-task 229 learning, the j-vector network applies a phrase classifier to 230 explicitly propagate the phrase information to the speaker-231 embedding layer. To address the text mismatch between the 232 training data and the evaluation data, and also the text mis-233 match between the enrollment phrases and the test phrases, 234 a speaker-text factorization network was proposed in  found that fine-tuning the whole model is more effective 319 than fine-tuning the upper layers only. To explicitly exploit 320 the text information in the fine-tuning operation, a multi-321 task fine-tuning strategy was introduced for TD-SV in [42], 322 where both a speaker classification head and a phrase clas-323 sification head were used. Another fine-tuning example for 324 TD-SV is illustrated in [23], where a GE2E contrastive loss 325 was used to fine-tune a text-independent speaker embedding 326 network. 327 1 https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2 VOLUME 10, 2022 To make the organization of this paper clear, we illustrate 328 the detailed structure of the ''Front-end + Backend'' SV 329 system in Fig. 2. Note that the system structure is applicable     When there is domain mismatch between the training data 370 and the evaluation data, fine-tuning is an effective way to 371 boost SV performance. In [64], a domain-balanced hard pro-372 totype mining (HPM) technique was proposed to exploit the 373 ''harder'' speakers who confuse the system during the fine-374 tuning process. In contrast to metric learning in which con-375 siderable effort is made to mine hard negative samples [19], 376 HPM is easier to implement and nicely adoptable to the 377 AAM-Softmax loss.

378
Because it is impossible to compute a similarity matrix 379 across all utterances in a mini-batch to deduce speaker con-380 fusion, the weights of the AAM-Softmax layer are used as 381 the proxies of the class centers of the training speakers. 382 TD-SV not only deals with the speaker information but also 430 the text information in the utterances. Therefore, TI-SV meth-431 ods cannot be directly used in the TD scenarios. In this 432 section, we will introduce several typical TD-SV systems 433 according to the organization in Fig. 2.  Fig. 4. Unlike the conventional i-vector extractor, a BN 440 i-vector extractor computes the Baum-Welch statistics (see 441 Fig. 4(b)     To incorporate phonetic information into the segment-level 471 subnetworks shown in Fig. 5(b), the phoneme posteriors pro- where N c is the number of occurrences of the c-th phoneme, 495 N is the number of frames in X, and C is the number of 496 phonemes in the selected phoneme set. To optimize the net-497 work, we define the total loss as a combination of the speaker 498 classification loss L s , the frame-level phoneme classification 499 loss L pf , and the segment-level phoneme classification loss 500 L ps , where α and β are hyperparameters controlling the contri-503 bution of L pf and L ps , respectively. L s , L pf , and L pf are 504 expressed as 507 and 508 L ps = KL M ps (M e (X)) , y ps , ing is used in pre-training, we can also use multi-task learning 522 in the fine-tuning process to improve TD-SV performance.

523
In [42], the authors investigated two different fine-tuning 524 strategies using both speaker labels and phrase labels: 525 ''speaker + phrase'' and ''speaker × phrase''. As shown 526 in Fig. 6, ''speaker + phrase'' follows a multi-task fine-527 tuning style with two separate classification heads. In the 528 ''speaker × phrase'' mode, however, only a single head is 529 used in the output layer of the classifier, and utterances in dif-530 ferent phrases with the same speaker identity are considered 531 different classes. It was shown in [42] that the ''speaker + 532 phrase'' mode outperforms the ''speaker × phrase'' strategy 533 on the TD task in SdSVC 2021, which verifies the effective-534 ness of multi-task fine-tuning.

536
In this section, we compare the performance on the 537 recent Far-Field Speaker Verification Challenge (FFSVC) 538 2020 data [68]. FFSVC20 focuses on the smart home scenario 539 where far-field distributed microphone arrays are used in 540 noisy environments. The utterances in FFSVC20 are recorded 541 by one close-talking microphone, one iPhone, and six circular 542 microphone arrays. The language is Mandarin. The enroll-543 ment utterances and the test utterances in both TI-SV and 544 TD-SV tasks come from different microphones. 545  TABLE 3. TD-SV performance of existing best systems (without fusion) on the evaluation set of FFSVC 2020.

FIGURE 6.
Illustration of two fine-tuning strategies in TD-SV. The left subfigure shows the ''speaker + phrase'' method with a speaker classification head and a phrase classification head; whereas the right ''speaker × phrase'' method uses a single classification head but with more output nodes. (Adapted from [42] Table 2.

559
The official baseline system [43] (the first row of Table 2) 560 used public data from openslr.org for pre-training the embed-561 ding model and adopted fine-tuning to transfer the knowl-562 edge learned from the pre-training data to the FFSVC20 563 TI-SV task. The pre-training set comprises 10,544 speakers. 564 The system in [71] used a similar number of speakers for 565 pre-training and fine-tuning. Because the system uses more 566 advanced VAD (with a U-net structure) in the front-end and 567 mean adaptation in the backend, its performance is better than 568

623
The concluding remarks are summarized as follows: 624 1) Advanced convolutional layers/blocks such as 625 DenseNet and ResNet are prevalent in SV. 626 2) Most existing SV systems are implemented in a ''Front-627 end + Backend'' structure.

628
3) Fine-tuning is an effective tool to improve the perfor-629 mance of TI-SV and TD-SV. As mentioned in Section II-C, SV faces many challenges 633 in real-world applications. Background noise, reverberation 634 effect, short utterances, microphone mismatches, and lan-635 guage mismatches have always been and will continue to be 636 the critical issues in robust speaker verification. Although the 637 current SV systems can partially address these problems, the 638 solutions are scenario-specific, e.g., an SV system that can 639 address noise could fail miserably when the utterances are 640 very short. Therefore, seeking principled solutions that can 641 generalize across different tasks is essential in the future.

642
On the other hand, to facilitate system deployment, model 643 compression techniques such as knowledge distillation [76] 644 and network pruning [77] have received increasing attention. 645 However, due to the trade-off between the system perfor-646 mance and the runtime efficiency, developing lightweight 647 and effective SV systems is challenging and worths further 648 research.

649
Recently, the research on security in SV has also 650 attracted great attention and many studies have been focus-651 ing on defending SV systems against malicious spoofing 652 attacks through replay, speech synthesis, voice conversion, 653 and adversarial samples [78], [79], [80]. Unlike previous 654 ASVspoof tasks [78], [79], which aim to develop counter-655 measures (CMs) for a fixed SV system, the spoofing-aware 656 speaker verification (SASV) challenge [80] focuses on the 657 optimization of both CMs and SV subsystems to improve 658 the SV reliability. In this regard, SASV will attract extensive 659 attention in the future.