Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting.


I. INTRODUCTION
An automatic speaker verification (ASV) system attempts to verify if a given speech utterance matches the claimed identity [1]. As with any machine learning system, this system is susceptible to malicious inputs such as replay attacks [2] or speech synthesis [3], where the attacker aims to fool the ASV system with crafted audio samples. Spoofing countermeasure (CM) systems aim to detect these crafted audio samples, and thus improve security when combined with an ASV system [3]. This improvement is achieved by separately training the two systems, then using them in conjunction with each other and evaluating their performance using a tandem detection cost function (t-DCF) [4].
Although they are evaluated using this tandem metric, the original ASV and CM systems are not trained to minimize the t-DCF. They are instead trained using their respective losses in their individual tasks, with no consideration given to the existence of the other system that it will be combined with. While there are no immediate downsides to this practice (these tandem systems have achieved high accuracy [3]), there could be room for improvement. Some attack systems used to generate spoof samples could fool the CM but may be easily detected by the ASV system, as is the case with system A17 in the ASVspoof19 dataset [5].
Alas, as the t-DCF metric requires hard counts of misses and false accepts, it becomes a non-differentiable metric that cannot be directly optimized with gradient descent methods. Fortunately, metrics can be "softened" (made differentiable) by approximating the non-differentiable operations with differentiable ones, such as maximal figure-of-merit [6], [7] or hinge-loss [8], [9]. Here we use the same technique and extend a soft (differentiable) version of the detection cost function (DCF) metric [10] to accommodate the t-DCF. However, using approximations in place of exact counts leads to inaccuracies, and these soft metrics may require tuning for efficient learning [7].
While reinforcement learning (RL) [11] is often associated with games and robotics, it has also, perhaps surprisingly, been used in conjunction with multi-step supervised tasks such as sequence generation [12]. In this study, we consider an entire novel application area of RL to optimize tandem systemshere, ASV, CM, and their combination. Even if RL methods have been applied for this purpose to small datasets [13] and adversarial attacks [14], the application of RL to classifier optimization remains largely unexplored. Compared to differentiable objective functions that are typically optimized using gradient-based methods, RL methods (in particular, REINFORCE [15]) are widely applicable for optimizing any objective that involves sampling from a probability distribution. REINFORCE offers a zero-bias estimate of the gradient for minimizing the objective.
Because REINFORCE can be directly applied to nondifferentiable objective functions without approximations, it may lead to different results compared to optimizing soft versions of the objectives. To this end, we propose optimizing tandem ASV and CM systems with the REINFORCE algorithm. We pre-train the two systems and then optimize them in tandem using REINFORCE, a softened evaluation metric, and, as a baseline result, cross-entropy loss for the same data. The key advance of this work is the process of optimizing the entire tandem system at once; moreover, we are first to optimize t-DCF directly. In contrast to preliminary results reported in [16], here we conduct experiments using state-of-the-art ASV and CM systems, perform more theoretically sound REINFORCE training using score calibration, and include a soft version of the t-DCF metric in the comparison.
Our contributions can be summarized as follows: 1) We propose jointly training a tandem ASV and CM system using methods from RL literature. 2) We present a novel reward function for optimizing t-DCFs using RL and discuss the importance of score calibration for RL training. 3) We analyze the proposed method against baseline solutions with the ASVSpoof19 dataset; these baselines include a soft (differentiable) version of t-DCF, which we derive based on previous work on soft decision cost functions. 4) We study the results of each spoofing attack and the generalization of the different joint training methods. 5) We study how the cross-task performance changed during tandem training (i.e., did ASV learn to detect spoof samples during tandem training?).

II. TANDEM DETECTION COST FUNCTIONS (T-DCFS)
The t-DCF [4] extends the conventional DCF [17] used in speaker recognition studies. We briefly review each approach.
A. Conventional Detection Cost Function (DCF)-one system, two classes A binary classifier (or a detector) is any function g : X → {0, 1} that assigns a predicted class label,ŷ = g(x), to a given input x ∈ X . One can view g(·) as a device that either accepts (ŷ = 1) or rejects (ŷ = 0) a given hypothesis. Since there are only two possible outcomes (classes), accepting one hypothesis means rejecting its complement and vice versa. In practice, one first computes a real-valued score (soft decision), s ∈ R, and compares it to a pre-set threshold, τ , to reach the binary decision. The hypothesis is accepted if and only if s > τ . In ASV [18], x comprises a pair of enrollment and test utterances and the hypothesis is that the same speaker is present in the two. In voice anti-spoofing, x is a single utterance and g(·) tests whether or not x is true human speech or a spoofing attack.
Since neither the classifier nor its input x are perfect, the predictionŷ may differ from the actual (true) class, y ∈ {0, 1}. There are two possible errors. A false accept happens when y = 1 but y = 0, and a miss (false rejection) happens when y = 0 but y = 1. By viewing y andŷ as instances of random variables Y andŶ , the miss and false accept rates can be are defined as the conditional probabilities P miss ≡ p(Ŷ = 0|Y = 1) and P fa ≡ p(Ŷ = 1|Y = 0), respectively. Their empirical estimates are obtained by simply counting the errors (i.e., counting the ratio of errors with a threshold τ ).
The miss and false accept rates can be considered proxies for user convenience and security, respectively. Neither should be compromised, but the trade-offs between the two can be adjusted (by selecting different values of τ ), and there might be different preferences for their relative severity depending on the application. Performance is therefore gauged using a flexible, parameterized DCF [17], [19], DCF = ρ tar C asv miss P asv miss + ρ non C fa P asv fa = ρ tar C asv miss P asv miss + (1 − ρ tar )C fa P asv fa , where ρ tar ≡ p(Y = 1) and ρ non ≡ p(Y = 0) = 1 − ρ tar are the prior probabilities of positive (target) and negative (non-target) class, while C miss , C fa > 0 are costs of a miss and a false accept, respectively. We introduced the superscript 'asv' to emphasize that these errors are those of an ASV system (generally, any single classifier). The DCF parameters, (C miss , C fa , ρ tar ) are set in advance and remain fixed in a given evaluation setup. Note that the prior ρ tar reflects prior uncertainty concerning the class label: the prior can (and usually is) different from the empirical proportion of test trials in an evaluation dataset.

B. t-DCF-two systems, three classes
While (1) is well-suited to assessing the performance of regular ASV systems in which only target (same speaker) and non-target (different speaker) trials are encountered, it does not consider the impact of spoofing attacks. The t-DCF framework [20], [4] expands the scope by adding a third class (spoofing attack) and another classifier, a spoofing CM. The CM aims to discriminate between true utterances and spoofing attacks. To leverage the benefits of each system, we place a CM in front of an ASV system as a bonafide/spoof gate. The t-DCF metric assesses the performance of this cascaded, or tandem, system.
While conventional ASV systems have two possible errors, the tandem system may now experience four different types of errors: (a) a target user is accepted by the CM but rejected by the ASV system, (b) a non-target user gets accepted by both systems, (c) a spoofing attack gets accepted by both systems, or (d) a target user is rejected by the CM. With this motivation, [4] extended the DCF to a t-DCF, where t-DCF = C miss · ρ tar · (P a + P d ) where P • denotes the error rates of the four cases noted above (obtained by treating the CM and ASV decisions independently) and ρ spoof is the prior probability of a spoofing attack. Since there are now three mutually exclusive classes, ρ tar +ρ non +ρ spoof = 1, the three cost parameters, C miss , C fa , and C fa,spoof , are now the detection costs of the tandem system. The first two have the same meaning as their corresponding ASVsystem costs in (1), while C fa,spoof is the cost of falsely accepting a spoofing attack. The tandem cost used in this work is thus presented by five parameters (C miss , C fa , C fa,spoof , ρ tar , ρ non ).
While the previous version of t-DCF [20] contained four separate costs (two miss cases, two false-accept cases), this formulation was later revised [4]. For completeness, (2) includes the miss rates of both systems, but the impact of a miss is ultimately the same from the end-user's perspective regardless of which subsystem caused the error. For this reason, the two miss terms share the same cost and prior terms. If the cost and priors of false acceptance terms are the same for both scenarios, the t-DCF becomes the original DCF (1). However, the system designer may want to set the cost of falsely accepting a spoofing attack higher than the cost of a miss, as accepting an intentional attack (spoofing attack) is more undesirable than accepting a non-target speaker.
Each of the four error rates in (2) now depends on two different thresholds-one for the CM system and another for the ASV system. The authors of [4] discuss two different use cases: one where the ASV remains a fixed black box (fixed threshold and error rates that cannot be adjusted) and another where both the CM and the ASV systems are adjustable. The former, ASV-constrained t-DCF, is suited for scenarios where the two classifiers are developed by different parties. The latter, unconstrained t-DCF, is applicable when both classifiers are adjustable. A CM-constrained setup, where the CM is a fixed system, has not been considered in the context of t-DCF. The t-DCF was designed to evaluate the performance of CM systems in combination with ASV systems and to further develop CM systems; ASV systems are provided by a separate research community with rapid methodological progress [21], [22]. In a CM-constrained t-DCF setup, one would assume the CM system to be is fixed and the ASV would be adjusted to fit the CM, which is contrary to the assumption of ASV being a fixed black box.

C. Optimizing DCF or t-DCF
Both (1) and (2) are helpful for performance evaluation but not optimization. All the error rates needed for computing these metrics are obtained by counting errors, leading to a generally non-differentiable optimization problem. We address this challenge using two approaches. First, we design a soft t-DCF metric using approximations for the hard error counts [6], [23], [7], which we can then differentiate and optimize for directly (Section V). However, due to these approximations, the softened metric does not match the original t-DCF. Instead, we can optimize directly for the original t-DCF metric without approximations using methods from reinforcement learning [11]. Before detailing our approaches (Sections IV and V), we provide a brief background on RL.

III. REINFORCEMENT LEARNING AND POLICY GRADIENTS
In this section, we cover the terminology and base concepts of RL and, more specifically, methods using the policy gradient theorem (PG) [24], which lies at the core of our proposed approach. A key feature of PG is the concept of sampling actions: instead of designing a system to output scores and then determining the decision threshold, RL trains systems that pick the actions themselves. These actions can be chosen deterministically or via stochastic sampling. This practice allows the RL system to naturally explore different paths of actions. Once a promising action has been found, the system is updated to choose this action more frequently ("exploit" knowledge) [11]. Unlike back-propagation-based loss-minimization training, PG methods can be used to train systems that are non-differentiable, much like the t-DCF metric, by estimating the true gradient which can then be used to optimize for the objective [24].

A. Reinforcement learning (RL)
RL aims to solve problems formulated as Markov decision processes (MDPs), which consist of five elements. Using RL notation and the speaker verification setup as an example, the ASV system π : X → {ACCEPT, REJECT} (policy) chooses an action a ∈ {ACCEPT, REJECT} 1 given a state 2 x ∈ X , either deterministically or stochastically. Here π is usually a neural network with millions of parameters, and the state is a pair of utterances the ASV will compare. After choosing an action, the policy is rewarded with r = R(x, a), where the function R is determined by the designer, where the reward can be positive for the correct choice and zero otherwise, for example. The task for the policy is simple: pick the action with the highest expected reward. To optimize (minimize) an error metric, one can maximize its negative, which is the same as minimizing the original error.
In the generic case, an MDP lasts for multiple timesteps t ∈ N, starting from an initial state sampled from a distribution x 0 ∼ p(x 0 ) and, potentially, ending at a terminal state x T . The subscripts 0 and T denote the timestep in which the state was encountered, and T ∈ N represents the final timestep when the episode ends. After each action, the MDP transitions to the next state according to the transition dynamics x ∼ p(x |x, a), where x denotes the state succeeding x. The goal in MDPs is to obtain an optimal policy π * which maximizes the expected sum of rewards The expectation is computed over all of these sampling processes.

B. Policy gradients and REINFORCE
One approach to obtain the optimal policies is to use the PG [24] to compute the gradient of the learning objective with respect to the policy. Compared to other approaches such as value-based learning [11], PG can be applied more generally; the only requirement is that the system must output probabilities for the actions it chooses. If the policy π θ is parametrized with parameters θ (e.g., parameters of a neural network), the PG can be used to update parameters as follows: 3 where the gradient of the expectation is where ∇ θ • denotes the vector of all partial derivatives of • with respect to θ, or the gradient. The gradient of the objective function can be increased by moving the policy's parameters according to the gradient's direction in a small step (a gradient ascent step). Intuitively, the right side of PG (5) suggests that to increase the expected return, one should encourage the policy to take actions that lead to a high sum of returns and discourage actions that lead to a negative sum of rewards. This becomes even more apparent in our ASV setting, where we only have one step and one reward-the system sees a single state (the input), determines an action, and gets rewarded. If we set the reward for the correct decision to be r = 1, the PG suggests that we should encourage the ASV system to take that same action more often. Likewise, if the decision was wrong and we set the reward to be negative r = −1, the PG suggests that we should reduce the probability of that action to improve our reward.
One of the earliest RL algorithms using PG is REIN-FORCE [15], which iteratively updates policy by performing two steps: 1) Run a policy in the environment for multiple episodes.
2) Estimate PG with collected data and update the policy. If the completed episodes have a high/positive sum of rewards, the actions the policy took will be encouraged; likewise, actions related to bad episodes will be discouraged. This estimation of the PG has been shown to have a zero bias but a high variance-PG does provide the correct gradient at expectation, but in practice, any single sampled estimation has a large error. Reducing this variance and improving PG-based methods is an ongoing area of research (e.g., [25] and [26]).

REINFORCEMENT LEARNING
Building upon the background information provided in Section III, we now describe how to use RL to optimize tandem systems composed of ASV and CM systems. We also compare the described method to alternative solutions.

A. Tandem systems as Markov decision processes (MDPs)
The training procedure is outlined in Algorithm 1, and the flow of information is illustrated in Figure 1. The tandem system consisting of ASV and CM systems can be considered a simple MDP with only one transition and two actions: ACCEPT or REJECT. In each episode, both systems receive their respective inputs and determine an action via stochastic sampling. In practice, this process is performed by picking a uniform random number from the interval [0, 1] and checking if it is lower than or equal to the sigmoided output value; if so, the decision is ACCEPT; otherwise, REJECT. If the sigmoided output value is close to one, then the decision is almost certainly ACCEPT. The final decision is determined by computing the logical-and of the two system decisions.
Finally, the tandem system is rewarded if it correctly accepted or rejected the trial. We then compute the PG loss (5) by averaging loss over a mini-batch and update all parameters of both systems with a gradient ascent step. We use PyTorch library [27] and its automatic differentiation to compute gradients throughout the ASV and CM systems.
While this solution is general and can be applied to different systems, two aspects require more in-depth study: how to select an action and what the reward function should be. Picking the action is especially relevant to our scope of ASV and CM systems, as these systems are normally trained to output raw scores rather than probabilities. For the sake of generalization, we assume that the scores are scalars, where a higher value reflects more evidence towards accepting the trial.
Algorithm 1 Optimizing a tandem system with REINFORCE.
Input: Pre-trained ASV π asv and CM π cm with combined parameters θ, dataset D, reward function R, mini-batch size B while training do Initialize loss L ← 0 for i ∈ {1, 2, . . . B} do //Sample one trial with true class labels y asv and y cm x asv , x cm , y asv , y cm ∼ D //Sample actions (binary decisions). a asv ∼ π asv (x asv ) a cm ∼ π cm (x cm ) //Combine for tandem action. a tandem ← a asv ∧ a cm //Compute probability of the tandem action.  The reward function, on the other hand, defines the objective function we ultimately wish to maximize (or a loss function we wish to minimize).

B. Mapping system scores to decision probabilities with calibration
By default, the outputs of the system are raw detection scores s. To turn these into action probabilities, we could apply the sigmoid function where s is a score from either an ASV or CM system and σ is the sigmoid function. However, depending on the range of the scores, this design may result in degenerate action probabilities (see Figure 2 for a demonstration). Fortunately, a related problem has been studied in the form of score calibration [19], [28]. After calibration, the output scores can be treated as log-likelihood ratios (LLRs) of two hypotheses [19] where, for ASV, H s is the hypothesis that two utterances come from the same speaker and H d is the hypothesis that utterances come from different speakers; for CM, H s is the hypothesis that a sample is bonafide and H d is the hypothesis that the sample is a spoof sample. A convenient property of LLR is that we can obtain a posterior probability of either hypothesis with [29] p(H s |d) = σ s + log where log p(Hs) p(H d ) is the prior log-odds of the two hypotheses. These two priors are set by the system designer according to their best knowledge on how often target (positive) trials occur [30]. Using this relationship, we define the posterior probability p(H s |x) as the probability of choosing the ACCEPT action: To calibrate the systems, we use parametric discriminative calibration, where we transform scores with an affine transformation s = as + b, where the calibration parameters (a, b) are obtained via training [30] and s is used in (8) instead of s. Following [30] and [22], we minimize the loss function to obtain (a, b): where P and N are the sets of positive (same speaker) and negative (different speaker) trials, respectively.

C. Optimizing for the t-DCF
Above we described a simple reward function of r = 1 for correct decisions and r = −1 for incorrect decisions. However, while maximizing this reward function would improve the tandem system, it will not optimize the t-DCF function itself. A convenient property of PG (5) is that it maximizes any reward function on the right-hand side. Therefore, if we substitute the negative of our t-DCF function as a reward function, we can then directly optimize the t-DCF: maximizing the negative of the t-DCF leads to minimizing the t-DCF.
Recall that the t-DCF (2) consist of three separate error terms/scenarios: the tandem system rejects a target speaker (miss), the system accepts a non-target speaker (false accept, or fa) or the system accepts a spoof trial (fa, spoof). If we compute t-DCF for a single trial, we will have exactly one of these errors or a true positive/negative. We can describe the cost of an individual trial as follows: where y asv and y cm are true labels for the ASV and CM systems, respectively, and y tandem is the 'logical-and' of these true labels. Actions a are the system outputs. We assume that all spoof trials aim to mimic the target speaker. We take the negative of the t-DCF single to construct the reward function for optimizing the t-DCF. Much like the t-DCF metric used for evaluation (2), a core aspect of this reward function is that different error situations are weighted differently according to their impact (cost) and prevalence (prior terms). Even if the tandem optimization were not able to improve the discriminative performance of the systems, weighing errors according to the t-DCF may bias the systems in the desired direction (e.g., assigning more cost to incorrectly accepting spoofing attacks biases the CM system to reject borderline trials).

METRICS
An alternative approach to RL is to design a differentiable ("soft") version of the t-DCF, which then can be minimized by gradient descent methods. This solution has been used with other metrics [6], [7], [8], including DCF [10], but not with t-DCF. Given its success with other metrics, we extend the idea of soft DCF to t-DCF to assess its applicability in tandem optimization.
The single non-differentiable operation in both DCF and t-DCF metrics are the miss P miss and false acceptance P fa rates, defined as where π is the detection system, τ is the decision threshold and 1 is the indicator function that outputs 1 if the condition is true and 0 otherwise. This indicator function is non-differentiable. The authors in [10] proposed to soften these rates by applying the following approximationŝ where σ is the sigmoid function (6). Essentially, the failure rates are approximated by their distance from the threshold. The soft DCF derived using these rates can then be minimized, and to minimize for any available threshold, the authors proposed optimizing for the threshold τ as well as the parameters of the detection system. Following this approach, we derive a soft t-DCF by applying (15) to the error rates of (2). Optimization can therefore be applied to tune the systems and scores for the priors and cost weights set by the evaluation metric.

VI. EXPERIMENTAL SETUP
Here we cover the datasets, features (front-end), neural network architectures, and training procedures used in the experiments to evaluate the above solutions. Before tandem training, we pre-train the ASV and CM systems separately. We repeat all the experiments three times and report the average results to avoid drawing conclusions influenced by noise [31], [32]. Such noise may stem from, for example, random initialization of the network, the sampling of data, and the sampling of actions in the REINFORCE method.

A. Datasets
The main dataset used in this work is the ASVspoof19 dataset (logical access, or LA, scenario) [3], which provides labels for speaker verification and spoof samples generated using different techniques. We use the ASVspoof19 dataset to train the initial countermeasures and to perform the tandem training. Statistics of the portion used for training independent ASV and CM are summarized in Table I.
As ASVspoof19 does not contain enough data on different speakers for training a reliable speaker verification system, we use the VoxCeleb2 dataset [33] to pre-train it. Specifically, we use the development part of the authors' protocol, which contains 5,994 unique speakers and 1,092,009 utterances. The whole VoxCeleb2 dataset was extracted from 150,480 unique videos (2,442 hours) with an average individual utterance length of 7.8 seconds; face recognition was used to capture speech of the persons of interest (celebrities). In this dataset, 61% of the speakers are male and 39%, female. Roughly 39% of the speakers are from the USA or UK by nationality [33]. We do not use evaluation sets of the VoxCeleb2 in this work.

B. Evaluation metrics
We evaluate the tandem systems using the normalized minimum t-DCF with the same parameters and protocol as in the ASVspoof19 challenge [3]. This approach uses the ASVconstrained form of the t-DCF, where we fix the threshold of the ASV system to its equal error rate (EER) point on target-non-target discrimination and then sweep over different thresholds for the CM system to obtain a minimal, normalized t-DCF. Individual ASV and CM systems are evaluated on their respective tasks with EER (ASV on target-non-target discrimination; CM on bonafide-spoof discrimination). Unless otherwise noted, all the t-DCF and EER values are computed using the ASV protocols provided by the ASVspoof19 dataset (trials in ASVspoof2019.LA.asv.dev.gi.trl.txt 4 ), including the performance of the CM system.

C. Speaker verification system (ASV)
The ASV system is inspired by the results of x-vector (or r-vector) systems [22] (MagNetO) and [21] (BUT-system), both of which have achieved state-of-the-art results in multiple ASV corpora. Due to memory limitations, we use the network architecture of the BUT-system [21] with the shortsegment training setup from MagNetO [22], with the following modifications. First, we set the batch size to 32 (instead of 256) to accommodate our memory constraints. To compensate, we train for 200,000 steps (instead of 150,000). We use standard cross-entropy loss with the Adam optimizer to train the system with an initial learning rate of 0.001 and halve the learning rate every 10,000 training steps up to 100,000 training steps. We also use a small L2 regularization weight of 10 −5 . These settings were selected based on preliminary results with cosine scoring on the evaluation set of VoxCeleb2.
During training, each utterance is augmented with one of the following: music, babble, reverberation, or noise. We use impulse responses from the AIR RIR database [34] for reverberation and the MUSAN database [35] for speech, music, and noise sources. After augmentation, we cut out a random four-second segment from the utterance or discard the utterance completely if the utterance is shorter than four seconds. Log-mel filterbank energies are used as features (80 features per frame, 25ms window length, 10ms window hop), post-processed with sliding-window mean-normalization.
For the ASV back-end we chose to use neural PLDA (NPLDA) [36], which we train using the same Vox-Celeb2 dataset with 100 epochs. The non-neural alternative, PLDA [37], [38], has been a successful back-end in ASVSpoof19 dataset [3]. Compared to PLDA, NPLDA is suitable for gradient updates which we need for our tandem training. Each epoch consists of 1,000 mini-batches, each with 2,048 random pairs of utterances (and a balanced number of target and non-target pairs). The NPLDA model is trained with cross-entropy and L2 regularization (weight 10 −5 ) losses using the Adam optimizer with a learning rate of 0.0001, which is halved every 20 epochs. All parameters were taken from the source code of the original NPLDA publication [36], apart from training length. For training length. We increased training length until training loss converged. After training the system using the VoxCeleb2 dataset, we adapt the NPLDA to ASVspoof19 corpus ASV part (bonafide trials from ASVspoof2019.LA.cm.trn.txt) by finetuning its parameters for one epoch, using the same settings, which we found to improve performance without overfitting with this small dataset. This approach reduced the EER from an average of 10.1% to 5.8% in the ASVspoof19 evaluation set. Training longer than one epoch worsened the results.

D. Spoofing countermeasure (CM) system
For the CM system, we use the source code and a system similar to the one described in [32], where a light convolutional neural network (LCNN) network of [39] is used with linear frequency cepstral coefficient (LFCC) features. We chose this system based on its high accuracy with the ASVSpoof19 dataset at the time of these experiments [32] and for its neural-based architecture, which we require for the gradient updates in the tandem training. The LFCC features follow the baseline recipe presented in the ASVspoof19 [3], with 512-length fast Fourier transform computed with 20ms windows and 10ms hops, processed with 20 linearly spaced filterbanks, along with delta and delta-delta features. Every utterance is cropped or padded to a fixed size of 750 feature vectors. Shorter utterances are padded with zeros from the end, and longer utterances are randomly cropped to fit 750 feature vectors. The model outputs a single sigmoid scalar and is trained on the dataset using cross-entropy. For the description of the LCNN-trim-pad (sigmoid) system in [32].

E. Tandem training and baseline
After developing the pre-trained systems (See "Initial" results in Table II), we aim to optimize them using the techniques described in Sections IV and V. In all cases, we optimize the parameters of the NPLDA back-end of the ASV system and all parameters of the CM system. We only train the backend of the ASV system due to memory limitations. As a baseline solution, we use the same training regimen to finetune the pre-trained systems separately using the same data. We emphasize that no tandem training is used in this setup-we merely use the same data to further update the models for a fair comparison against the proposed methods.
For tandem optimization, we use the trial list from ASVspoof19's development list, (ASVspoof2019.LA.asv.dev.gi.trl.txt). This file list contains 10 different speakers with 2,548 bonafide and 22,296 spoof samples. To avoid catastrophic forgetting and overfitting to the small number of samples of this dataset, we use standard stochastic gradient descent (without momentum) with a learning rate of 0.0001 and mini-batch size of 64, with a total of five epochs of training over the entire dataset. Each item in a mini-batch has an equal sampling probability of being either a target, a non-target, or a spoof trial.
In total, we compare six different methods. We use the same t-DCF cost and prior parameters for these training methods as with our evaluation metric. These six methods are as follows: 1) Finetune.

VII. RESULTS
We report results in two ways: tables of raw metrics are provided in Table II, and learning curves of relative changes are summarized in Figure 3. The learning curves show the change in a metric as training progresses. This allows us to study how the change in one metric affects others and, importantly, how ASV and CM performance affects t-DCF.
All optimization groups are able to improve t-DCF in the training set, but not reliably in the evaluation set, unless we remove outlier spoofing attacks (these results are further discussed in Section VIII-A). Interestingly, while finetuning improves the EER of both ASV and CM in the training set, REINFORCE with t-DCF reward and score calibration only marginally improves ASV EER and worsens CM EER while still improving t-DCF. This finding indicates that tandem training is able to not only optimize the individual systems for  better discrimination but also optimize for the operating point induced by t-DCF and its cost function.
In the evaluation set, finetuning reduces the t-DCF while the EER of both ASV and CM systems increase. This observation is interesting, as finetuning is not done in tandem, so we would expect any improvement in t-DCF to come from the two systems collaborating. Nonetheless, results indicate that the systems sacrifice per-task performance for better tandem results.
While REINFORCE approaches do not yield positive results, they have one advantage over finetuning: the performance of the ASV system does not get considerably worse. The disparency between the training and evaluation set in the ASV EER suggests that finetuning overfits to the training set, but this is not the case for REINFORCE. Optimizing for soft t-DCF yields similar results to REINFORCE but with more pronounced changes in metrics.

VIII. DISCUSSION AND FURTHER ANALYSIS
The aggregated results are less than promising. While we see positive learning in the training set in all cases, this success does not translate to the unseen attacks in the ASVspoof19 dataset. To understand these results better, we perform further analysis by studying results for each spoofing attack, crosstask performance, and optimization of the ASV front-end.

A. Splitting results by spoofing attacks
The ASVspoof19 dataset consists of 19 different attacks for creating the spoof samples-six in the development set and 13 in the evaluation set. The 13 are all unseen attacks. Given the variation of CM performance between these attacks, it is instructive to study per-attack performance [3]. Table III  TABLE III  THE CM EERS (%) AFTER TANDEM TRAINING SEPARATED BY THE SPOOFING ATTACKS. VALUES IN GREY INDICATE RESULTS WORSE THAN THE INITIAL  EER. SYSTEMS A01-A06 ARE INCLUDED IN THE DEVELOPMENT (TRAINING) SET, WHILE A07-A19 ARE ONLY IN THE EVALUATION SET.   Method  A01 A02 A03 A04 A05 A06  A07 A08 A09 A10 A11 A12 A13 A14 A15 A16 A17 A18  separates the CM EER results by attack. As with the baseline results in ASVspoof19 [3], most attacks are well detected with low EERs, apart from the last three (A17-A19). A17 is exceptionally difficult to detect.
A common feature of these three attacks is that they modify genuine human speech to perform the spoofing attack, while other systems synthesize speech in one way or another. This feature makes it harder for CM systems to identify A17-A19 as spoofing attacks, but these attacks also do not fool the ASV system (i.e., they are less performant as spoofing attacks) [3]. Specifically, A17 uses waveform filtering on genuine human speech to perform a spoofing attack but only applies small modifications to the input speech. As such, the utterance resembles the source speaker more than the target speaker (low ASV EER) while remaining difficult to detect (high CM EER). By studying the ASV EERs of different spoofing attacks (Appendix A), we find that attacks A17-A19 are weaker than most of the other attacks. A17 has an ASV EER of 7.7%, which is close to the bonafide ASV EER of 5.77% (Table  II), indicating that most of the error stems from imperfections in the ASV system. The same was observed in the original ASVSpoof19 publication [3]. The difficulty in detecting A17-A19 could also be attributed to the length of silences in these attacks [40], which makes them outliers among all the attacks. In our tandem training, the CM EERs of these three systems increase in these three cases, while in the majority of other systems the EERs decreases.  Table II present the learning curves and results in the evaluation set without A17-A19. All methods except finetuning improve results in all metrics. REINFORCE combined with the t-DCF reward yields the greatest improvement in the t-DCF. Interestingly, while soft t-DCF optimization also reduces t-DCF, it quickly stagnates to a value and no further improvement happens. Meanwhile, REINFORCE with the t-DCF reward shows a declining learning curve, indicating that with further training the results could improve. Curiously, while the original ASV and CM systems were not guaranteed to output scores that are calibrated to be LLRs, as discussed in Section IV-B, calibrating the scores and using class posterior probabilities did not seem to have a drastic effect on the training results (Figure 4, third and fifth panel). The use of reward function based on t-DCF costs, however, stabilized the learning and yielded better performance.

B. Analysis of ASV and CM systems in cross-system tasks
The goal of tandem optimization is to train the two systems to function better together, with one intuitive benefit being that both systems might learn each other's task-for example, the ASV system could learn to reject spoof samples which the CM system struggles to detect. To this end, Table IV presents the performance of the ASV system in the CM task. This crosstask performance evaluation is done by taking the scores of the ASV system for a trial but using true spoofing labels to evaluate its CM EER. We omit results for the CM system's performance in the ASV task, which were 50% for all cases. This is because the CM system does not receive the claimed identity enrollment utterance as an input, and as such, it can not tell if the input utterance is the target speaker or not.
The results indicate that the ASV system does not substantially shift towards performing the CM task, especially in the evaluation set. We only see a 1% EER decrease when using finetuning, which could explain the increased ASV EER of the ASV system in the main results ( Figure 3, lower left corner). However, this result cannot be explained by the ASV system learning to perform the CM task, as finetuning does not involve real tandem training and ASV was only trained with ASV labels. Removing the spoofing attack systems A17-A19 and recomputing these numbers did not change the results.

C. Training ASV front-end
We limited ourselves to optimizing the ASV system's backend only, as the utterance embedding (front-end, Section VI-C) model was too large to sufficiently fit into memory. The results above indicate that ASV performance does not change considerably, which could be a result of this limitation. To study this finding, we repeat the training procedure but this time we also optimize the ASV utterance embedding model. Due to memory restrictions, we use a batch size of 24 instead of 64, with other settings remaining as above. The policies are thus updated for more steps than in the previous experiments.
The results for all spoofing attacks are included in Figure 5 and for excluded attacks A17-A19, in Figure 6. Again, ASV EER either does not change or gets considerably worse. Unlike in the original results, soft t-DCF optimization leads to increased ASV EER in the evaluation set, which seems to affect the tandem system as a whole and result in an increased t-DCF value. REINFORCE yields similar results to earlier findings with no significant change. Note that the CM EER and t-DCF keep improving in REINFORCE training when outlier spoofing attacks are removed, despite the smaller batch size and longer training ( Figure 6). We hypothesize that the ASV EER does not improve due to lack of training material, in which case it is easy to overfit the training data. Curiously, REINFORCE optimization does not seem to be subject to this flaw.
IX. CONCLUSION ASV and spoofing CM systems are trained separately from each other, yet they are evaluated as one "tandem" system. In this work, we evaluated three different approaches for optimizing such tandem systems for spoof-robust speaker verification-finetuning, RL, and a differentiable (soft) tandem evaluation metric. Our results indicate that while traditional finetuning can yield benefits, it tends to overfit the training data. RL-based optimization methods and differentiable evaluation metrics yield solid improvements once outlier systems have been removed (up to 20% relative improvement), while finetuning still exhibits overfitting behaviour. These results indicate that the two suggested methods, RL and soft metric, yield the same or better results than naive finetuning of the systems separately and that they should be used for such optimization. RL specifically yielded stable results, regardless of the changes in the training setup, and at least, did not negatively impact the results much like soft metric did occasionally.
Considering that two works in this area have reported similar results (this and the previous work [16]), the tandem optimization of ASV systems combined with a CM system seems promising. Nevertheless, this work should be continued by, for example, performing large-scale experiments with different ASV and CM models. Compared to previous work [16], this work applied recent state-of-the-art systems and evaluated multiple variations of the tandem optimization methods. Nevertheless, new systems and datasets are continuously presented, and the same tandem optimization experiments could be repeated with these. Specifically, given the variations in recording conditions and speakers of different datasets, future work should explore the cross-domain performance of tandem training. This could be done by training on one ASVSpoof dataset and evaluating on another, such as ASVSpoof19 and ASVSpoof2021 [41], once the ground truth labels have been publicly shared. Finally, given the general applicability of the RL approach, we believe that it can be used as a general tool for directly optimizing the evaluation metrics in different domains, but its performance should be evaluated with proper experimentation. Table V shows the EERs of the ASV system in discriminating bonafide target trials from spoofing attack trials: if ASV EER is high, the spoofing attack was successful at bypassing the ASV system.