Encoder-Decoder Based Attractors for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against cascaded approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional cascaded approach.


I. INTRODUCTION
S PEAKER diarization is a task of estimating multiple speakers' speech activities from input audio (sometimes referred to as the "who spoke when" problem) [1]. It can be placed as a downstream task of automatic speech recognition (ASR), in which speaker information is tagged to each transcribed utterance [2]- [4]. It can also be used as a prior step to speech separation and the following ASR. For example, in guided source separation [5], speech activities are used as constraints to update time-frequency masks of a complex angular central Gaussian mixture model. The speech-activitydriven speech-extraction neural network [6] takes acoustic features and a target speaker's speech activity to perform fully neural speech separation.
Classical cascaded methods treat speaker diarization as a partition problem. Given a set of time frames, they first detect S. Horiguchi and Y. Xue are with Hitachi, Ltd. Y. Fujita is with LINE corporation. This work had been done during he was with Hitachi, Ltd.
S. Watanabe is with Carnegie Mellon University. P. García is with Johns Hopkins University. ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. speaker-active frames and then divide them into clusters by using speaker embeddings extracted with a sliding window. The number of clusters, which represents the number of speakers, is determined in the clustering step during inference. Eigen value analysis on the graph Laplacian of a similarity matrix calculated from frame-wise embeddings is one way to estimate the number of speakers explicitly [7], [8]. If agglomerative hierarchical clustering is employed as a clustering algorithm, a threshold value is usually preset, and the number of clusters, i.e., the number of speakers, is dynamically determined by the threshold value [9]. Either way, the number of clusters can be set flexibly during inference. However, there is one fundamental problem that it basically cannot handle speaker overlaps because each speech frame is usually assigned to one speaker.
Some neural-network-based end-to-end methods, in comparison, naturally handle speaker overlap with a single network. For example, the Recurrent Selective Attention Network (RSAN) [10], [11] decodes speech activity for each speaker one by one until a stopping condition is satisfied. However, it requires clean speech to be trained as a mask-based speech separation model. End-to-end neural diarization (EEND) [12]- [14], which estimates multiple speakers' speech activities at once from input audio, does not require such clean speech for training. The limitation is that the original EEND fixes the output number of speakers; thus, knowing the number of speakers in advance is a requirement.
In our previous study [15], we introduced an encoderdecoder-based attractor calculation module (EDA) as part of the self-attentive EEND model [13] to handle unknown numbers of speakers (EEND-EDA). It calculates attractors from frame-wise embeddings using a sequence-to-sequence method with an LSTM encoder-decoder; thus, the number of attractors can be flexible. In general, sequence-to-sequence methods require a stopping criterion in their decoding process. To decide when to stop the attractor calculation, EDA also estimates whether each calculated attractor really corresponds to a speaker. The diarization results are calculated as dot products between the attractors and frame-wise embeddings. Despite being designed for the diarization of flexible numbers of speakers, it also has performed better than the original EEND under fixed-number-of-speakers conditions. Compared with other EEND extensions for unknown numbers of speakers [16], [17], it performed the best on various datasets including the CALLHOME and DIHARD III datasets [18]. Several studies have also proposed extensions to EEND-EDA to allow online processing [19], [20].

arXiv:2106.10654v2 [eess.AS] 28 Mar 2022
In this paper, we revisit EEND-EDA with more comprehensive discussions and formulations and propose several extensions from the original EEND-EDA presented in [15]. The modifications from the original EEND-EDA study are summarized as follows: • We discuss the relationship between the original EEND and EEND-EDA, which explains EEND-EDA's better performance in a fixed-number-of-speakers evaluation. • We also propose refining the training strategy of EEND-EDA, which resulted in a 2.41 % DER improvement on the CALLHOME dataset from the original paper [15]. • In the history of diarization studies, it has been difficult to compare the results of cascaded approaches and EENDbased approaches because the former ones are often evaluated with an oracle speech activity detection (SAD), while EENDs operate SAD and diarization simultaneously. To conduct fair comparisons between cascaded and EEND-based approaches, this paper introduces SAD post-processing to align diarization results from EEND-EDA with external SAD results. • We also propose an iterative inference for handling the problem of the number of outputs of EEND-EDA being empirically limited by its training dataset. • We conduct thorough evaluations and analyses on simulated and real datasets including CALLHOME, CSJ, AMI, DIHARD II, and DIHARD III.
Neural-network-based methods that directly produce diarization results from audio are emerging [10], [11]. One strength of such methods is that they require no extra modules for SAD or overlap handling. For some methods, models have been trained for speech separation, and diarization results have been obtained as byproducts [10], [11]. Such models have been trained on the basis of clean speech (or time-frequency masks calculated from clean speech); thus, they cannot be trained on real mixtures like DIHARD datasets [38], [39]. However, EEND-based models are trained to output multiple speakers' speech activities; they do not require clean speech for training and real mixtures can be used. The original EEND [12]- [14] can output diarization results for a fixed number of speakers. To extend the EEND for an unknown number of speakers, two approaches have been investigated. One is an attractor-based approach [15], [19], and the other is a speaker-wise conditional EEND (SC-EEND) [16], [17]. In this paper, we investigate the attractor-based EEND because it showed better performance compared to SC-EEND.
B. Speech processing based on neural networks for unknown numbers of speakers While some methods have achieved promising results with a fixed number of output speakers in diarization [12], [13], [40] and speech separation [41]- [44] contexts, it is challenging to make them able to deal with unknown numbers of speakers. The difficulty of neural-network-based speech processing for unknown numbers of speakers is that we cannot fix the output dimension.
One possible approach is to determine the maximum number of speakers to decode. In this case, the number of outputs is set to a sufficiently large value. Some methods treat a flexible number of speakers by outputting null speech activities if the number of outputs is smaller than the network capacity [45]. However, this approach did not work well with EEND (see [16]). In other methods, the number-of-speaker-wise output branches are trained independently, and the most probable is used during inference [46]. In this case, we have to know the maximum number of speakers. One of the strengths of EEND is that it can be finetuned using a target domain dataset from a pretrained model, but we usually cannot access the maximum number of speakers of the target domain beforehand. Therefore, a method that does not require that the maximum number of speakers be defined would be preferable.
Another approach is to decode speakers one by one until a stopping condition is satisfied, like SC-EEND [16]. For speech separation, RSAN [10], [11] and one-and-rest permutation invariant training (OR-PIT) [47] can be used. The key difference between speech separation and diarization is whether or not the residual output can be defined. RSAN uses a mask-based approach, in which each time-frequency bin is softly assigned to each speaker so that the process finishes when all the elements of the residual mask become zero. OR-PIT is time-domain speech separation by which residual output is determined as a mixture that contains other speakers rather than the target speaker. Both require clean recordings to determine oracle masks or signals. However, they are not always accessible in the diarization context, in which only multi-talker recordings and speech segments are provided.
In this paper, we adopted an attractor-based approach like deep attractor networks (DANet) [45], [48]. While the number of speakers [48] or maximum number of speakers [45] is fixed for the original DANet, in this paper, we calculated a flexible number of attractors without defining them.

C. Neural-network-based representative vector calculation
There have been several efforts to calculate representative vectors from a sequence of embeddings in an end-to-end trainable fashion. For example, Set Transformer [49] enables setto-set transformation, which can be used to calculate cluster centroids from a set of embeddings. However, the number of outputs has to be known in advance, so it cannot be used for our purpose. Meier et al. proposed an end-to-end clustering framework [50], in which clustering for all possible number of clusters K ∈ {1, . . . , K max } is performed and the result of the most probable number of clusters is used. The framework performs the clustering of a flexible number of clusters in an end-to-end manner, but the maximum number of clusters is limited by K max . EDA in this paper, in comparison, determines a flexible number of attractors from an input embedding without prior knowledge of the number of speakers. Thus, we can use datasets of the different maximum number of speakers during pretraining and finetuning.

III. METHOD
In this section, we first introduce the conventional EEND in Section III-A followed by an explanation of a natural extension of the method called attractor-based EEND in Section III-B. We also provide novel inference techniques in Section III-C.

A. Conventional end-to-end neural diarization
End-to-end neural diarization (EEND) [12], [13] is a method for estimating multiple speakers' speech activities simultaneously from an input recording. Given frame-wise Fdimensional acoustic features (x t ) T t=1 , where t ∈ {1, . . . , T } is a frame index, EEND estimates speech activities (y t ) T t=1 . Here, y t := [y 1,t , . . . , y s,t , . . . , y S,t ] T denotes speech activities of S speakers at t defined as y s,t = 0 (Speaker s is inactive at t) 1 (Speaker s is active at t) . ( EEND assumes that y s,t is conditionally independent given the acoustic features, namely, With this assumption, speaker diarization can be regarded as a multi-label classification problem and can thus be easily modeled using a neural network f EEND as where p t := [p 1,t , . . . , p S,t ] T ∈ (0, 1) S is the posterior probabilities of S speakers' speech activities at frame index t. The estimation of speech activities (ŷ t ) T t=1 iŝ y 1 , . . . ,ŷ T = arg max y 1 ,...,y T P (y 1 , . . . , y T | x 1 , . . . , x T ) , (4) where 1 (cond) is an indicator function that returns 1 if cond is satisfied and 0 otherwise. Note that the threshold value in (5) is always set to 0.5 in this paper for simplicity. The conventional EEND is implemented as a composition of an embedding part g : R F ×T → R D×T and a classification part h : R D×T → (0, 1) S×T , i.e., The first embedding part g converts input acoustic features into D-dimensional frame-wise embeddings. It is implemented with N -stacked encoders, each of which converts a flexible length of embedding sequence (e where g (n) is the n-th encoder layer. As examples of encoders, bi-directional long short-term memories (BLSTM) [12] and Transformers [13] are exploited in the conventional studies.
In this paper, we used Transformer encoders but without positional encodings to prevent the outputs from being affected by the absolute position of the frames. Hereafter, for simplicity, we use e t to denote the embeddings from the last encoder, i.e., e t := e where (·) T denotes the matrix transpose, 1 D is D-dimensional all-one vector, and W cls ∈ R D×S and b cls ∈ R S are the weight and bias of the fully connected layer, respectively. EEND outputs posteriors of multiple speakers simultaneously but without any conditions to decide the order of the speakers. Such a network is optimized by using a permutationfree objective [41], [51], which was originally proposed for multi-talker speech separation. It computes the loss for all possible speaker assignments between predictions (p t ) T t=1 , as introduced in (3), and groundtruth labels (y t ) T t=1 , and it picks the minimum one for backpropagation as follows.
where Φ (S) is a set of all possible permutations of the sequence (1, . . . , S), φ := (φ 1 , . . . , φ S ) is the permuted sequence, y φ t := [y φ1,t , . . . , y φ S ,t ] T ∈ {0, 1} S is the permuted groundtruth labels using φ, and H (·, ·) is the binary cross entropy defined as  Compared with cascaded approaches, EEND has two significant strengths. One is that the cascaded approaches conduct diarization by dividing frame-wise speaker embeddings, so they require SAD as pre-processing and overlap detection and assignment as post-processing. In contrast, EEND estimates each speaker's speech activities independently, so no extra modules for speech activity detection and overlap detection are needed. The other strength is that the EEND model can be adapted to the desired domain's dataset, while cascaded approaches typically tune only probabilistic linear discriminant analysis (PLDA) parameters to optimize intra-and interspeaker similarity between speaker embeddings [9], [18], [52].

B. Attractor-based end-to-end neural diarization
The limitation of the conventional EEND is in the classification part h in (6); the number of output speakers S is fixed by the fully connected layer as in (10). One possible way to treat a flexible number of speakers with this fixed-output architecture is to set the number of outputs to be large enough. However, as discussed in Section II-B, it requires knowing the maximum number of speakers in advance, and it has been already verified that such a strategy results in poor performance (see [16]). It is also a problem that the calculation cost of the permutationfree loss increases if we set a large number of speakers to be output. Therefore, a significant research question is how to output diarization results for a flexible number of speakers.
In this paper, we extend the conventional EEND to handle a flexible number of speakers. We assume that the embedding part g in (6) is implemented in the same manner as the conventional EEND described in Section III-A. Given frame-wise D-dimensional embeddings {e t } T t=1 , our goal is to produce posteriors for a flexible number of speakers in the classification part h. To achieve this goal, we propose a method to calculate a flexible number of speaker-wise attractors from embeddings and then calculate diarization results on the basis of attractors and embeddings. The proposed method is depicted in Figure 1.

1) EDA: Encoder-decoder-based attractor calculation:
EDA converts frame-wise embeddings into speaker-wise attractors using a sequence-to-sequence method with an LSTM encoder-decoder. The LSTM encoder h enc takes the framewise embeddings as input and updates its hidden state h enc t and cell state c enc The hidden and cell states of the encoder are initialized with zero vectors, i.e., h enc 0 = c enc 0 = 0. The LSTM decoder h dec estimates speaker-wise attractors as We treat the hidden state at each step h dec s =: a s ∈ (−1, 1) D as speaker s's attractor, whose dimensionality D is the same as that of the frame-wise embeddings e t . The hidden and cell states of the decoder are initialized by the final hidden and cell states of the encoder as which is shown as a right arrow from the LSTM encoder to the LSTM decoder in Figure 1. In general applications of a sequence-to-sequence method, e.g., speech recognition or machine translation, the output is sentences, i.e., a sequence of words, so the order of output is fixed. However, EDA cannot determine the order of output speakers in advance because this order is determined by minimizing cross entropy as in (11). Even if the order could be predetermined, it would not be possible to determine the optimal attractor outputs. Thus, the well-known strategy of teacher forcing, for which the optimal outputs with their order have to be known in advance, cannot be used. Furthermore, the s-th attractor can correspond to any speaker that is not contained in the first (s − 1) attractors.
To make this attractor calculation procedure fully order-free, we input a zero vector as input at each step as in (14). Using zero vectors as inputs provides flexibility to change the number of output speakers across pretraining and finetuning rather than using, for example, trainable parameters. This is why we chose an LSTM-based encoder-decoder rather than Transformer encoder-decoder, which requires input queries rather than zero vectors. Here, the input order to the EDA encoder affects the output attractors because EDA is based on a sequence-to-sequence method. To investigate the effect of the input order, we tried two types of input orders: chronological and shuffled orders. In the chronological order setting, embeddings are input in the order of frame indexes as in (13). In the shuffled order setting, we use the following instead of (13) : where (ψ 1 , . . . , ψ T ) is a randomly chosen permutation of (1, . . . , T ).
The diarization results p t in (3) are calculated on the basis of the dot product of the frame-wise embeddings and speakerwise attractors (⊗ in Figure 1): where A := [a 1 , . . . , a S ] are the speaker-wise attractors. The posteriors are optimized by using (11) in the same manner as the conventional EEND. This posterior calculation no longer depends on the fully connected layer, which determines the output number of speakers as in (10); therefore, EDA-based diarization can vary the output number of speakers.
Comparing (10) and (18), the conventional EEND can also be regarded as using fixed attractors W cls (with bias b cls ). In comparison, EDA calculates attractors from an input sequence of embeddings, which makes attractors adaptive to the embeddings. This makes EEND-EDA more accurate even under the fixed-number-of-speakers condition (see Table III).
2) Attractor existence probability: As in (14), we can obtain an infinite number of attractors. To decide when to stop the attractor calculation, we calculate the attractor existence probabilities from the calculated attractors by using a fully connected layer followed by sigmoid activation: where w exist ∈ R D and b exist ∈ R are trainable weights and bias parameters of the fully connected layer, respectively. During training, we know the oracle number of speakers S, so the training objective of the attractor existence probabilities is based on the first (S +1)-th attractors using the binary cross entropy defined in (12): where The total loss is defined as the weighted sum of L diar in (11) and L exist in (20) with the weighting parameter α ∈ R + as In this paper, we use α = 1. This multi-task loss aims to optimize frame-and speaker-wise posteriors with L diar and attractor existence probabilities with L exist . While (23) was used for the network optimization in our previous study [15], we found that the optimization of L exist inhibits the minimization of L diar during the training of a model with a flexible number of speakers, which is more important for improving diarization accuracy. Therefore, when a flexible number of speakers' dataset is used for training, we use L exist to update only the fully connected layer parameterized by w exist and b exist in (19). This can be implemented by cutting the graph before the fully connected layer to disable backpropagation to the preceding layers.
During inference, we cannot access the oracle number of speakers; thus, it is estimated using q s in (19) as follows.
where τ ∈ (0, 1) is a thresholding parameter, which is set to 0.5 in this paper. We then use the firstŜ attractors to calculate posteriors as in (18).
C. Inference methodology 1) SAD post-processing: Diarization methods, especially cascaded ones, are sometimes evaluated with oracle speech segments. When evaluated in such a way, the comparison between cascaded methods and EEND-methods becomes hard, mainly because EEND-based methods perform SAD and diarization simultaneously. One reason evaluations of cascaded approaches are mainly based on oracle speech segments is to consider speaker errors and SAD errors separately. It is reasonable to use oracle speech segments to focus on reducing speaker errors. However, such segments are not accessible in real scenarios, and the existence of SAD errors may worsen the clustering performance, which directly affects the diarization accuracy. Thus, we believe that SAD errors should also be considered in the context of cascaded methods. However, it is hard to say how accurate the SAD should be for a fair comparison between cascaded and EEND-based methods. Therefore, to align with the the cascaded methods, we introduce SAD post-processing for evaluating EEND. With this method, we can conduct a fair comparison between cascaded and EENDbased methods with the same SAD. Note that it can be used to improve the diarization performance by eliminating false alarm speech and recovering missed speech when an accurate external SAD system is given.
The SAD post-processing algorithm is described in Algorithm 1. Here, we assume that we have SAD results z 1 , . . . , z T in addition to frame-and speaker-wise posteriors p 1 , . . . , p T . We first estimate speech activities as usual by using (5) (line 1). However, this estimation is not always consistent with SAD results. Thus, we first filter false alarms (FA) by using SAD results. For each frame (line 2), if it is estimated that some speakers are active while the speech activity should be zero (line 3), we update the estimations with a zero vector (line 4). This procedure will always improve DER if z 1 , . . . , z T are the oracle speech activities. We also recover missed frames (MI) if no speaker is estimated as active while the speech activity is one (line 5). For each of such frames, we treat the speaker with the highest posterior as an active speaker (line 6-line 7). Including the oracle SAD as input will also improve the DER because missed-frame errors are replaced by correct estimation or at least speaker errors.
2) Iterative inference: Even if the model is trained to output a flexible number of speakers, the output number of speakers is empirically limited by the maximum number of speakers in a recording observed during pre-training (see Table VII). How to output the results of more than N speakers even if the model is trained on at most N -speaker mixtures is still an open question. In this paper, we propose an iterative inference method to produce results for more than N speakers by applying EEND decoding with iterative frame selection.
Preliminarily, we first reveal the characteristics of the EEND models that consist of stacked Transformer encoders and EDA. A Transformer encoder involves neither recurrence nor convolutional calculation, and we do not use positional encoding in this paper; thus, the embedding part g in (6) is an orderfree transformation. EDA contains an LSTM encoder-decoder, but if the order of the input sequence to EDA is shuffled, we  (1) . . .
can say that EDA does not depend on the input order, so the EDA's classification part h in (6) is also an order-free function. Therefore, EEND-EDA does not depend on the order of the input features, which makes it possible to process features that are not extracted at equal intervals along the time axis, as in EEND as post-processing [53]. The proposed iterative inference also utilizes this characteristic.
Algorithm 2 shows the algorithm of iterative inference. In the algorithm, two processes are iteratively conducted: decoding and silence frame selection. Each process at the n-th iteration is described as follows.
1) Decoding (line 3): Acoustic features x t of the selected frames T are fed into EEND, and the corresponding posteriors p (n) t ∈ (0, 1) where S (n) ∈ {0, . . . , S max } is the number of decoded speakers. The posteriors of the frames that are not in T are set to zero as  With the posteriors p are computed using (5). Note thatŶ (n) corresponds to the speech activities of the ((n−1)S max +1)-th through ((n−1)S max +S (n) )th speakers.
2) Silence frame selection (line 4): Given the diarization results decoded at the n-th iteration, we select the frames in which no speaker is active to update T as The above processes start with the initial value of T as the set of all frames {1, . . . , T } (line 1), and last until T becomes the empty set or when it is assumed that all the speakers are decoded (line 5-line 6). Here, we assume that all the speakers are decoded if the number of output speakers S (n) is smaller than the maximum output of EEND S max . After the iterative process is finished, the final resultsŶ are obtained by concatenating the results calculated at each iteration (line 7). With iterative inference, the number of speakers to be decoded is no longer limited by the training dataset. The iterative inference workflow when S max = 3 is also illustrated in Figure 2.
3) Iterative inference with DOVER-Lap (or iterative infer-ence+): Despite iterative inference being able to produce more than S max speakers' speech activities, it has a potential problem in that the speech activities of two speakers decoded at different iterations never overlap. For example, the (S max +1)th speaker's speech activities never overlap with those of the first S max speakers. This is because the frames in which the first S max speakers are active will not be processed in the second iteration. To ease this problem, we introduce DOVER-Lap [54], which is the extension of DOVER [55]. Both of them are methods for combining multiple diarization results on the basis of majority voting, but unlike DOVER, DOVER-Lap take speaker overlap into account. We used a modified version of DOVER-Lap presented in [18], in which the speaker assignment strategy when multiple speakers were ranked equally was slightly different from the original DOVER-Lap [54]. Note that we did not use a hypothesis-wise weighting of DOVER-Lap, which is also introduced in [18].
The algorithm of iterative inference incorporated with DOVER-Lap is shown in Algorithm 3. In this paper, we refer to this inference as iterative inference+. The difference from the iterative inference in Algorithm 2 is that we limit  (1) . . .
the number of speakers to decode at the first iteration with S limit (≤ S max ) (line 5-line 6). After the decoding step at the first iteration using (25), (26), and (5), we choose at most the first S limit speakers' speech activities fromŶ (1) The other procedures are the same as those in Algorithm 2, and finally, we obtain S limit -wise diarization results Y Slimit (line 10). In iterative inference+, S limit is varied from 1 to S max (line 1), which results in S max diarization results for each recording. We then combine them by using DOVER-Lap to obtain the final resultŶ (line 11). With this procedure, the kth speaker's speech activities can be overlapped with those of the max (1, (k − S max + 1))-th to (k + S max − 1)-th speakers.

IV. EXPERIMENTS
A. Datasets 1) Simulated datasets: To train the EEND-EDA model, we created simulated speech mixtures from single-speaker recordings of the following corpora. We used the following simulation protocol to create multitalker mixtures from single-speaker recordings: 1) Select N speakers, 1 https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome diarization/v2 2) For each speaker, randomly sample speech segments and concatenate them with silences that are interlaid between speech segments, 3) For each of the N long recordings created, randomly select a room impulse response and convolve it with the recording, 4) Mix the N long recordings and a noise signal with a randomly determined signal-to-noise ratio. The detailed algorithm for creating simulated mixtures can be found in [12]. In the second process, we assume that the occurrence of an utterance is a Poisson process, so the duration of the silence between speech segments follows the exponential distribution 1 β exp − x β , where β is the mean value. β can be used to control the overlap ratio of the mixtures. To obtain a similar overlap ratio among various numbers of speakers, we varied β according to the number of speakers as summarized in Table I. 2) Real datasets: For real datasets, we employed five multitalker datasets below.
• CALLHOME [56]: A dataset that consists of telephone conversations whose average duration is two minutes. We used the splits provided in the Kaldi x-vector recipe 1 , which are denoted as Part 1 and Part 2, respectively. Two-and three-speaker subsets were used in the fixednumber-of-speakers evaluations, which are denoted as CALLHOME-2spk and CALLHOME-3spk. • CSJ [57]: A dataset that consists of monologues and dialogues of Japanese speech. In this paper, we used the dialogue part of the dataset. The average duration of the recordings is about 13 minutes. Following [58], we used 54 dialogue recordings out of 58. • AMI headset mix [2]: A meeting dataset that consists of 100 hours of multi-modal meeting recordings. Each meeting session is about 30 minutes. We used headset mix recordings, which were obtained by mixing the headset recordings of all the participants. We used the split and reference RTTMs provided in the VBx paper [35]. • DIHARD II [38]: A dataset used in the second DIHARD challenge. We used single-channel audio, which is used challenge. It also consists of recordings from 11 domains (including telephone data) with an average duration of about 8 minutes. The test set has two evaluation conditions called core and full. The core set is a subset of the full set, in which the recordings are selected to balance the duration of each domain. In terms of the number of speakers, the full set contains more recordings of two speakers than the core set. Their statistics are summarized in Table II. Note that the recordings in CSJ, AMI, DIHARD II, and DIHARD III were sampled at 16 kHz, so we downsampled them to 8 kHz to be aligned with those of the simulated datasets. We also note that the recordings of the CSJ corpus are in stereo, so we mixed them to create monaural recordings.

B. Training
For the embedding part g in (6) of the proposed EEND-EDA, we used four-stacked Transformer encoders with four attention heads without positional encodings, each of which outputs 256-dimensional frame-wise embeddings. The inputs for the model were log-scaled Mel-filterbank-based features. We first extracted 23-dimensional log-scaled Mel-filterbanks with a frame length of 25 ms and frame shift of 10 ms. Each of them was then concatenated with those of the preceding and following seven frames, followed by subsampling with a factor of 10. As a result, a 345 (= 23 × 15) dimensional acoustic feature was extracted for each 100 ms.
In this paper, we evaluated EEND-EDA for both fixednumbers-of-speakers and unknown-numbers-of-speakers conditions; thus, a model was trained for each purpose. For the fixed-number-of-speakers evaluation, the model was first trained on the Simkspk training set for 100 epochs and evaluated on the Simkspk test set. We also adapted the model to CALLHOME-kspk for another 100 epochs to evaluate the model on real recordings. We used k ∈ {2, 3} in this paper. For the unknown-number-of-speakers evaluation, the model that was trained on Sim2spk was finetuned by using the concatenation of Sim{1,2,3,4}spk or Sim{1,2,3,4,5}spk for 50 epochs. The model was also adapted to each target dataset for another 500 epochs.
For network training using simulated mixtures, we used the Adam optimizer [59] with the Noam scheduler [60] with 100,000 warm-up steps. For adaptation, we also used the Adam optimizer but with a fixed learning rate of 1 × 10 −5 . For efficient batch processing during training, we split each recording into 500 frames when using Simkspk and 2000 frames when using the adaptation sets. The batch size for training was set to 64. Note that an entire recording is fed into the network without splitting during inference.

C. Evaluation
As an evaluation metric, we used diarization error rates (DERs) defined as where T Speech , T MI , T FA , and T CF denote the duration of total speech, missed speech, false alarm speech, and speaker confusion, respectively. Following the prior work in [12], [61], we used 0.25 s of collar tolerance at each speech boundary for the Simkspk, CALLHOME, and CSJ evaluation. For AMI, DIHARD II, and DIHARD III, we allowed no collar tolerance and used a subsampling factor of 5 during inference, which results in acoustic features extracted every 50 ms, to obtain more fine-grained results. We emphasize that speaker overlaps were NOT excluded from the evaluations. We also report Jaccard error rates (JERs) in addition to DERs. To calculate JER, first, the optimal assignment between reference and system speakers is calculated. JER is the average score of each reference speaker defined as where S ref is the number of reference speakers, and T is the time duration in which at least one of the s-th reference speakers of a paired system speaker is active.

V. RESULTS
A. Fixed numbers of speakers 1) Two-speaker experiment: First, we evaluated our method under the two-speaker condition. In this case, the model was first trained on Sim2spk and then adapted to CALLHOME-2spk Part 1. For the EEND-based methods, we used the model trained on Sim2spk to evaluate the simulated datasets and the one adapted to CALLHOME-2spk Part 1 to evaluate CALLHOME-2spk Part 2 and CSJ. For EEND-EDA, we used the first two output attractors for speech activity calculation.  Table III shows the results of the two-speaker evaluation. We observed that the proposed method with the shuffled order setting achieved the best DERs. Despite EEND-EDA being designed to deal with flexible numbers of speakers, it outperformed the conventional EENDs, i.e., BLSTM-EEND and SA-EEND, which output diarization results for fixed numbers of speakers. This is because the conventional EEND can be regarded as a fixed-attractor-based method, while EEND-EDA is an adaptive-attractor-based method as described in the last paragraph of Section III-B. This flexibility of attractors makes the proposed method more accurate even in fixednumber-of-speakers evaluations. In terms of the order of the input to EDA, shuffled sequences always performed better than chronologically ordered sequences. It indicates that the global context is more important than the temporal context to calculate attractors.
2) Three-speaker experiment: We also evaluated the method under the three-speaker condition. We first trained the model on Sim3spk and then adapted it to CALLHOME-3spk Part 1. We validated the performance on Sim3spk using the model trained on Sim3spk and that on CALLHOME-3spk Part 2 using the model adapted to CALLHOME-3spk Part 1. We used the first three attractors to evaluate EEND-EDA's performance. As shown in Table IV, EEND-EDA with sequence shuffling performed best on both simulated and real datasets.
3) Effect of input order: For a better understanding of EDA, we tried various types of sequences as inputs to the models, each of which was trained on chronologically ordered sequences and shuffled sequences. We evaluated matched and unmatched conditions of orders, and we also evaluated the effect of reducing the sequence length by subsampling or using the last 1/N part of the sequences. Table V shows the results on Sim2spk (β = 2). The EEND-EDA that was trained on chronologically ordered sequences performed well on chronologically ordered sequences but did poorly on shuffled sequences. It was also affected by subsampling, while it was slightly influenced by using the last 1/N part. These results indicate that the length of each utterance is an important factor to decide the output attractors for the model trained on chronologically ordered sequences. On the other hand, when the model was trained on shuffled sequences, it was not that affected by the order of sequences nor subsampling. However, when the last 1/N of the sequences were used, its performance degradation was worse than the model trained on chronologically ordered sequences. These results indicate that EDA trained on shuffled sequences captured the distribution of embeddings; thus, subsampling did not affect the performance that much, while using the last 1/N , i.e., biased sampling, degraded the DERs. 4) Embedding visualization: For intuitive understanding of the behavior of EDA, we visualized the embeddings e t and attractors a s within a two-speaker mixture from Sim2spk (β = 2) in Figure 3b. They were projected to two-dimensional space by using principal component analysis (PCA). We observed that the embeddings of two speakers were well distinguished from those of silence frames, and those of overlapped frames were distributed between the areas of the two speakers. For EEND-EDA, two attractors were calculated for each of the two speakers successfully as in Figure 3b. In Figure 3a, in comparison, the fixed attractors W cls of the conventional EEND were not well separated compared with the attractors calculated using EDA.
To understand the characteristics of attractors from EDA, we also visualized the inter-mixture relationship of attractors. For visualization, we first chose an anchor speaker and then selected mixtures that contained the anchor speaker. We calculated two attractors from each mixture by using EEND-EDA and mapped them onto a two-dimensional space using PCA. The speaker assignment from the calculated attractors to speaker identifiers was based on the groundtruth labels.  Figure 4 shows the attractors of two-speaker mixtures that contain the same anchor speaker. It clearly shows that the each anchor speaker's attractors were not distributed near each other. From these results, the embeddings and attractors were calculated only to separate speakers in each mixture. We can also say that the attractors were not suited for speaker identification. This also supports the idea that attractors are adaptively calculated from input embeddings. A similar observation on attractors from DANet [48] in speech separation was provided in Section 5 of [62] that attractors cannot be used for speaker identification or tracing.

5) Evaluation on the mismatched number of speakers:
We also evaluated two-speaker EEND-EDA on three-speaker datasets, and three-speaker EEND-EDA on two-speaker datasets. We used the model trained on Sim2spk or Sim3spk for the evaluation on the simulated datasets, and used the model adapted to CALLHOME-2spk or CALLHOME-3spk for the evaluation on the real datasets. The order of the embeddings is shuffled before being fed into EDA. The results are shown in Table VI. It is clearly observed that the DERs degraded when the number of speakers during training and inference was different. It is worth mentioning that threespeaker EEND-EDA did not work well on the two-speaker datasets; this indicates that the larger number of speakers during training does not serve the smaller number of speakers during inference.
B. Unknown numbers of speakers 1) Simulated mixtures: To train EEND-EDA to output flexible numbers of speakers' results, we finetuned the model from the two-speaker model for at most 50 epochs using Sim1spk to Sim4spk or Sim1spk to Sim5spk. Table VII shows the step-by-step improvement of the model. Note that the results on the top row correspond to our previous paper [15]. First, disabling backpropagation from the attractor existence loss L exist to update only w exist and b exist improved the DERs for Sim1spk to Sim4spk. However, we observed that the model still did not perform well on Sim5spk, which was not included in the training set. Adding Sim5spk to the training set solved the problem as shown in the third row, which shows DERs that improved for Sim5spk from 23.08 % to 13.70 %. This indicates that EEND-EDA's number of output speakers was empirically limited by its training datasets, even though it does not limit the number of output speakers with its network architecture. Increasing the number of training epochs further improved the DERs as shown in the last row. We also showed the DERs computed by SA-EEND [13] trained on a flexible number of speakers' dataset in the last two rows. In each case, the model's output number of speakers was set to the maximum number of speakers in the dataset, i.e., four or five, and the model was trained to output null speech activities if a recording of a fewer number of speakers was input. EEND-EDA outperformed SA-EEND in all datasets. Hereafter, we use the EEND-EDA model of the fourth row (k ∈ {1, . . . , 5}, 50 epochs, using L exist to update only w exist and b exist during training) and the SA-EEND model of the sixth row (k ∈ {1, . . . , 5}, 50 epochs).
2) CALLHOME: Since the CALLHOME dataset does not include an official dev/eval split, we used the split provided in the Kaldi recipe and performed cross-validation. For comparison with the prior work on EEND, we also report the results obtained for Part 2 of the dataset using the model adapted to Part 1. For SAD post-processing described in Section III-C1, we used the TDNN-based SAD provided in the Kaldi ASpIRE recipe 2 and oracle speech segments.
We show the number-of-speakers-wise results of crossvalidation in Table VIIIa. We also show the results for only evaluated single speaker regions in brackets. For this purpose, we chose up the most probable speakers from each time frame of the EEND-EDA results for fair comparison with xvector-based methods. EEND-EDA outperformed the state-ofthe-art x-vector-based methods in total DERs. One reason is that EEND-EDA can handle speaker overlap, but it showed a competitive DER (5.29 %) even when speaker overlaps were excluded from the evaluation. Considering the number of speakers in a mixture, EEND-EDA did especially better than the x-vector-based methods with VBx clustering when the number of speakers was small (#Speakers=2,3,4), while it was worse or on par when the number of speakers was large (#Speakers=5,6,7). One reason is that the pretraining was based on mixtures with at most five speakers, and another reason is that mixtures of a larger number of speakers are rare in the CALLHOME dataset. Compared to SA-EEND, EEND-EDA achieved better DERs on all the cases. Table VIIIb shows the results on CALLHOME Part2. It clearly shows that EEND-EDA outperformed the other EEND-based methods [16], [17] by over two percent of absolute DER. Table IX shows confusion matrices for the speaker counting of x-vector (TDNN) + AHC, x-vector (ResNet101) + AHC + VBx [35], SC-EEND [16], and EEND-EDA on CALLHOME Part 2. Our method achieved a higher speaker counting accuracy than the other methods by a large margin.
3) AMI headset mix: We next evaluated our method on the AMI headset mix, which has a different domain from the pretraining data (telephone conversation vs. meeting). We trained the model on the training set for 500 epochs and evaluated it on the dev and eval sets. The oracle speech segments were also used for SAD post-processing.
The results are shown in Table X. EEND-EDA outperformed the x-vector-based methods on both the dev and eval sets with the oracle SAD. Note that the x-vector-based methods tuned the PLDA parameters on the dev set, so the superiority of EEND-EDA was smaller on the dev set than the eval set. EEND-EDA also outperformed SA-EEND with and without the oracle SAD. We also note that the average duration of the recordings in the AMI headset mix test set is over 30 min. The performance of EEND-EDA showed that EEND-EDA generalized well to such long recordings while using 200 s segments during adaptation.

4) DIHARD II & DIHARD III:
Finally, we evaluated our method on the DIHARD II and III datasets, which contain recordings from multiple domains. In this evaluation, we used iterative inference with and without DOVER-Lap, each of which are described in Section III-C2 and Section III-C3, respectively, to deal with large numbers of speakers. For SAD post-processing, we used oracle segments and the system used in the Hitachi-JHU submission to the DIHARD III challenge [18].
The results are shown in Tables XI and XII. We can see that iterative inference with DOVER-Lap (iterative inference+) consistently improved DERs. Compared with the x-vectorbased methods, EEND-EDA performed best on DIHARD III full, while the x-vector-based methods were better on DIHARD II and DIHARD III core.
We show the number-of-speakers-wise DERs and JERs on DIHARD III in Table XIII. Our method performed better when the number of speakers was small and worse when the number   of speakers was large. This is why EEND-EDA performed well on DIHARD III full and worse on DIHARD II and DIHARD III eval. We also observed that the proposed iterative inference+ improved the performance, especially in terms of JERs on a large number of speaker cases, but it was still worse than the x-vector method. Handling a large number of speakers with EEND is left for future work.

VI. CONCLUSION
In this paper, we proposed an end-to-end speaker diarization method for unknown numbers of speakers using an encoderdecoder-based attractor calculation module called EEND-EDA. In EEND-EDA, frame-wise embeddings are firstly calculated from an input acoustic feature sequence, then speakerwise attractors are calculated from the embeddings using EDA, and finally diarization results are obtained by the dot product of the embeddings and attractors. We also proposed to improve the performance of the diarization by shuffling the order of the embeddings before input to EDA and limiting the scope of backpropagation of the attractor existence loss.
To conduct fair comparisons between EEND-based methods and cascaded methods under the same SAD condition, we introduced SAD post-processing for EEND-based methods. We also proposed iterative inference to cope with the problem of EEND-EDA's number of outputs being empirically limited by its training dataset. The evaluations on both simulated and real datasets showed that the proposed EEND-EDA performed well in both fixed-number-of-speakers and flexible-number-ofspeakers evaluations.
One possible future direction of this research is to train EEND-EDA with simulated data of a larger number of speakers. Preparing a large amount of data in advance for training increments the storage usage. Therefore, we will need a method to prepare simulated mixtures on the fly during training as recently studied in [64]. In addition, to create a simulated mixture, we first create N recordings each of which contains one speaker, and then mix them to be an N -speaker mixture. To control the overlap ratio, we increased the value of β as the number of speakers in the mixture increased, but this leads to an increase in the duration of silence in the mixture. An investigation of a better simulation protocol is also left for future work.
Even if EEND-EDA is trained with datasets of a large number of speakers, it would still limit the maximum number of speakers by the datasets as shown in Table VII. One reason is that EEND-EDA decides the number of speakers by using a neural network trained in a fully supervised manner. One of our later works has shown that unsupervised clustering can be introduced into EEND-EDA to remove the limitation on the output number of speakers caused by the training dataset [65].
Another direction is the network architecture. Currently, EDA employs a vanilla LSTM encoder-decoder, but an attention-based LSTM or Transformer encoder-decoder may be possible alternatives. Transformer encoders to extract frame-wise embeddings from input features can be also replaced with other architectures such as Conformers [66] or time-dilated convolutional neural networks [64].