Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking

Particle filters (PFs) have been widely used in speaker tracking due to their capability in modeling a non-linear process or a non-Gaussian environment. However, particle filters are limited by several issues. For example, pre-defined handcrafted measurements are often used which can limit the model performance. In addition, the transition and update models are often preset which make PF less flexible to be adapted to different scenarios. To address these issues, we propose an end-to-end differentiable particle filter framework by employing the multi-head attention to model the long-range dependencies. The proposed model employs the self-attention as the learned transition model and the cross-attention as the learned update model. To our knowledge, this is the first proposal of combining particle filter and transformer for speaker tracking, where the measurement extraction, transition and update steps are integrated into an end-to-end architecture. Experimental results show that the proposed model achieves superior performance over the recurrent baseline models.


I. INTRODUCTION
Speaker tracking plays an important role in speech separation [1], speech enhancement [2] and speaker diarization [3].The task of speaker tracking is to estimate the 2D position, 3D position or Direction of Arrival (DOA) of speakers at each time step.Generally, speaker tracking consists of two steps, measurement extraction and Bayesian filtering.Speaker localization can provide measurements for speaker tracking.For speaker localization, there are two types of methods: parametric-based methods [4] and learning-based methods [5].One of the important parametric-based methods is global coherent field (GCF), which is widely used for obtaining measurements in speaker tracking [6].GCF map accumulates the generalized cross-correlation phase transform [4] (GCC-PHAT) generated by signals from each microphone pair.Then a grid search method is employed to find the maximum over the acoustic map.The position producing the maximum on the acoustic map is regarded as the position of the sound source.Compared to parametric-based methods, learning-based methods are more robust against room reverberation and background noise [7] when trained on audio data recorded from different acoustic environments.It finds the relationships between the audio features such as GCC-PHAT [8] and the speakers' positions through neural networks.
Tracking considers the temporal variations of a speaker trajectory.The tracking algorithms often focus on temporal variations, smoothed trajectories, removing estimation outliers and compensating for missing observations.The family of Bayesian filtering algorithms is often used to address these problems, which aims to estimate target states recurrently given the previous states and the current measurements.There are two recursive steps in the Bayesian filtering, namely, prediction and updating.In the prediction step, the target states are transferred from the last time step to the current time step through a transition model.In the update step, the states are updated from prior to posterior by the measurement model.Several methods have been developed in this family, including Kalman Filter (KF) and PF.KF assumes the transition process and the update process to be linear and the noise to be following Gaussian distribution.It uses Gaussian distribution to represent the target states and updates the mean and covariance at each time step.The performance is satisfactory in this linear Gaussian environment but this assumption limits the generalization of this model to complicated scenarios.Extended KF (EKF) and unscented KF (UKF) are proposed to mitigate the linear-Gaussian limitations.EKF approximates the non-linear transition and update process using a first-order Taylor series expansion.UKF employs deterministic sampling to generate a set of sigma points to calculate the mean and covariance of state distribution.Particle filter (PF) is a Sequential Monte Carlo (SMC) method and uses a group of particles instead of Gaussian distributions to represent the target states, which can handle the non-linear and non-Gaussian scenarios.PF contains four steps.At first, the particles are initialized with the same weights.The particle states are transitioned in the prediction step, and the particle weights are updated by the measurement likelihood in the update step.The target states are calculated as the weighted sum of the particle states.At last, the particles are resampled to avoid the weight degeneracy problem.The particles with high weights are maintained and duplicated, while the particles with low weights are discarded.
The Bayesian filter is a first-order Markov process.The estimation of current states depends on the state in the last time and the measurements in the current time.Similarly, transformer is an auto-regressive model when used in temporal prediction tasks such as machine translation, text summarization and tracking, whose output at current time step also depends on the input and output in the previous time steps.Based on this similarity, we propose a new method by combining the two models with the following novel aspects.First, the transition model, which is used to change the particle states in a particle filter, is learned with a multi-head self-attention model, instead of being pre-defined as a constant velocity model in a conventional particle filter.Second, in the observation model, which aims to update the particle weights according to the measurement likelihood, we use the multi-head cross-attention to model the interaction between different modalities, in order to capture the relationship between the encoded audio embedding and particle embedding.The designed model combine the advantages of a particle filter and transformer.1) The particle filter is not an end-to-end architecture.The measurement needs to be obtained before the update step.The combination is an end-to-end architecture where the filter and the measurement model can be trained jointly.2) The transition model and update model are often preset, which are hard to generalize to complex scenarios.The self-attention and cross-attention modules are employed as learnable transition and update models.3) It is proved that the algorithm priors introduced by particle filter will improve the model performance [9].In addition, the prediction step and update step introduced in the transformer model bring explainability to the model, as compared to training in a black-box neural network model.
The remaining part of this paper is presented as follows: In Section II, we summarize some related topics.In Section III, we formulate the tracking problem.In Section IV, we present our differentiable particle filter transformer for single-speaker tracking and discuss the extension of this model for multiple-speaker scenarios.In Section V, we show the experimental results of the baseline methods and the proposed method.We also discuss the model robustness against noise and sequence length.In Section VI, we conclude the paper and point out the limitation of our method and potential directions for future works.

II. RELATED WORK A. SPEAKER TRACKING
In the past few years, Bayesian filter based methods have been developed for speaker tracking.Most methods adopt the paradigm of filtering with a measurement model.In [10], an adaptive particle filter is proposed for single speaker tracking.It uses GCF as audio measurement and uses face detection and color histogram as video measurement.An adaptive weight mechanism is designed to determine the importance of audio and visual modality dynamically.In general, training a model for extracting measurements requires abundant labeled data, which is not always available.Selfsupervised learning and active learning proposed in [11] and [12] can be used to leverage unlabeled data to obtain measurements.In [13], an algorithm similar to [10] is explored under a reverberant and noisy environment with occluded speakers, speakers out of the field of view, and speakers not facing the cameras.In [6], a new dataset named CAV3D is proposed for audio-visual speaker tracking.Compared to the widely used AV16.3 dataset [14], CAV3D contains recordings with stronger reverberation and more complicated scenarios.In [15], particle filter is used for multiple speaker tracking with discriminative and generative measurement likelihood.In [16], a two-layer particle filter is proposed.Two groups of particles are passed through the audio and visual layers separately.The particle weights are determined by the likelihood of the two modalities.
The random finite set (RFS) based method is another branch of Bayesian filter, which can handle the varying number of speakers.RFS contains a varying number of elements.Both the target and measurement sets can be represented by the RFS.At each time step, the speaker RFS is the combination of surviving speakers, spawned speakers from last time step, and new speakers.Here, the spawned speakers refer to the speakers appeared in the last step and could be potentially existing in current time step without associated measurements, such as the occluded speakers.To lower the computational complexity of RFS, the probability hypothesis density (PHD) filter propagates the first-order moment of the multiple target state distribution.It has a linear Gaussian form, representing the target in Gaussian distribution and SMC form, representing the target with particles.In [17], the PHD filter is used for tracking unknown and varying number of moving audio sources.In [18], the SMC PHD filter is employed with the help of mean-shift to move particles to the local maximum.There are several works [19], [20] that combine particle flow with PHD filter while particle flow can help to transfer the particles from prior distribution to posterior distribution.Unlike the PHD filter, multi-target multi-Bernoulli (MeMBer) filter propagates posterior density function rather than the first-order moment.The target state is represented as Bernoulli RFS, which is empty or has a single element.In [1], generalized labeled Bernoulli filter (GLMB) is employed to solve the problem of multi-modal space-time permutation and deal with the problem of varying number of speakers.In [21], the Poisson multi-Bernoulli mixture (PMBM) filter is proposed for multi-target speaker tracking, which employs Poisson distributions to represent undetected targets and employs a multi-Bernoulli mixture to represent detected targets with different data association strategies.

B. DIFFERENTIABLE BAYESIAN FILTER
There have been some recent works that combine Bayesian filter and deep learning models for temporal prediction tasks.In [22], a backprop Kalman filter is proposed, which takes the raw image as the input and outputs the tracking results.In [23], a dynamic weight mechanism is jointly trained with backprop Kalman filter so that the importance of different modality can be determined by the quality of the measurements.There are also some works that combine neural networks with particle filter.In [24], a differentiable particle filter is designed with a semi-supervised learning strategy to reduce the requirement of labeled data.In [25], particle filter is combined with simultaneous localization and mapping (SLAM) for visual navigation.In [26], a particle filter network is proposed for visual localization, which encodes the measurement model and the particle filter in a single neural network.In [9], a similar architecture has been implemented, where the training strategy is formed in three steps, with two steps on training the transition and measurement models, and the final step on end-to-end learning of the whole model.In addition to the conventional neural network, a recurrent neural network, such as long short term memory (LSTM) and gated recurrent unit (GRU), can also be combined with a particle filter.In [27], PF-LSTM and PF-GRU are proposed, which replace the deterministic update with stochastic Bayesian update.In [28], particle transformer is proposed, which leverages weighted multi-head attention for differentiable resampling.Compared to [26] and [9], our model extends the localization (tracking) task to multiple objects for the first time.Compared to [27], our model integrates particle filter and transformer for object tracking for the first time, while in [27], particle filter is combined with recurrent neural network.Compared to [28], we combine particle filter and transformer for object tracking, while in [28], the two models are combined for differentiable resampling.In our proposed model, the particle states are also changed in the observation model [27], which is different from a vanilla particle filter where the particle states remain unchanged in the observation model.The design of the differentiable particle filter is used to provide a training strategy, such as using weighted particles for state representation and particle resampling.

III. PROBLEM FORMULATION
The whole algorithm design is based on particle filter framework.In a particle filter, N particles {w t,i , x t,i } N i=1 are used to represent the target states, where x t,i denotes the state of the i-th particle at time t, containing the DOA d t,i and distance δ t,i , with w t,i being its corresponding weight.A standard PF has four steps: initialization, prediction, update and resampling.In the first step, all particle weights are initialized at t = 1 to be the same: In the prediction step, the particle states {w t,i , x t,i } N i=1 are transited from the last time step to the current time step {w t,i , x t+1,i } N i=1 using the transition model: where T is the transition model and is assumed to be a constant velocity model.In the update step, the measurement likelihood l is first calculated: where Z t+1 is the observation set, and M is the measurement model.The particle weights are updated by the measurement likelihood: The target state X is obtained as the weighted sum of the particle states: After some iterations, particle filter may suffer from the weight degeneracy problem, where the target state is determined by a few high-weight particles.Therefore, the last step is particle resampling, where the particles with higher weights are duplicated while the particles with lower weights are discarded.The resampled particles are assigned with identical weights.
In this paper, we explore using audio signals captured by microphones for speaker tracking.Given the binaural audio waveform {a 1 , a 2 } captured by two microphones, where a 1 , a 2 ∈ R |a| with |a| being the length of the waveform, the task of speaker localization aims to predict the direction of arrival (DOA) of the sound sources from the speakers with respect to the microphone array at each time step.
We use GCC-PHAT as the audio feature, which is commonly used in speaker tracking and calculated as: where G ∈ R T ×C , with T being the temporal dimension, C is the number of coefficients of delay lags, τ is the time delay lag, (i, j) denotes a microphone pair, ST F T represents Short Term Fourier Transform with (t, f ) being time frame and frequency bin indexes, respectively, and * denotes complex conjugate.

IV. PROPOSED METHODS
In this section, we show an end-to-end differentiable architecture which combines particle filter and transformer for single speaker tracking.Then, we discuss the extension of the proposed model to the problem of multi-speaker tracking.

A. ATTENTION-BASED DIFFERENTIABLE PARTICLE FILTER
The overview of the model can be seen in Fig. 1.The self-attention acts as an implicit learnable transition model with which the particle states are transferred to the next time step without applying an explicit motion model to the particles.The cross-attention module is used to calculate the measurement likelihood.
The overall architecture of our model is shown in Fig. 2, which follows the paradigm of a vanilla transformer.The GCC-PHAT G is firstly added with the positional encoding G POS ∈ R T ×C along the time dimension and input to the transformer encoder.In the transformer encoder, the input goes through the multi-head self-attention (MSA) layers and the fully connected layers with the residual connection.This can be described mathematically as follows, where Ẑ is the intermediate state after MSA layers and Z is the output of one transformer encoder module.Ẑl , Z l ∈ R T ×F , with T being the length of the temporal feature and F being the feature dimension, l is the index of the transformer module, L denotes the number of repeated transformer encoder modules, MLP represents multi-layer perception, and LN is the layer normalization.The MSA is defined as, where H ω , ω = 1, . .., n, are computed as where T * is the sequence length and d * is the feature dimension.In the transformer decoder, the particles are represented as the embedding matrices S t ∈ R N×D , where N is the number of particles and D is the hidden dimension of the particle embedding.Each particle embedding implies the particle position.
An advantage of the proposed model is that, as the feature extractor, the transformer encoder can be optimized over a sequence of audio frames instead of a single frame [22].The particle embedding S t is firstly added with the positional encoding S POS ∈ R N×D and then passed through the self-attention layer.The self-attention layer is applied on the first dimension of the particle embedding, which is regarded as the transition model in a particle filter.The transition of one particle state depends on the self-attention with other particle embedding, After the self-attention transition, particle states are transferred from S t to S t+1 .Then cross attention is used between the predicted particle states with the output of the encoder.The encoder output also contains corrupted information for DOA estimation, such as clutter, outliers and noise.The multi-head cross attention layer (MCA) is regarded as the measurement model and the output of the encoder is regarded as the measurement.
where the MCA operation is defined as where H ω , ω = 1, . .., n, are computed as where W o , W Q , W K and W V are defined similarly as earlier.A fully connected layer is used to calculate the measurement likelihood in terms of the particle embedding.
The particle weights are updated according to the likelihood: The corresponding DOA posterior d t+1,i ∈ R 360 are derived by an MLP layer in terms of the updated particle states: The final DOA posterior d t+1 ∈ R 360 is the weighted sum of the DOA posterior over all the particles: Finally, the DOA is obtained as the peak index of the posterior: where d j,t+1 is the j-th element of the vector d t+1 , and j = 1, . .., 360/ with being the angle resolution, set to = 1 • in our experiments.

B. RESAMPLING
The resampling step selects and duplicates the important particles and discards the unimportant ones.However, the resampling step is not differentiable.To integrate the resampling step into the transformer, similar to [27] [26], we employ the soft-resampling.In soft-resampling, we resample the particles from a new distribution q instead of the original distribution p, where q is the combination of p and uniform distribution u = 1/K with 0 < α < 1 being the hyperparameter to balance the two distributions, as follows Instead of using equal weights for all the particles, the new particle weights are calculated as follows:

C. EXTENSION TO MULTIPLE SPEAKERS
The extension of the proposed method to the problem of tracking multiple speakers follows the setting in a conventional particle filter.
In the conventional particle filter, several groups of independent particles are used for different objects.In this paper, as an example, we consider the scenario of tracking two speakers.To this end, two independent groups of particles are employed as input to the decoder of the transformer.For the self-attention based transition step, the particles from the two groups are processed independently with an attention mask to ensure the transition of different groups does not interfere with each other, where I is an identity matrix whose dimension is identical to the number of particles used for tracking each speaker.
The cross attention update is the same as that of ( 14)- (19).Compared to the single speaker tracking, multi-speaker tracking has a data association step, in which the measurements are matched with the speaker states.The data association is hidden in the cross attention update step and each particle embedding automatically finds the related measurements.When obtaining DOA, (20) and ( 21) are performed for each group of particles.In the multi-speaker tracking scenario, the number of the speakers may varying with time.Thus, we add a binary classifier to estimate whether the particles correspond to an existing target, Êt+1 = MLP Ŝt+1 (24) where Ê ∈ R 2×2 in the two-speaker scenario, which is rescaled using the so f tmax function to obtain the predicted speaker existence possibility.After obtaining the DOAs of multiple speakers and the existence probability by averaging the particle states according to their weights, the Hungarian algorithm [29] is used to match the estimated DOA with the ground truth.Similar to [30], the matching strategy is applied on the basis of the speaker existence probability and the DOA estimation, as follows where At last, soft-resampling in terms of ( 23) is performed for each group of particles, independently.The overall process for multispeaker tracking is shown in Algorithm 1.

D. LEARNING OBJECTIVE
For the DOA distance loss, we cannot use the distance between the predicted DOA and the ground truth DOA.This is because the arg max operation in ( 21) is not differentiable.Some works [31], [32] treat the DOA estimation task as the classification task and use the cross entropy loss.With a resolution , the DOA space is split into 360/ classes.However, the cross entropy loss cannot describe the relationship among different classes.For instance, the error between 0 • and 180 • should be larger than that between 0 • and 90 • .However, the cross entropy loss will treat them equally.To model the relationship between different classes, inspired by [5], we encode the ground truth r with a Gaussian distribution centered on ground truth DOA, where r ψ are the ψ-th element of ground truth of the DOA r.We generate the Gaussian distribution centered on the ground truth DOA with resolution = 1 • and covariance σ 2 = 1 • .Then we use the earth mover's distance (EMD) loss [33], used originally for speech quality evaluation [34], to measure the difference between the DOA posterior and the Gaussian distribution of the ground truth, as follows where d ψ and r ψ are the ψ-th element of DOA posterior d t+1 and ground truth r, respectively.Besides, we adopt the evidence lower bound (ELBO) loss [27] to maximize the particle likelihood: where r t is the ground truth DOA, S 1:t is the updated particle embedding.For the likelihood, we adopt p(r t | S 1:t,i , Z L 1:t , ) = L EMD (d i ) to calculate the EMD loss between particle states and the ground truth Gaussian distribution.The EMD loss provides a macro optimizing strategy to improve the model performance while the ELBO gives a micro optimizing strategy to focus on the particle estimation.The learning objective is the combination of the ELBO loss and the DOA distance loss with the hyperparameter λ: For the multiple speaker scenario, the cross entropy loss L CE ( Ê, c) is added for class prediction, which is defined as follows, where Ê(k, 0) is an element of the matrix Ê at the position of the k-th row and first column, likewise, Ê(k, 1) represents the element at the k-th row and second column.

V. EXPERIMENTS
In this section, we first evaluate the performance of the proposed model, as compared with the baseline models on single speaker tracking.We then compare the robustness of different models against sequence length and additive audio noise.At last, we show the model performance on multi-speaker tracking.
Algorithm 1: Attention-Based Differentiable Particle Filter for One Time Step.

A. DATASET
In this paper, we focus on audio speaker tracking.Most speaker tracking datasets (2 hours for the AV16.3 dataset [14], 2 hours for the CAV3D dataset [6], and 0.5 hours for the AVDIAR dataset [35]) have limited size and do not support the training of deep learning models.Therefore, we resort to a simulated dataset.We use Two Ear auditory model 1 for simulating the binaural audio with moving sound source due to its easy implementation.Binaural audio has been used for localization, tracking and navigation [36], [37].The dataset is simulated within a 2D room of 20 × 20 squared meters.The initial position of the speaker is chosen randomly within the room.The walking speed of the sound source is set to three meters per second to mimic the normal walking scenario.The direction of the velocity is randomly generated and fixed, so the sound source moves with a linear and constant speed.We have created almost 33 k trajectories with each trajectory containing 50 k sampling points.The speech corpus is from Librispeech [38].The training set is from train-clean-360.The development set is from dev-clean and dev-other.The test set is from test-clean and test-other.We cut each speech clip to 10 seconds and input to the Two Ears auditory system with a trajectory to generate binaural audio with a sampling interval of 368 with a sampling rate of 44100.We collect around 33 k spatial speech clips of more than 100 hours in total.We use 27 k clips as the training set, 3 k clips as the development set, and another 3 k clips as the test set.
For the multi-speaker tracking scenario, we randomly choose two audio clips from the single-speaker dataset and add their waveforms to simulate the two-speaker scenario.In this way, we create a corpus of another 100 hours of speech data.Together with the single-speaker corpus, we use the blended corpus to train the multi-speaker tracking model.The number of audio clips in the training set, the development set and the test set is 60 k, 6 k and 6 k, respectively. 1[Online].Available: https://github.com/TWOEARS/TwoEars

B. IMPLEMENTATION DETAILS
For GCC-PHAT calculation, we split each audio clip into chunks with a hop size of 368 to match the simulation interval.For each chunk, the GCC-PHAT is calculated by including six previous chunks and six latter chunks.The n_fft is set to 1024 and the hop size is set to 320.The number of coefficients of the delay lags is set to 96.
Both the transformer encoder and decoder use one transformer module to mimic the process in a particle filter.The dimension of the query, key and value is 128.The dimension of the latent representation from the fully-connected layer is 256.The number of heads for the multi-head attention is 4.For the particle filter, we use 30 particles with 128 dimension for the latent representation.α for soft-resampling is set to 0.2.
For model training, we run the Adam optimizer for 100 epochs in total, with the learning rate set to 5e-4 for the initial 50 epochs, and then decreased for the remaining 50 epochs with 0.1 learning rate decay.We adopt the early stop mechanism with 30 patience.For the Hungarian matching, λ cls = 3 and λ DOA = 5/180.For the learning objective, the hyper parameter λ is set to 0.5.For the multiple speaker scenario, we adopt the EMD loss and cross entropy loss.We discard the ELBO loss to reduce the computational cost.

C. EVALUATION METRICS
We use the mean absolute error (MAE) and Accuracy to evaluate the model performance.The MAE is calculated as: where T is the number of time steps and M is the number of speakers.MAE is the average error along the time dimension within one trajectory, and is smaller than 180 degrees.The Accuracy is calculated as the percentage of the trajectory whose MAE is smaller than three degrees.For the multi-speaker scenario, we also report the cardinality error, calculated as the absolute value of the estimated number of speakers and the ground truth.

D. COMPARISON WITH OTHER METHODS
We compare the proposed model with other temporal prediction models including the vanilla RNN, LSTM, and GRU, and other models which combine RNN and PF such as PF-LSTM and PF-GRU [27].We choose the RNN-based models as baseline methods as they estimate the states in the current time step based on the current measurements and the previous states, in a similar spirit to the Bayesian filters.When re-implementing the baseline models, we choose a proper dimension of the embedding to ensure that different models have roughly the equivalent number of model parameters.
The GCC-PHAT is directly input to the RNN-based method to obtain DOA.
The experimental results on the simulated dataset are shown in Table 1.It is observed that the proposed particle filter transformer outperforms the baseline methods by a large margin.The transformer encoder can provide extracted features and the transformer decoder combined with particle filter can estimate the speaker state.LSTM and GRU offer a better performance than the vanilla RNN as they have better ability in modelling longer sequences through different gates for remembering important information and discarding redundant information.

E. ABLATION STUDY
We conduct the ablation study to show the effectiveness of the proposed end-to-end attention-based particle filter.The results of the ablation study are shown in Table 2.The first model uses transformer to obtain the measurements and uses the conventional particle filter for tracking.In each training iteration, only the transformer is optimized.For the transformer to obtain measurements, we use the transformer encoder and add an [CLS] token at the beginning of the GCC-PHAT.We use the first position of the output and pass it to a classification layer to get the DOA.We adjust the dimension of the hidden layers to match with the models in Table 1.It is observed that the performance of the two-stage model is not as good as the one-stage end-to-end model, which shows the effectiveness of the end-to-end model.The second model uses the same architecture as the proposed model but without temporal dependency.At each time step, new particles are generated as input to the decoder instead of using the resampled particles from the last step.The performance of the second model is better than that of the two-stage model and shows the competitive performance of our proposed model.The reason is that the feature contains the information from both previous and latter audio chunks.As explained in Section V.B, one chunk is calculated with six previous chunks and six latter chunks.While the input features already incorporate temporal information, the impact of temporal prediction effectiveness is not immediately apparent.
We also show how the change in the number of particles affects the model performance in Table 3.It can be seen that the model achieves the best Accuracy with 30 particles and gives the lowest MAE errors with 10 particles.A larger or smaller number of particles may lead to performance decline.In addition, we integrate differentiable resampling mechanism [28] with the proposed model and compare its performance with that of the model using soft-resampling.The experimental results are shown in Table 5.We can find that the performance of the model leveraging differentiable resampling is not as good as that using soft-resampling.Both the differentiable resampling method and our proposed model use transformer blocks.However, the cascaded Transformer architecture is hard to converge and achieve optimal performance.While the model [28] leverages  fully connected layers as the transition and update models, which are more light-weighted and suitable for integrating with differentiable resampling.The soft-resampling mechanism is a model-free architecture and fits better with our transformer-based model.

F. THE IMPACT OF SEQUENCE LENGTH
In this section, we explore the impact of the sequence length on the model performance.We test three well-performing models in Table 1, LSTM, GRU and our proposed model under the sequence length of 100, 200, 300, 400 and 500, and the results of MAE and accuracy are reported in Fig. 3.It can be seen that the performance of LSTM and GRU drops significantly (Accuracy decreased from 78% to 50% for LSTM and accuracy decreased from 80% to 73% for GRU) with the increase of the sequence length.While the performance of our model is relatively stable against the long sequence with the Accuracy maintained in around 95%, which proves that our model has strong modeling and memory capacities over long sequence.

G. THE IMPACT OF NOISE
The simulated dataset we generate does not contain noise in the binaural audio.However, in real applications, audio signals captured are often contaminated by noise.Therefore, in this section, we explore the model robustness against noise.To this end, we add noise to the development set and the test set of the simulated dataset.The noise we use is from DEMAND [39], which provides noise from different scenarios including office, park, sports fields, and so on.The noise is added to the magnitude spectrum with Signal-to-Noise Ratio (SNR) of 20 db, 10 db, 0 db and −10 db, respectively.The experimental results are demonstrated in Table 4.We use the pre-trained models on the clean training set and evaluate it on the noisy version of the test set without finetuning the model.It is observed that our model performs better than the baseline models on all the SNR levels.The performance of RNN is the worst due to its simplest architecture.Other models offer performance in a similar level.On the SNR level of 20 db, 10 db and 0 db, our model outperforms the baseline models to a large extent (accuracy increased almost 20% on 20 db and 10 db, and increased almost 10% on 0db), which shows that our model is more robust to additive noise.On the SNR level of −10 db, the power of the noise is greater than that of the speaker audio and this scenario is very challenging.The performance of all models is not satisfactory.However, in the more noisy environment our model still obtains 10% increased accuracy for the noise level at −10 db compared to the recurrent baseline models.
To increase the model robustness against noise, we also train the model on the noisy dataset.Specifically, we add Gaussian white noise to the training set, with the SNR level at 20 dB.We test the model on the noisy dataset contaminated by DEMAND [39].It is observed that the model performance (Ours*) improves as the Gaussian white noise is the most general form.The model trained with Gaussian white noise tends to be generalized to the specific noise.

H. VISUALIZATION
To present the tracking results more intuitively, we show the trajectories and particle states in Fig. 4. As the DOA classification resolution is 1 • , the estimated DOA are not in decimal points, and the estimated trajectories appear to be saw-toothed.It can be seen that at the start period, the particles are scattered in different positions, which are similar in conventional particle filters.After some iterations, the particles converge to certain points.Although several outliers exist (the impulses in the third and fifth sub-figures), the estimated trajectories are close to the ground truth trajectories.

I. RESULTS ON MULTI-SPEAKER SCENARIO
We report the results of two-speaker tracking in Table 6.We compare the baseline method DEtection TRansformer (DETR) [30] combined with the Poisson multi-Bernoulli mixture (PMBM) filter [21], which is a two-stage process.DETR [30] is used to extract audio measurements.The vanilla DETR, originally proposed for the task of object detection, has a classification head and a localization head to determine the position of the bounding box.The Hungarian matching module is used for bipartite matching during the training stage, which is not differentiable.We modify the classification head to a binary classifier to determine the speaker existence, and modify the localization head for classification of 360 classes related to the DOA angles.DETR is trained on the simulated two-speaker dataset.PMBM is used for state estimation.Compared to PF, PMBM can estimate the varying number of speakers while PF needs to set the number of speakers as a prior.PMBM has been used for vehicle tracking [40] and multiple speaker tracking [41].PMBM takes the audio measurements from DETR [30] and estimates the number of speakers and each speaker's DOA.
In the baseline of PMBM filter, the survival probability was set to 0.99 and the birth model is set to a Gaussian mixture.The detection probability is set to 0.9.Both the baseline method and our proposed model can be used for estimating the number of speakers.It is observed that the proposed method performs competitively with the baseline method in MAE and outperforms the baseline method in cardinality estimation.It can be seen that the multi-speaker tracking task is more challenging than single-speaker tracking.There are two reasons.On one hand, the GCC-PHAT feature is hard to be adapted in the multi-speaker scenario [7].The performance of GCC-PHAT degrades significantly as the number of speakers increases [15].On the other hand, data association is needed to match the measurements with the targets.

J. RESULTS ON REAL DATASET
We evaluate our methods on a real dataset, i.e.AVRI dataset [42], which is recorded with a four-microphone array and KINOVA robot.Compared to the simulated dataset, the real dataset is more complicated, covering varying reverberation, noise and speaker motions.The experimental results are shown in Table 7.It is shown that our proposed method offers competitive performance as compared with the state-of-the-art methods, despite having a smaller model size.Both A-CRNN [7] and AV-CRNN [42] employ GCC-PHAT and mel-spectrogram as audio features.In addition, both AV-CRNN [42]

TABLE 7 Experimental Results on the AVRI Dataset
and CMAF [42] use additional facial features, and achieve marginal performance improvement.Our light-weighted model only takes GCCPHAT as input, which reduces the computational complexity and strikes a balance between performance and model complexity.

VI. CONCLUSION AND FUTURE WORK
We have presented a new end-to-end model for speaker tracking by leveraging the conventional particle filter and the transformer based learning architecture.The particle filter provides potential explainability for the transformer, while the transformer offers a strong measurement model for the particle filter.This combination abandons the traditional pattern of tracking, which first extracts measurements and then feeds the measurements to the Bayesian filter.Instead, it provides an end-to-end differentiable architecture.Experiments on the simulated and real datasets show that the proposed model offers improved modeling capacity and robustness to long sequence and noise.However, there are limitations with the proposed method.On the one hand, when the algorithm is used for multi-speaker tracking, the maximum number of speaker needs to be specified, which is unknown in some scenarios.On the other hand, the model performance degrades in multi-speaker scenarios due to the limitations of GCC-PHAT and the existence of data association steps.In the future, we will improve the performance of the proposed method for multi-speaker tracking.Motivated by insights in [44], the resampling step can be optimized further, such as leveraging the parallel processing to reduce the computational cost and adjust the sampling frequency according to the degree of weight degeneracy to mitigate the problem of sample impoverishment.

FIGURE 1 .
FIGURE 1. Overview of the model architecture, where a learned transition and update model is proposed.Here, s t is the speaker state at time t and z t is the measurement at time t.

FIGURE. 2 .
FIGURE. 2. Architecture of the proposed model architecture.GCC-PHAT of the two-channel audio is calculated and split as the input to the encoder.The decoder takes the particle embedding as input and performs self-attention and cross attention with the encoder output as the transition model and update model, respectively.Initially the colors of particles are the same, indicating the same weights.After the update, the particles are denoted by different colors with deeper color indicating higher weights.Finally, soft-resampling is used to select important particles.
{c ζ =∅} is an indicator function, taking the value of 1 if c ζ is not empty.ζ is the index of the ground truth and σ (ζ ) is the matching index.σ (ζ ) = ( Êσ (ζ ) , dσ (ζ ) ) and y ζ = (c ζ , r ζ ) where c ζ ∈ {0, 1} indicates the speaker existence and r ζ is the ground truth DOA.L DOA is the absolute difference between dσ (ζ ) and r ζ .λ cls and λ DOA are the hyperparameters to balance the classification error and the DOA error.

FIGURE 3 .
FIGURE 3. Impact of sequence length on the performance of the model.

FIGURE 4 .
FIGURE 4. Visualization of tracking trajectories.The horizontal axis is the time and the vertical axis is DOA.The red and green lines represent the estimated trajectories and ground truth, respectively.The star represents the particles whose colors represent the particle weights.Darker colors mean higher particle weights.