Multi-Head Self-Attention-Based Deep Clustering for Single-Channel Speech Separation

Turning attention to a particular speaker when many people talk simultaneously is known as the cocktail party problem. It is still a tough task that remained to be solved especially for single-channel speech separation. Inspired by the physiological phenomenon that humans tend to distinguish some attractive sounds from mixed signals, we propose the multi-head self-attention deep clustering network (ADCNet) for this problem. We creatively combine the widely used deep clustering network with multi-head self-attention mechanism and exploit how the number of heads in multi-head self-attention affects separation performance. We also adopt the density-based canopy K-means algorithm to further improve performance. We trained and evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three talker mixtures. Experimental results show the new approach can achieve a better performance compared with many advanced models.


I. INTRODUCTION
Cocktail party problem which is of great significance for automatic speech recognition and voiceprint recognition [1] was first proposed by Cherry in [2]. Conventional solutions for single-channel speech separation mainly include computer auditory scene analysis (CASA) [3], [4] and non-negative matrix factorization (NMF) [5]. CASA simulates the processing of sound by human auditory system using certain organizational guidelines and appropriate separation cues. NMF is based on the assumption that audio spectrogram has a low-rank configuration and can be represented with a small number of bases [5]. However, these traditional techniques suffer from similar issues relating to performance with limitations on spectral dynamics or the exploitation of temporal, unknown speakers, and high complexity [6]. Recently, deep learning has been successfully applied to source separation. Deep neural networks (DNNs) were adopted to capture the highly non-linear relationship of The associate editor coordinating the review of this manuscript and approving it for publication was Mohamad Forouzanfar . speech characteristics between mixed signals and the target speaker [7]- [10]. Deep clustering (DPCL) was creatively exploited for monaural source separation in [11]. The essence of DPCL is to determine source assignment based on a similarity measurement in original spectral space or embedding space. Following DPCL, many innovative approaches have been successfully applied to formulate separation as a regression problem and learn an effective mapping from the mixture to the source time-frequency masks. Typical solutions include permutation invariant training (PIT) [12], utterance-level permutation invariant training (uPIT) [13] and deep attractor network (DANet) [14]. However, the reconstruction of source signal with time-frequency masking is imperfect. A time-domain audio separation network (TasNet) [15] was conducted and achieved good results. Furthermore, a joint audio-visual model for speech separation was put forward in [16] and it outperforms other audio-only models. However, it is notable that in these methods, the input mixed speech is considered as a pure engineering signal and the physiological characteristics of human beings are ignored. For instance, humans can concentrate on one or two sounds VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of interest in a noisy party or other complex auditory environments. It seems easy for humans to recognize a single voice from a signal consisting of multiple sources. According to the research on cortical representation of multi-talker mixed speech by Mesgarani and Chang [17], human auditory system can restore the representation of the speaker of interest while controlling irrelevant competing speech when several speakers talk simultaneously. In other words, human auditory system will assign different weights to different units in the speech sequence. This feature is similar to the attention mechanism which is widely used in natural language processing (NLP) tasks. It was initially proposed in [18] for machine translation. It works as a pooling layer which decides the most discriminative features over sequence and learn the distribution of weights by computing the similarity of elements in sequence. So in our paper, we try to apply the attention mechanism to address the cocktail party problems. We conduct the joint optimization of DPCL model [11] and multi-head selfattention, which can jointly attend to information from different representation subspaces at different positions. The main contributions of this paper are: 1.Human beings can separate target speech from mixed signals easily owing to the speciality of auditory system. We utilize this physiological characteristics for cocktail party problem. 2.We creatively combine the original model with multi-head self-attention mechanism.
Multi-head self-attention can capture relevant information on different subspaces by calculating multiple times so that the network can learn more comprehensive information. Besides, how the number of heads in multi-head self-attention mechanism impacts model performance is also explored. 3.K-means is used in DPCL model. However, the value of K should be known at first, and it is not effective in dealing with big data. So we adopt the density-based canopy K-means clustering algorithm, which not only improves the final performance but also makes the system compatible with the condition where the number of speakers is unknown. The organization of this paper is as follows: Section II introduces the original DPCL model and its drawbacks. Section III introduces the proposed methods, including the multi-head self-attention mechanism and density-based canopy K-means. Section IV presents the experimental setting and results using the Wall Street Journal dataset (WSJ0) speech corpus. In Section V, experimental conclusions and future work directions are described.

II. DPCL MODEL
DPCL for single-channel speech separation was first put forward by Hershey et al., etc., [11]. The essence of it is to train a neural network to learn the high-dimensional embedding of each time-frequency unit so that the embedding belonging to the same speaker has the minimum distance in the embedding space. The value of complex spectrogram at corresponding time-frequency bin is defined as: where t and f represent the time index and frequency index of the signal respectively. A single time-frequency bin X tf and a frame X t * are mapped to p tf and P t * by deep neural network, where p tf ∈ R D * 1 ,P t * ∈ R DF * 1 and |p i | 2 = 1. D represents the dimension of embeddings. Y = y i,c is a one-hot label vector indicating which source dominates the mixture. It maps element i to c categories. When element i belongs to class c, y i,c = 1, otherwise it is 0. YY T is a binary affinity matrix representing the clustering results.When element i and element j belong to the same category, (YY T ) ij = 1, otherwise it is 0. DPCL defines an objective function to make the estimated affinity matrix as close as possible to the real one. This formula is expressed as: The structure of DPCL is shown in figure 1. The high dimensional embeddings are extracted by bidirectional long short-term memory (Bi-LSTM) network, then K-means is utilized to separate the time-frequency bins. It should be particularly emphasized that the K-means clustering is used only at testing time, which is different from other source separation methods learning a direct mapping from mixed signals to source signals. Bi-LSTM has two parallel long short-term memory (LSTM) layers propagating in two directions. It computes the sequence from the front to the back to get the forward hidden state h, then from the back to the front to get the reverse hidden state h t . The final hidden state of the sequence is calculated by merging the forward hidden state and the reverse hidden state of each element according to some merging strategies such as splicing, averaging and summation. However, traditional LSTM unifies the hidden variable-length vector into a fixed-length vector [19]. In this way, speech vectors of different lengths are encoded into feature vectors of fixed length, which results in the loss of intermediate nodes in LSTM, especially in processing long speech. What is more, temporal pooling averages these representations over time to obtain a final encoded embedding. The major drawback of this method is that every element of the sequence must contribute equally in obtaining the representation. So a trainable layer is necessary for assigning a weight over each representation of the sequence. Besides, in the K-means clustering algorithm, the value of K should be known at first which is impractical in real life. Based on above problems, we adopt the multi-head self-attention mechanism, which can learn features of a sequence from different aspects and retain the output of hidden units in LSTM. Besides, the original K-means clustering algorithm is also improved in our model.

III. MODEL
The overview architecture of our proposed model is shown in figure 2. It mainly consists of four parts: pretreatment layer, embedding network, multi-head self-attention layer, improved K-means layer. The last two parts of our proposed model will be presented in detail in the following sections.

A. ATTENTION
Recently, attention mechanisms have become an integral part of models that can model the strength of relevance between representation units. The input of attention mechanism consists of queries Q, keys K , and values V . Its structure is shown in figure 3. i ∈ [1, L], L is the length of input vector. There are three main steps in calculating attention value: (1) Calculate the similarity between query Q and each key K i to get the weight. The commonly used methods for similarity calculation are as follows: where W a is trainable weight matrix. (2) Normalize the weights using the softmax function.
(3) The weight and corresponding value are weighted and summed to get the final attention result. Where Q,K ,V denote query matrix, keys matrix and value matrix respectively.

1) SELF-ATTENTION
Self-attention is a special form of the attention mechanism. Queries Q, keys K , and values V are the same and they are equal to the input vector [18]. It calculates the representation of one unit by attending to all units within the same sequence. It can be formalized as a non-local operation to model the spatial-temporal dependencies in sequences. Self-attention has been successfully applied in various tasks such as semantic role labeling [20], machine translation [21] and relation extraction [22].

2) MULTI-HEAD SELF-ATTENTION
Developed from self-attention, multi-head attention maps the input sequence and groups of key-value pairs into weighted outputs, where the weights assigned are calculated by the compatibility function using corresponding key and input sequence. Many tasks on speech recognition and speech enhancement have adopted multi-head attention and achieved good results. For example, [23] encodes short-term talker characteristics from the spectrogram and a multi-head attention model is adopted to map these representations into a long-term speaker embedding. By employing multi-head attention, [24] models the inner dependencies between units with different positions in the learned feature sequence,  which enhances the importing of information. Reference [25] employs the multi-head attention to highlight the speaker related features learned from context information in frequency and time domain. As shown in figure 4, for given sequence R, the multi-head attention mechanism relies on scaled dot-product attention, which operates on queries, keys, values of dimension d k , d k , d v with linear projections. By stacking them into matrices Q, K , V , the output of attention is: In self-attention, Q = K = V and they are from the output of the previous layer. To exploit information from different representation subspaces at different positions, multi-head self-attention is further proposed to perform multiple self-attention function h times to generate queries, keys, values matrices Q i , K i , V i from i = 1, . . . , h. So h also can be considered as the number of heads in multi-head self-attention. The multi-head self-attention is calculated as: are trainable weight matrixes. Since each attention layer captures different modalities, it is expected to improve the performance with an ensemble effect. But there is a potential problem: the network will become more complex and its performance even will deteriorate with the increasing of h. In our following paper, we combine the DPCL model in [11] with multi-head self-attention and explore the most proper value of h for our proposed model.

B. DENSITY-BASED CANOPY K-MEANS
K-means algorithm is a universally used clustering algorithm based on division concept, it is simple and efficient but the determination of cluster number and the selection of initial clustering centers remain hard [26]. While canopy clustering is of low accuracy, it does not need to specify the number of K values [27]. So density-based canopy can be employed as pre-processing procedure, then its result is utilized as cluster number and initial clustering center of K-means algorithm [28]. The sample elements m in dataset M = {x 1 , x 2 , . . . , x n } is x m = (x m1 , x m2 , . . . , x mr ). n ∈ [1, N ], N is the number of elements in M . The average distance S of all elements in M is defined as: where d(x i , y j ) represents the Euclidean distance between x i and x j . As is illustrated in (10), ρ(i) is the number of data whose distance from other samples j to point i is less than S. f (x) is a boolean function whose value is set to 1 when x < 0 and 0 when x > 0.
a(i) is the average distance between samples in cluster and it is defined in (11).
O(i) stands for the distance between element i and other element j with a higher local density. O(i) is defined in (12).
ω is defined as the weight product of ρ(i), a(i) and O(i) in (13).
According to the practical significance of ρ(i), a(i) and O(i), we can conclude: (a) the greater the value ρ(i), the more the samples gather around the point i and the tighter the samples in a cluster; (b) The smaller the value a(i), the larger the value 1/a(i), the more intensive the cluster will be; (c) The larger the value O(i), the greater the degree of dissimilarity between the two clusters. So in order to divide the datasets appropriately, ω needs to be maximized. At first, the density of every element in the dataset M is calculated according to (10). The maximum value of ρ will be the first cluster center c 1 , all samples whose distance to the initial cluster center is less than S will be added to the new cluster C 1 and at the same time, they will be removed from the original dataset M . For the rest samples in dataset M , ρ(i), a(i) and O(i) are calculated according to (10)- (11). The second clustering c 2 center will be decided to maximize the value of ω. The data meet the conditions will be removed from M and added to the cluster C 2 , the same as the procedures in C 1 . For the rest elements in M , the distance between them and the points in C = {C 1 , C 2 } will be calculated. The third clustering center will be determined to maximize ω(i, c 1 ) * ω(i, c 2 ), where i is the index of the sample in M . Then elements meeting the condition will be removed from the old dataset and be added to the new cluster. The above process will be repeated until the dataset M is empty. At last, the original dataset will be divided into K subsets. The clusters and the value of K then will be input to the K-means algorithm to be clustered more precisely.

IV. EXPERIMENTS A. DATASETS
The training and evaluation for our model are based on the Wall Street Journal (WSJ0) dataset. In the original WSJ0 dataset, the training part contains fifty men speakers and fifty-one women speakers while the testing part includes ten men speakers and eight women speakers. Every speaker has 141 or 142 utterances, each of which lasts for about five or six seconds. The bit rate of the voices is 256 kb/s, and the sample rate is 16 KHz. We trained and evaluated our network on two-speaker speech separation problem using WSJ0-2mix dataset [11], which contains 18000 utterances and lasts about 30 hours. The mixtures were generated by randomly selecting speeches from different speakers in WSJ0 training set si_tr_s, and mixing them at random signal-to-noise ratios (SNR) between −5 dB and 5 dB. Five hours of evaluation set was generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset. The WSJ0-3mix dataset was generated using a similar approach but it contains mixtures of speeches from three speakers. For most existing speech separation methods, the identity of speakers should be known at first, while the task for unknown speakers separation is challenging. Besides, since the mixture of different gender speakers has different separation performance, we created known and unknown speakers datasets which include different gender speakers in the WSJ0 test section. Therefore our test dataset includes known speakers set and unknown speakers set, both of which can be divided into two male speakers mixed sets, two female speakers mixed sets, a male speaker and a female speaker mixed set.

B. SETUP
In pretreatment layer, every audio file is downsampled to 8 KHz. The log short-time Fourier spectral magnitudes of mixture speech are extracted by the hamming window with a 32 ms window length and 8 ms window shift. Each utterance is divided by 100 frames, which is roughly the length of one word. To avoid silent speech, some speech bins whose amplitude is 40 dB lower than speaker's maximum amplitude are removed. 256-point DFT is applied to each frame of Bi-LSTM training to extract 129-dimensional log- arithmic amplitude features. In all experiments, the embedding network in ADCNet contains four Bi-LSTM layers with 600 hidden units in each layer. The embedding dimension D is set to 40 according to [11]. Adam is exploited as the model optimization and L2-norm regularization is applied during training to avoid overfitting. For comparison with other models, we primarily evaluated our system with the source-to-distortion ratio (SDR), according to the BSS-EVAL metrics [29]. Other evaluation metrics including signal-to-distortion ratio improvement (SI-SDR) [30], perceptual estimation of speech quality (PESQ) scores [31], scale-invariant signal-to-noise ratio (SI-SNR) [32] are also used in our experiments. Higher values of SDR, SI-SDR, PESQ, SI-SNR represent a better separation quality. SDR is defined as: SDR := 10 log 10 S 2 e interf + e noise + e artif 2 (14) SI-SDR is defined as: SI-SNR is defined as: SI-SNR := 10 log 10 S target 2 e noise 2 (18) whereŜ ∈ R 1×t and S ∈ R 1×t denotes the estimated and original clean source respectively, t is the length of signals, and S 2 = S, S denotes the power of the signal. e interf , e noise , e artif are the interferences, noise and artifacts error terms respectively.Ŝ and S are both normalized to have zero-mean to ensure scale-invariance.
where d SYM is symmetric disturbance and d ASYM is asymmctric disturbance. To evaluate the effect of multi-head self-attention mechanism, we first set up five comparative experiments, using the combination of ''DPCL + multi-head self-attention + K-means'', which we called ''*ADCNet''. To further improve model performance, the original clustering algorithm in *ADCNet was discarded and density-based canopy K-means was adopted. And this model was named as ADCNet. What is more, an ablation study was also conducted using the combination of ''DPCL+density-based canopy K-means'' to support the effectiveness of multi-head selfattention. For every experiment in *ADCNet, the K value in the K-means algorithm was 2 because each audio file in our experiment was composed of two speakers. While in ADCNet, the number of clusters was not specified, because the algorithm could calculate it automatically. The value of h was employed as 1, 2, 3, 4, 5 respectively. h was not further increased, mainly because when h was 6 or larger, a lot of computing resources was consumed, and a further increase of h was not worth the loss. To reduce the randomness of experiment caused by random variables, the cross-validation method was adopted in each experiment. Different experiments were conducted on different test sets for different people, and the statistical mean value of all experiments was finally taken as the result of a complete experiment.

C. EXPERIMENTAL RESULTS AND ANALYSIS
SDR results about different gender combinations and overall performance across all combinations on the original DPCL model [11], DPCL with density-based canopy K-means, *ADCNet and ADCNet were recorded in Table 1. SI-SDR, PESQ, SI-SNR results were shown in Table 2, Table 3 and Table 4 respectively.

1) THE EFFECT OF MULTI-HEAD ATTENTION LAYER AND DENSITY-BASED CANOPY K-MEANS
According to Table 1, by comparing the results of ''DPCL+ Kmeans'' and ''DPCL+canopy Kmeans'', ''DPCL+Kmeans'' and ''*ADCNet (h = 1, 2, 3, 4)'', it is distinct that both multi-head self-attention mechanism and density-based canopy K-means algorithm can improve performance, while multi-head self-attention is much more effective. By compar-   ing the results of ''*ADCNet'' and ''ADCNet'' in Table 1-4, it can be observed that when *ADCNet performs the best, that is, when h is set as 3, the SDR difference between *ADCNet and ADCNet is 0.27 dB for known speakers and 0.22 dB for unknown speakers. And the SI-SDR, PESQ, SI-SNR differences are 0.27 dB, 0.14, 0.19 dB for unknown speakers respectively, which also shows the optimization of clustering algorithm has a little influence on ADCNet. We attribute this result to multi-head self-attention. After signal sequence has been processed by self-attention for three times, the network has learned enough speaker features, which can be easily distinguished by both K-means and density-based canopy K-means. What is more, our system achieves much better separation performance on ''male+female'' combinations than the same gender conditions. The results agree with the previous observations from other works [11], [13], [33], [34] and indicate that the same gender-mixed speech separation is often a tough project.

2) THE EFFECT OF THE NUMBER OF HEADS IN MULTI-HEAD SELF-ATTENTION LAYER
By observing the results of ''*ADCNet'' and ''ADCNet'' in Table 1-4, it can be discovered that with the increasement of h, the performance of speech separation increases initially and decreases afterwards. When h is 3, the model performs the best. But when h is 5, SDR for *ADCNet is lower than the baseline DPCL model in Table 1, which indicates that multi-head self-attention can not perform well in all instances and the value of h is not the greater the better. Interpretations are given theoretically. Multi-head self-attention can acquire information from multiple aspects for feature extraction. It concatenates the output of self-attention and the input data are fully utilized. In addition, Highly optimized matrix multiplication can be achieved by dot-product in multi-head attention. As a result, compared with single attention layer, the computational expense for multiple attention does not cost too much. So the model can perform well with the increasement of h at first. However, when the value of h is very large, much redundant information will be reserved and the size of parameters in model will also grows rapidly. Not only the model complexity is greatly increased, the computation resource is also used up. As a consequence, the model performance will become worse and worse. So appropriate value of h can improve performance, while a big number will weaken the function of model.

3) COMPARISON WITH OTHER SEVERAL STUDIES
Our ADCNet is also compared with other models for single-channel speech separation on unknown speakers in WSJ0-2mix and WSJ0-3mix. As shown in Table 5, ADCNet achieves comparable performance with DPCL, uPIT, DANet and ADCNet models reported in [11], [13], [14] and [30]. Figure 5 shows the spectrograms for separating a male utterance and a female utterance from the mixed speech. The words for Sp1 and Sp2 are ''We didn't like that'' and ''A print media campaign will begin the following day'' respectively. It can be observed that the spectrograms of original signals and their corresponding recovered speeches share many similarities, which indicates the effectiveness of our model.

V. CONCLUSION
In this paper, we propose combing the physiological characteristics of human beings with deep learning for singlechannel speech separation. Specifically, we propose the joint optimization of multi-head self-attention and DPCL model. We prove that multi-head self-attention can achieve better results by experiments. What is more, we further optimize the model using density-based canopy K-means and our model outperforms other advanced models. For future work, we will investigate the label permutation problem and reduce the time delay for speaker-independent speech separation.