Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET

One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don’t use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.


I. INTRODUCTION
People rely on sounds in the environment to obtain important information. Sound event detection (SED) can detect specific audio events from audio recordings, estimate the starting and offset locations of sound events, and provide a label for each event. SED has great potential in many applications, such The associate editor coordinating the review of this manuscript and approving it for publication was Lorenzo Mucchi . as information retrieval, monitoring systems, and automatic control of devices in smart home systems [1].
The most typical method of SED is to use hidden markov models (HMMs) [2], support vector machine (SVM) [3], and non-negative matrix factorization (NMF) [4]. However, to build a system based on HMMs, multiple labels need to be provided at the same time. When we choose the deep learning method for SED, the structure and training of the neural networks directly allow multi-label classification. Parameters of the neural networks can simultaneously be trained, and the neural networks can output the results [5]. Therefore, in recent years, most SED problems have used deep learning methods, such as the convolutional neural network (CNN) [6], the convolutional recurrent neural network (CRNN) [7], or the capsule network [8].
However, the CNN pooling operation has the disadvantage of losing the location information of the target object [9]. The traditional CNN network is not interpretable [10], and is a black box method. However, the interpretability is necessary in many applications, such as monitor, health care, and education. Interpretable machine learning means that machine learning model can explain why some predictions are made [11].
Interpretable deep network models are the current research hotspots. One of the interpretable methods is the optimization algorithm-driven deep networks [12]. Recently, deep neural networks driven by optimization algorithms have become increasingly popular. Gregor and LeCun [13] proposed a learned iterative soft threshold algorithm (LISTA) network, which uses a learning matrix to produce the lowest possible loss in a given number of iterations. Borgerding et al. [14] proposed a learned approximate message passing (LAMP) network and a learned vector AMP (LVAMP) network. The LAMP network significantly improved the LISTA network. Ito et al. [15] proposed a novel sparse signal recovery algorithm for trainable ISTA (TISTA). TISTA consists of two estimation units, a linear estimation unit and a minimum mean squared error (MMSE) estimator-based shrinkage unit. The numerical results show that TISTA converges faster than AMP and LISTA.
The convolutional sparse coding (CSC) model and the optimization algorithms have strong prior knowledge [16]. The CSC prior replaced the traditional image patch-based model with a global shift-invariant model. It proposes a global dictionary constrained by a specific structure -a concatenation of banded circular matrices, which limits the degrees of freedom introduced by general sparsity-based model. The dictionary is an important factor in the formation of the priori, because its atoms represent the signals that this model can sparsely represent. The l 1 sparse constraint prior condition is applied to sparse coding solved by the optimization algorithms [17], [18]. The dictionary and sparse code of multi-layer convolutional sparse coding (ML-CSC) also inherit the same prior knowledge [19]. The ML-CSC optimization algorithms can be converted into the iterative neural networks, and extract features that are different extracted from the CNN. The CNN may not have strong constraints similar to these algorithms. For CSC problems, Zisselman et al. [20] proposed a based local block coordinate descent (LoBCoD) algorithm for performing global the basis pursuit and introduced a new stochastic gradient descent version of LoBCoD for training the convolutional filters. For ML-CSC problems, Sulam et al. [21] proposed a multi-layer ISTA (ML-ISTA) and a multi-layer FISTA (ML-FISTA) algorithm. The two algorithms can converge to the global optimum. ML-ISTA-NET is a deep network structure based on the iterative unfolding of the ML-ISTA algorithm. The learnable network parameters are updated by the backpropagation algorithm in deep learning. One iteration of ML-ISTA algorithms implements a traditional CNN while a new recurrent architecture emerges with the subsequent iterations.
Inspired by ML-ISTA and the corresponding iterative unfolding network ML-ISTA-NET, we extend the LoBCoD algorithm to the multi-layer basis pursuit problem of ML-CSC. A multi-layer local block coordinate descent (ML-LoBCoD) algorithm and multi-layer local block coordinate descent network (ML-LoBCoD-NET) with iterative unfolding are proposed. ML-LoBCoD-NET implements the representation learning of the signal, and the output of the deepest convolutional sparse coding is used for classification. ML-LoBCoD-NET retain ReLU and convolution operation, use the strong constraints of the ML-CSC algorithm, and don't use pooling operation.
Inspired by the CRNN-Att network [22], the MRNN-Att network combines the ML-LoBCoD-NET, a recurrent neural network (RNN) and an attention network is proposed. The MCRNN-Att network combines MRNN-Att and CRNN network for fused the different features. The MRNN-Att network and MCRNN-Att network are used for weakly-supervised sound event detection tasks.
Many methods for solving SED problems rely on a fully supervised approach using strong labeled data (SLD). However, strong labeled data needs to label the start and offset times of the audio events, and the process of creating a large number of SLD requires a large amount of time, which is a difficult and expensive process [23]. Recently, many audio datasets have been weakly labeled and are typically larger than strongly labeled SED datasets. Compared with SLD, weakly labeled data (WLD) only knows if there is an audio event in the record. A strong label is the start and offset times of the audio event class. A weak label is the class label of the audios. Weakly supervised learning is studied for sound event detection from weakly labeled datasets, and some of the models include the joint separation-classification (JSC) model [24], the attention and positioning model [25], and the multiple instance learning (MIL) [26] method. Tarvainen and Valpola [27] proposed the mean teacher (MT) model for the weakly supervised learning of images. The mean teacher model can solve semi-supervised learning problem and can effectively use unlabeled data. The mean teacher model includes the student model and the teacher model. The student model and the teacher model currently all use the same model. The main purpose of the mean teacher model is that averaging the model weights over the training steps tends to produce a more accurate model than using the final weights directly. A key issue of the mean teacher model is the choice of the student model. For example, If the student model is chosen as the traditional model such as SVM, the mean teacher model can only solve the traditional supervised learning problem. If the student model is selected as the CRNN model which commonly used for sound event detection or the models we will proposed, the mean teacher model can solve the weakly supervised learning problem. In summary, the choice of the student model in the mean teacher model framework determines whether it can deal with supervised learning problem or weakly supervised learning problem. No matter which learning mode is chosen, the mean teacher model framework can deal with semi-supervised learning.
Inspired by the mean teacher model to solve the semi-supervised problem, this paper proposes two mean teacher models for sound event detection tasks in the domestic environment. The first our proposed mean teacher model is the MRNN-Att-MT, and the student model is the MRNN-Att. The second our proposed mean teacher model is the MCRNN-MT, and the student model is the MCRNN-Att.
The weakly-labeled sound event detection task is the core problem, the proposed MRNN-Att network is the core method in this paper. The MRNN-Att network is based on the ML-LoBCoD-NET which is driven by the ML-LoBCoD algorithm. The ML-LoBCoD-NET shows the feature extraction ability different from the CNN for sound event detection task. The MCRNN-Att network is combined of the MRNN-Att and the CRNN-Att. We use the MRNN-Att and the MCRNN-Att as the student model in mean teacher model, respectively.
The remainder of the paper is organized as follows: the CRNN-Att network is introduced in Section II-A, and the CRNN-Att-MT model is introduced in Section II-B. The ML-LoBCoD algorithm and the ML-LoBCoD-NET are proposed in Section III. In section IV, for the weakly-labeled weakly-supervised sound event detection task, the MRNN-Att network based on the ML-LoBCoD-NET network is proposed in section IV-A. Moreover, in order to fully utilize the feature information of the CNN and ML-LoBCoD-NET network, the MCRNN-Att network is proposed in section IV-B. In section V, for the weakly-labeled semi-supervised sound event detection task, the MRNN-Att-MT is proposed. Moreover, the MCRNN-Att-MT is proposed for sound event detection task in section V. The experimental results and analysis are given in Section VI. The conclusion is given in Section VII.

A. THE CRNN-ATT NETWORK FOR SOUND EVENT DETECTION TASK
For sound event detection task, a weakly-supervised learning model is need. The CRNN-Att network is a common model for sound event detection task [25], which is described below. A CNN consists of three basic components, convolutional layers, pooling layers, and fully-connected layers. A convolutional layer first performs convolution operations to produce a set of linear activation, which then is fed into a non-linear activation function like ReLU or tanh. Pooling layers are usually used after each convolutional layer to reduce the representation size of convolutional output and the computational burden of the next layers. The pooling function divides its input into a set of rectangles, and each sub-region generates a summary statistic of the input nearby. The use of pooling is very useful for extracting the most effective information from an area. After several convolutional layers and pooling layers, the fully connected layers are adopted at the end of a CNN. A fully-connected layer in a CNN is similar to the layer in a standard neural network where the neurons in the adjacent layers are fully pairwise connected and the neurons in the same layer share no connection.
The advantage of a CNN is that it can effectively process the spatial structure data with large width and height. The function of a RNN is to be extended to longer sequences. In a RNN, a hidden layer with a self-joining unit acts as memory that accumulates information over time from the input sequence. However, there is the problem that the gradient disappears when training the RNN to capture long-term dependency. To combat the gradient disappearance problem, several techniques have been proposed, such as long short term memory (LSTM) [28] and gated recurrent unit (GRU) [22]. The LSTM and GRU architectures accumulate information by replacing self-joining units with memory blocks, which better capture the long-term dependencies in time series data.
The CRNN is a network structure that combines a CNN and a RNN, benefiting from the advantages of both. A RNN can work well in a time domain while a CNN can apply a linear convolutional filter in the time domain and frequency domain of local features. In addition, a CRNN has proven to work well in sound event detection tasks [7].
The attention mechanism can increase the focus on the important time frames through weighting, and the attention layer can automatically select the important frames of the target and ignore the irrelevant parts (such as background noise segments). It can also be viewed as a weighting factor for learning each frame. The system can suppress the background noise, and thus the whole system is more robust [22]. The CRNN-Att network that combines a RNN and an attention network has also been used for sound event detection tasks and has achieved good results [25].

B. THE MEAN TEACHER MODEL BASED ON THE CRNN-ATT NETWORK FOR SOUND EVENT DETECTION TASK
For semi-supervised sound event detection task, the mean teacher model is a new method [27]. The mean teacher model can effectively utilize large amounts of unlabeled data. The mean teacher model based on the CRNN-Att network was used for SED task in Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge and obtained the first place [29]. Then the mean teacher model based on the CRNN-Att network was as the baseline system in DCASE 2019 challenge [30].
The mean teacher model consists of a student model and a teacher model, and the teacher model uses the same model as the student model [27]. For sound event detection task, the student model in mean teacher model uses a weakly-labeled and weakly-supervised deep learning model.
The input of the student model and the teacher model is the same sample with different noise. The output of the student model and the teacher model both include strong labels and weak labels. The main purpose of the mean teacher model is to average the parameters of the student models on the training steps to obtain the parameters of the teacher models. It is easier to obtain accurate results with the teacher model that uses the average parameters than with the student model that directly uses the final parameters. Thus the final output of the strong and weak labels of the mean teacher model are the strong and weak labels obtained from the teacher model.
For weakly labeled training data, there are three losses, which are the classification loss, strong consistency loss, and weak consistency loss. The classification loss is the multi-class cross entropy loss of the weak labels generated by the student model and weak reference labels of the training data. The strong consistency loss refers to the consistency loss of time frame level between the strong labels generated by the student model and the strong labels generated by the teacher model. The weak consistency loss refers to the consistency loss of the clip level between the weak labels generated by the student model and the weak labels generated by the teacher model.
There are two losses for unlabeled training data, the strong consistency loss and weak consistency loss.
The five loss functions are weighted for optimization. The backpropagation algorithm is used to update the parameters of the student model. After the parameters of the student model are updated, the parameters of the teacher model are updated to an exponential moving mean or a random weighted mean of the student parameters.

III. THE PROPOSED ML-LoBCoD ALGORITHM AND ITS ITERATIVE UNFLOD NETWORK
The pooling operation in the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and l 1 penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). The iterative unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET) is proposed in this section. The ML-LoBCoD-NET is driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm.

A. THE PROPOSED ML-LoBCoD ALGORITHM
Given a set of convolutional dictionaries {D j } J j=1 or convolutional filter {D L,j } J j=1 with appropriate dimensions, the global signal X ∈ R N can be represented by the slice-based multilayer convolutional sparse coding (MLCSC-S) as follows: where the norm l 0,∞ is defined as the maximal number of non-zeros in the vector; j is the sparse representation of the j-th layer; P T j,i is the operator that extracts the i-th n-dimensional patch from the j-th layer sparse representation j ; D L,j is the j-th layer local dictionary; α j,i is α i of the j-th layer; λ j is a super parameter; N is the number of slices, and the N in each layer should be different. For the sake of simplicity, it assumed that all layers have the same N. Let 0 denote signal X, and then the MLCSC-S model can be rewritten as follows: According to Formulas (4), the multi-layer basis pursuit problem proposed in this paper is as follows: The slice-based local block coordinate descent algorithm is extended into the multi-layers. Then, a slicebased multi-layer local block coordinate descent algorithm (ML-LoBCoD) is proposed. The ML-LoBCoD algorithm divides the layer sparse vector j into a local vector {α j,i } N i=1 , and then the optimal solution for needles α j,i is searched for. The other needles are fixed and regarded as constants. Equation (5) can be written as as the layer residual of the contribution of the needle α j,n in each layer, Equation (6) can be rewritten as Equation (7) is further organized as follows: where P j,i ∈ R N ×n is defined as the operator that extracts the i-th n-dimensional patch from the j-th layer convolutional sparse coding j . The optimal convolutional sparse VOLUME 8, 2020 representation of each layer is updated using the soft threshold formula. The updated form is α k ← α k−1 − ∂f ∂α k−1 , where f denotes the error term in the objective function of Equation (8). The derivative formula can be given as follows: ∂f Similar to the ML-ISTA, the update of the needle α j,i is expressed as where S λ is a soft threshold function, which can also be replaced by the ReLU function [21]; P j,i represents the operation of image to column; P T j,i represents the operation of column to image; D L,j represents the transposed convolution operation; and D T L,j represents convolution operation.
The proposed slice-based multi-layer local block coordinate descent algorithm is shown in Algorithm 1. The input signal is the signal y contaminated by the noise w, and the output is the layer sparse representation , the top convolutional sparse coding 0 , and the top layer residual R 0 are initialized. In each iteration process, first, the local residual R k j,i is obtained according to the local residual R k−1 j of the previous iteration and the i-th needle α k−1 j,i of the j-th layer convolutional sparse code of the previous iteration. Secondly, the needle α k j,i is updated using the optimization Formula (10). All the updated needles combine the convolutional sparse coding k j = {α j,i } N i=1 of the j-th layer. Thirdly, the layer residual R k j is computed by k j−1 of the (j − 1)-th layer and k j of the j-th layer. The convolutional sparse coding of the next layer is updated. The iteration process is repeated K times until the optimized deep convolutional sparse coding is obtained.
Calculate local residual: R k Update layer residual: The slice-based multi-layer local block coordinate descent algorithm (ML-LoBCoD) provides an effective algorithm for the multi-layer basis pursuit. The iterative unfolding multi-layer local block coordinate descent network (ML-LoBCoD-NET), which is similar to the ML-ISTA-NET [21], is proposed, as shown in FIGURE 1. The LoBCoD algorithm is unfolded into a layer of neural networks. The ML-LoBCoD algorithm is unfolded into a multi-layer neural network. ML-LoBCoD-NET iterates K times to form a recurrent structure because the algorithm iterates many times to obtain the optimal performance. Its parameters are the same as a traditional CNN, and thus the network parameters remain unchanged. In FIGURE 1, the ML-LoBCoD-NET is a three-layer feedforward neural network. One iteration of the ML-LoBCoD-NET corresponds to the traditional CNN. Three iterations of the ML-LoBCoD-NET correspond to FIGURE 1. The input of ML-LoBCoD-NET is X and the output is 3 .
Firstly, the input signal is X , and the initial values 1 , 2 and 3 are generated by a standard convolution operation. 1 is generated by a deconvolution operation using 2 . 2 is generated by a deconvolution operation using 3 . Then, the final initial value 1 is obtained by weighting 1 and 1 , and the final initial value 2 is obtained by weighting 2 and 2 . β is a weight. When β = 0, the signal doesn't satisfy the ML-CSC model; when β = 1, the signal satisfies the ML-CSC model. In this experiment, the value of β gradually increased from 0 to 1.
Secondly, the first iteration of the algorithm is unfolded to the three-layer network corresponding to the third column in FIGURE 1. The CSC estimator 1 of the first layer is obtained by the ML-LoBCoD algorithm using 0 and 1 . The CSC estimate 2 of the second layer is obtained by the ML-LoBCoD algorithm using 1 and 2 . The CSC estimate 3 of the third layer is obtained by the ML-LoBCoD algorithm using 2 and 3 . Thirdly, the second iteration of the algorithm is unfolded to the three-layer network corresponding to the fourth column in FIGURE 1. The CSC estimator of the first layer 1 is obtained by the ML-LoBCoD algorithm using 0 and 1 . The CSC estimator 1 of the first layer is obtained by the ML-LoBCoD algorithm using 0 and 1 . The CSC estimate 2 of the second layer is obtained by the ML-LoBCoD algorithm using 1 and 2 . The CSC estimate 3 of the third layer is obtained by the ML-LoBCoD algorithm using 2 and 3 .
Finally, the third iteration of the algorithm is unfolded to the three-layer network corresponding to the fifth column in FIGURE 1. The CSC estimator 1 of the first layer is obtained by the ML-LoBCoD algorithm using 0 and 1 . The CSC estimate 2 of the second layer is obtained by the ML-LoBCoD algorithm using 1 and 2 . The CSC estimate 3 of the third layer is obtained by the ML-LoBCoD algorithm using 2 and 3 . A fully connected layer is added after 3 as a classifier. ML-LoBCoD-NET was tested on the Mnist dataset (http://yann.lecun.com/exdb/mnist/). The classification accuracy rates of the CNN, ML-ISTA, ML-FISTA, ML-LISTA, Layered Basis Pursuit (LBP), and ML-LoBCoD are given in Table 1. The classification accuracies of the ML-ISTA, ML-FISTA, and ML-LoBCoD network under different iteration times are given in FIGURE 2. As shown in Table 1, the classification accuracy of the ML-LoBCoD network is higher than the classification accuracy of the CNN and ML-ISTA on the MINIST dataset. In addition, the classification accuracy of the ML-LoBCoD network is better than the classification accuracy of the ML-ISTA and ML-FISTA in FIGURE 2, and the classification accuracy of the ML-LoBCoD network is more stable than the classification accuracy of the ML-ISTA and ML-FISTA.
ML-LoBCoD-NET was tested on the CIFAR10 dataset (https://www.cs.toronto.edu/ kriz/cifar.html). The classification accuracy rates of the CNN, ML-ISTA, ML-FISTA, ML-LISTA, LBP, and ML-LoBCoD are given in Table 2. As shown in Table 2, the classification accuracy of the ML-LoBCoD network is higher than the classification accuracy of the CNN, ML-ISTA, ML-FISTA, ML-LISTA and LBP on the CIFAR10 dataset. The ML-LoBCoD network is better at extracting features than the CNN, ML-ISTA, ML-FISTA, ML-LISTA and LBP on the CIFAR10 dataset.

IV. THE PROPOSED MRNN-ATT NETWORK BASED ON ML-LoBCoD-NET FOR SOUND EVENT DETECTION TASK
In this section, for the weakly-supervised learning problem of sound event detection task, we first replace the CNN network of the CRNN-Att network in section II-A with the ML-LoBCoD-NET network. The MRNN-Att network based on the ML-LoBCoD-NET network is proposed for sound event detection task in section IV-A. Moreover, in order to fully utilize the feature information of the CNN and the ML-LoBCoD-NET network, the MCRNN-Att network is proposed in section IV-B.

A. THE MRNN-ATT NETWORK
The proposed MRNN-Att network is shown in FIGURE 3. The input is a log-mel spectrogram of an audio clip. The output is the prediction of the strong label and weak label, where the prediction of the weak labels data is used for the weak label training, and the prediction of the strong labels for locating the time location of the sound. The network includes the ML-LoBCoD-NET with K iterative unfoldings, a RNN network, and an attention network. The ML-LoBCoD-NET is used to extract features of the audio clip. The RNN network uses a two-layer Bi-GRU network. The Bi-GRU network can be used to capture the context information of sound events and can simulate well the long-term mode of the entire block. The attention network has two FNN layers with softmax and sigmoid layers.
One output of the network is the strong label, which is given as follows. The first attention layer uses the sigmoid activation function to predict the probability of occurrence of where Z class (c, t) and Z att (c, t) are denoted as the output of the first attention layer and the second attention layer. T is the temporal resolution of the input spectrogram or the feature map or the number of time frames. For the training problem, the loss function uses the multiclass cross-entropy loss O(c, n) and P(c, n) represent the weak prediction labels and the weak reference labels for the n-th sample of the cth class, respectively. The batch size is N, and the total number of classes is C. By calculating the gradient of the loss function with respect to the network parameters using back-propagation algorithm, the parameters of the neural network can be updated.

B. THE MCRNN-ATT NETWORK
The proposed MCRNN-Att network is shown in FIGURE 4. The input is a log-mel spectrogram of the audio clip and  (η, η ). The student model outputs strong labels and weak labels. The teacher model also outputs strong labels and weak labels. Five loss functions are computed. After the parameters of the student model have been updated with the backpropagation algorithm, the teacher model weights are updated as an exponential moving average of the student weights.
The specific definition of the loss function is given below. In the semi-supervised setup, the weakly labeled data Parameters of the student model are denoted by θ , and the parameters of the teacher model are denoted by θ . f (x; θ ) indicates weak labeled output of the student model. The weak labeled output of the teacher model is represented by f (x; θ ). The strong labeled output of the student model is represented by f strong (x; θ ), and the strong labeled output of the teacher model is represented by f strong (x; θ ).
Firstly the multi-class cross entropy loss function L ce (x, y, θ) in the supervised training is defined as follows: Secondly, given the sample x of two disturbance inputs η and η and the two network disturbance parameters θ and θ , the strong consistency loss between the strong prediction label of the student model f strong (x, η; θ) and the strong prediction label of the teacher model f strong (x, η ; θ ) is defined as the mean square error loss form as follows: The weak consistency loss between the predictive weak label of the student model f (x, η; θ ) and the predictive weak label of the teacher model f (x, η ; θ ) are defined as the form of multi-class cross entropy loss as follows: Finally, the total loss for training the student model is defined as The parameters λ 1 , λ 2 , λ 3 and λ 4 control the relative importance of the consistency term in the total loss. In this study, the values of λ 1 ,λ 2 ,λ 3 and λ 4 were set to 1 for simplicity. The training process is organized as follows: (1) The input of the mean teacher model are unlabeled in the domain training set and labeled training set. Data are input into the mean teacher model to generate the output, which are strong labels and weak labels. (2) Five losses are calculated, and then the total loss is calculated according to Formula (16). (3) Parameters of the student model are updated using the backpropagation algorithm based on minimizing the total loss. (4) Using the parameters of the student network, the parameters of the teacher model are updated to the average value of the current student model and the previous student model. (5) The above processes are repeated until the network converges.

A. DATASET
The dataset used in this paper for the experiment was the dataset for the DCASE 2018 Challenge Task 4 [31]. The task used weakly labeled data (without timestamps) to evaluate large-scale detection systems for the sound events. The test set was annotated with strong labels and time boundaries (obtained by human annotators). The evaluation set included 880 clips (3227 events). The audio clips were divided into nine categories: alarm, speech, dog, cat, dishes, frying, blender, running water, vacuum cleaner, and electric shaver. Table 3 shows the number of labeled training sets and test sets of the development set and the evaluation set 10s clips along with the number of complete sessions for each activity.

B. FEATURE EXTRACTION
The log Mel spectrum is widely used in the sound events detection [32]. So we used log Mel filters to process audio clips. Each audio clip was first resampled at 44.1 KHZ because we believe that resampling at low frequencies may confuse some categories like ''electric shaver/toothbrush''and ''vacuum cleaner''. After resampling, we apply a short-time Fourier transform with a window size of 2048 and an overlap of 512 between neighboring windows to extract the spectrogram of audio clips. Following this configuration the good resolution in both the time and frequency domains is provided. Then a Mel filter bank with 64 bands VOLUME 8, 2020 is applied to the spectrogram, and a logarithmic operation is performed to obtain the log Mel spectrogram, which is the time and frequency representation feature. Thus for a 10s clip, a 864×64 feature was obtained. The log Mel spectrum of each audio event are shown in FIGURE 6.

C. EXPERIMENTAL SETUP
In this study, the ML-LoBCoD-NET in MRNN-Att network and in the MCRNN-Att network used three-layer convolutional sparse coding. The first layer had 64 filters, the kernel size was (4,8), and the step size was 2. The second layer had 64 filters, the kernel size was (3,12), and the step size was 2.
The third layer had 64 filters, the kernel size was (3,12), and the step size was 2. The CNN network in the MCRNN-Att network used a three-layer convolution neural network with 64 filters per layer, the kernel size of convolution layer was (3, 3), the step size was 1, and the kernel size of the pooling layer was (2,4). The development dataset are divided into the training dataset and test dataset. The 80 percent of the training dataset are used to train the model. The 20 percent of the training dataset as the validation dataset are used to verify the F1 score of the training model, and the results on the validation dataset are used to adjust the model parameters of the training model to obtain an optimal training model. To evaluate the performance of the training model on the development dataset, the test dataset are put to the training model to obtain performance indicators. The predicted strong labels on the test dataset are obtained. Moreover, the performance indicators such as F1 score based on the predicted strong labels and ground truth are obtained. The performance indicators of the test dataset represent the experimental results of the development dataset. Furthermore, the evaluation dataset are put to the optimal training model to obtain performance indicators, such as F1 score, based on the predicted strong labels and ground truth. This training and test process is the same as the baseline system [33].
In the training phase, different hyperparameters are used to train different models, and the training model with the highest F1 score on the validation set is selected. The selection range of the hyperparameters are as follows. The learning rate are selected from 0.1, 0.01, and 0.001. The sampling frequency are selected from 44.1KHZ, 16KHZ, and 8KHZ. The Mel frequency points are selected from 128, 64, and 32. The window shift are selected from 512 and 256. The batch size are selected from 8, 16 and 32. EMA are selected from 0.9, 0.99, and 0999. After expensive experiments, the optimal learning rate is selected 0.001, the optimal sampling frequency is selected 44.1KHZ, the optimal mel frequency is selected 64, the optimal window shift is selected 512, the optimal batch size is selected 8, the optimal EMA is selected 0.99, and the number of iterations is 100.
We choose ER and F1 score as evaluation metric. Error rate (ER) is used as a secondary metric to assess errors in terms of insertions, deletions, and substitutions. F1 was used to evaluate the model, which is defined as follows: where TP c , FP c and FN c represent the true, false positive, and false negative numbers of the sound event class C, respectively. The average of the F1 scores of the final model was calculated using the macro average where C represents the number of the sound event class, which is 10. All the ER and F1 calculations in this paper used the sed_eval kit provided in the competition [32]. The equipment used in the experiment was an Nvidia Geforce 1080 Ti GPU, and each experiment needed to run for about seven hours.

D. EXPERIMENTAL RESULTS AND ANALYSIS
The input of the baseline system [33] provided by the DCASE 2018 Task 4 is a log Mel spectrogram of the audio clip, the output is the F1 value. The baseline system is the CRNN network, which includes the CNN network and the RNN network. The first place in the DCASE 2018 Task 4 used the mean teacher model, which is a fusion model [30]. In order to compare with our proposed model, we use the single model of the first place, called GCRNN-Att-MT. The GCRNN-Att-MT is a mean teacher model in which the student model is GCRNN-Att, which includes the CNN network, a Bi-GRU network, and an attention network. The CNN network in the GCRNN-Att-MT model used a three-layer convolutional neural network with 64 filters per layer. For each filter, the kernel size was (3,3), the step size was 1, and the kernel size of the pooling layer was (2,4). The RNN network used a 2-layer RNN network with 64 filters per layer and the batch size is 24. Moreover, the number of iterations was 100. In this study, the proposed models was compared with the baseline system, the GCRNN-Att-MT model, and the GCRNN-Att network using the F1 values. The F1 score and error rate (ER) of baseline, MRNN-Att, MCRNN-Att and GCRNN-Att models are given in Table 4. For the development set in Table 4, the F1 scores of the MRNN-Att model and the MCRNN-Att model were respectively 2.18% and 1.95% higher than those of the baseline system, which indicates that the attention network can improve the performance of sound event detection by focusing on the relevant frames to ignore irrelevant frames. The F1 score of the MRNN-Att model was 0.68% higher than that of the GCRNN-Att model. The F1 score of the MCRNN-Att model was 0.45% higher than that of the GCRNN-Att model. These results indicate that the MRNN-Att model and the MCRNN-Att model were better than the GCRNN-Att model, and the extracted feature of the ML-LoBCoD-NET was effective. For the evaluation set in Table 4, MCRNN-Att model was 0.43% higher than baseline system, and was 0.27% higher than GCRNN-Att system.
The F1 score and Error rate of different mean teacher model systems are given in Table 5. For the development set in Table 5, the F1 score of the MRNN-Att-MT model was 8.77% higher than that of the baseline system. The F1 score of the MCRNN-Att-MT model was 6.29% higher than that of the baseline system. These results show that the sound event detection effect of the MRNN-Att-MT model and the MCRNN-Att-MT model on the development set was better than the baseline system. The F1 score of the MRNN-Att-MT model was 3.49% higher than that of the GCRNN-Att-MT model. The F1 score of the MCRNN-Att-MT model was 1.01% higher than that of the GCRNN-Att-MT model. This indicates that the sound event detection effect of MRNN-Att-MT and MCRNN-Att-MT was better than that of GCRNN-Att-MT on the development set.
For the evaluation set in Table 5, the F1 score of the MRNN-Att-MT model was 4.88% higher than that of the baseline system. The F1 score of the MCRNN-Att-MT model was 3.76% higher than the baseline system. This shows that the sound event detection effects of the MRNN-Att-MT model and the MCRNN-Att-MT model were better than that of the baseline system (CRNN). The F1 score of the MRNN-Att-MT model was 0.95% higher than that of the GCRNN-Att-MT model. Comparison of F1 scores between the MCRNN-Att-MT model and the GCRNN-Att-MT model indicates that the performance of the ML-LoBCoD-NET for extraction feature was better than that of the CNN network.
The performance index of the competition ranking is F1 score, and ER is the reference index. The value of ER in  our experiments being on par with baseline in Table 4 and Table 5. We mention two differences between the proposed model and the other two leading methods. Firstly, the number of parameters in the proposed approach does not grow with the depth of the model. Secondly, sound event detection methods based on traditional deep learning almost employ batch normalization operations which is known to improve the performance and convergence rate of the trained model. As our presented method relies only on the CSC prior, we did not include such batch normalization operators.
The F1 score and error rate (ER) of the proposed model with mean teacher model and without mean teacher model systems are given in Table 6. The F1 score of the MRNN-Att-MT model on the development dataset was 6.59% higher than that of the MRNN-Att model, and the F1 score of the MCRNN-Att-MT model was 4.34% higher than the MCRNN-Att model. The F1 score of the MRNN-Att-MT model on the evaluation dataset was 4.9% higher than the MRNN-Att model. The F1 score of the MCRNN-Att-MT model was 3.33% higher than the MCRNN-Att model. These results indicate that the mean teacher model can promote the sound event detection effect, thus improving the F1 score of the development set and evaluation set. The error rate of MRNN-Att-MT was 0.95% lower than MRNN-Att; the error rate of MCRNN-Att-MT was 0.82% lower than MCRNN-Att, which indicates that the proposed model with mean teacher model is better than the proposed model without mean teacher model.
The F1 scores of ten classes audio events using four systems without mean teacher model are given in Table 7. The F1 score of the MRNN-Att model was significantly better than the baseline system in terms of ''cat'', ''dog'', and ''running water'' class, and was significantly better than the GCRNN-Att in terms of ''blender'',''cat'' and ''dog'' class. The F1 score of the MCRNN-Att model was significantly better than the baseline system in terms of ''ringing'', ''dog'', and ''running water'' class, and was significantly better than TABLE 7. The F1 scores of ten classes audio events using four systems without mean teacher model. the GCRNN-Att in terms of ''blender'',''Electric shaver'' and ''Vacuum cleaner'' class.
The F1 scores of ten classes audio events using four systems with mean teacher model are given in Table 8. The F1 score of the MRNN-Att-MT model was significantly better than the baseline system in terms of ''ringing'', ''cat'', and ''dog'' class, and was significantly better than the GCRNN-Att-MT in terms of ''ringing'', ''cat'', and ''dog'' class. The F1 score of the MCRNN-Att-MT model was significantly better than the baseline system in ''ringing'' and ''dog'' class, and was significantly better than the GCRNN-Att-MT in ''dishes'',''dog'' and ''speech'' class.

VII. CONCLUSION
The MRNN-Att network for weakly-labeled sound event detection task is proposed in this paper. The CNN pooling operation has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retain ReLU and convolution operation, and use the strong constraints of the ML-CSC model. The MRNN-Att network based on the ML-LoBCoD-NET which is driven by the ML-LoBCoD algorithm. The ML-LoBCoD-NET shows the feature extraction ability different from the CNN for weakly-supervised sound event detection task.
Furthermore, the MRNN-Att-MT and the MCRNN-Att-MT model, the two mean teacher models, are proposed to solve the semi-supervised sound event detection problem.
The MRNN-Att and the MCRNN-Att network are selected as the student model in the mean teacher model, respectively.
The proposed models were tested on the DCASE2018 Task 4 dataset. The results of these experiments showed that the F1 score of the proposed MRNN-Att-MT model and the MCRNN-Att-MT model were superior to the F1 score of the baseline and GCRNN-Att network for sound event detection. Furthermore, the F1 score of the MRNN-Att-MT model was superior to the F1 score of the GCRNN-Att-MT model. The F1 score of the MRNN-Att model and the MCRNN-Att model were superior to the F1 score of the baseline system. Adding an attention network can improve the performance of sound event detection. The sound event detection effects of the MRNN-Att model and the MCRNN-Att model were better than that of the GCRNN-Att model. These results indicate the ML-LoBCoD-NET shows the feature extraction ability different from the CNN for sound event detection task, and the proposed MRNN-Att network can be used in sound event detection task and is superior to the baseline system.
There is still a lot of room for improvement in the MRNN-Att network, such as adding the different attention networks or data augmentation methods. The MRNN-Att network is also used for acoustic scene classification and audio tagging.
JING XIA was born in 1993. She is currently pursuing the master's degree in communication engineering. She studied at Yanshan University, Qinhuangdao, Hebei. Her research interests include sensor signal processing and audio classification. She completed the main thesis writing, revision, and experimental work.
QIAN YANG was born in 1994. She is currently pursuing the master's degree in communication engineering. She studied at Yanshan University, Qinhuangdao, Hebei. Her research interest is in sensor signal processing. She mainly completes the paper writing of the mean teacher model section and the subsequent modification.
YUZHEN ZHANG was born in 1994. She is currently pursuing the master's degree in communication engineering. She studied at Yanshan University, Qinhuangdao, Hebei. Her research interest is in sensor signal processing. She contributed to the first version of ML-LoBCoD section.