Weak Supervised Sound Event Detection Based on Puzzle CAM

The sound event detection method based on deep learning has achieved excellent performance. However, the training of high-performance sound event detection models depends on high-quality strong-label datasets, which brings great pressure and cost to the labeling work of datasets. In this paper, we propose a weakly supervised sound event detection method based on Puzzle Class Activation Map(CAM), which aims to obtain the timestamp information of sound events from a sound event classification network trained by weak labels. Specifically, the method discovers more complete regions of sound events by minimizing the difference between the CAM of the original features and the merged CAM of the separated feature blocks. CAM highlights the feature capture in the time dimension during the model decision-making process. We determine the occurrence time of the sound event based on the position of the frame where the feature is located, and use the model classification prediction to guide the CAM output to achieve weakly supervised sound event detection. Experiments on the Domestic Environment Sound Event Detection Dataset demonstrate that our proposed method exhibits superior performance compared to the baseline system in the task of sound event detection. Specifically, The CAM of a single system achieves the best PSDS2 value of 0.732. Furthermore, when utilizing the CAM ensemble of multiple systems, the PSDS2 value improves to 0.751. Both of these values are higher than those achieved by the baseline system.


I. INTRODUCTION
Sound event detection (SED) is a challenging task with the goal of detecting the categories of multiple events that may appear in the audio and determining the time boundaries of these events [1], [2], [3], [4]. It can be used in smart homes, autonomous driving, and smart security, among other applications. Since the Detection and Classification of Acoustic Scenes and Events (DCASE) was held, it has attracted an increasing number of researchers, and many methods for SED have been proposed based on DCASE. The baseline system for SED announced by DCASE Task4 provides a valuable reference for contestants and researchers [1], [2]. The semi-supervised learning method in [5] and [6] achieved great results in DCASE Task 4.
In recent years, deep learning-based neural network approaches have achieved significant performance in the field of SED [7], [8]. Reference [9] demonstrated the effectiveness The associate editor coordinating the review of this manuscript and approving it for publication was Sun-Yuan Hsieh . of using Convolutional Neural Networks (CNN) for SED. In addition, Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models have also been utilized for time series modeling in SED [10]. Cakır et al. [11] applied a Convolutional Recurrent Neural Network (CRNN), which combines CNN and RNN, to SED. CRNN utilizes the powerful feature extraction capability of CNN to capture deep, high-dimensional audio features. It also utilizes bidirectional RNNs to establish long-term dependencies among sound events. It has achieved excellent performance in SED tasks and has become the most widely used architecture. In the field of SED, the log-mel spectrum becomes the primary feature. Its two-dimensional data representation works well with convolution operations. CNN is proposed for images, but log-mel spectrograms differ from images. The information depicted in the image is shift-invariant in both dimensions. However, the spectrogram is not shift-invariant along the frequency axis. However, traditional 2D convolutions enforce translational equivariance on sound events in both the time and frequency domains. To address this problem, Nam et al. [12] proposed frequency dynamic convolution, which achieved superior performance in detecting non-stationary sound events. In addition, the method of kernel attention mechanism [13] is also employed in the SED task to enhance the feature extraction capability of CNN. With the great success of Transformer [14], [15] in natural language processing (NLP) and Automatic Speech Recognition (ASR) tasks, the self-attention mechanism has also garnered attention in the SED task. Compared to RNN, self-attention can capture global context information with lower computational costs. However, the self-attention mechanism of the Transformer does not necessarily outperform the existing CRNN on SED [16], [17], [18]. Moreover, the Conformer [19], which achieved the best performance in ASR, also failed to demonstrate stable performance in SED [20], [21]. Recently, large-scale audio pre-training models such as PANNs [22], AST [23], and BEATs [24] have achieved great success in the SED task. References [25] and [26] utilized a pre-trained model to extract audio embeddings and conducted feature fusion with CRNN, resulting in a significant improvement in the performance of SED.
However, most of the existing SED methods are based on semi-supervised learning, which relies on large datasets with strong labels and requires costly human annotations on the datasets. To alleviate the pressure of strongly labeled datasets, weakly supervised learning has emerged as a crucial research direction. Inspired by weakly supervised semantic segmentation [27], this paper explores a weakly supervised algorithm for sound event detection using Class Activation Map (CAM). The goal is to obtain temporal information of sound events from a network model trained with weak labels. The CAM technique was originally proposed in the field of computer vision. This technique can visualize the internal representations in convolutional neural networks and is widely used in weakly supervised object localization and weakly supervised semantic segmentation. The convolutional unit of the convolutional neural network has a remarkable ability to locate the target. In order to maintain this ability to locate, the earliest CAM generation method is to replace the fully connected layer in the network with global average pooling [28]. However, this method needs to modify the network structure and retrain, which greatly limits the application of this method. Then came some gradient-based CAM generation methods [29], [30], which use the gradient information flowing into the last convolutional layer of CNN to assign weights to each neuron without any modification to the network architecture. The mask-based CAM generation technique [31], [32] obtains the weight of each activation map by calculating the forward propagation score on the target class. This approach addresses the problem of false confidence in gradient-based CAM generation methods. These techniques can efficiently extract the CAM in the network. However, since classification networks tend to capture only the most salient features of objects, CAMs are ofen sparse and coarse. As a result, they cannot be directly used as pseudo-labels for semantic segmentation. Researchers have put forth a range of approaches in order to address this problem. So et al. [27] divided the image into blocks, which forced the model to focus on the features of each image block. This approach generated a more comprehensive Class Activation Map (CAM) that covers the object. Zhang et al. [33] complementarily hid input image information and sumed the CAMs of a pair of input images with complementary hidden parts. This approach activated more foreground object regions. References [34] and [35] drove a classifier to discover new and complementary object regions by erasing its discovered regions from feature maps. Xie et al. [36] utilize class-agnostic activation maps of unlabeled image data for contrastive learning. They disentangle the foreground and background by utilizing a class-agnostic activation map and a novel contrastive loss. This approach improves the CAM generated by the classification network for semantic segmentation. Chen et al. [37] clustered all the local features of an object class. For each cluster center, they calculated its cosine similarity to the local features at each location. They then derived a similarity matrix and aggregated them into a Class Activation Map (CAM), which fully covered the object.
CAM technology has been extensively researched and applied in computer vision, but its application in the detection and classification of audio is still relatively limited. In the SED task, the log-mel spectrogram is an excellent acoustic feature. It appears as a two-dimensional matrix in the time-frequency domain, resembling the pixel matrix of a two-dimensional image. The sound events occurring in the log-mel spectrum frame can be considered as objects to be localized in the image. Therefore, the weakly supervised learning method based on CAM has the potential to be applied to sound event detection tasks. This paper will explore the performance effect of CAM's object localization ability in sound event detection. CAM weakly supervised object localization heavily relies on the classification performance of the network. Therefore, we conduct research on weakly supervised sound event detection using the pre-trained sound event classification model, CNN14.
The main contributions of this work are as follows: • We propose a Puzzle-CAM method for sound event detection. This method divides the audio features into multiple blocks and forces the model to establish consistency between the separated feature blocks and the CAM of the original features, which activates a more complete sound event activity region.
• The application of CAM has been successfully implemented in the field of sound event detection. Through noise removal and CAM stitching, we can accurately locate the active areas of multiple sound events that may occur in the audio clip.
• The performance of the proposed method is assessed on the DESED dataset and compared to three recent methods. The results demonstrate the superiority of the proposed method on the SED task. VOLUME 11, 2023

A. ARCHITECTURE
We use the pretrained CNN14 from pretrained audio neural networks(PANNs) [22] as the base model. CNN14 is a network trained on the Audioset dataset to classify 527 kinds of sound events. It contains a total of 6 convolutional modules, and each convolutional block consists of two convolutional layers. The convolution kernel size is 3 × 3, and Batch Normalization(BN) and Rectified Linear Unit(ReLU) activations are performed between each convolution layer. Average pooling of size 2 × 2 is applied to each convolutional block, and global pooling is applied after the last convolutional layer. This article is aimed at ten categories of sound events in the domestic environment. We adjust the CNN14 network structure, replace the CNN14 fully connected(FC) layer, and map the output dimension to 10. The network structure is shown in TABLE 1. The number after the ''@'' symbol indicates the number of feature maps. The ''Pooling 2 × 2'' is average pooling with size of 2 × 2. The ''×2''after the ''()''symbol indicates that the structure has 2 layers, and each layer performs the operations in''().''

B. PUZZLE CAM 1) MOTIVATION
Since classification networks tend to capture the most salient features of objects, conventional CAM highlights the most representative regions of each class. Therefore, when generating a CAM of the same class for the mel-spectrum of audio, the model focuses on using only a part of the object to find the key features of the class. The model mainly focuses on the most discriminative time frame of the sound event and ignores the information of other frames, so the generated CAM is relatively sparse and rough. Inspired by [27], we conjecture that the merged CAM of mel feature blocks can more accurately cover the object region than the CAM of a single mel feature map. From this, we propose the Puzzle CAM method for sound event detection, a process that minimizes the difference between the CAM of single audio features and the merge CAM of separated feature blocks. The method consists of a puzzle module and two regularization functions to discover more complete regions of audio events. The overall architecture is shown in Fig1.

2) PUZZLE MODULE
Unlike images, the width and height of an audio spectrogram represent different information (i.e. time and frequency). In order to better locate the time when the sound event occurs, we divide the input feature I with a size of Frame × Freq into n non-overlapping feature blocks along the direction of the time frame {I 1 , I 2 , I i , . . . , I n }, the size is Frame/n × Freq. Then each feature block I i is input into the model, and the feature map A i is output at the last convolutional layer, and all A i are merged as a single CAM A * .
The consistency loss between CAMs is defined as: The final loss is defined as: L = L class + L pl−class + wL con (4) w is the weight of the CAM consistency loss.

C. GENERATE CAM
During the test stage, the CAM of the model is extracted for sound event detection. We calculate the class activation map of the last convolutional layer of the model, and aggregate the information in the frequency dimension and normalize it as a CAM for a certain class. Each sound event category corresponds to a CAM, which shows which frames on the feature map the model pays more attention to when judging this category. We determine the region of sound event activity based on the location of the frames where the features captured by the model. In addition, since sound event detection is a multi-label detection task, it is necessary to generate CAM for all categories and stitch them together (Fig 2-3 shows the CAM of ten sound event categories of a certain audio in the test set).  Compared with sound event detection, sound event classification is a relatively simple task, and CNN14 is a model that can achieve high-precision classification. Therefore, we expect the model classification results to further help improve CAM sound event detection performance [38]. Specifically, we use the model's classification results to guide the CAM output, and the CAM of categories with probabilities below a threshold are removed as noise according to the model's classification predictions. Let the prediction vector of the model be P, the CAM of the i-th class be C i , and L cam is the final CAM used for sound event detection. Define the calculation as follows: where i represents the i-th class, P i represents the predicted probability of the model for the i-th class, θ is the threshold (0<θ <1), and the CAM C i of the i-th class is generated using Grad CAM++ etc. [29], [30], [31], [32].

A. IMPLEMENTATION DETAILS 1) DATASETS
The experiment is primarily based on the Domestic Environment Sound Event Detection Dataset (DESED), which has been used since DCASE 2020 Task 4. DESED is composed VOLUME 11, 2023 of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper(a library for soundscape synthesis and augmentation) to simulate a domestic environment. We use some subsets of the DESED dataset to evaluate the method proposed in this paper [39]. Specifically, We use weakly labeled set (1578clips) and validation set (1168clips). The detail of the dataset is shown in Table 2. In addition, We copied validation set and made weak annotations on it, so that the copied validation set labels only contain the category information of the sound event, without the timestamp information of the sound event.
The copied validation set(no timestamps) is used to evaluate the classification performance of the model. The validatuon set (with timestamps) is used to evaluate the sound event detection performance of CAM.

2) EVALUATION METRICS
Segment-based F1-scores on 1 second segments are used to evaluate the classification performance of the model [40]. Segment-based F1 is computed in a fixed-time grid, using one-second-long segments to compare ground truth and system output. Each target class and prediction is marked as active over the entire segment if it is active in at least one frame of the segment. Precision (P) and recall (R) are calculated as: F1-score is calculated based on P and R: where TP, FP, FN are the number of true positive, false positive and false negative samples, respectively. The sound event detection performance of CAM will be evaluated with poly-phonic sound event detection scores (PSDS) [41] computed over the real recordings in the test set. This metric is based on the intersection between events. PSDS is defined as the normalized area under the receiver operating characteristic(ROC) curve.
where e max is the maximum eFPR value, r(e) is the ROC curve, eFPR refers to effective FP rate. More detailed and visualized explanation of PSDS can be found in [41] and [42]. In order to test the SED system in different scenarios, we set two different PSDS parameters. In Scenario 1, the system needs to react quickly to events, and the localization of the sound event is then really important. In Scenario 2, the system must avoid confusion between classes, but the requirements for event reaction time are not as strict as in Scenario 1.

3) TRAINING SETTINGS
We choose log-mel as input features (1000 × 64). The training data batch size is set to 24, and the shallow layer (the first three convolutional blocks) is frozen, the learning rate of the deep layer is set to 0.0001, and the learning rate of the last fully connected layer is set to 0.001. In the puzzle moule, the logarithmic mel feature is divided into 10 blocks, and the consistency loss weight w is set to 1. We employ the data augmentation method SpecAugment [43], FilterAug [44]. The model is implemented using the Pytorch framework and trained on NVIDIA RTX3060 12G for 10 epochs. In the reasoning process, the operation of the puzzle module is not performed, and the CAM is directly extracted from the trained model as the sound event detection result. The DCASE2023 Task4A baseline system is based on mean-teacher CRNN model, which is a combination of a CNN and a RNN followed by an attention layer. The output of the RNN gives frame-level predictions while the output of the attention layer gives the clip-level predictions. See [1] and [2] for more details on baseline system. FDY-CRNN is proposed by Nam et al. [12]. Frequencydynamic convolution uses a frequency-adaptive kernel to enhance the frequency dependence of 2D convolutions, thereby improving the physical consistency of SED models with time-frequency patterns of sound events. FDY achieves excellent performance on the DESED dataset and is especially helpful for the detection of non-stationary sound events.
SKA-RCRNN is proposed by Kim et al. [13]. This method uses Mean Teacher semi-supervised learning to solve the problem of insufficient strongly labeled data. The SKA-RCRNN model modifies the residual convolutional block in RCRNN to Selective Kernel Attention (SKA). This enables SKA-RCRNN to have an adaptive receptive field, which can effectively capture the features of different types of sound events. Consequently, the utilization of SKA in Sound Event Detection (SED) has become prevalent.

2) COMPARISON WITH EXISTING METHODS
The comparison of PSDS values between the proposed method and existing methods is shown in Table 3. The CAM generated from the best model based on our proposed method achieves PSDS-2 of 0.732 on the test set, which is 0.17 higher than the DCASE2023 Task4A baseline system. We ensemble CAMs(average method) generated from multiple optimal models under different hyperparameter settings. Specifically, we calculate the mean of CAM generated from multiple optimal models as the final CAM. By combining the CAMs of multiple optimal models using the averaging method, the complementarity between them can be fully exploited, thereby reducing the bias and variance of the entire system and improving the overall SED performance. The ensemble CAM achieves the highest PSDS-scenario2 value of 0.751, an improvement of 0.189 over the DCASE2023 Task4A baseline. Compared to Nam et al. [12] and Kim et al. [13], our proposed method also shows a remarkable improvement in PSDS Scenario 2. The PSDS-2 value reflects that CAM effectively avoids confusion between sound event categories and has excellent event discrimination ability. The PSDS-1 value is lower than the existing methods, which indicates that the CAM has poor ability to respond quickly to sound events. Although the time stamp information of sound events is obtained by using the object location capability of CAM, the boundary information of sound events provided by CAM is not accurate enough. In addition, Puzzle CAM performs better than Original CAM(using CAM alone for the entire audio without puzzles and merging).
From the conclusions in Table 1, it can be seen that the ability of CAM to accurately locate sound events is weak, but the time stamp information provided by CAM has considerable performance in detecting the time region of sound event activities (Fig 5).
It is worth noting that the existing methods shown above are based on semi-supervised learning, which require strongly labeled datasets when training the models, while the method proposed in this paper is weak supervised sound event detection-the model is never given timestamp annotations during training, so the sound event timestamp information  provided by CAM, even if not completely accurate, it is still invaluable for weakly supervised learning.

3) ABLATION EXPERIMENTS
To study the impact of classification loss and consistency loss on model training, we combine the loss functions of puzzle CAM. Table 4 shows the optimal value of the experimental results. Sb-F1 refers to Segment-based macro F1, which is used to measure the sound event classification performance of the model. Based on the classification loss L class , when only the classification loss L pl−class is used, the results do not show significant differences. When only the consistency loss L con is used, PSDS1 is significantly improved, and the ability of CAM to locate certain types of sound events is improved, but the PSDS2 value and Sb-F1 value decrease, the event discrimination ability of CAM and the classification performance of the model decrease. When the consistency loss and the two classification losses are combined, CAM has the best sound event detection ability without loss of classification performance.
We explored the sound event detection capability of CAM under different CAM generation technologies, and evaluated the impact of different thresholds θ on the sound event detection performance of CAM. The threshold θ was set from 0.2 to 0.9 in the experiment. The results are shown in Table 5. A lower threshold means that the classification result of the model has weaker guidance to CAM, and more original information of CAM is preserved, whereas a higher threshold makes the CAM output more conservative. Under several TABLE 5. Sound event detection performance of CAM under different thresholds, the thresholdθ is set from 0.2 to 0.9. Grad CAM [6], grad CAM++ [7], and group CAM [9] are methods for generating CAM. different CAM generation techniques, there is no significant difference in the PSDS value, and the best PSDS2 values are all higher than the baseline system.

IV. CONCLUSION
This paper proposes a weakly supervised sound event detection algorithm based on Puzzle CAM, which uses the object localization ability of CAM to obtain the timestamp information of sound events in a sound event classification network trained with weak labels. The Puzzle CAM approach better fits complete regions of sound events by employing a consistency loss between a merged CAM of separated feature blocks and a CAM of single audio features to match partial and all features. In addition, we also clean the noise of CAM according to the classification decision of the model. Experiments on the DESED dataset show that our proposed Puzzle CAM method achieves higher performance than the existing methods in scenario 2. Further research could explore the problem of CAM's rough localization of sound event activity boundaries.
XICHANG CAI received the Ph.D. degree in mechanical and electronic engineering from the Chinese Academy of Sciences, in 2009. He is currently a Senior Engineer with the School of Information Science and Technology, North China University of Technology, Beijing. His research interests include weak signal processing, machine earning, and embedded AI systems.
YANGGANG GAN is currently pursuing the master's degree with the School of Information, North China University of Technology. His research interests include speech signal processing, sound event detection, object localization, and deep learning.
MENGLONG WU received the Ph.D. degree in communications and information systems from the Beijing University of Posts and Telecommunications of China, in 2015. He is currently an Associate Professor with the School of Information Science and Technology, North China University of Technology, Beijing. His research interests include wireless communication, signal processing, machine learning, and neural networks.
JUAN WU is currently pursuing the master's degree with the School of Information, North China University of Technology. Her research interests include speech signal processing, sound event detection, and deep learning.