Sound Event Detection for Human Safety and Security in Noisy Environments

The objective of a sound event detector is to recognize anomalies in an audio clip and return their onset and offset. However, detecting sound events in noisy environments is a challenging task. This is due to the fact that in a real audio signal several sound sources co-exist. Moreover, the characteristics of polyphonic audios are different from isolated recordings. It is also necessary to consider the presence of noise (e.g. thermal and environmental). In this contribution, we present a sound anomaly detection system based on a fully convolutional network which exploits image spatial filtering and an Atrous Spatial Pyramid Pooling module. To cope with the lack of datasets specifically designed for sound event detection, a dataset for the specific application of noisy bus environments has been designed. The dataset has been obtained by mixing background audio files, recorded in a real environment, with anomalous events extracted from monophonic collections of labelled audios. The performances of the proposed system have been evaluated through segment-based metrics such as error rate, recall, and F1-Score. Moreover, robustness and precision have been evaluated through four different tests. The analysis of the results shows that the proposed sound event detector outperforms both state-of-the-art methods and general purpose deep learning-solutions.


I. INTRODUCTION
Hearing is a mechanical sense that converts physical movements, i.e. sounds, into nerve impulses by detecting pressure variations of the surrounding medium [1]. This process is performed by the human ear that acts as transducer and amplifier for the human brain. Hence, an audio signal conveys relevant information on the surrounding environment. For this reason, many efforts have been devoted to audio pattern recognition [2], [3], [4], [5], [6], [7]. Several applications of audio pattern recognition are in the audio security domain. This includes tasks such as audio classification and tagging [8], [9], [10], [11], [12], [13], [14], sentiment analysis [15], [16], audio source separation [17], [18], [19], Anomalous Sound The associate editor coordinating the review of this manuscript and approving it for publication was Razi Iqbal .
In more detail, differently from monophonic audio classification, the objective of a SED system is to identify both the type of event and the exact time of its onset and offset [24]. However, recent state-of-the-art models trained on AudioSet [8] have shown to be unsuitable for human-security and safety oriented applications [24]. In fact, generally, SED systems provide accurate start and end time identification of acoustic events, while their recall, that is the percentage of the fraction of relevant elements that are successfully retrieved, may be unsatisfactory. Moreover, the lack of large annotated audio datasets has a significant impact on the performance of SED models, leading to weak generalization capabilities of deep learning-based systems.
To address these problems, a new lightweight SED approach, AuSPP, which aims to detect audio events that potentially denote the presence of circumstances threatening public safety and security (e.g., broken glass, gunshots, or shouting) has been proposed. AuSPP exploits state-ofthe-art spatial filters on Mel-Spectrograms and introduces the Atrous Spatial Pyramid Pooling (ASPP) module [25] for the first time in a SED system. In addition, in the proposed approach, the number of learnable parameters depends on the number of class anomalies and on the time resolution. Hence, with respect to state-of-the-art SED approaches, it is possible to customize the behaviour of AuSPP, drastically reducing its complexity and detecting more sound events than stateof-the-art models.
The proposed model has been tested in a public transportation vehicle. The motivation of the choice is twofold. First, modern buses are equipped with on-board sensors able to collect heterogeneous data that can be employed for maintenance, i.e., the Automatic Vehicle Monitoring (AVM) paradigm [26]. Second, the same raw audio information can be exploited for granting passenger security and safety in a noisy urban environment.
In this context, an annotated audio SED dataset, Sound Event Detection Dataset On Bus (SEDDOB), specifically designed for the bus environment, has been devised. In more detail, labelled audios with onset and offset time of different types of audio events are provided. To the best of our knowledge, this is the first contribution of an audio dataset for human safety in the public transport system.
To summarize, the contributions of the work are: • the definition of a new augmented spectrogram that exploits spatial filters to enhance time-frequency patterns by means of the ASPP module; • the design of an end-to-end SED that is more lightweight than state-of-the-art approaches. Moreover, the number of parameters depends on the time resolution and on the number of classes, thus being customizable; • the generation of a new SED dataset for the bus environment. Synthesis of real background recording with anomalous events from state-of-the-art monophonic datasets have been performed.
This work is organized as follows: Section II reviews stateof-the-art approaches for sound feature extraction and classification, deep learning models, and audio datasets; Section III contains the description of the proposed SED model; Section IV details the synthetic dataset called SEDDOB with its statistics; Section V describes the experimental results of general-purpose deep learning methods, state-of-the-art SED models, and the proposed solution. Finally, in Section VI the conclusions are drawn.

A. DEEP LEARNING MODELS
One of the first approaches to SED was the use of Hidden Markov Model (HMM) combined with Gaussian Mixture Models (GMMs) with Mel-Frequency Cepstral Coefficient (MFCC) as features [27], [28]. In these approaches, each anomaly class is modeled by an HMM and the maximum likelihood is obtained by exploiting the Viterbi algorithm. Nonetheless, together with Non-negative Matrix Factorization (NMF) approaches [29], HMM-based methods show limited performance and can hardly generalize.
Deep learning-based SED techniques have reached good performance. Ding et al. [30] proposed a Convolutional Recurrent Neural Network (CRNN) able to identify short-term dependencies of audio patterns by applying a multi-scale detection method. With a similar approach, Chatterjee et al. [31] introduced a Transposed Convolutional recurrent Neural Network (TCRNN) that incorporates Mel Instantaneous Frequency spectrogram (mel-IFgram) features. However, recurrent layers in neural networks are subject to the vanishing gradient problem, leading to lack of generalization capabilities. Moreover, the training process is slower since the data has to be fed sequentially and parallel computing cannot be exploited.

B. SED DATASETS
From monophonic collections of labelled audios, i.e., audio classification database such as UrbanSound8k [9], FSD50K [36], ESC50 [10], AudioSet [8], and FreeSound [37], [38], the research community has started to generate SED datasets by synthesizing normal background recordings with audio anomalies. Turpault et al. [39] proposed Domestic Environment Sound Event Detection (DESED) that contains precisely labelled audio (with precise onset and offset times) with the aim of recognizing sound event classes in domestic environments. In more details, the authors injected recorded anomalies on background soundscapes of Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. The process is performed by using the open source software Scaper [40]. From UrbanSound8k [9], in [40] the authors proposed URBAN-SED that contains 10.000 soundscapes with sound event annotations for urban sound monitoring use-cases. With the same technique, the USM-SED [41] dataset based on sounds taken from the FSD50k dataset, consisting in 20.000 polyphonic soundscapes, has been built.

III. PROPOSED MODEL
One of the peculiarities of the proposed model, AuSPP, is the exploitation of a larger dimensional input than state-of-theart Mel-Spectrogram based methods. This choice allows the extraction of a larger number of semantic audio features. More specifically, the information provided by the concatenation of spatial derivatives of the Mel-Spectrogram helps the model to learn frequency patterns for jointly classifying the audio events and their corresponding onset and offset times.
AuSPP can be partitioned into three main blocks: a preprocessing stage, the ASPP, and a Fully Convolutional Network (FCN). The first stage is responsible for extracting a time-frequency audio representation, applying spatial filters to the input audio spectrum, and arranging the output into an augmented Mel-Spectrogram. Subsequently, the ASPP module is introduced to combine the output of dilated convolutional filters. Finally, a FCN processes the ASPP output to obtain the predicted activity heatmap. The overall architecture is depicted in Figure 1.  represents the Nyquist frequency f s 2 . The Mel-Spectrogram X is computed by means of the Mel-Filterbank H mel (·) on the squared magnitude of the STFT In more detail, the Mel-filterbank groups the spectral values of each time frame into n mels logarithmic bins in order to model the human sound perception. Hence, the Mel-Spectrogram X has size M × n mels .
To enhance the audio patterns on the Mel-Spectrogram, Sobel [42] and Langrangian [43] operators have been employed. Let H u and H v be the horizontal and vertical first order spatial derivatives, respectively. Furthermore, let H l be the second order spatial derivative The first Mel-Spectrogram spatial derivative is calculated by convolution: Similarly, the second spatial derivative is obtained as: Finally, the Mel-Spectrogram is concatenated on the frequency axis with its derivatives in an augmented Mel-SpectrogramX with size M × 3n mels : where • denotes the concatenation function. An example of an augmented spectrogramĤ is depicted in Figure 2.

B. ATROUS SPATIAL PYRAMID POOLING
In this work, the ASPP module [25] is employed. In more detail, it applies atrous convolutions to the augmented Mel-Spectrogram in order to extract more relevant spatial features. Considering two-dimensional signals, given the augmented Mel-SpectrogramX , a convolutional kernel W , and a region of interest I , the output of the atrous convolution for the selected region, Y [i], is:  where the atrous rate r is the stride parameter of the convolutional layer of the network, which allows to apply convolutions to the input spectrogramX with upsampled zero-padded filters. The variable k accounts for all the possible regions of the image. In the proposed model, the ASPP module is composed of 5 atrous convolutions with kernels of size 3 × 3 with an atrous rate of 1, 2, 4, 8, and 16, respectively. This design choice have been considered for capturing new time-frequency pattern across the augmented spectrogram.

C. FULLY CONVOLUTIONAL NETWORK
Inspired by the network architecture proposed in [11], let ConvBlock(C) be the generic convolutional block with C output channels, shown in Figure 3. It is composed of two consecutive Conv2D-BatchNormalization-Exponential Linear Unit (ELU) [44], [45] blocks with 3 × 3 kernels. Variable pooling sizes are applied to intermediate ConvBlock(·) outputs to reduce the size of the feature maps. Moreover, a Dropout [46] layer is applied at the end of each ConvBlock(·) to reduce overfitting. Finally, the output of the module, that is the predicted activity matrixŶ , is obtained by applying the sigmoid activation function. Details about the configurations of AuSPP convolutional blocks are reported in Table 1. It is worth noticing that the number of parameters changes with the size of the predicted activity matrix. More specifically, the model becomes more complex if a finer time resolution and/or a larger number of audio classes are required.

D. LOSS FUNCTION
The loss function is a fundamental component of a deep learning-based approach. The training process aims at finding the best configuration of weights that minimizes the training loss. This procedure is performed by means of the mini-batch gradient descent algorithm, exploiting the derivative of the loss function with respect to the weights of the model. Let Y and Y be the predicted and ground truth binary activity matrices, respectively, with size T × N class . Then, we employ VOLUME 10, 2022 the Binary Cross Entropy (BCE) loss, that is (11) where J is the number of training audio samples in a batch. An example of the two activity matrices is shown in Figure 5.

E. DATA AUGMENTATION
Time and time-frequency audio augmentation have been exploited. This procedure helps the model to generalize and to increase the overall performance by exploiting modified versions of the training dataset. As a drawback, the training process becomes slower compared to models not adopting data augmentation strategies. We employ SpecAugment [47] that randomly drops the time and frequency stripes of the initial Mel-Spectrogram X . In addition, before spectrum extraction, we apply pitch shift, frame shift, polarity inversion, and gain augmentations [48] to the input batch.

IV. SYNTHETIC DATASET-SEDDOB
In this work, a synthetic audio dataset SEDDOB has been designed. As previously mentioned, a large number of high quality annotated audios is required for training data-driven approaches. In the following we describe the recording setup and the synthesis of the sound classes.

A. BACKGROUND AUDIO
In order to collect a typical bus background, a microphone array and a recording unit have been deployed inside a bus. The microphone array has been positioned in 3 different locations (Figure 4). This choice accounts for the observation that in the bus used for the recordings, the engine is located in the rear. Therefore, the background sound pressure level and its spectral characteristics change with distance from the engine, which is the main source of background noise. The total length of the recorded background noise is 1 hour. All the recordings have been acquired in dedicated runs in compliance with General Data Protection Regulation (GDPR).

B. FOREGROUND EVENTS
The audio events selected for the classification task are extracted from existing monophonic datasets: Urban-Sound8k [9], ESC50 [10], and AudioSet [8]. 10 audio classes which can occur in a bus environment and that can relate to threatening events for public safety and security, have been selected: breaking_glass, car_horn, gunshot, siren, slap, scream, cry, jackhammer, car_alarm, smoke_alarm.

C. DATASET GENERATION
Scaper [40] has been used to generate the proposed dataset by adopting the parameters configuration reported in Table 2. The audio length is selected based on the characteristics of the human auditory perception. In fact, tests on human subjects confirm that 4 seconds are sufficient to correctly classify events [9]. Furthermore, the distribution of class events, together with their number, onset, and offset times, is set as uniform to create a balanced dataset. In addition, augmentation strategies such as pitch shift, time stretch, and variable SNR, have been introduced for generalization purposes.
To assess the performance of the proposed SED, two datasets (denoted as full and reduced dataset) with 10 and 4 audio classes have been generated. Both datasets are split into 10 folds to use different portions of the data for training and testing.

V. EXPERIMENTAL RESULTS
The AuSPP weights are updated by means of the Adam optimizer with learning rate λ. The STFT, Mel-Filterbank, and training hyperparameters for each model are listed in Table 3. The model implementation is based on Pytorch-Lightning and it is available at the following GitLab repository.

A. METRICS
The validation of an audio classification system is usually based on the accuracy score. More specifically, let TP be the number of true positive, FP be the number of false positive,  FN be the number of false negatives, and TN be the number of true negatives, the classification accuracy is evaluated as: which can be decomposed into Sensitivity (S t ) and Specificity (S p ): Inspired by the state-of-the-art [53], we discard TN from Equation (12) and we obtain: At the moment, a rigorous quantitative evaluation of SED systems is still not universally accepted from the research community. For this reason, the comparison between the output of the SED algorithm and the ground truth is performed also on fixed length time intervals, thus measuring segmentbased metrics [54].
Segment-based metrics evaluate the system prediction and the reference in fixed short time segments. Thanks to this VOLUME 10, 2022 TABLE 4. Performance of SoA models and of the proposed approach on 4 classes with threshold δ = 0.8 and time frame T = 20ms. We denote with ↑ when the performance is better when the metric is high and ↓ otherwise.

TABLE 5.
Performance of SoA models and of the proposed approach on 10 classes with threshold δ = 0.8 and time frame T = 20ms. We denote with ↑ when the performance is better when the metric value is high and ↓ otherwise. activity representation, it is possible to define intermediate statistics like TN , TP, FP, and FN . To evaluate the robustness of the system, an activity threshold δ is introduced to the predicted activity matrixŶ . In the following we denote with loose threshold the case δ = 0.8 and with strict threshold the case with δ = 0.9. By applying one of the thresholds to the predicted activity matrixŶ , the resulting matrix is binary and intermediate statistics can be evaluated. Precision and Recall [55] are used for measuring the effectiveness of the retrieval.
For a generic i-th class event, the precision (P i ) and recall (R i ) are evaluated as: Then, class-wise Precision (P c ) and Recall (R c ) are calculated by averaging with respect to the number of class events N class : where E[·] is the expected value. In addition, F-score can be derived as: In the literature, two types of averaging approaches for the F-score are proposed [54]. We define the class-wise F c as the average of all the F-scores: We evaluate the audio-wise F-score F a by calculating the precision P j and the recall R j of a j-th audio recording (ignoring the class events). Let N audio be the number of audio samples. Then, the metric is computed as: Performance of SoA models and of the proposed approach on 10 classes with threshold δ = 0.9 and time frame T = 20ms. We denote with ↑ when the performance is better when the metric is high and ↓ otherwise.

TABLE 7.
Performance of SoA models and of the proposed approach on 10 classes with threshold δ = 0.9 and time frame T = 50ms. We denote with ↑ when the performance is better when the metric is high and ↓ otherwise.
Furthermore, we use the Error Rate (ER) metric to account the amount of wrong predictions in terms of substitution, deletion, and insertion errors. More precisely, let k be a specific time-segment, with its intermediate statistics (TP k , TN k , FP k , and FN k ), in the j-th audio file with K timeframes. We can define the errors as: The total ER of the j-th audio file is evaluated as: Finally, ER is averaged with respect to the number of audio files in the validation set: It is worth noticing that the ER is not a probability so its value can be bigger than 1. However, thanks to the use of the activity threshold δ and the fact that a zero predicted activity matrix yields unit ER, all the approaches used for comparisons have an ER lower or equal than 1.

B. RESULTS
We present the results of four tests where the ability to generalize and the robustness of the proposed framework is compared with respect to state-of-the-art techniques [11], [49], [50], [51], [52]. Since SEDDOB is split into 10 folds, we provide the mean values for the aforementioned metrics.
As shown in Tables 4, 5, 6, 7, some metrics such as Acc, S t , P t are not meaningful for rare anomalous events activity since they are biased by the high value of TN . Moreover they do not take into account: i) the fact that multiple class events must be recognized in the same time frame (class errors) and ii) the wrong predictions of onset and offset timestamps for a correct class event (time errors).

1) REDUCED DATASET, SMALL TIME FRAME, AND LOOSE THRESHOLD
This test is performed for evaluating the AuSPP performance on a reduced number of audio classes with an high timeresolution. In more details, 10.000 audio tracks with 4 anomalous events (breaking_glass, car_horn, gunshot, and siren) are used for training and testing. The temporal resolution is set to 20ms and the loose threshold δ is set to 0.8 to filter low probabilities from the predicted activity matrixŶ . As shown in Table 4, AuSPP outperforms both general and SED deep learning approaches for all the metrics, except for the precision metric.
However, for a security-driven system, the proposed model is preferable since it has a larger recall than the state-ofthe-art CNN14 [11], with an improvement of 7.68%. From the results, the model pre-trained on AudioSet [8] does not outperform the other models even if it has been trained on a large annotated audio dataset. This behaviour could be caused by the reduced number of audio events to be classified. As a drawback, AuSPP shows a smaller value of precision than the other models. More specifically, predicted onset and offset timestamps of events from AuSPP are less accurate, as it can be seen in Figure 5. This can be due to the low complexity of the model with respect to state-of-the-art models.

2) FULL DATASET, SMALL TIME FRAME, AND LOOSE THRESHOLD
In this case, a dataset of 10.000 audio samples is generated with all 10 audio class events. Hence, AuSPP ability to recognize multiple class anomalies is tested together with the other models. The results reported in Table 5 show that AuSPP achieves comparable results with respect to the version of CNN14 [11] pre-trained on AudioSet [8]. It is worth noticing that AuSPP, with 75% less parameters than CNN14, does not require additional data. Therefore, AuSPP can be employed in edge computing devices where computational resources are limited.

3) FULL DATASET, SMALL TIME FRAME, AND STRICT THRESHOLD
Exploiting the aforementioned extended dataset, this test allows to analyze the predictions of all the models by adopting the strict threshold (δ = 0.9). With this test, we evaluate which model is more confident, i.e. higher probability, of its predicted activity matrix. Table 6 shows that the proposed AuSPP outperforms state-of-the-art approaches with a Recall improvement of 4.74% over CNN14 [11], the second best model. Similarly with the other tests, our model is less precise with respect to the state-of-the-art pre-trained model on AudioSet.

4) FULL DATASET, LARGE TIME FRAME, AND STRICT THRESHOLD
Finally, we increase the time-frame from 20ms to 50ms and re-train all models in order to assess their performances.
As shown in Table 7, AuSPP outperforms state-of-the-art SED models with a recall improvement of 5.16%. It is worth noticing that the proposed method achieves better performance without the ASPP module. Hence, using a larger time frame, it is preferable to avoid using the module since no improvements can be obtained.

VI. CONCLUSION
In this paper we propose AuSPP, a lightweight deep learning model that applies spatial filters to Mel-Spectrogram in order to predict different anomalies class with the corresponding onset and offset time. Moreover, we introduce SEDDOB, a human safety-oriented dataset which provides high quality annotated audio waveforms for detecting anomalies on buses. To assess the performance of AuSPP, four tests on SEDDOB demonstrate that our AuSPP outperforms stateof-the-art general-purpose deep learning approaches with a reduced number of audio event classes. In addition, AuSPP achieves comparable results with respect to SED models pre-trained on large scale audio datasets with fewer learnable parameters.
However, the results show that significant improvements can be achieved. All considered models suffer from false alarms on normal recordings, significantly impacting on the task of ensuring human safety. As a possible solution, further studies on a coarse-to-fine approach for increasing the quality of audio prediction -in terms of recall and F-score metricscould be performed. Moreover, tests could be conducted on a real scenario exploiting edge computing devices. Finally, model interpretability could be performed.

ACKNOWLEDGMENT
The article reflects only the authors' view and the Commission is not responsible for any use that may be made of the information it contains.