A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics

With the emergence of new digital technologies, a significant surge has been seen in the volume of multimedia data generated from various smart devices. Several challenges for data analysis have emerged to extract useful information from multimedia data. One such challenge is the early and accurate detection of anomalies in multimedia data. This study proposes an efficient technique for anomaly detection and classification of rare events in audio data. In this paper, we develop a vast audio dataset containing seven different rare events (anomalies) with 15 different background environmental settings (e.g., beach, restaurant, and train) to focus on both detection of anomalous audio and classification of rare sound (e.g., events—baby cry, gunshots, broken glasses, footsteps) events for audio forensics. The proposed approach uses the supreme feature extraction technique by extracting mel-frequency cepstral coefficients (MFCCs) features from the audio signals of the newly created dataset and selects the minimum number of best-performing features for optimum performance using principal component analysis (PCA). These features are input to state-of-the-art machine learning algorithms for performance analysis. We also apply machine learning algorithms to the state-of-the-art dataset and realize good results. Experimental results reveal that the proposed approach effectively detects all anomalies and superior performance to existing approaches in all environments and cases.


I. INTRODUCTION
T Echnological advancements the world has seen during the past decade, the volume of digital media data on the internet has nearly quadrupled. [1], [2]. Smartphones have enabled people to record and store every aspect of their lives in the form of multimedia files [3]- [5]. Moreover, surveillance cameras for monitoring streets, offices, and traffic for security have increased [6]. This exponential increase in multimedia data has called for the need for multiple techniques to analyze this data to be managed and utilized to the best of its ability. Anomaly detection refers to the difficulty of finding patterns in data that do not conform to expected behavior [7]. The significance of anomaly detection is because anomalies in data translate to vital information in broad categories of application domains [8]. In the audio category, anomaly detection has several critical applications [9]. For instance, detecting abnormal activities/events in audio can conscientiously supplement video-based methods [2], anomaly detection for machines by analyzing their sounds could be extremely valuable to detect an abnormal performance of the machines in advance [10], to detect abnormal situations that may represent a risk for the public security [11].
With the arrival of smart video surveillance systems, innovative ways for quickly and effectively detecting malicious occurrences or behaviors in monitored settings based on realtime analysis of multimedia streams have emerged [12], [13]. Most real-world audio recordings are complicated in that they are composed of sequences of many different sounds [9], [14], [15]. If a comparably short sequence of sounds can be distinguished by a human regardless of the acoustic context in which it occurs, it can be classified as an anomaly. For example, the sound of a gunshot in a beach environment or the sound of screams in an office environment. Detecting such anomalies has been researched by researchers during the past few years. Researchers employed a variety of machine learning, and deep learning algorithms [16]. However, their signal-to-noise ratios were poor, or their accuracies were low [11], [17].

A. MOTIVATION
This section provides the primary motivations for our research. Law enforcement and private investigators may learn a lot from a media forensics expert [18]. The audio forensic investigator must be able to detect events from audio in a short time; for example, in a gunshot incident, the gun may not be readily noticeable on the scene or camera, but if there is a gunshot event occur in audio, so the proposed system should be able to efficiently detect the gunshot event, which helps the forensic investigator during the investigation. In addition, there is a real need for a system that detect abnormal events related to the occurrence of environmental burst like sounds (such as Screams, gunshots, glass break, explosion) that may have been considered "anomalous" for the observed environment, thereby diverting the attention of human security operators to a potentially dangerous situation. Furthermore, the authors highlighted the issue of public security, which may be addressed by identifying anomalous noises such as (e.g., footsteps, police siren, baby cry, scream) so that changing the focus of surveillance/human security operator to the specific situation to avoid further harm.

B. CONTRIBUTIONS
This study provides the following contributions to successfully and efficiently identify abnormalities in audios with varied background environments.
• Present a multi-modal open dataset for anomaly detection and rare event classification from audio. For the time being, the dataset consists primarily of rare events with 15 background audios. The final collection of dataset comprises of seven different types of rare events-baby cry, gunshots, broken glasses, footsteps, police siren, explosions, and screams-that have been artificially blended with background audio recordings from 15 various environmental contexts (i.e., office, Library, and Park). Detailed descriptions of the dataset are presented in the following Section III. • Utilize a supreme feature extraction technique to extract MFCC features from the audio signals and Principal Component Analysis (PCA) for the selection of suitable features for anomaly detection in the audio signal. • Propose a practical machine learning approach for anomaly detection and classification of rare events in audio data embedded in various types of background sounds. • Analyze and validate the effectiveness of those feature extraction and feature selection approaches on the performance of the machine learning algorithms for anomaly detection and present a comparative analysis of the suggested approach with other state-of-the-art studies, which effectively enhances the detection rate with consistent performance.

C. ORGANIZATION
The structure of this paper is as follows. The section II addresses related work. The dataset used for testing and early analysis is discussed in section III. The proposed technique for anomaly identification and categorization of unusual occurrences in audio data is described in Section IV. Section V articulates the experimental setup and findings. Section VI contains a discussion, and Section VII provides the conclusion and future work.

II. LITERATURE REVIEW
Several methods, primarily based on AI/ML-based methodologies, have been used to detect abnormalities throughout the last decade. In [19], the authors employed a technique that reconstructs the features for anomaly identification based on an LSTM-based network that detects abnormalities from subsampled signals. The authors of [20] used non-uniform sampling for audio subsampling to make low-volume samples that include higher frequencies than Nyquist. The LSTMbased auto-encoder network is then used for anomaly detection, in which the signal is made demultiplex and accepted as input from the endpoint. The authors in [21] used Convolutional Autoencoder (CAE) . The CAE is used to detect abnormal activities from the audio that are overlaid to natural factory soundscapes. The CAE-based approach gives better results than One-Class Support Vector Machines. They used a limited number of audios in their experimental work. The authors in [22] used sequence-to-sequence autoencoder models on audio features extracted from the streaming audio signals. They found that Convolutional Long Short-Term Memory autoencoders perform better than sequential Convolutional autoencoders under diverse signal-to-noise ratio conditions of audio events.
In [23] this paper, the authors used two models to achieve the goal. In [24], the authors used a modular deep convolutional autoencoder with a dense bottleneck structure for unsupervised anomaly detection. They also applied Maximum Mean Discrepancy (MMD). To efficiently learn the features, they used MMD. For training, the authors employed two models. The first is a 1D-convolutional-encoder, and the second is the WaveNet-decoder model. The identical encoder/decoder structures are trained to learn a mapping function between different mel-scaled frequency bands. An SVM model is trained to predict anomalies and examine the latent space representation learned by the autoencoders. They found that this method paves the way towards semi-supervised or self-supervised training for detecting anomalies.
The authors of [25] suggested a methodology for using Huffman coding. This approach is utilized for anomaly detection in audio to obtain benefits such as variable event length and reduced reliance on cluster information, and it was discovered that this method enhances outcomes with little computing overhead. In [26] the authors introduced a training strategy, primarily used in unsupervised ADS. The authors suggested a batch uniformization technique. First, they reduced the weighted mean score. Here weight is defined as the reciprocal of each sample's probability density. The authors found that this method is appropriate for an unsupervised anomaly detection system based on a deep neural network (DNN).
The authors of [27] suggested an auto-encoder that leverages the residual error, which represents reconstruction quality, to find the anomaly. In [28], the authors employed an auto-encoder model to detect the anomalies in audios. The audios were recorded in home surroundings. The main limitation of this work is that they used a very less number of audio events and background audios for training and testing. In [17], the authors adopt WaveNet architecture model. This model was developed for raw audio synthesis, ADA, and significant performance increases over deep-convolutionalautoencoders (DCA). The WaveNet model outperformed the DCA technique; however, it earned a relatively low AUC ROC score overall, indicating that the model did not perform well on the dataset. Table 1 tab summarises the literature review.

III. NETWORK MODEL, DATASET AND PRELIMINARIES
We evaluate the suggested system's performance for an automated surveillance application that should be capable of identifying the following occurrences (called "abnormal" or "anomaly" in the observed environment): baby cry, gunshots, broken glasses, footsteps, police siren, explosions, and screams. We create the dataset by mixing different rare events with 15 background audio datasets fetched from the TUT Acoustic Scenes 2016 dataset [29]. A Sound-man OKM II Classic/Studio A3 head-microphone and an R-09 from Edirol/Roland wave recorder were used for the TUT Acoustic Scenes 2016 audio dataset recordings. The recording quality is excellent. The TUT Acoustic Scenes 2016 collection is made up of real-world audio recordings. The recorded audio is remarkably comparable to the sound that reaches the human wearing the equipment's human auditory system. The final collection of the dataset consists of seven types of unusual events-baby cry, gunshots, broken glasses, footsteps, police siren, explosions, and screams. These audio events were then synthetically mixed with the 15 background environmental contexts audios(beach, bus, home). The final created dataset is available at https://www.kaggle.com/ ahmedabbasi/audioanomalydataset. The datasets utilized in this experiment are summarised in Figure 1. Existing techniques concentrated on identifying only one type of audio (only a rare event or a background sound), resulting in poor performance throughout the test under actual situations. As a result, the aim of extending the dataset is: • To focus on detecting anomalous audio and classifying rare sound events. • To focus on using the audio information for surveillance purposes. • To encourage other researchers in the field to use this dataset for testing their methods for anomaly detection and rare event classification. Rare events are randomly mixed at different "event-tobackground" ratios. The original audio mixtures are sampled at 44.1KHz with a 24-bit resolution. The data set contains highly noisy environmental sounds, making event detection more difficult in some environments and challenging the detection and classification of events.
We begin by gathering 1170 background audio recordings from the TUT 2016 dataset. First, the background sound bg i (n) is selected randomly by defined number of audios n ∈ {1, 2, 3, ...} as mentioned in equation 1. bg i (n) are the "n" carefully chosen background audios that are utilized to produce the complex environmental sound by combining several unusual events.
Once the background audio has been selected, a number N e of rare events is randomly chosen and superimposed to the background audios. As a result, the unusual occurrence might be present in the final data set and appear with different background audio each time.
In equation 2, with ⊕ [Ne,Bj ] we define an operator that mixes the rare events N e with the background audio B j (n) at random positions of audio signal. The final dataset consists of 8,922 audios, and the total duration is about 75h making the database huge. We split the final mixture of audio files of 30 seconds into two partitions. We employed the first split for training and the second portion for subsequent assessment. Figure 2 depicts sound waves of anomalous sounds for all types of rare events.

IV. PROPOSED APPROACH
The proposed approach comprises multiple steps data analysis, feature extraction, feature reduction, processing data, and finally, detection of anomalous audio and classification of rare sound events as shown in Figure 3.The data analysis involved the visualization of audio waveform and spectrogram to extract meaningful insights from audio. The feature extraction techniques use a featured ensemble of MFCC,spectral_rolloff features, spectral_centroid features, spectral_contrast features, spectral_bandwidth features, and VOLUME 4, 2016 Approach Limitation [28] Adversarial autoencoders Limited anomaly events and backgrounds [17] Adapt WaveNet Lower Accuracy [21] Convolutional Autoencoder (CAE) Limited dataset and number of audios [22] Convolutional Long Short-Term Memory autoencoders Lower Accuracy FIGURE 1: Dataset Overview zero_crossing_rate features. We used the feature reduction technique because feature ensemble has useful and nonuseful features, so we need to remove these non-useful features and pass only useful features to machine learning classifiers for better classification results. For feature reduction, we used Principal Component Analysis (PCA) that selects the most appropriate features that are finally passed to the classifiers for classification and anomaly event detection. An audio surveillance system must recognize unusual occurrences even when blended with a variety of background audios of varying energy levels-as a result, training a model with a collection of the training set, which contains only one sort of audio (at a time, either an exceptional event or a background sound) would lead to a performance in the test phase in pragmatic scenarios. We decided to create a train and test set where the different audios are already layered rather than separated to address this issue. Additionally, the proposed approach can efficiently detect anomalous events in audio, which helps the forensic investigator for further investigation.
We have performed supreme feature engineering techniques, which results in models' ability to detect anomalies exceptional. After gathering many background audios from the TUT challenge 2016, we combined them with events of interest in various ways to get a large data set. To provide very challenging event detection tasks, the data set comprises loud environmental noises, such that events may be more challenging to detect in specific situations, attempting to make event detection and categorization extremely difficult. The audio clips are divided into two distinct divisions, each containing 80% and 20% of the total sounds from the original collection. The audios from the first partition were used to create the training set, while the audios from the second portion were utilized to create the test set. Pre-processing is necessary to acquire remarkable performance in any machine learning model before training the classifiers. Therefore, audio data includes a handful of preprocessing procedures that must be taken before it is delivered for further analysis. The first stage is data framing, which involves converting the audio data into a machinereadable format. We acquire values after a particular time. For example, in a 10-second audio file, we extract values every second, which is audio data sampling, and the sampling rate is the rate at which it is sampled. In our case, by default, In our case, if an audio file "file1" has a 30-s time, then the total frame rate of this file can be calculated by the formula in equation 4.
Data framing is used to fix the sampling(frame) rate of each audio file [30]. The audio processing procedure begins with extracting key acoustical characteristics, preceded by decision-making techniques involving detection, classification, and knowledge fusion. In the following phase, we represent audio data by transforming it to a new data representation domain, the frequency domain. We needed a lot more data points to represent the entire audio data when we sampled it, and the sampling rate should be as high as feasible. Each sample represents the amplitude of the audio waveform at a certain time interval. To visually inspect the audio signal, we create a spectrogram. It shows the signal intensity, or "loudness," across time at various frequencies contained in a certain waveform, as seen in Figure 4. It shows a spectrogram of a baby cry audio waveform. The vertical axis displays frequencies ranging from 0 to 8kHz, while the horizontal axis displays the duration of the audio clip. In a spectrogram, purple colors represent the amplitude of a sound wave.

FIGURE 4: Spectrogram of Audio signal
We choose a standard scaler for feature normalization. Standard scaling is utilized in this study to standardize data within certain ranges (e.g., 0 and 1). The goal of Standard Scaler is to rescale features so that they are roughly standard normally distributed. We utilize a conventional scaler to modify the data such that it eliminates the mean and scales each feature/variable to unit variance, as shown in equation 5, where y is our standardized form of x.

B. FEATURE EXTRACTION
Every audio signal contains a variety of characteristics/features. We must, however, extract the features related to the event that we will detect. We employed Mel-frequency Cepstral Coefficients for this purpose (MFCC). The MFCC is a feature extraction method, and in this study, we used 39 MFCC features. In sound processing, MFCCs features are the most often utilized in speech recognition [31]. In this work, MFCC is exploited for anomaly detection. After the pre-processing of anomaly audio signals, the MFCC vector will be extracted from each frame of the audio waveform in the form of a vector group. This study uses MFCCs, spectral_rolloff, spectral_centroid, spectral_contrast, spectral_bandwidth and zero_crossing_rate features for experimentation. We construct these features by taking the mean and standard deviation of values computed at each frame and combining them to get the value for the relevant feature. FIGURE 5 depicts the MFCC series of infant cry audio files by converting the audio waveform into the frequency domain using the Fourier transform of a signal, then mapping the powers of the spectrum produced onto the mel scale. After that, compute the discrete cosine transform of the list of mel log powers by taking the logs of the powers at each of the mel frequencies.
The MFCCs are the resultant spectrum's amplitudes. The features collected from each feature group are described in Table 2. Each audio file has 270 characteristics extracted, and the results are saved in a data frame. We selected only suitable features and removed all non-useful features using Principal Component Analysis [32]. In the end, 65 most VOLUME 4, 2016 important features passed to models for anomaly detection. To evaluate the usefulness of selected features, we calculate the explained_variance_ratio of PCA. The 97% value of explained_variance_ratio shows that the selected data is valuable.

C. CLASSIFICATION MODELS AND PARAMETER SETTING
We employ the following machine learning techniques to detect anomalies and classify unusual occurrences, as well as to assess the efficacy of our suggested approach: X(m, w) is essentially the Fourier transform of x[n]w[n − m], a complicated function describing the signal's phase and amplitude with time and frequency Where w(n) is the window function, commonly a Hann window, and x(n) is the signal to be transformed (note the difference between the window function w and the frequency w). X(m, w) here m is time, discrete, but w is the frequency and continuous. In the last step, STFT {x[n]}(m, w) is the visual representation of the signal strength, or "loudness," of a signal over time at various frequencies. In our case, we have frequencies from 0 to 8KHz. After the power spectrogram, the filter bank processing is carried out on the power spectrum using melscale S k to extract the valuable features. Eventually, the 270 MFCCs are calculated, where k is the number of mel cepstrum coefficients, S k is the output of filterbank, and C n is the final MFCC coefficients. The feature vector is formed in the next step by calculating the mean and standard deviation of values determined by each frame and storing them in the data frame (df). For selecting convenient features, the first step is to standardize the data.
A standard scaler is used to convert the data into numeric, where the mean is removed, and each feature is scaled to unit variance. The second step is to calculate the covariance matrix of the df. The dimension of the covariance matrix is represented as n * p. The third step is to calculate eigenvalues and eigenvectors. The dimension of eigenvectors is represented as p * m. Finally, X_P CA represents 65 suitable features for experimentation. The features are chosen, and the resulting feature vectors are put into machine learning models. The model learns the anomaly sequence patterns. The properly chosen characteristics increase the learning operations and aid in achieving greater accuracy in the anomaly detection process. The last phase is model prediction, used to discover abnormalities in new data. {Model Training} 17: for each i in MLModels do 18: acc ← getClassif ication(i) 19: i ← i + +

V. EXPERIMENTAL ANALYSIS AND RESULTS
This study proposes a generic system by extending the dataset for anomaly detection and classification of a rare event in real-life audio. We conduct an experimental evaluation of the suggested technique using a typical audio surveillance application in which seven types of audio events must be detected: baby cry, gunshot, glass breaking, footsteps, explosion, and scream. The experiments of this study were carried out utilizing different machine learning techniques. Six machine learning models are used for experiments, including Random Forest, KNN, XGB, MLP, SVM, and logistic regression. We used Google Colab for experimental implementation. Following the experimentation, the findings are compared to the other existing state-of-the-art approaches depicted in Figure 6. Accuracy, precision, recall, and Fscore are the performance evaluation metrics. We conduct experiments on google colab with the windows 10 professional operating system. The CPU is Intel Xeon Processor. Furthermore, GPU is Tesla K80. To conduct the experimental assessment, the dataset is divided into two disjoint groups: training and testing, which account for 80% and 20% of the total number of audios in the newly created dataset, respectively. Eighty percent of the data is used for training, while twenty percent is used for testing.

A. BASELINE CAE AND WAVENET MODEL
We employ a convolutional autoencoder (CAE) and the WaveNet model as a baseline. We compare the results of our machine learning model with these two baseline model results. The CAE model has 20 layers. Ten are encoder layers, and the remaining 10 are decoder layers. The WaveNet model comprises 20 layers, but it comprises two stacks of 10 convolutions. They used the dataset named DCASE Challenge Task 2 published in 2017 [33]. This dataset includes three unusual events (baby cry, glass break, and gunshot) and background audio from 15 different environmental situations (e.g., Bus, Forest Path, and Home). CAE and WaveNet models are trained on training sets and evaluated on test sets. They calculated the Area Under the ROC Curve (AUC) during the testing phase to demonstrate the models' capacity to identify abnormalities. TABLE 3 shows the performance of both models throughout the 15 datasets, with a tie in the home and office settings.

B. RESULTS
The proposed system's performance is assessed using a huge dataset of audio samples that includes seven unique, unusual occurrences (anomalies) that are artificially blended with fifteen diverse background audios. ADS, the suggested technique, is utilized for anomaly identification and categorization of unusual occurrences in real-world audio. Five evaluation metrics were used in this proposed approach. First is accuracy, then precision-recall, F1-score, and at the end, ROC curve was a plot to evaluate the model ability on the given dataset. Tables 4 and 5 show the experimental outcomes of the machine learning models. Table 4 displays the machine learning models' accuracy, precision, recall, and F1-score, whereas Table 5 displays the ROC curve score of the machine learning models across the 15 datasets. On the proposed dataset, the MLP model performed quite well on average. The MLP model obtained the highest accuracy score of 99.08% cafe environment dataset.

C. COMPARATIVE ANALYSIS WITH BASELINE APPROACHES
This study compares the findings of the suggested technique to another state-of-the-art study [17], whose experimental circumstances are similar to the settings used in this study. The performance of both the baseline approach and proposed approach across the 15 datasets is shown in FIGURE 6. The (ROC) curves, which provide an overall evaluation of classification performance, are used to evaluate the performance of both approaches (baseline and proposed). We discover that the suggested method consistently outperforms the baseline CAE and WaveNet models in virtually all datasets, with corresponding scores closer to 1. For a perfect classification, we consider the ROC curve score to be closer to 1. The greater the value of this metric, the better the overall performance of the suggested system. An audio surveillance system must identify events even when blended with various background audios of varying energy levels. As a result, we suggested a general approach for anomaly identification and categorization of a rare event in real-life audio by expanding the dataset and employing machine learning. The baseline technique comprises training a system by utilizing a set of training samples to identify only (whether an anomaly exists or not), but we concentrated on detecting anomalous audio and categorizing rare sound events. Because the original dataset comprises very loud ambient noises, events may be more challenging to identify in various situations, making event detection and categorization difficult. In the baseline approach, it is noticeable that the performance of the CAE and WaveNet model differs dramatically across diverse acoustic settings, demonstrating that varied acoustic environments may significantly alter the models' capacity to detect anomalies. TABLE 6 tab compares our suggested technique to the best performing classifier in the referenced paper. Across all datasets, we find that the suggested strategy outperforms the baseline approach by a substantial margin.  This study presents anomaly detection and classification using machine learning algorithms. Terrorism is posing serious and rising threats all across the world. Anomaly detection in audio is critical for detecting events connected to environmental burst and explosion-like sound occurrences (e.g., gunshots screaming, explosion) that may be regarded as "odd" for the observed environment to make our living environment safer. For this purpose, a dataset is created using seven different anomalous sounds and 15 different background sounds. All the anomalous sounds are then embedded in each different background to create a dataset for experimentation. Then, many features from the dataset are retrieved, and a feature selection technique, Principal Component Analysis, is used to limit the number of features and choose just useful features. For anomaly detection, several machine learning methods are used in the dataset. After thorough experimentation, we found that the MLP machine learning classifier consistently performed well for every anomaly event in each background scenario. Using PCA for feature selection and MLP classifier for detection gives remarkable detection accuracy that can be very useful when applied in real-world scenarios. As a result, our experimental results demonstrated that the suggested technique detects and classifies abnormalities more effectively than existing state-of-the-art studies.

VII. CONCLUSION
As the velocity of multimedia generation increases, there is a need for techniques to analyze that data. Anomalies mean a deviation from normal or expected behavior. Detection of anomalies can serve as an essential tool for enhancing persons' security and maintaining public and private assets. A customized dataset has been created by mixing rare events with 15 background audios fetched from the TUT Acoustic Scenes 2016 dataset to detect anomalous audio and classify rare sound events. To detect anomalies in audio data, this study conducted experiments using multiple features of audio data. For feature engineering, this study extracts various features from the audio signal and then applies the PCA feature selection technique to select the minimum number of best-performing features for optimum performance. Several machine learning algorithms are employed on the selected feature set to detect seven different anomalous events in 15 different background environments. Experiments demonstrated that our technique outperformed existing state-of-theart research for anomaly identification in audio data in all circumstances. In the future, we plan to expand our dataset to include a wider variety of anomalies and background scenes and analyze the effectiveness of multiple machine learning algorithms on different types of anomalies.