A Survey on Artificial Intelligence-Based Acoustic Source Identification

The concept of Acoustic Source Identification (ASI), which refers to the process of identifying noise sources has attracted increasing attention in recent years. The ASI technology can be used for surveillance, monitoring, and maintenance applications in a wide range of sectors, such as defence, manufacturing, healthcare, and agriculture. Acoustic signature analysis and pattern recognition remain the core technologies for noise source identification. Manual identification of acoustic signatures, however, has become increasingly challenging as dataset sizes grow. As a result, the use of Artificial Intelligence (AI) techniques for identifying noise sources has become increasingly relevant and useful. In this paper, we provide a comprehensive review of AI-based acoustic source identification techniques. We analyze the strengths and weaknesses of AI-based ASI processes and associated methods proposed by researchers in the literature. Additionally, we did a detailed survey of ASI applications in machinery, underwater applications, environment/event source recognition, healthcare, and other fields. We also highlight relevant research directions.


I. INTRODUCTION
Acoustic data carry valuable insights for scientific and engineering research communities across different sectors that include human speech recognition [1], ocean exploration and localization [2], animal and birds localization [3] and underwater geographical imaging [4].Acoustic data analysis is a complex process that encounters a number of challenges, including inaccurate data, inadequate measurements, noise/reverberation, and large amounts of data.For instance, multiple arrivals of an acoustic signal can result in poor source localization.Utterances and background noises in sound recordings make it difficult for machines to interpret an acoustic signal [5], [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Mostafa M. Fouda .
In recent years, advances and developments in acoustic processing have been broadened through the application of AI principles.With the progress of AI, the capabilities of pattern recognition have tremendously increased in image processing, computer vision applications and speech processing.AI in acoustics has significantly contributed and progressed in the past few years.Advanced acoustic processing techniques can consolidate the strengths of AI methods to achieve better performance when it comes to recognition, identification and localization than conventional audio processing methods.
AI-based (ASI) can be tailored to meet the needs of a diverse range of applications.For example, AI-based ASI plays a vital role in the industrial sector through continuous condition monitoring [7].It can be used to detect and identify faults in different components of machines, thereby improving their safety, efficiency and reliability.
The undersea domain is becoming more contested day by day; therefore, demanding constant surveillance and monitoring operations.The acoustic signature radiated by marine vessels has unique information that can be utilized to identify, detect and recognize a marine vessel.In addition, AI-based ASI can be employed in target detection and recognition that aids the navy in the crucial investigation of both deep and shallow underwater environments [8].The surveillance of human activities in modern environments is predominately carried out in urbanized areas for the safety and security of the general public.Environmental sound recognition (ESR) systems are incorporating AI-based ASI to identify sound sources that exist in our everyday environment [9].Additionally, ASI can also be tailored to meet the requirements of a wide range of healthcare applications, such as cardiac auscultation [10], fall detection [11] and hearing-impaired wearable devices [12].This technology is not only limited to the above-mentioned applications, but is also useful for music genre classification [13], animal and bird species identification [14], [15], robotics [16], drone detection [17], and insect identification [18] as well.
Owing to the significance of ASI recently, a number of research works have been carried out and published in the literature in different fields of applications.We have summarized some of the latest review works in ASI, and they are outlined in Table 1.
The authors in [19] surveyed the environmental sound identification (ESI) and sound event recognition for surveillance applications in which various domain features have been compared that are suitable for sound events and scene identification systems.Moreover, AI model-based approaches have also been compared by using different available data sets for ESI and event detection.Lei et al. in [20] have presented a detailed review on intelligent fault diagnosis using the sound of machines in industrial settings and AI techniques.In addition, a detailed road map for future researchers is discussed to enhance the quality and outcome of AI models for intelligent fault diagnosis.AlShorman et al. [25] also published a study on fault diagnosis in components of motors using radiated acoustic patterns and AI methods.The authors in [21] surveyed different machine learning (ML) techniques used in the past few years and elucidated a deep learning (DL) framework for underwater target recognition.The information from underwater images is used to classify targets using DL.Similarly, Chen et al. [23] also reviewed the underwater target recognition application based on DL methods and discussed problems in feature extraction (FE) and captured underwater image quality.
In addition to this, the authors in [22] reviewed systematically and mostly used DL methods for the classification of heart sounds for cardiac auscultation.In this paper, two DL methods, convolutional neural network (CNN) and recurrent neural network (RNN) is emphasized over the course of the past five years.Nunes [24] published a detailed review on anomaly detection in the object based on its acoustic signature.In their review, ML techniques from 2010 to 2020 are studied and analyzed for anomalous detection.Most recently, Bansal and Garg [26] focused on ESI and classification using various traditional ML classifiers and deep neural networks(DNNs).They have also explicated various pre-processing and FE schemes for ESI.Lastly, the authors in [27] introduced and reviewed in their study the integration of internet of things (IoT) and ML approaches for smart environments in which acoustic sensing using IoT and ML algorithms has been outlined.
To the best of our knowledge, there is no other survey published so far that presents such a detailed overview of AI-based ASI.We have compiled a detailed overview of AI-based ASI along with its various applications to provide future researchers with a holistic understanding of this concept.Our contribution through this survey is organized as follows: • We surveyed and compared recent reviews and surveys on AI-based ASI for various applications including, surveillance, healthcare, smart cities, underwater detection and machinery fault detection.• We present a detailed overview of AI-based ASI process.We discuss data acquisition and traditional audio processing methodologies along with famous databases used in various fields by researchers.Moreover, we compiled and provide a detailed discussion on the significance and methodology of traditional audio processing techniques, FE methods, ML and DL algorithms that have been mostly used in the literature to aid readers in forming an effective model for a given problem.
• We provide a detailed overview of AI-based ASI in diagnosing faults in industrial machinery, underwater applications, event source detection (ESD), and ESI, healthcare, music-genre classification and wildlife monitoring applications.In our review, we provide a thorough comparison and analysis based on previous works' limitations and performance metrics.
• We discuss the possible future research directions in the light of this survey and some generalized methodological recommendations for future researchers to extend the pathways in this area and overcome problems in ASIbased ASI.We organized the remaining paper as follows: In Section II, the methodology of data acquisition and audio data pre-processing has been discussed.We highlight some famous databases that have contributed to AI-based ASI.Next, we compile various popularly used FE techniques, AI algorithms and evaluation metrics.In Section III, we survey ASI in industrial fault detection, underwater applications, ESI, healthcare, music-genre classification, wildlife monitoring and forensics.Future research directions are discussed in Section IV followed by the conclusion of this paper in Section V.

II. ACOUSTIC SOURCE IDENTIFICATION OVERVIEW
AI-based acoustic source identification is a systematic process of recognizing an unknown source using the sound that it generates.Further, it includes five basic stages that comprise, data acquisition, data pre-processing, FE, feature selection, and identification or classification using AI algorithms.Generally, the model performs better when all of these steps are followed.These stages are illustrated in Figure 1.In this section, we have explained in detail all the steps of ASI and highlighted methods that have been previously used in literature frequently.

A. DATA ACQUISITION
In this subsection, we detail the methodology of data acquisition for the ASI process.Data acquisition can be defined as the process of collecting and gathering relevant information to drive the aims and objectives of an AI-based problem.Data collection is the fundamental step of the AI-ASI process.The methodology of data collection can impact the performance of an AI algorithm; therefore, it can alter the decision of a given problem.
There are two ways to generate data for further processing; synthetic data and real data.Supervised learning algorithms are limited by the scarcity of labeled data.There can be cases where a sufficient amount of data is unobtainable to characterize a particular problem, for example, underwater environment, hyper-diverse rain forests, etc.Therefore, an investigator can simulate a large amount of data to achieve an efficient recognition system.To generate realistic data, the investigator has to take into account natural noise and reverberation must be added to the generated sound.To limit the massive use of simulated data, data augmentation techniques [49] is a promising solution.Data augmentation generates additional training examples without more recordings, often leading to improved performance.Real acoustic data collection can be conducted via microphones and hydrophones.After collecting data from an experimental procedure, raw data needs to be initially labeled to avoid mixing.Real data can also be aggregated by using data augmentation techniques if the gathered samples are insufficient.
In AI-based ASI, public evaluation and benchmark datasets help the research community to investigate the performance of various proposed systems.We have categorized some of the popular data sets with respect to their applications.In addition, each data set's properties and web links are mentioned in Table 2.

B. DATA PRE-PROCESSING
Raw acoustic data is mostly not suitable for FE and needs to be preprocessed.Raw data in its original state is not suitable for the AI algorithm as it can compromise the performance of the model.There are various reasons that can affect the suitability of data for learning algorithms such as excess data, insufficient data and tampered data.Some of the data might be corrupted in case of loads of data.In contrast, insufficient data lacks the necessary attributes of the dataset.In both cases, the predictive ability of the model gets weakened resulting in poor accuracy.For example, the decision-tree algorithm splits the data set into training and testing sets and missing information may lead to an inaccurate decision [50].Moreover, there are other important data pre-processing steps required to ensure the data is prepared for the next stage which are as follows:

1) DATA AUGMENTATION AND INTEGRATION
Data integration is defined as combining various heterogeneous data into unified data.This involves two techniques known as tight coupling and loose coupling [51].Contrastingly, data augmentation is the technique of adding data by synthesizing new data from available data.Data augmentation can be carried out for time and frequency domain features.Recently, SpecAugment has become popular in audio processing as an effective data augmentation technique, especially for spectrograms [52].

2) DATA CLEANING
Data cleaning is done to enhance the quality of the signal.The process involves the identification of inaccurate or irrelevant data and the elimination/replacement of such unwanted data in a data set.Errors can occur during naming, missing entries, or human negligence while gathering.Denoising of data is also a part of the data cleaning process.Noisy components can be removed from audio samples by filtering [53].Noise problems can also be countered by signal enhancement  techniques [54].Similarly, silence can be easily detected in audio samples and removed using amplitude-based silence detection algorithms [55].

3) DATA TRANSFORMATION
Data transformation is required when acoustic data attributes need to be scaled at the same level.Multiple features present in a data set, that might be mapped to different scales needs to be scaled at standard value.Therefore, normalization can be used to normalize all features to the same scale such as min-max normalization, z-score normalization and decimal scaling.

4) DATA LABELING
Pre-processing of the dataset also involves annotation/labeling of data after denoising and transformation.Usually, this is carried out by expert acousticians who are familiar with the targetted sounds and able to track sounds in an audio file.This involves the identification of targetted sound in an audio file and assigning it with a label also known as class.These labels are used to train an AI algorithm.Annotation can also be done by visually inspecting spectrograms of audio files.

C. FEATURE EXTRACTION AND SELECTION
AI models require discriminatory and distinct features to learn information about any particular sound.Therefore, FE is defined as the process of extracting meaningful information from raw data by removing most of the redundant data.The extent of training decides the performance of an AI algorithm.The effectiveness of these features results in accurate predictions from an algorithm.Therefore, FE and selection is the method of finding the features that possess most of the information of a particular data set.In this subsection, we discuss the popularly used audio FE methods as illustrated in Figure 2.

1) TIME DOMAIN FEATURES
• Zero Crossing Rate (ZCR) is defined as the rate of change of an audio signal from positive to negative and negative to positive crossing zero level in the middle.
In simple terms, it is the count of signals crossing zero level in one second period of time.The ZCR for k th frame is represented mathematically as: where M is the length of the frame and sgn(.) is the sign function that is ZCR estimates fundamental frequency and is proven efficient for voice-based systems [56].ZCR conveys important information about the voiced and silent frames of a voice signal.Due to its ability to give discriminating frequency information, this feature can be designed as a classifier [57].• ADSR envelop detection stands for Attach, Delay, Sustain and Release.This FE method is mostly used in music-genre classification and is not applicable to real-time sounds due to the absence of decay envelop.Additionally, it does not work with environmental sounds since they lack sustain temporal envelop.Therefore, this kind of envelope is known as AR envelope which is mostly used in timbre analysis in musical instruments [58].• Log attack time: As its name implies, this is the logarithmic (base 10) of the time interval between the start time until it has reached to its stable stage.If T 0 is the starting time of the signal and T 1 is the maximum time then the range can be found by the length of the signal as follows: Among its applications are the detection of musical onsets [59] and the detection of environmental and event sounds [60].and voice signals [62].As compared to voiced frames, it is relatively low in unvoiced frames.A number of applications can be found in audio analysis, including environmental sound and event detection [63], music systems [59], and acoustic monitoring systems [9].• Auto-correlation is the extent of similarity of a signal with its delayed version.This measure is represented by +1 and −1 values.The maximum relation is given by +1, the minimum relation is given by −1, and the absence of any relation is represented by 0. For example, the correlation at some value of lag is less than 1 but greater than 0 depending on the extent of similarity [64].Therefore, the correlation at zero lag will always be 1 since the signal is repeated undelayed.Music analysts use auto-correlation as a method of analyzing beats, tempo, and pitch.

2) FREQUENCY DOMAIN FEATURES
• Peak frequency: Peak frequency conveys information about the most dominant frequency and the fundamental frequency of the signal.This is defined as the frequency of maximum power.In the case of music and speech classification, peak frequency information is used since vocal sounds have pure tones (sine wave).Peak frequency provides the best estimate of the pitch in this case.
• MSAF-Multiexpanded stands for the method of selection of amplitudes of frequency multi-expanded filter.These features are mostly used in fault diagnosis in electrical drilling motors [65] and commutator motors [66].These acoustic features are handcrafted and generated by computing the difference between Fast Fourier Transform (FFT) spectra of different classes.
The absolute value of differences forms a feature vector that is used to construct classes.• SMOFS-Multicrafted is shortened method of frequency selection which is the same as MSAF-multi expanded and is being applied in the industrial sector.This is also used to classify faults in motors [65].The only difference between SMOFS-multi crafted and MSAFmulti expanded FE method is the selection of frequency components after FFTs computation.• Short-time Fourier transform (STFT) is the timefrequency transform of a signal represented as time-frequency distribution (TFD).In the timefrequency analysis of an audio signal, time is on one axis and frequency is on another axis.Changes in the amplitude of the signal over time can be observed along with the magnitude of frequency content in the signal.
With the use of STFT, a time-frequency analysis can be performed on audio signals with abrupt discontinuities and patterns, which is a promising method for nonstationary signals.There are different types of TFD techniques depending on the requirement such as linear [67], quadratic [68], positive [69] and matching pursuit TFDs [70].TFDs are used in audio processing in the detection of industrial gear faults [71], seismic data processing [72] and environmental sound source recognition [67].• Chroma and tonality based: Chroma-based features represent an audio signal for example music audio in the form of 12 chroma segments mapped from the spectrum.Logarithmic STFT is used to compute these bin/segments.This representation of mapping is called chromagram.As the statistics from chroma energy distribution also have information about the audio, it is an important method for obtaining chroma-based features.
Tonality-based features depend on the fundamental frequency of the harmonic audio signal.Tonalitybased FE is only applicable to stationary periodic audio signals.The fundamental frequency is the lowest frequency of a periodic signal.For example, the pitch of music audio gives an estimate of the fundamental frequency.Tonal features find their applications in music onset detection [59], environmental sound source detection [73], and audio retrieval systems [74].• Long-term Average Spectrum (LTAS): LTAS is the FFT generated unusual spectral information from an audio signal.Due to its ability to capture the spectrum of both glottal source and vocal tract, it is widely used in pathological speech [75].LTAS acquires spectral information from every octave of a filtered speech signal.The spectral information comprises certain parameters which are combined to form a 99-dimensional feature vector.These parameters are Root mean square(RMS) values, normalized mean and standard deviation (SD) of segment RMS, segment SD normalized by full-band and band RMS, skewness, kurtosis, range of segment RMS and variation in RMS energy in ensuing segments.• Envelop Modulation Spectrum (EMS): This FE method uses amplitude modulated audio signal.EMS is a representation of the energy distribution in amplitude variations across different frequencies.In the first step, a Butterworth filter of 8 th order is used to generate octave bins centered at certain frequencies from the audio signal.Following this, the Hilbert transform is used to extract the envelope of the original signal and the filtered octave bin.Then power spectrum is estimated by taking Discrete Fourier Transform (DFT) of the envelope.A 60-dimensional feature vector is then constructed containing 6 features derived from the power spectrum, including peak frequency, peak amplitude, spectrum energy (0-4Hz and 4-10Hz), and energy ratio.EMS features can be utilized to solve classification problems in pathological and control speech [76], [77].• Spectrum-shape based: In spectrum-based features, a spectral centroid is commonly used to describe the position of a spectrum's center of mass.Normalized amplitude is computed by the distribution of frequencies and probabilities across the spectrum.The spectral centroid is a brightness parameter that describes the brightness of an acoustic signal.Additionally, this also conveys information about musical timbre [78] which is why it is employed in music-mood classification [79] scenarios.
The spectral center is another type of spectrum-based feature that relies on median frequency of the signal spectrum.Due to its energy balancing attribute, this feature is used in rhythm tracking in the music field [80].
The spectral roll-off feature is defined as a frequency under which 95 percent of the energy remains.Audio surveillance systems [9], music-genre classification [47] and speech-music [61] classification use this feature for discrimination.There are other spectrum-based features that have different characteristics of the spectrum.Spectral spread, for example, categorizes sounds according to their spectral bandwidth, while spectral skewness and spectral kurtosis indicate the symmetry and flatness of the spectrum, respectively.• Auto regression-based features commonly include linear prediction coding coefficients (LPCCs) synthesized using linear prediction analysis of a signal.In this way, it eliminates the problem of redundancy by estimating new values based on the previous coefficients.The linear prediction model generates a compressed spectral envelope of a digital speech; therefore, it is commonly used in audio segmentation and retrieval applications.Additionally, there is another modified version of LPCCs which is known as Code Excited Linear Prediction (CELP) that reassembles the human vocal tract using a linear prediction model.In linear prediction models, excitation signals are fed into adaptive or fixed code-book entries.Afterwards, the model performs the search in the perceptually weighted domain and closed iterations.Due to its promising ability to code speech, this delivers better quality than low bit-rate algorithms.Therefore, they are used in ESI applications [81].

3) CEPSTRAL DOMAIN FEATURES
Cepstrum represents the cepstral domain that is generated by taking the inverse Fourier transform of log spectrum of a waveform.Cepstrum is categorized into three types depending on different audio applications.In speech processing, power cepstrum features are used, while real cepstrum features are used for pitch detection [82].Analyzing cepstrum features is called cepstrum analysis or quefrency analysis.
Cepstral features have a number of benefits such as sourcefilter separation, orthogonality and conciseness.These attributes make them suitable for training ML algorithms.
In this subsection, we discuss various types of cepstrum features and their potential applications.
• Mel spectrogram: Mel spectrograms are widely used features for DL algorithms.They convey useful information about an acoustic signal such as loudness or intensity over time at different frequencies.They are based on the Mel scale which is the logarithmic transformation of a signal's frequency.The behavior of mel scale reassembles to human's perception of sound at different frequencies.The relationship between mel scale and frequency is shown mathematically as: If a signal is denoted by x(n) and k a is the index of mel scale filter, then Log Mel spectrogram is denoted by S a (n a , k a ) which can be computed by Figure 3.
Mel spectrograms have been used in a variety of applications such as speech-emotion recognition [83], healthcare [84], underwater target recognition [85], industrial fault diagnosis [86] and many others.• Mel Frequency Cepstral Coefficients (MFCCs) are mostly used cepstrum features for audio processing due to their ability to resemble the human auditory system.An audio frame is pre-emphasized and hamming windowed.Subsequently, the time domain signal is converted into frequency (N-point) by using DFT.If s(n) is an audio signal then the energy spectrum in the frequency domain can be represented by the below equation.
Then the filter banks are imposed on the frequency spectrum S(k).Discrete Fourier Transform is taken again on filter bank energies and MFCCs are obtained that can be written as MFCCs are prominently used in speech and speaker recognition systems [88], [89], vowel detection [90], music-genre classification and audio similarity analysis [91].noise elimination [92], music-genre classification [93] and speech recognition systems [94].• Perceptual linear prediction (PLP) cepstral coefficient is another form derived from Linear prediction coefficient.The PLP coefficients represent critical band spectral resolution, equal-loudness curve and intensity loudness power law [95].To generate PLP coefficients, perceptual processing is performed; afterwards, autoregressive modeling is done before converting those coefficients into cepstral coefficients.PLP coefficients are useful in animal sounds classification [96], emotion identification [97] and speech recognition systems [98].
• Greenwood function cepstral coefficients (GFCCs) use MEL features and are termed as a generalized form of MFCCs and deliver fine vocal representations of animals and birds.This is why, GFC features are primarily founded in terrestrial mammals.GFCCs are derived from the greenwood equation that closely maps the cochlear-frequency position for all terrestrial animals and birds species.Their primary applications include animal and bird sound identification and classification [99].• Gammatone cepstral coefficients (GTCCs) is one of the most noise-robust features in automatic speech recognition systems.GTCCs are extracted in a similar way as MFCCs and are based on gammatone filter banks.These filter banks generate output which is the frequency-time domain representation of an acoustic signal.GTCCs can be derived to more features by taking first and second-order derivatives.They are employed in environmental sound recognition and automatic speech recognition systems (ASR) [99].

4) IMAGE BASED FEATURES
An image of an object contains patterns and points that help in the identification of that particular image.AI-based algorithms distinguish those objects with the help of those patterns.DL algorithms mostly use image-based features as inputs for recognition, identification and classification.In this subsection, we have discussed popularly used image-based features.
• Local Binary patterns (LBPs): In audio processing, local binary patterns are called visual descriptors and can be extracted from the spectrograms of audio signals.These patterns possess information on grayscale contrast and local spatial descriptors.The feature vectors extracted from the spectrogram are used by ML and DL algorithms for textural analysis.LBPs are powerful features used in computer vision applications.They have been proven useful in audio scene detection [100], psychological diseases analysis from speech [101] and emotion detection applications [102].LTPs carry significance in audio scene detection and classification [103] and healthcare analysis [104].• Histogram of gradients (HOG) descriptor: Histogram of gradients (HOG) is another feature descriptor that conveys information about the structure and shape of an object.These descriptor measures the magnitude and angle of the gradient and generates histograms.Similar to other image-based features discussed above, these features also extract information in the frequencytime domain.These features have been used in emotion detection [102], audio scene classification [105] and snore sound classification [106].• Scale-invariant feature transform (SIFT) descriptor: SIFT is an image-based FE method used to generate local features for small and large-size objects.Their processing is efficient and close to real-time.Another benefit of SIFT features is its extensibility to a wide range of other types of features.SIFT features are used in computer vision applications, emotion detection [102] and audio/video concept classification [107].

5) DISCRETE WAVELET FEATURES
An audio signal can be converted into a time-frequency representation using a wavelet transform.This is merely a product of the audio signal with a wavelet.Wavelet transform works in two ways: continuous and discrete.Discrete Wavelet transform (DWT) is more efficient due to the frequency filter bank and can extract information from non-stationary signals such as audio signals.DWT delivers uniformity in timefrequency resolution.The coefficients generated by DWT are wavelet features and can also be extracted from wavelet packet decomposition.Discrete wavelet features are widely used in audio analysis [108], music classification [109], motor fault detection [110] and emotion recognition [111].

6) OTHER SPECIAL FEATURES
Researchers have combined a number of approaches to improve the extraction of discriminatory features and identification accuracy.The combination of DWT and MFCCs is usually done by concatenating both MFCCs and DWT features.A combination of DWT and MFCC method performs relatively better in noisy scenarios than either technique alone.For example, Authors in [112] and [113] used the hybrid method of MFCC and DWT in speaker recognition and speaker verification and achieved higher accuracy in different noisy cases.Similarly, Hidayat et al. [114] implemented the same combinational approach in the text-dependent speaker recognition system and achieved 96.67% overall recognition accuracy.Researchers have also combined MFCCs with GFCCs in various applications in order to achieve greater efficiency.In [115], authors have used the fusion of MFCCs, GFCCs and mel-spectrograms to classify heart conditions based on heart sounds.On the PhysioNet2016 set, accuracy was achieved at 96%, which is higher than the accuracy achieved using MFCCs alone.Moreover, Al-Qaderi et al. [116] developed a two-stage speaker identification system using the fusion of MFCCs and GFCCs and various classification approaches and analyzed them under different environment noises.The fusion of FE techniques and classifier is evaluated vs SNR.The proposed fusion method demonstrated better recognition rates as compared to base classifiers and FE methods.
There are various vector-based FE methods that are widely used by researchers with ML and DL algorithms.Initially, the i-vector approach is employed in speaker recognition which is constructed by a feature extractor or frontend implemented using Gaussian mixture models (GMM) and universal background models (UBM) and backend is implemented with a probabilistic linear discriminant analysis (PLDA) classifier [117].After the early success of i-vector-based systems, a number of hybrid methods combined i-vector and DL architectures [118], [119].Subsequently, researchers have implemented speaker embedding systems based on DL, d-vectors [120], x-vectors [121], and t-vectors [122] after the success of i-vectors.In order to train deep d-vectors, frame-level speech information is used.The deep-vector architecture includes 300ms speech frames that contain 40 filterbanks.In addition, there are four dense layers (or fully connected layers) in the network, each containing 256 nodes.Contrastingly, the x-vector algorithm produces embedded speaker data based on variable-length speech input [121].X-vector systems achieve a lower equal error rate (EER) than i-vector systems and d-vector systems.Inspired by the performance of FaceNet [123], various domains have implemented embeddings specific to facial images.The t-vector system also known as triplet network, is trained on a shared DNN triplet network and the triplet loss function is normally applied.T-vector systems do not perform better than x-vector systems, but these systems usually compete with one another [124].

7) PERFORMANCE OF ACOUSTIC FEATURE EXTRACTION TECHNIQUES
Traditional ML algorithms use almost all aforementioned features from time, frequency, and cepstral domains to solve various problems.Features need to be handpicked based on the performance of each model.DL algorithms work on unstructured audio representations.Sound features, such as spectrograms and MFCCs, are capable to extract patterns on their own.Furthermore, they are supported by a vast amount of data and computing power [125].Therefore, widely used feature representations by DL algorithms that can be directly fed into neural network architectures are spectrograms, mel-spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs).
In order to solve a specific problem, researchers have used a variety of features.The best ones were selected by their performance and evaluation metrics.Researchers in [126] have compared the performance of MFCC, PLP, and LPC techniques for speaker recognition.Among these, PLP has performed better at low SNR values thus it is proven as a robust technique in the presence of noise.In [127], it has been validated that MFCCs are good at computing the distance between sounds.Moreover, the performance of a FE technique also depends on the nature of the sound.For example, speech and music sound share similarities in phones and notes, and harmonic structures in spectra.Unlike speech and music, environmental sounds have obscure periodicities and an indefinite dictionary of sounds which is why they are complex.Therefore, FE techniques that mimic human perception of sounds such as LPCCS [98], MFCCs [91], and GTCCs [99] are mainly used in speaker and music systems.
Researchers have proven in one study [128] that the right combination of time and frequency-based techniques can outperform well-known techniques.In their paper, Robert et al. compared the performance of the combined approach of line spectral frequencies (LSF), ZCR and spectral ux (SFX) with MFCCs in a standard audio recognition system.F-scores indicated that the proposed approach achieved 97.5% and MFCC 78.9%.The Chromagram has demonstrated great potential in one study [129] of Thai classical music instruments and has proven a good representation of the complex internal structure of Thai music.One investigation [130]

D. AI METHODS
This subsection discusses some of the traditional and widely used ML and DL algorithms.After appropriate features are extracted successfully, these AI algorithms are trained on the informative features to learn about the acoustic signature generated by a particular sound source.Figure 4 represents the traditional machine learning and deep learning algorithms that are used in ASI in various scenarios.

1) MACHINE LEARNING ALGORITHMS
To date, few researchers have compared ML algorithms in research work [131], [132].Some of the prominent ML algorithms are discussed as follows: • K-Nearest Neighbour (KNN): K-nearest neighbour is an instance-based, non-parametric, and supervised ML algorithm.This is used to solve regression and classification-based problems mostly.An audio sample is assigned with a class label when most of the nearest neighbours belong to that class.This is known as majority voting which is the core concept of the KNN algorithm.KNNs are used in identifying patterns in texts.[133], ESI [131] and finance studies [134].HMM is employed in various fields such as speech recognition [137], gesture recognition [138] and target classification [139].
• Gaussian Mixture Model (GMM) is another probabilistic unsupervised learning model i.e. it does not need the prior information of the data points labeled with classes.GMM can approximate complex class density functions with random precision.Further, it can also be employed as a supervised classifier.However, its performance has shown to be lesser than the KNN and SVM in various applications [140].• Artificial Neural Networks (ANN) are subsets of ML that work like biological neurons signaling each other.ANNs are made up of a set of artificial neurons also called nodes.These nodes form layers comprising an input layer and one or more hidden layers and an output layer.Every node has a weight and threshold value to communicate between layers.That's how they can transfer data between layers and the network gets trained on it and gains accuracy over time.ANN is a supervised algorithm and it learns by using examples.For instance, a network can identify a dog in an image when they are trained with manually labeled dog images using outputs obtained from other dog images.The learning rate in ANN varies over time.ANNs are massively used in almost every other field such as in healthcare [141], stocks and finance [142], 3D reconstruction [143] and environmental sound source identification [131].

2) DEEP LEARNING ALGORITHMS
This subsection outlines commonly used DL approaches which are as follows: • Convolutional Neural Network (CNN): CNN is a type of DL also known as a feed-forward neural network.CNNs The confusion matrix forms the basis for the other types of metrics.The classification accuracy rate is a usual metric based on calculating rates from subsets of these values.In simpler terms, the accuracy of the matrix can be evaluated by taking an average of the values lying across the main diagonal.
The precision score from the confusion matrix can be calculated by the ratio of true positives and total positives predicted.A low precision score of less than 0.5 depicts the outcome of a high number of false positives due to imbalanced class or untuned model hyperparameters.
Recall or sensitivity is defined as the number of correct positive outcomes divided by the total number of positive instances identified by the classifier.The area under the precision-recall curve delivers an average precision score (APS).The mean of the average precision score is termed mAP and can be computed by taking the mean of APS over all classes.The F-1 score is another evaluation metric that measures the test's accuracy.This provides a harmonic mean of precision and recall scores for a classification task.
Area under the curve (AUC-ROC) is one of the frequently used metrics to analyze the performance of a classifier.AUC gives outcomes in binary classification problems.According to the definition of an AUC, it is the probability that a randomly chosen positive instance will rank higher than a randomly chosen negative instance by the classifier.There are three terms that explain the characteristics of AUC; True positive rate (TPR), True negative rate (TNR) and False positive rate (FPR) are defined mathematically as: These AUC metrics are displayed on the receiver operating characteristic (ROC) graph.The AUC-ROC graph is drawn as FPR on the x-axis and TPR on the y-axis.The values of FPR and TPR range from 0 to 1.The greater the value of the AUC, the higher the performance of the model.
Mathews Correlation coefficient (MCC) [149] is a more reliable metric than the aforementioned traditional metrics.MCC is the measure of the difference between true and predicted classes that is analogous to x 2 statistics on a confusion matrix.

MCC = (TN .TP − FN .FP) √ (TP + FP).(TP + FN ).(TN + FP).(TN + FN )
MCC achieves a high score when prediction in all categories (TP, FP, TN, and FP) is true with respect to the size of positives and negatives.MCC offers numerous advantages over the F1 score and accuracy in binary classification problems [150].

III. APPLICATIONS OF AI-BASED ASI
This section discusses the significance of AI-based ASI and its applicability to a variety of applications.As we mentioned before, this section provides a detailed discussion of AI-based ASI in fault diagnosis, underwater detection, ESR, healthcare, music-genre classification and wildlife monitoring.For all these applications, we compared previous works based on evaluation metrics, limitations and advantages.In addition, we also provide statistical analysis of proposed models from the literature review.

A. MACHINERY FAULT DETECTION
This subsection provides a detailed discussion of ASI methods that have been applied in industrial settings for intelligent fault diagnosis in machines and their components.In addition to the discussion of ASI methods for fault diagnosis, we present state-of-the-art artificial intelligence models for the identification and detection of faults.We have also summarized relevant studies in Table 3 to give a clear overview of their works.With the progress and development in production processes, science, and technology, machines, and equipment is getting advanced and automated.Modern machines are complex and their components are linked to each other.A slight fault can raise a chain of issues in a machine if not diagnosed timely.For instance, the crash of a US space shuttle occurred due to a slight problem in its component.Therefore, research is needed in the fields of urgent condition monitoring and intelligent fault diagnosis so that prompt maintenance activities can be done to ensure the smooth functioning of equipment.This shall increase the reliability and safety of the industrial environment and reduction of costs as well.Fault diagnosis has been conducted using vibration, thermal, current, and sound signals from the machinery in different components of machines such as motors, gearboxes, bearings, transformers, etc.
In 2013, Pandya et al. [151] discussed fault diagnosis in rolling element bearings in one of the earliest ASI-related papers.In their work, acoustic signals from bearings are captured and time-frequency features are derived using intrinsic mode functions.Then, supervised machine learning classifiers such as KNN and weighted KNN are used for the classification of faults and KNN has been chosen as the best classifier with 92.77% accuracy.Later on, Yoon and He [152] and Yao et al. [163] investigated the fault diagnosis in planetary gearbox using acoustic emissions and supervised learning algorithms.To accomplish this, Yoon and team set up the power gearbox (PGB) test rig experiment, created faults in gears artificially, and collected acoustic samples using acoustic sensors.In their study, KNN, back propagation (BP), and LAMSTAR learning algorithms are used to compare their performances based on their error rates.Subsequently, Yao utilized four classification models back propagation neural networks (BPNN), Extreme learning, random forests (RF), and SVM for fault classification and compared the results.
Right after Yoon's work, Waqar and Demetgul [154] did another experiment with worm gear in motors.Both vibration and sound signatures from the motors at different speeds are acquired.The collected signatures were preprocessed and classified using Multilayer Perceptron Artificial neural network (MLP-ANN).The trained algorithm achieved successful prediction of 2 different speeds and 4 different oil levels.Adam et al. [66] used real acoustic data of four states of faulty three-phase induction motors, extracted two types of features, and used nearest neighbour, BPNN, and a modified classifier based on words coding for recognition.
The authors in [156] proposed a fault diagnosis model that can identify new fault modes through K-means unsupervised clustering and store the real-time data for future fault diagnosis.An experiment has been conducted to validate their study in which they recorded acoustic signals from bearings at different speeds to create a database.Further, KNN classifier has been used to estimate the prediction performance.
In recent literature of the last three years, Potovcnik et al. [159] performed the classification based on the condition of the system valve using acoustic features and various ML algorithms.To accomplish this, an experimental setup of the valve assembly with a microphone is established in a semi-anechoic chamber.The proposed methodology also involves the comparative analysis of feature selection using different classification models.Later, Yaman [160] and Santos et al. [164] investigated the faults in the bearings of three-phase induction motors.Audio data is collected by setting up experiments and classifying the data using supervised learning algorithms such as SVM, KNN, and MLP respectively.Later in 2022, Orhan et al. [166] continued and proposed a lightweight method for the detection of faults in unmanned aerial vehicle (UAV) motors.Audio datasets are collected from healthy and faulty motors of various (UAV) sources.SVM classifier is used for fault diagnosis in UAV motors.Moreover, Cai et al. [162] and Fu et al. [168] did their research on anomaly detection in transformers using acoustics.Fu and team developed a method namely lightFD to perform SVM classification on edge devices with limited computing power.Most recently, Liu et al. [167] also studied rotor-bearing fault analysis and use sound data acquired from faulty rotating machinery and classified by SVM, KNN, and decision trees.

B. UNDERWATER APPLICATIONS
ASI in the underwater medium has a wide range of applications.For example, ASI capability plays a vital role in the military to identify friendly/adversarial objects (e.g., submarines, torpedoes) in water.ASI is also beneficial for scientists studying marine ecology, geology, oceanography, and seismology.Moreover, the ability to localize objects and analyze transients in ocean acoustics can be utilized by the mining industry for offshore oil and gas discovery and plant maintenance.
Traditionally, ASI has been performed successfully in the underwater medium using matched-field processing (MFP) [183].However, one of the severe limitations of MFP is its sensitivity to the mismatch between model-generated datasets and real-world conditions [169], [183].In other words, MFP may not be flexible enough to adapt its parameters to changing channel conditions (e.g., sound speed profiles, bathymetry, and chemical composition of water) which is an inherent and peculiar nature of the underwater medium.In order to combat this challenge, many in the literature have resorted to data-driven ML techniques which can learn from and adjust themselves to rapidly varying channel conditions, yielding better results such as improved accuracy for ASI.
In this work, we have conducted a comprehensive literature review on ML techniques for ASI in the underwater medium.Table 4 compares the AI-based ASI in underwater applications based on their advantages and limitations.Additionally, based on the type of AI technique, we have organized the works into the following main categories: The authors in [169] have formulated the ASI problem as a ML problem where the ML model learns directly from observed data.In this work, the authors utilize a vertical linear array for building a normalized covariance matrix which is used as a training dataset.Three ML techniques -FFNN, SVM and ensemble learning-based random forest (RF) are evaluated against the traditional MFP technique.Results indicate that ML algorithms yield better results when ASI is posed as a classification problem rather than a regression problem.Moreover, FNN yields better predictive performance at multi-frequency inputs with SNRs above 0 dB despite a small number of training samples.
In their follow-up work in [170], the same authors show that ML-based classifiers deliver better results in estimating ship range for up to 10 km when MFP fails at approximately 4 km range when environmental information is limited.
More recently, the work in [171] has proposed a two-stage process of underwater target detection where the first stage involves beamforming-based direction of arrival (DoA) estimation and the second stage involves taking the DoA information and feeding it into an FFNN which develops a detection model.The proposed method can yield a detection accuracy of as high as 97.32% at particular locations of the ocean.

2) MULTI-LAYER PERCEPTRONS (MLP)
Traditional recursive algorithms such as gradient descent used in neural networks (NN) encounter multiple issues such as low accuracy, slow convergence, and local minima entrapment.This has led researchers to use heuristics/metaheuristics-based algorithms for training NNs.The work in [172] has used grey wolf optimization (GWO) for training NN for target classification.Results show that compared to the Particle Swarm Optimization (PSO) algorithm, Gravitational Search Algorithm (GSA), and the hybrid algorithm (i.e.PSOGSA), the multi-layer perceptron (MLP NN) using GWO yields better results across all three datasets used (i.e., Iris, Lenses, and Sonar-1988) in terms of higher accuracy (>95% for Sonar), lower probability of local minima entrapment and higher convergence rate.
The authors in [173] have used another meta-heuristicbased algorithm (i.e., biogeography-based optimization (BBO)) for classification with NNs for the same three datasets as [172] (except Sonar-2015 has been used in [173]).Non-linear migration models offer two-pronged advantages: one to search agents for better exploration of the solution space resulting in local optima avoidance; and two to accelerate the search agents towards global optimum enhancing convergence rate without sacrificing accuracy of classification.
More recently, the authors in [174] have used another meta-heuristic Dragonfly Algorithm (DA) on active, passive, and Gorman and Sejnowski sonar datasets and compared its performance against BBO, GWO, Ant Lion Optimization (ALO), ACO, GSA and Multi-verse Optimization (MVO) algorithms where DA outperforms the rest in terms of accuracy and convergence speed.
The work in [175] presents a method for classifying targets in passive sonar using MLP trained by a salp swarm algorithm (SSA).The authors have also used MFCC to improve the dataset's dimensions.proposed method utilizes SSA to optimize the weights and biases of the MLP, which are then used to classify sonar signals.SSA allows for faster and more efficient training of the MLP even with the presence of noise.The limitation of the proposed method is that it is based on passive sonar, which may be limited in its ability to detect targets that are quiet or have low echo strength.Additionally, the method relies on the SSA, which may be sensitive to initial conditions and may not be able to find the global optimal solutions.
More recently, the Whale Optimization Algorithm (WOA) and an improved WOA with Local Wavelet Acoustic Pattern (LWAP) have been utilized in [176] and [177], respectively, for training MLP NNs.These works indicate that meta-heuristics-based MLP NN training for sonar target classification is a promising technique.

3) DEEP NEURAL NETWORKS (DNN)
Besides MLP NNs, researchers have also used DNNs for ASI.The work in [178] uses DNN for ranging and depth determination of acoustic source in shallow water (100 m) using DNNs.They propose two methods: a) a two-stage FE and model building process b) a direct process of training a convolutional neural network (CNN)-FNN (CNN-FNN) architecture by using raw acoustic data.Both methods provide better performance compared to MFPs under mismatched environments.
Inspired by the human auditory system of sound perception, the authors in [179] propose a deep CNN for underwater target recognition.The architecture maps various frequency components into a bank of multi-scale deep filter subnetworks.Then it mimics the neuro-plasticity mechanism of the human brain to train those multi-scale deep filter subnetworks using raw time-domain ship noise.The proposed method, when trained with raw time domain waveforms, achieves better classification accuracy as compared with standard CNN/DNN methods with other types of training input such as MFCCs.
The work in [180] has proposed a DNN-based source localization technique for very shallow water environments (a 1.1 × 1.4 m laboratory tank with depth 0.1 m) with high-frequency components, while the work in [181] have devised a deep transfer learning technique which can be adapted by models such as that proposed in [180] to translate its capabilities for real-world deep-sea environments.The transfer learning approach in [181] can open new possibilities for effectively training DNNs since real-world deep-sea trial data are difficult to obtain.
More recently, the work in [182] has proposed a method where learning features are extracted from five different dimensions, i.e., noise spectrum level (NL), time-frequency spectrum (Spec), power spectral density (PSD), Melfrequency cepstral coefficient, and Mel filter bank energy (FBANK).Then the authors compared the performance of SVM and CNN on noise-added data with various SNR levels.They have found that underwater noise can be best characterized by NL and PSD features.Additionally, CNN outperforms SVM in noise classification.

C. EVENT AND ENVIRONMENTAL SOUND SOURCE DETECTION
In this subsection, we provide a detailed discussion of the significance of acoustic source identification (ASI) in environmental sound source recognition and event detection.First, we discuss the applications and benefits of environmental sound source identification.Then, we discuss in detail most of the FE and AI methods available in the literature.Later, we present a summary of the event and environmental source identification works and highlighted their limitations in Table 5.
ASI can be used to recognize events and scenes.Context recognition is becoming popular and becoming an important research area.Acoustic scene identification is defined as the recognition and classification of acoustic such as schools, offices, hospitals, and trains/buses based on the generated sounds [200].The aim is to create applications that can improve urban environments if the activities in the surroundings are detected and identified.Environment sound source identification is a complex activity as compared to speech and music because environment sounds are very dynamic in nature and sometimes it is difficult to identify targeted sounds due to background noise.Some of the sounds can have a low signal-to-noise ratio that can make it difficult for some sources to be recognized properly [201].
ESR is a promising method for audio surveillance applications [202].In addition, ESR can be used in robotics to improve their navigating abilities and interactions with the environments [203], [204].ESI along with video analysis contributes its benefits majorly in surveillance applications of homes and cities [205], [206].Home surveillance is very important, especially for elderly people who are living alone or other smart home applications [207], [208].ESI has been used to recognize animal species [209], [210], [211], bird species [212], [213] by their distinct acoustic signatures for bioacoustic applications and wildlife monitoring.Recently in a study [214], hive health has been monitored by analyzing hive sounds using AI algorithms.
In our study, we have reviewed the literature extensively to compile the renowned work which has been done in the field of ESI and sound event detection (SED).The authors in [19] did the survey and compiled the works which have been investigated in environment audio scene and sound event recognition for surveillance purposes.Recently, a detailed review [26] has been published in the area of ESI.This research work highlighted the environmental and events sound datasets, FE methods, and different ML and DL algorithms used in recent studies.In the past five years of literature, the prominent works accomplished by authors in urban sound event detection and environment sound source identification are as follows.
Authors in [184] have developed an efficient urban sound classification mechanism using ML.Local and global FE techniques are employed to process the most discriminant information-carrying features for the ML algorithm.A mixture of expert model techniques is also introduced to assemble information from local and global features.Zhu et al. [187] performed multi-scale FE on audio data.In multi-scale convolution, the signal waveform is convolved with filters at different scales and performed feature fusion.CNNs are trained after the pooling of useful features.By employing these setups, improvements in sound recognition are achieved that yield better results than previous methods.The authors in [188] improved sound recognition in hearing aids by using ensemble techniques.Moreover, automation in devices is introduced with respect to sensing and recognizing sounds and their sources.Similarly, Ahmed et al. published his work [192] to establish an automatic environmental sound recognition system using deep learning.Image-based features are used to train the DL algorithm.Moreover, the performance of various FE methods is compared and achieved different recognition accuracy rates using publicly available data sets.
Lately, in 2021, many researchers actively worked in this area and published their works.The authors in [194] recorded ambient sounds in an indoor environment for the purpose of recognizing an activity based on the produced sound.Spectral information has been extracted from the data collected by using smart IoT sensors.CNN-DL model has been used by the research team along with fuzzy logic to accomplish a coherent recognition of activities.Nanni et al. [195] presented their idea of combining ensembles of classifiers exploiting six data augmentation schemes for the training of CNNs.Further, those ensembles are tested on open-sourced environmental sound datasets.The performance of ensembles are compared extensively with the ones mentioned in the literature resulting in a high-performing ensemble with high accuracy.Zinemanas and team [196] proposed a novel interpretable architecture employing a DNN for ESI.Audio domain knowledge is used to improve the distinction in classes.The key idea was to incorporate frequency-dependent similarity by assigning different weights to each frequency bin in the latent space.Due to the system's interpretability, it can be evaluated and debugged easily.Moreover, the authors in [197] proposed an intelligent forest monitoring system that applies signal processing techniques such as dynamic time warping and ML algorithms trained by MFCCs and spectral feature spaces.

D. HEALTHCARE APPLICATIONS
This sub-section outlines the usefulness of AI-based ASI in the healthcare field and research that has been done in the past few years.AI-based ASI has been widely used in the healthcare field in fall detection, health, and fitness wearable devices, equipment for hearing-impaired patients, and cardiac auscultation.Elderly people have the tendency to fall down in their later ages and sometimes they do not have access to any external help.The fall can be serious and can lead to severe injuries that may take longer to heal in old age.In this case, early first aid is very crucial to reduce the risk of death [215], [216].Therefore, these incidents can considerably be avoided using modern AI techniques along with traditional methods.In past, various methods for fall detection have been proposed such as by using cameras [217], sensors [218], [219] and radars [220], [221].Health support wearable devices for hearing impaired people provides a promising solution for them to interact with their surroundings [222], [223].Nowadays, heart auscultation has been used massively in conjunction with AI techniques for the diagnosis of cardiovascular diseases [224] and condition monitoring of arteries and valves [225].
The authors in [103] proposed a fall detection framework that is based on signal processing methods such as silent zone suppression and acoustic ternary pattern FE.SVM has been used to classify and detect fall events.Their proposed method works well in a multi-class environment.Yauganouglu et al. [226] developed a real-time detection system for hearing-impaired people using a wearable device in which sound events' information is conveyed to the user through vibrations.A combination of pre-processing methods is used for FE.Correct perception and recognition of sound have been made delivered using KNNs classifier and audio fingerprinting.Later, Ramadhan and team [227] published their work in which acoustic event recognition is investigated as part of a smart home system for elderly people.In their work, spectrograms are extracted from practically collected audio data to train a DL model i.e.CNN.During their investigation, an accuracy rate of 97.5% in silence and 85% in normal scenarios are achieved respectively.
Recently, Jain et al. [12] developed an interactive tool namely ProtoSound for the hearing impaired or people with hard-of-hearing problems.The system has the ability to personalize a sound recognition model from user recordings.User recordings undergo the same set of steps of FE and classification performed by the chosen model.CNN architecture is being used in the ProtoSound system for the prediction of sounds.Their proposed ProtoSound achieved an average accuracy of 88.9%.Authors in [228] also presented a review of sound recognizer tools for hearing-impaired individuals.In their review, a user-driven automated sound recognition system is studied using ML techniques.The potential use of personalizable sound recognition systems is also highlighted for prospective research.
Authors in [229] used curve fitting and a KNN algorithm.The normal and abnormal heart sounds have been classified with 92% accuracy.Nouman et al. [230] developed a framework for automatic heart sound detection by using neural networks.An optimal combination of 1D-CNN and 2D-CNN is employed which exhibited an accuracy of 89.22%.Furthermore, Zhang and team [231] proposed a novel method based on temporal quasi-periodic features and LSTM algorithm.The popular 2016 PhysioNet dataset is used and an accuracy of 94.66% is achieved.
In addition, the authors in [232] performed their research in determining the heart condition based on its sounds and developed an AI-enabled tool for automatic quality assessment.Two datasets (2016 PhysioNet/CinC Challenge and self-collected) are used to compute necessary de-noised features and trained MLP classifier to perform binary classification of heart sounds.Then in the same year Zhiming et al. [233] proposed a heart sound recognition method to identify congenital heart disease in patients.Two classifiers, SVM and BP are used to train MFCC features extracted from heart sounds obtained from the 2016 Heart sound Challenge dataset.Their work improved the accuracy of detection of the congenital disease up to 93.52%.Kui et al. [234] proposed a promising approach in the classification of heart sounds using the duration-dependent hidden Markov model (DHMM) in the segmentation of heart sounds.Additionally, dynamic frame length is used to extract MFCCs from heart audio.Then, the extracted features are classified using DL i.e.

E. OTHER APPLICATIONS
This subsection discusses some of the recent literature reviews of the research work in various other applications such as music-genre classification, wildlife monitoring and forensic applications.

1) MUSIC-GENRE CLASSIFICATION
Moreover, inspired by the advancements in natural language processing (NLP), Zhuang et al. [237] designed a transformer classifier for music-genre classification.They used the famous GTZAN dataset and the transformer model is fed with mel-spectrograms as features and achieved an accuracy of 76.0%.Later, Mounika et al. [238] applied CNN and CRNN to classify music into various genres.The proposed classification is performed on GTZAN dataset and generated mel-spectrograms as distinguishing features.Their proposed model indicated a classification accuracy of 73.2% with train accuracy being 12% lower than validation accuracy due to the overfitting problem in the model.The authors in [239] developed a transformer model-based music recognizer in which they used MFCCs to recognize the genre of audio.The performance of their proposed model is analyzed on GTZAN original dataset and data augmented set which resulted in a better accuracy rate of 75.1%.
Furthermore, Shah and team [240] classified music into different genres using various time and frequency domain features.They extracted spectral centroid, onset strength, ZCR, tempo, spectral contrast, spectral bandwidth, roll-off contrast, and flatness to train SVM, random forests and gradient-boosting ML algorithms.In addition, Spectrograms are extracted to train DCNNs.Their classification performance is compared which proves CNN outperforms ML algorithms.Lately, Cheng et al. [13] performed their research to understand the music-genre classification problem using visual mel spectrogram with YOLOv4 neural network which is based on CNN.Their model is evaluated on various metrics such as precision, recall, F1-score, mAP, and confusion matrix.The average mAP results indicated 97.93% accuracy on the test set and 91.49% on the training set.They achieved better accuracies on GTZAN dataset however, the used graphical spectrum feature increases hardware cost.Most recently, the authors in [241] introduced a hybrid approach of CNN, multimodal and transfer learning based model.In this approach, GTZAN and Ballroom dataset has been used for analysis and benchmarking.Wavelet features are extracted and mel-spectrograms as visual representations in CNN.Results have demonstrated that their proposed hybrid model scored 81% on GTZAN and 81% on ballroom datasets.Added to this, the computational performance of their model is also analysed in their study on a laptop and a supercomputer with a supercomputer having much lower computational time.
2) WILDLIFE MONITORING AND BIO-ACOUSTICS Furthermore, in wildlife monitoring and bio-acoustics, several studies have been performed to study animal behavior and classification problems based on their acoustic features.The behavior of domestic cats was studied by Pandeya et al. [242] by developing an automated classification system over cats' generated sounds.The cat sounds dataset has been increased with the help of a data augmentation technique and extracted mel spectrograms as features for classification.Transfer learning of CNN and convolutional deep belief network (CDBN) has been carried out due to the close relation of cat sound and music.This resulted in an overall good performance in classification accuracy and receiver operating characteristic (ROC) metrics.Later on, the same authors in [243] investigated cow sounds as SED technique and developed an autonomous monitoring system.Using a data-driven approach, mel spectrogram is selected as a potential feature for video object description models (VODMs) and this approach is compared with conventional CNN.The proposed approach achieved better quantitative and qualitative scores.In addition to this, Li et al. [244] proposed an automatic sound recognition system in dairy cows that classifies the ingestive behavior in them.A publicly available jaw movements dataset of two forage species is used.Time, frequency domain, and MFCCS features are formed and a statistical model is developed.Then, three DL models CNN(Conv1D), (Conv2D), and LSTM are trained and optimized to classify the ingestive behaviors.The resultant performance under different forage species and heights came out to be 0.93, and the difference between the best and poorest obtained was 0.4-0.5.
The authors in [43] did their research on insect sound recognition.An ARS center dataset is used which comprises sound files of various activities of insects such as moving, feeding, and calling.CNNs are trained with feature maps created with MFCCs and obtained 92.56% recognition rate.Sun et al. [245] developed a reliable rainforest monitoring system using data augmentation and CNN-based transfer learning due to the scarcity of datasets.This system enables the detection and classification of various animal species (birds, amphibians, invertebrates, mammals).Their model achieved an average accuracy of ≥ 90% with mel-spectrogram features.However, their model included limited sonotypes of rainforest only.Moreover, Echinski and team [246] also investigated birds species using the sounds of birds and established a recognizer model.The spectrograms of bird sounds are fed into Resnet34 CNN for training.Performance metrics indicated a macro average F1 score of 0.74.However, their system could not recognize new entries in the test dataset which needs to be addressed.Recently, Jiang et al. [247] solved the classification problem as SED of ape calls using LSTM neural network.Three types of input features are used i.e. raw waveform, spectrogram and wave2vec 2.0 for the training of NNs.In their study, the results demonstrated that wave2vec 2.0 outperformed the raw waveform of than spectrogram in the classifier.

3) SPEAKER IDENTIFICATION (SID) SYSTEM FOR FORENSICS
AI-based ASI is not limited to the above-mentioned applications.Over time, it has gained much significant attention in speaker identification and verification system for forensics and surveillance applications.Authors in [248] proposed a novel model to detect disguised voices for forensic identification systems.GMM supervector obtained from Gaussian distribution of the speaker's voice and extracted MFCCs are used as features to train SVM classifier.The proposed model achieved good identification rates and lower error rate than 7%.Later, authors in [249] presented another method based on the evaluation of speech quality data.Three experiments are performed on SRE dataset to assess the impact of quality data on forensic speaker recognition (FSR).GMM universal background model (UBM) is trained on MFCCs and delta MFCCs-based vectors.The results indicated their proposed model obtained an Equal Error Rate (EER) of 0.6% as compared to state-of-the-art performance on TIMIT dataset.Rozario et al. [250] implemented a speaker recognition system using ANNs.The performance of Relative Spectral Amplitude (RASTA) PLP, MFCCs and Power Normalized Cepstral Coefficient (PNCC) features are compared on TIMIT database.The results demonstrated that MFCCs outperformed PNCC and RASTA-PLP in speaker identification with the highest accuracy score of 90.66% on the full speech segment.
Subsequently, authors in [251] proposed a DL-based speaker identification mechanism using improved shuffled MFCC (SHMFCC).The data augmentation approach is used in conjunction with the extraction of shuffled MFCC features.Three different datasets; LibriSpeech, TSP and VoxCeleb1 are used to conduct experiments for the study.The tuned Feed forward DNN was trained and tested under various noisy conditions.The proposed method demonstrated high accuracy in all noisy scenarios.Later, Bakir et al. [252] presented a forensic voice application.Data sets comprising recordings from 1000 people have been gathered and MFCCs features are extracted.The identification rates of CNN and DBN trained on these features are compared.CNN performed better than DBN on all MFCC vector lengths.Recently, Authors in [253] performed speech enhancement firstly by employing spectral and log Minimum Mean Square Error (MMSE) techniques.Then, the task of speaker identification on the Australian Forensic Voice comparison database is carried out by training GMM on MFCC features.The average scores of log MMSE are observed much higher than of VOLUME 11, 2023 spectral subtraction.The trained model achieved a 63.5% accuracy score for speech signals enhanced by using the log-MMSE technique.Babu et al. [254] presented a short review of the forensic speaker identification (FSID) system.The authors highlighted the physical properties of speech signals.
Several FE techniques and AI algorithms for FSID are also discussed in the paper.
• Computational complexity of SID systems: Speaker recognition is an important technology when it comes to forensics, access control systems and the financial sector.AI-based approaches introduce a new direction to this technology in terms of recognition accuracy, computational complexity and identification rates.Over the past decades, there has been plenty of research being done to solve problems in speaker recognition and speaker verification systems.Inspired by the remarkable performance of DL algorithms in SID systems, researchers have applied DL algorithms in speaker recognition [255], [256] and delivered high accuracy.However, advanced DL algorithms are computationally intensive; therefore, restricting its implementation on hardware platforms.ML algorithms implemented in SID systems such as GMM and SVM are less computationally complex as compared to DL algorithms.The execution time of GMM trained on MFCCs is 0.8ms at 48MHz frequency for a speech set of length 20 [257].Moreover, the execution time of SVM and MFCCs is 4.6ms at 50Hz [258].Another group of researchers [259], developed a SID framework based on MFCCs and SVM and concluded an execution time of 9.10ms per frame.DL algorithms comprise several hidden layers which add additional time complexity to a system.In speaker recognition, deep CNNs, RNNs and ResNet models are mostly employed by the researchers.Cai et al. in [260] presented a DNN-based framework with i-vector approach for the Speaker verification system.The inefficient use of DNN layers resulted in the high computational complexity of this method.
Authors in [261] modified ID-ResNet20 by changing its convolutional kernels from 3 × 3 to 1 × 3 which reduced the computational complexity of the system by two-thrids approximately.The peak computation-tocommunication ratios of layers resulted in 3.75 Gb/s for a speech length of 3s.In addition, the authors further modified the ResNet20 by adding a pooling layer after the convolutional layer.In comparison to the original ResNet20, the modified model achieved 51% reduced parameters and 64% computational complexity.
In [262], authors developed a less complex attacking toolkit namely PhoneyTalker for DNN-based speaker recognition systems.The results from proposed framework demonstrated a low average time cost (ATC) of 0.03s and 15% ASR improvement than state-of-theart methods.In addition, authors in [263] proposed a lightweight Few-shot speaker identification (FSSI) based on recurrent convolutional block (RCB) on the backbone of Bidirectional LSTM.A softmax layer is introduced in the proposed model and evaluated on three datasets (VoxCeleb1, VoxCeleb2, and LibriSpeech).The performance metrics model size (MS) and the number of multiplication and addition operations (MACs) indicated improvements as compared to state-of-the-art methods.The method achieved the highest accuracy scores 92.89%, 92.74% and 98.51% (V2-set, V1-set, L-set) on feature subset size 4 with low values of MS (54.14k) and MACs (103.16M).In addition to above-mentioned applications, there are a few more applications that we discussed here.Authors in [264] classified sonar targets of different shapes and sizes in the air using MLP neural networks.In their method, they generated feature vectors after extraction of raw echos' spectrograms and other spectral features using STFT.
After training MLP-NN, the performance is compared with narrowband and wideband excitation signals.Jin et al. [265] developed an object recognition framework for robots in an open environment based on their acoustic signatures collected by using the dynamic contact method.K-nearest neighbour ML algorithm is trained with MFCC features.Their framework proved that robots can detect objects by their acoustic waveforms and gives the best results with 180 • joint rotation and 180 • horizontal rotation.Moreover, He et al. [266] investigated drone sound identification in a noisy environment.Feature vectors of drone sounds are created by employing harmonic line association (HLA) and wavelet packet transform (WPT) FE methods.SVM along with optimized parameters by genetic algorithm (GA) is used to identify drones.They achieved 100% identification probability during trials.

IV. FUTURE RESEARCH DIRECTIONS
With numerous challenges and limitations faced by ASI, there exists immense potential for future research in various areas.This section discusses some recommendations based on our perspective for future consideration.We believe these directions would be interesting to investigate which would improve the performance of ASI and enhance a better understanding of this concept.Some of these recommendations appeal to general methodological problems and some are specific to ASI in the light of the earlier analysis in this survey: • Real-time database expansion: Sometimes, required datasets are not available to solve a particular ASI problem for example in the underwater domain.Due to scarcity of real databases, it is very challenging to address undersea problems.DL approaches cannot be applied in this case because of insufficient data.In these cases, the underwater research community can expand underwater databases by collecting new reliable real-time audio datasets.• Poor generalization ability of DL: Another problem that demands real-world data sets is the poor generalization ability of the DL model.DL models trained on simulated data perform poorly when tested on realworld datasets.This is called train-test data mismatch.This problem also occurs when acoustic data acquisition trials are unrealistically performed and room geometries are not considered.Therefore, in this line domain adaption [267] and transfer learning [268] techniques must be investigated which ensures improving the performance of the network for one problem (real data) but actually trained for another problem (simulated data).• Improvements in audio processing (AP): Multiple acoustic data acquisition trials conducted in various situations can introduce background noise and polyphonic sounds.Robust techniques need to be developed to identify and eliminate such anomalous sounds, especially in the case of multiple sound sources.New robust FE hybrid approaches with better discrimination for real-time ASI applications need to be explored.Our survey revolves around ASI only and we haven't considered the cases of Acoustic source separation (ASS), diarization and sound source enhancements that are all connected to ASI.In this survey paper, we have shown, how AI-based data-driven approach to ASI can replace conventional AP techniques.We believe that a combination of AP techniques and powerful DL models in particular deep generative models such as (GANs) [269], variational autoencoders (VAEs) [270] and dynamical VAEs [271], can model the temporal and/or spectral characteristics of sounds.Therefore along with AP these DL approaches can improve the performance of aforementioned problems and may be implemented by future researchers.• Multi-task learning approach (MTL): Multi-task training is a general method used to improve the performance DNNs on a given problem by training the model to simultaneously handle other several tasks [272].
As per our knowledge and our survey, no one has used the MTL approach to tackle tasks jointly.In an ASI-based problem, this approach is implemented in the following way: First part of the model (e.g., FE module for several blocks) is common for different tasks, afterwards the model divides into different modules each one performing a different specialized task.The common module ensures the discovery of efficient signal representation which is used for other tasks.This approach offers data efficiency and shared representations and reduces overfitting problem as well.
• We noticed in our survey, many deep networks are presented which are computationally inefficient and thus require high computing power.Therefore future research can consider developing powerful DNNs using computing power for big datasets.• In this paper, we have carried out a detailed survey but in future, meta-analysis, simulations and results can be added as well that will foster pathways to better research and development.

V. CONCLUSION
ASI is facing numerous challenges in accuracy, automation and robustness.AI methods have evolved and serve as a promising solution to these problems.In the past decade, considerable research has been carried out in various domains to identify and recognize sound sources from their acoustic signatures but these research works have not been surveyed and compiled to give a comprehensive review.Our work serves as a detailed guide for future researchers.In our work, we have attempted to study and review the past few research works towards acoustic source identification using AI methods and organized them in terms of different applications.
In this paper, we have presented an in-depth survey of ASI in the industry for fault detection, underwater for target recognition, surveillance, medical for disease diagnosis and fall detection and some others.Initially, we highlighted potentially available databases for future research to start with.Then, we highlighted a few basic audio processing steps.Afterwards, an overview of FE techniques in time, frequency and cepstral domains was presented to aid researchers to choose the best technique as per the given problem, dataset and AI algorithm.We have also discussed briefly some of the traditional ML and DL algorithms that have been mostly used in literature.Added to this, we have given a comprehensive survey of the ASI works along with its significant contributions in various fields in subsequent sections.Lastly, we have discussed some future research directions for the readers after explaining the thorough idea of the concept and its significance.

•
Local Ternary patterns (LTPs): Local ternary pattern is an extended version of LBPs.Similar to LBPs, they are also extracted from spectrograms.The difference lies in the measurement scales of pixels.LBPs are scaled in binary pattern (0 and 1) only whereas, LTPs are scaled into three values that are −1,0 and 1.
compared different acoustic FE techniques based on robustness to noise and spectro-temporal representation.According to the results, spectrogram, gammatone filterbank and Zweig impedance function-based linear transmission line generated good outcomes against noise at −5dB whereas, wavelet feature scored worst at +2dB.In terms of spectrotemporal representation, Zweig impedance function-based linear transmission line and wavelet feature performed better than the Mel spectrogram.Gammatone filterbank and spectrogram performed satisfactorily during the test phase.

FIGURE 4 .
FIGURE 4. Traditional machine learning and deep learning methods.

•
Support Vector Machines (SVM): SVM is another supervised learning classifier used for classification and regression analysis.SVM works with the use of various kernels based on the number of classes.There are different SVM kernels used for various problems such as linear, polynomial, Radial Basis Function (RBF), and gaussian.SVM uses a set of hyper-planes or decision boundaries in N-dimensional space that classifies different classes.The number of features is a decisive factor that determines the dimensions hyperplane.SVM has been proven a promising classifier for generalization problems.This can work well with small datasets.Some of the applications of SVM include text categorization[135], image classification[136] and environmental source detection[54].•Hidden Markov Model (HMM): Hidden Markov Model (HMM) is a statistical classifier with its ability to consume less computational power as compared to other classifiers.HMM works on the principle of Markov chains.The Markov chain stays hidden in the process of observing events in different states of the Markov chain.In HMM, variables can be continuous or discrete.HMM learns the path of trajectory from an existing dataset containing classified trajectories.Therefore, for classification and recognition purposes, a flying object (hidden model) is classified knowing only its trajectory.
are primarily used to detect, identify and classify objects based on given visual image data[144].CNN comprises four types of layers, convolution layer, pooling layer, fully connected layer, and non-linear layer.In CNNs, nodes are capable to do weight sharing thus possessing an important property called shift-invariance.This is why CNNs have convolutional layers along with linear filter banks on input layers.They have applications in video recognition, recommender systems, object detection, image classification and natural language processing.• Tensor Deep Stacking Network (TDSN) is a DL algorithm that learns from parallel hidden layers in each unit.TDSN is an extension of a deep stacking network that has sequential layers only in its modules.There is no change in the stacking operation of TDSN compared to the Deep stacking network(DSN).TDSN and DSN have the same computational complexity and scalability.In addition to this, TDSN offers training in hidden representations to encode speaker and environment information to include their factors [145].Khamparia et al. performed sound classification using T-DSN and obtained an accuracy of 56.00% [146].• Image recognition network: Image-based recognition networks are very deep CNNs specifically designed for image features.some of the image recognition networks include AlexNet, GoogLeNet, LeNET, and VGG16.AlexNet used a gradient descent optimization function with all the layers using a uniform learning rate of 0.001.AlexNet has eight layers whereas GoogleNet is deeper as it has 22 layers.GoogleNet is a promising deep CNN that can avoid the problem of overfitting due to many deep layers with the use of multiple-size filters at the same level of operation [147].These image recognition networks are used on the ImageNet dataset for various applications.• Deep Belief Neural Network (DBNN): DBNN is a traditional deep neural network (DNN) that faces problems like slow learning and works on big databases only.DBNN has multiple connected layers.When a network is trained unsupervised, it can construct its input layer depending on probabilities.Other layers can detect features thus they can further be trained under supervision for accurate predictions.DBNN has performed better than HMM and neural networks for event source detection [148].• Convolutional Recurrent Neural Network (CRNN): CRNN combines convolutional neural network (CNN) and RNN and has presented better results in the audio processing domain.E. EVALUATION METRICS The evaluation metric is a useful criterion to study the quality of an AI algorithm.Evaluation of an AI algorithm is essential for any project.Many different performance metrics can be used to test a model.It is very important to include multiple evaluation metrics in a study to deliver a detailed performance report of an AI model.In this subsection, we have discussed some of the prominent evaluation metrics relevant to AI-based AI from the literature.The confusion matrix is the mostly used and prominent evaluation method for classification-based problems.A confusion matrix is a 2-dimensional NxN matrix that is used to summarize the classification results.In this matrix, one dimension represents a predicted class and the other dimension represents the corrected or true class in a given problem.In a binary classification problem, there are four important terms that are involved to describe each entry of the confusion matrix.True positive (TP) is the correct prediction of the positive class, True negative (TN) is the correct prediction of the negative class; false positive (FP) is the wrong prediction of the positive class, and false negative (FN) is the wrong prediction of the negative class.

TABLE 5 .
(Continued.) Summary of event or environmental source identification using artificial intelligence techniques.
(CNN).A majority voting optimization algorithm is used to optimize the classification results.They achieved 93.89% and 86.25& accuracy for binary class and multi-class respectively.Another research performed by Bilal and team [235] classified heart sounds using 1D-CNN.He proposed a classification model employing Local Binary Pattern (LBP) and Local Ternary Pattern (LTP) features.Using PASCAL and PhysioNet 2016 datasets, he scored 91.66% and 91.78% classification accuracy respectively.Recently, the authors in [236] published their work investigating heart sound classification aimed at the diagnosis of disease due to heart failure.Two heart sound data sets (PhysioNet and PASCAL) are used and then preprocessed to generate MFCC features.Principal Component analysis and linear discriminant analysis has been used for feature selection and dimensionality reduction.Eventually, SVM, gradient boosting algorithm(GBA), and random forests classifiers are trained on those features to perform the classification task.

TABLE 1 .
Brief summary of relevant review works.
FIGURE 1. Acoustic source identification process.

TABLE 2 .
Datasets used for ASI in various applications.

TABLE 3 .
(Continued.) Summary of ASI in industrial machinery fault detection.

TABLE 4 .
Comparison of AI-based ASI in underwater applications.