An Investigation of Preprocessing Filters and Deep Learning Methods for Vessel Type Classification With Underwater Acoustic Data

The illegal exploitation of protected marine environments has consistently threatened the biodiversity and economic development of coastal regions. Extensive monitoring in these– often remote– areas is challenging. Machine learning methods are useful in object detection and classification tasks and have the potential to underpin techniques for the development of robust monitoring systems to overcome this problem. However, development is hindered due to the limited number of publicly available labelled and curated datasets. Furthermore, there are relatively few open-source state-of-the-art methods to be used for evaluation. This paper presents an investigation of automated classification methods using underwater acoustic signals to infer the presence and type of vessels navigating in coastal regions. Various combinations of deep convolutional neural network architectures, and preprocessing filter layers, were evaluated using a new dataset based on a subset of the extensive open-source Ocean Networks Canada hydrophone data. Tests were conducted in which VGGNet and ResNet networks were applied to classify the input data. The data was preprocessed using either Constant Q Transform (CQT), Gammatone, Mel spectrogram, or a combination of these filters. With over 97% accuracy, using all three preprocessing representations simultaneously yielded the most reliable result. However, high accuracies of 94.95% were achieved using CQT as the preprocessing filter for a ResNet-based convolutional neural network, providing a trade-off between model complexity and accuracy; a result that is more than 10% higher than previously reported approaches. This more accurate classifier for underwater acoustics could be used as a reliable autonomous monitoring system in maritime environments.


I. INTRODUCTION
Illegal fishing represents a serious problem for society in general, affecting not only the marine life through destructive trawling but also the local economy of coastal areas, which depends economically on this ecosystem for subsistence. Therefore, the detection and classification of illegal vessels situated in law-protected areas represent a poignant need for the surveillance and protection of the coastal ecosystem.
The associate editor coordinating the review of this manuscript and approving it for publication was Yougan Chen . Nowadays, there is a large number of applications that involve maritime classification tasks, such as the identification of underwater archaeological remains [1], the inspection of underwater structures for the offshore industry [2], [3], the surveillance of shorelines [4], the identification of vessels [5], as well as applications in environmental sciences, like counting and classifying the various marine species for biological research [6]. Also worth mentioning are studies relating the acoustic signals in the sea to environmental pollution, affecting not only the marine life [7], [8], [9], but also the human activities in port areas [10], [11], [12]. In this context, the identification of vessels from acoustic data is selected as the domain of interest for this paper.
Some technologies, such as the Automatic Identification System (AIS), which contain Global Positioning System (GPS) data, and satellite images, can be applied in the surveillance of the marine environment. However, these approaches have limitations. For instance, the high costs and maintaining accurate instrument calibration are still challenges for satellite imagery [13]. GPS signals, on the other hand, can be masked or defrauded to limit the system capabilities or even to hide illegal activities. In contrast, the acoustic signals emitted by vessels captured using hydrophones provide a low-cost and fraud-resistant data source to be used in surveillance tasks, as acoustic signals can be difficult to omit or mask. Efforts in this area can also be the initial step towards determining how such signals can be analysed, becoming the gateway to understanding of how acoustic environmental pollution is affecting marine life.
As the classification of underwater acoustic signals gained importance, this task became unfeasible to be solved by traditional (time-frequency) methods, which were primarily conducted by humans operators, due to the complexity of the data. Time-frequency representations, such as those based on the Fourier transform or on temporal data segments [14], can be applied in different forms, such as a linear scale (e.g., short time Fourier transforms) or a logarithmic scale (e.g., Mel filter banks). Both strategies produce a two-dimensional timefrequency representation of the signal, which can be used to analyse the features of the sound. However, the underwater acoustic signal is a mixture of environmental, biological, and human-generated sounds. Therefore, it has a low signal-tonoise ratio (SNR) and a high degree of variability for the same source [15], [16], raising the difficulty of the recognition task. In order to cope with this issue, recent studies, mostly based on the application of Deep Learning (DL) methods [17], [18], [19], have shown promise in automatic data classification tasks.
Many DL methods for object detection and classification have been successfully developed in the past few years for computer vision applications [20]. The use-case presented in this paper allows for the application of these methods to other data domains, which could inherit from the solutions developed in the visual domain. In this context, as the timefrequency representations are two-dimensional representations of the acoustic signal, an opportunity arises to apply the DL strategies, originally developed for computer vision, to acoustic analysis. Numerous DL solutions for the acoustic domain are now based on Convolutional Neural Networks (CNNs) [21], [22]. Although they can be applied to raw audio, they are often applied to two-dimensional audio representations, such as spectrograms. The most recent studies have used VGGNet [23] or ResNet [24] models as base algorithms for this development, owing to the models' high accuracy in complex classification tasks. Despite the use of time-frequency representations as inputs to CNNs being an interesting approach, its future development depends on the representation used as input and the model which will receive it. Also, as the sound generated by sources in the underwater domain is dependent on the environment, it is of extreme importance that not only the type of two-dimensional representation matches the problem, but also its parameters must be optimised for the task.
This paper describes an investigation into the automated classification of four distinct classes of vessels in marine environments using DL. Single channel underwater acoustic signals obtained by research-grade hydrophones were used as input, and the impact of the application of distinct preprocessing methods on the DL classification task was explored. Classification results from two key state-of-the-art DL methods, VGGNet and ResNet, were compared. The impact of applying three distinct preprocessing filters, Mel Spectrogram, Constant Q Transform (CQT), and the Gammatone-like spectrograms (or just Gammatone), was also evaluated, as was the impact of a combination of these three filters into a threechannel representation. The complete pipeline was trained and tested on three scenarios characterised by the distance between the objects of interest and the hydrophone.
In order to better monitor and protect marine environments, there is a need for an autonomous monitoring system that can generate an alert whenever a particular class of vessel is detected in an area. Towards that end, the main contributions of this paper can be summarised as follows: • Creation of an open-source pipeline for the classification of vessels from underwater acoustic signals using Machine Learning 1 ; • Comparison of the Adam and Stochastic Gradient Descent (SGD) optimisers for spectrogram analysis; • Evaluation of two different neural network architectures for acoustic classification (VGGNet and ResNet); • Comparison of three different preprocessing filters (CQT, Mel Spectrogram, and Gammatone); • Investigation into combining CQT, Mel Spectrogram, and Gammatone representations into a three-channel signal, generating a higher dimensional input signal to the network; • Analysis of the relation between the distance of the object of interest to the hydrophone and the accuracy of classification methods; • Presentation of a new open-source curated dataset containing underwater acoustic signals classified into different scenarios based on the distance from the vessel to the sensor. 23

II. RELATED WORK
The task of underwater acoustic target classification is challenging due to the complex nature of the sound produced by vessels [25]. Usually, this sound is produced by the set of mechanical components in the vessel's propulsion system, such as its engine, as well as by hydrodynamic interactions of the propeller. The former typically produces a broadband continuous spectrum, while the latter generates narrow band components whose spectrum consists of power at discrete frequencies [26]. As there are different types of vessels, in diverse states of upkeep, the sound produced by them is fundamentally distinct from one another, depending on the vessel's speed, the state of its mechanical parts, and the hydrodynamics of its design. Also, additional complexity exists due to the background sound produced by the region and the complexity of sound propagation in shallow waters, which causes multi-path reflections [27]. Environmental conditions, such as temperature, depth, salinity, pressure, and even precipitation, can directly influence how the signal travels from emitter to receiver [28]. In this context, some classical signal processing methods, such as Cepstral analysis [14], can improve the quality of the processed sound by reducing the effects of the reflections interference and scattering losses, but only if applied on signals for short ranges with a high SNR [29]. Early developments in the analysis and classification of underwater acoustic signals focused on time-frequency analyses, such as the use of Fourier transforms [14]. However, recent state-of-the-art methods are largely based on the application of deep-learning algorithms to solve similar tasks [17], [18], [19]. Advances in machine learning techniques mean CNNs are now being considered for underwater acoustic classification applications [25]. Consistent with this trend, the trade-off between the accuracy and model size of various CNN models for mine-like object detection from side-scan sonar images was investigated [30]. The comparative results reported suggest that deeper models (i.e., models with multiple layers) achieved less than 1% of accuracy improvement when compared with shallow models, at the cost of a 17x increase of computational requirements. This proved that smaller models can have a beneficial trade-off between processing time and accuracy. Similarly, the impact caused by distinct network topologies on the problem of underwater acoustic target classification is an important issue that has been recently considered [31]. A properly tuned model is capable of outperforming recent DL methods, such as a CNN-extreme learning machine [32], ResNet18 [24], and SqueezeNet [33]. The strong results presented in [30], [31] suggest that the search for the most suitable network topology, and the optimisation of its parameters, are essential tasks that should be considered in the development of any CNN-based classification system.
Although CNNs can be applied directly to the audio signal [31], [34], [35], acoustic filters are frequently used as preprocessing layers to improve the quality of the resulting audio representations [25]. Therefore, not only should the CNN parameters be investigated, but also which filters and features best contribute to the development of effective DL methods for the underwater acoustic domain. To this end, recent work has investigated the effect of various preprocessing methods on the original audio signal, including magnitude Short-time Fourier transform (STFT) spectrum, complex-valued STFT spectrum, Mel-log spectrum, and Mel-frequency cepstral coefficients (MFCCs), as inputs to real-valued and complex-valued ResNet and DenseNet CNNs [36]. The results obtained using preprocessing filters were considerably better than the baseline approach where a CNN was directly applied to classify the raw audio signal. Similarly, Mel-log spectrograms, delta, and delta-delta features were also used as acoustic filters in a ship detection task using a CNN, where high accuracy in the detection and localisation of vessels was reported [37]. Other studies in the literature also successfully applied filters to the DL inputs, showing a consistent improvement in underwater audio classification tasks [38], [39], [40], [41]. This strongly suggests that, although CNNs are capable of learning distinct filters in their convolutional layers, there may be insufficient training time or data for the network to converge on the best solution. Therefore, superior results, in addition to smaller networks and reduced training time, are obtained with the use of appropriate preprocessing filters in the classification pipeline.
Analogous to the research described in the present paper, recent work has been driven by the advantage of using preprocessing filters to extract optimised features from the audio, also using stacks of multiple filters as inputs to the CNN models [42], [43]. The rationale behind this approach is to take advantage of the strengths of each method, feeding the network with different representations of the sound. For instance, a joint learning framework was developed to address the underwater acoustic target classification using MFCC, CQT, Gammatone, and Log-Mel feature extraction methods to feed a CNN-based architecture [42]. The comparison of the results obtained with individual approaches and their combination showed that superior outcomes could be achieved with the latter. Another relevant work used a fusion of the Mel-spectrogram, MFCC, chromatogram, spectral contrast, and Tonnetz filters, resulting in a onedimensional representation, to improve the performance of a CNN model for the classification of underwater acoustic signals [43].
A summary of the classification methods cited in this section is shown in Table 1 which relates, for each method, the preprocessing applied (if any), the model architecture, the dataset used, the best reported accuracy, and the main contributions. A more complete up-to-date survey of this field can be found elsewhere [25].
There are a number of recent papers concentrating on the classification of underwater acoustic data. However, there is a pertinent need for a complete investigation into the application of DL algorithms for the task, an investigation that considers the optimisation of the DL model parameters, and the comparison between different preprocessing filters. Additionally, the impact of environmental variables on vessel classification is virtually non-existent in the related literature. These issues are taken into account in the research reported in this paper.

III. DATASET
The data used in this work consisted of signals obtained from the Ocean Network Canada initiative, 4 captured during the deployment at the Strait of Georgia, Canada, from June 24 to November 3, 2017, representing typical pre-pandemic operations during the Summer and Autumn seasons. An icListen AF Hydrophone, located 147 meters below sea level, was used to obtain the acoustic signals. In addition, the positional information about the vessels was obtained using Automatic Identification System (AIS) data.
The first part of the annotation process focused on the translation and filtering of the AIS signals. These signals contained position, identification, speed, course, and other information about active maritime traffic. Some of the information contained in AIS data is not necessary for vessel classification tasks. Only messages related to position report, as well as static and voyage related data, were used. Duplicated messages, and messages that did not have positional arguments, were filtered out. The vessel's class was then inferred from the type of ship and cargo fields of the AIS messages, generating four categories: Tug, Passengership, Cargo, and Tanker. Using the positional coordinates, a geodesic distance calculation was performed to estimate the distance from the vessels in the area of interest to the hydrophone. As the update rate of AIS data is related to the vessel's size, cargo, velocity, etc., there are intervals where gaps appear in the AIS reports. It was deemed safe to linearly interpolate between these sparse data points to provide greater resolution of the vessel's distance to the hydrophone.
Different subsets of data were generated from the original data considering the distance from the vessel to the hydrophone picking up the vessel's sound. These subsets, or scenarios, were created considering inclusion and exclusion radii. The inclusion radius was defined as the radial distance when only one vessel was present at a specific moment, whereas the exclusion radius was the region in which there was no vessel within a fixed radial distance. To isolate a single vessel as much as possible, scenarios were generated as illustrated in Figure 1, where a vessel was within the inclusion radius while no other vessels were within the wider exclusion radius.
These scenarios facilitated the analysis of the classification accuracy concerning the distance between the object of interest and the sensor. As the problem of vessel classification using machine learning depends on the quality of the input data, it was expected that the sound emitted by distant sources would have a lower SNR and, thus, lower classification accuracy. The three scenarios considered in this work were created based on the available data: the first had an inclusion radius of 2 km and an exclusion radius of 3 km; the second had 3 km and 4 km as the inclusion and exclusion radii; and the third had radii of 4 km and 6 km. Table 2 summarises the FIGURE 1. Diagram representing a scenario. Data was isolated where only a single vessel was within the inclusion zone while no other vessels were within the exclusion zone. This ensured a more reliable acoustic signature without interference from other vessels. scenario descriptions. The background class for each scenario was then generated based on the absence of vessels within the inclusion and exclusion radii combined. The final stage of the dataset formulation was the combination of every AIS instance, defined as the period that matched a specific scenario, with the acoustic data.
This automatic annotation procedure could generate mislabelling in the dataset, therefore the results were further analysed and filtered to avoid this issue. A data cleaning process was performed, noting that the variation of the time domain amplitude of a vessel was greater than that of the background sound. First, a median filter (med()) was used to de-noise the original signal (a(t), where t represents time). The resulting audio was subtracted from the original signal (Equation (1)) producing an audio signal (g(t)) free of DC offset. The standard deviation of g(t) (represented as , as shown in Equation (2)) was used to generate a scalar value of the amplitude variation for each 1-second signal segment. The mean and standard deviation of the values were obtained from the vessel and background sounds, respectively (µ −vessel , σ −vessel ) and (µ −back , σ −back ). As expected, this analysis showed that the tagged vessel data delivered higher variation ( ) when compared with the background audio (i.e., µ −vessel > µ −back ).
Individual segments tagged to contain a vessel, but with a value that was less than the overall standard deviation of the background increased by the mean (µ −back + σ −back ), were removed from the collection as they represented potentially mislabelled signals in the dataset. (1) (2) In the equations above g represents the mean value of the signal with the median removed, and N is the number of audio recordings.
The final version of the dataset was composed of the three scenarios, summarised in Table 2, each one with audio files saved as raw, uncompressed, WAV files. Also, a Comma-Separated Value (CSV) file was generated with the annotation of the vessel type for each scenario. In this work, each audio file was divided into 1-second segments, which were used as inputs to the preprocessing filters. The complete data was divided into Training, Validation, and Test subsets, following the proportion of 85%, 10%, and 5%, respectively. As there was a class imbalance problem, only the Training subset was balanced using an oversampling strategy. An oversampling factor, Equation (3), was used to define the size of each class based on the class with the smallest length.
In Equation (3), L represents the size in seconds of the class with the most data points, and l represents the size in seconds of the class with the fewest data points.
For each category, the audios were selected randomly to compose the dataset. If the size of the class did not reach the minimum size defined by factor (Equation (3)), the selection started again, gathering repeated audios until the desired length was achieved. However, uniqueness of each recording was enforced. Table 3 contains the duration of each subset for the dataset scenarios.
The next section introduces the concepts of each of the preprocessing filters used in this work, and the preprocessing pipeline is described in Section V-A.

IV. PREPROCESSING FILTERS
Two-dimensional representations of audio files can take the form of spectrograms, which represent the frequency distribution of the original signal over time. One of the possible ways to formulate such representations is using a window function applied along the length of the one-dimensional signal, dividing it into smaller (fixed) chunks. These chunks are then processed, generating the information about the frequencies in that period. Therefore, the horizontal axis of the resulting two-dimensional representation is highly dependent on the chosen initial time window. The vertical axis represents the frequency distribution of the sound and it is commonly represented either linearly, or logarithmically. For the problem of sound classification, the logarithm representation of the frequency is preferred over the linear representation, following the analogy with the human auditory system [46]. In this context, the present work focused on the application of three common methods for spectrogram generation based on non-linear frequency scales: Mel Spectrograms, Constant Q Transform (CQT), and Gammatone Spectrograms (as described below). These methods were used here to enhance features of the original signal, and their output served as input to the CNN models investigated in this work.

A. MEL SPECTROGRAMS
The Mel spectrogram is a representation of the short-term sound power spectrum. Mel's scale is empirically based on the way humans perceive sound [47]. The formulation of this scale consisted of submitting observers to different frequencies of sounds, while recording their perception and sensitivity to the stimulus. There are different mathematical formulations for the conversion between the frequency f in Hertz to m in Mels, such as that represented in Equation (4) [48]. m = 2595 log 10 Mel Spectrograms are commonly used in speech recognition analysis and music processing, where human perception is extremely relevant [49].

B. CQT
The Constant Q Transform (CQT) [50] uses a constant base scale (Q) to create a representation. This improves the resolution between frequencies of interest, while providing the means to solve the problem of fundamental frequency identification. In contrast with the classic Fourier transform, CQT is a bank of geometrically-spaced filters in which, for the k-th filter, the central frequencies are evaluated with Equation (5), where b represents the number of filters per octave. Thus, the relation between the distance of two adjacent filters is given by Equation (6), The quality factor Q (or constant Q) is defined as the ratio of frequency to resolution, as stated by Equation (7).
The correct tuning of the quality factor Q can supply the needed information for the acoustic analysis, with resolution to distinguish adjacent musical notes, where a sound with harmonic frequency components will produce a constant pattern in the log frequency domain [50]. This representation also increases time resolution towards higher frequencies, resembling the human auditory system, while emphasising lower frequencies.

C. GAMMATONE SPECTROGRAMS
The gammatone filter was first defined as a filter bank capable of representing the shape of the impulse response function of the human auditory system [51]. A gammatone function can be obtained with the mathematical formulation shown in Equation (8), where n is the filter order, i is the filter order number (ranging from 1 to the total number of filters), b is a bandwidth parameter, f is the filter centre frequency, and α is the phase of the impulse response. The function defined on Equation (8) was used by [51] to summarise the RevCor, a representation of the correlation between a sound stimulus on the human ear and the response of a primary auditory fibre [52]. The first term of Equation (8) represents a gamma function, and the second term represents the tone of the stimulus. This representation has an amplitude characteristic that can be used to predict the human auditory response. It also has a minimum-phase characteristic, which is a preferred feature for an auditory filter bank [52]. Gammatone filter-banks facilitate the representation of the signal's time domain response, as gamma filters are broader on lower frequencies and narrow on higher ones, emphasising the lower spectrum.
The raw signal and its processed representations (CQT, Gammatone, and Mel spectrograms) are shown in Figure 2.
In this work, the three preprocessing methods described herein were used on the underwater acoustic signals to extract relevant features from the acoustic signal, generating the two-dimensional representations used as inputs to CNNs. Section V describes the successive stages of the implementation of this work.

V. DEVELOPMENT STEPS
This section describes the development steps performed to address the classification of underwater acoustic signals using preprocessing filters and CNNs, detailing the procedures and experimental setup.

A. PREPROCESSING FILTERS
Each entry in the original dataset was divided into one-second segments. Segments smaller than one second were padded  with zeros. After that, the three proposed preprocessing methods, CQT, Gammatone, and Mel spectrogram, were applied to each audio file. Initially, to establish a baseline for the audio classification based on standard values found in the literature, the window chosen to generate the spectrograms had 1024 samples, with a hop length of 512, resulting in 64 frequency bins over 63-time intervals per data segment. This resulted in each method producing 64 × 63 element images. This set of parameters is referred to as Version 1.
A second set of parameters (Version 2) was obtained by means of an optimisation process. The majority of the power in the underwater acoustic signal was predominantly focused on the low-frequency band, below 3 kHz. To maintain a safe range above the maximum frequency, spectrograms were generated from 18 Hz (the minimum acceptable for the CQT representation for 1 second audios) to the frequency of 4186 Hz (C8 note and ≈ 1 kHz above the 3 kHz experimentally observed maximum value). Using a hop length of 256, which represented half of the value proposed on Version 1, the resulting representation (Version 2) had a size of 95×126. The values for the two versions of parameter sets are summarised in Table 4, where the values not related to a particular representation are marked with ''-''.
Inspired by the large variety of machine learning methods applied to three-channel images, such as colour images in, e.g., RGB (Red-Green-Blue) or HSV (Hue-Saturation-Value) colour spaces, the three preprocessing methods cited previously were combined into a single three-dimensional representation, which was then used as the input to the CNNs. This was motivated by providing the Neural Network with more complete representations, aiming to take advantage of the strengths of each of the preprocessing methods. Combining CQT, Gammatone, and Mel spectra resulted in data samples with dimensions of 64 × 63 × 3 for Version 1 and 95 × 128 × 3 for Version 2. This representation is called Complete.
The next section presents the Deep Learning methods used to classify the representations obtained with the preprocessing filters described above.

B. DEEP LEARNING MODEL DESIGN
As mentioned previously, this work used two distinct CNN models: VGGNet [23] and ResNet [24].
VGG-based methods can perceive granular spatial relations on images due their use of a 3 × 3 kernel size, the smallest possible size to capture the four cardinal directions (up, down, left, and right). This reduced kernel size also produces a good trade-off between classification accuracy and hyperparameter complexity. The implementation of this model in the present work contained two main modifications from the original VGGNet: 1) A Leaky ReLu was used as the activation function instead of a normal ReLu; and 2) A Batch Normalisation layer was added. Both changes aimed at reducing overfitting. The resulting model architecture is shown in Figure 3 and was composed of four feature extracting convolutional layers. The signal then passed through a Batch Normalisation and a Leaky ReLu activation layer associated with a Max Pooling layer, which resized the image by a factor of 2. Lastly, the classification weights were delivered by a fully connected layer.
The universal approximation theorem [53] states that a deep enough neural network is capable of approximating any complex function, although the vanishing gradient and the accuracy degradation problems become problematic as more layers are added to the models. ResNet addresses these issues by introducing the identity shortcut connection, bypassing one or more layers in a forward pass, defining Residual Blocks. These blocks facilitate the learning ability of the intermediary layers, reducing the vanishing gradient problem, and penalising the ones that could potentially degrade accuracy [24]. This work used ResNet18, with modifications to the input layer to match the preprocessed images. The ResNet18 architecture, shown in Figure 4, is composed of a convolutional layer followed by 8 Residual Blocks, each one formed by two other convolutional layers. As usual, the classification weights are generated by the final fully connected layers.
After the model definition, an optimiser had to be chosen, aiming to minimise the error in the training procedure. This work investigated the use of the Stochastic Gradient Descent (SGD) [54] and the Adam [55] optimisers. SGD is an iterative method that starts randomly and seeks the minimum value in the input function. It is the most common optimiser used in the literature. Adam, on the other hand, is an extension of SGD based on the combination of the Adaptive Gradient Algorithm (AdaGrad) and the Root Mean Square Propagation (RMSP). The use of SGD and Adam was compared in the tests executed in this work, where a learning rate of 0.001, decreasing exponentially, and a gamma value of 0.95, over 40 epochs, was used.
The resulting architecture was then composed of the preprocessed acoustic signals, produced by the four strategies described in Section V-A, applied to both CNN models (VGG-based and ResNet18). Each model was trained with batches of 8 images over 40 epochs using the Categorical Cross-Entropy loss function. The block diagram of the complete pipeline is shown in Figure 5. The preprocessing block is the representation of CQT, Gammatone, Mel, or the combination of all three preprocessing filters (Complete). The model block represents VGGNet or ResNet18.

VI. RESULTS
All training sessions were executed for the three chosen preprocessing filters in addition to the complete representation. Results are reported using micro-average accuracy, which measures the correct classifications of the classes combined. This provides a global overview of the model performance in realistic scenarios, i.e., with real observed class imbalances. Three additional metrics were used, providing complementary information: Precision, which represents the rate of correct positive predictions over the total positive predictions; Recall, which measures the rate of correct positive predictions over the real positive instances; and F1-score, which represents the weighted harmonic mean between precision and recall. These three additional metrics were evaluated using macro-averaging, which evaluates the classes separately before taking the average, aiming to obtain a balanced evaluation across the classes. The metrics were obtained using Equations (9), (10), (11), and (12), where K represents the number of classes in the dataset, TP stands for True Positive, TN for True Negative, FP for False Positive, and FN for False Negative.
All work was performed using an Intel(R) Core(TM) i7-1065G7 machine, and implemented using PyTorch framework (version 1.11.0).

A. OPTIMISER SELECTION
SGD and Adam are two of the most common optimisers used in DL. However, their performances are domain dependent. This adds to the difficulty of selecting a standard approach for any classification problem. Therefore, the choice of a suitable optimiser is an essential step in the development of DL solutions. Table 5 presents the results of applying SGD and Adam to train a VGG-based classifier on the first dataset scenario described in Section III. This scenario provides the best SNR since the signals were collected at a short distance from the sensor. Also, the three preprocessing filters were applied using the Version 1 parameters to maintain the same comparison basis.
The results represented in Table 5 show that SGD outperformed Adam for CQT, Mel, and the Complete representation, where the latter had the highest values (as shown in bold font in Table 5). Adam performed marginally better than SGD in the test where the Gammatone filter was used as the preprocessing method. In addition, the training session using Adam and the Mel spectrogram did not converge to a global minimum as the model predicted that almost everything belonged to the same class, as per the class-normalised confusion matrix shown in Figure 6.    Comparing the best results for SGD and Adam, with CQT and Complete inputs, the higher accuracy was obtained with the SGD optimiser (84.46%), which is 7 percentage points larger accuracy than the best result obtained with Adam (77.39%). Also, the SGD approach was more stable during the training procedure.

B. PREPROCESSING OPTIMISATION
As mentioned in Section V-A, the Version 1 parameters for the preprocessing filters were generated based on the information from the related literature, resulting in a 64 × 63 image. Version 2 had images of dimensions 95 × 126, which were generated according to underwater acoustics features. An experiment was conducted to establish a comparison between these two representations, where the same baseline setup used in Section VI-A was applied: the VGG-based model trained on data from the first dataset scenario. As the results obtained in Section VI-A showed that the SGD optimiser produced better results, the experiments were only performed using this optimiser. The results of this test are summarised in Table 6.
Both Gammatone and Mel Spectrogram methods presented lower accuracy values when compared with CQT and Complete representations, as shown in Table 6. The worst CQT result, obtained with Version 1, was 17 percentage points better than the best result for Gammatone filter, On the other hand, when using the CQT method, Version 2 had an accuracy improvement of 9.13% (7.22 percentage points) over Version 1. This improvement was likely due to the parameter optimisation process, which led to an increase in the temporal scale with the shorter hop length, and an improvement of the frequency representation with optimised frequency boundaries. These results suggest that the VGGmodel, using Version 2 of the preprocessing parameters, outperformed the results obtained with Version 1. Thus, Version 2 was considered as the baseline setting in the remainder of this work.

C. MODEL EVALUATION
Following optimiser and preprocessing parameters selection, the next step in the development of the underwater acoustic signal classifier was the selection of the DL model. Training sessions were performed using both VGG-based and ResNet18 models. As the results obtained in the previous experiments suggested a better performance using the combination of SGD optimiser, with either CQT or the Complete representation (generated using Version 2 parameters), this setup was selected for model evaluation. Table 7 shows the results obtained from these tests.
The results showed that the ResNet18 model outperformed VGG for both CQT and Complete preprocessing filters, presenting an improvement of 8.09 percentage points for the CQT, and 14.15 for the Complete, the latter being the best result obtained for this dataset scenario overall. The ResNet's capacity to have more intermediate layers proved to be suitable for the feature extraction stage, as it gave the model the ability to generalise the problem function better, thus resulting in higher classification performance. The loss obtained during the training stage (represented in Figure 7 and Figure 8) showed similar convergence behaviours for both  models. Figure 8 also shows that the loss curves for CQT and Complete preprocessing methods (feeding a ResNet18 classifier) are almost identical. This contrasts with the curves shown in Figure 7 that represents a better performance for the CQT than Complete when applied to a VGG-based model.
Although the Complete representation, combined with the ResNet model, presented an improvement of 2.12 percentage points in accuracy, the CQT was able to obtain a similar value using only one-third of the input size and preprocessing, since Complete is a three-channel representation. This result suggests that a fair trade off between accuracy and model size is obtained when using the CQT as a single preprocessing method.

D. SCENARIOS VALIDATION
The final test executed in this work evaluated how the distance from the sensor to the target vessel influenced the classification results. Tests were conducted with the combination of methods that produced the best results, as reported in previous sections. The architecture, composed of the CQT preprocessing filter applied to the ResNet18 model, was used to compare the results obtained in training and testing on the three scenarios described in Section III. Also, a test using all of the data from the three dataset scenarios combined was performed, aiming to evaluate if the distance impacted the accuracy, or if the generalisation ability of the architecture VOLUME 10, 2022 was capable of dealing with this variable. The results obtained from these tests are shown in Table 8.
As the different scenarios do not contain the same number of instances (or the same vessels), they are not directly comparable, making it difficult to precisely state which situation allowed the best outputs. However, these results suggest that there is no practical difference in accuracy between the individual scenarios. This means that range to target had minimal impact on system accuracy, at least up to the tested 6 km distance boundary. One explanation for this could be the depth and ocean temperature where the hydrophone was located, which provided the best context for underwater sound propagation [25], thereby not degrading the SNR sufficiently to invalidate the signal representation. However, combining all of the data from the three scenarios caused an accuracy drop of between 8.98 and 10.82 percentage points compared to the individual scenarios alone. The confusion matrix for the Combined scenario is shown in Figure 9. This suggests there is a negative influence in the data from the different scenarios that confused the model during training. In particular, results with the Combined scenario show a higher confusion rate between background and tug than when the scenarios were trained separately. This was probably due to the similar range of frequencies from these two classes, that may have been enhanced due to the combination of SNR from the various scenarios.

VII. DISCUSSION
This paper has reported experimental evaluations of the main aspects related to the development of a DL-based classifier for vessel types using underwater acoustic data.
The first tests reported focused on the selection of the most suitable elements to compose the classifier architecture, such as the optimiser and preprocessing methods. Initially, the two most commonly used optimisers, SGD and Adam, were tested and compared. The results reported in Section VI-A showed that Adam's performance was not satisfactory in this domain, owing to lower accuracy rates as a result of its inefficient treatment of local minima. In comparison, SGD produced higher accuracy and a more stable performance. This agrees with other studies (e.g., [56], [57]) that argue adaptive optimisation methods, like Adam, often generalise significantly worse than stochastic methods, such as SGD, since the strategy used by the former to escape saddle points causes difficulties in achieving flat global minima. In contrast, the momentum-based strategy of the latter provides a drift effect to escape saddle points without affecting the flat minima selection [57]. The tests performed herein seem to corroborate this hypothesis, as the best results obtained for SGD (using CQT) were 11.38 percentage points better than the performance (for the same preprocessing filters) obtained when using the Adam optimiser.
A second issue considered in this work was the selection of the most suitable representation of the signal to be used by the CNN. In our dataset, the CQT representation presented better performance, with an accuracy of around 86%, when compared with Gammatone and Mel spectrograms, with best accuracies of 62% and 54%, respectively, a minimum of 24 percentage points improvement. There are various possible explanations for this finding. One is that the acoustic signals generated by the vessels are predominantly composed of lower frequencies. However, the Gammatone filter does not emphasise low frequencies sufficiently, resulting in a lower classification performance. Mel spectrogram, on the other hand, does emphasise the lower frequencies, by mapping the frequency axis to the logarithmic Mel scale. However, it maintains the conversion from time-domain using fixed time windows, which negatively affects the temporal resolution. In contrast, CQT increases the time resolution towards higher frequencies while reducing the frequency resolution; this results in emphasising the lower-frequencies, which is akin to the human aural perception [50]. This feature makes the CQT spectrogram the most suitable representation for automated classifications of underwater acoustic data using CNN, owing to the nature of the convolutional layers.
The tests conducted with the Complete representation (all three preprocessing filters combined) aimed to obtain a preprocessing method that includes the advantages of each of the methods considered in this work. The results obtained showed that the classification accuracy obtained using this three-channel representation was marginally better than the best single filter (CQT) for ResNet, but not the VGG model. As the Complete representation combines multiple preprocessing methods, its generation and processing is more computationally expensive than applying each preprocessing method individually. Even in the cases where the Complete representation showed the best results (ResNet model), its performance was similar to that obtained using only the CQT spectrograms as input. This indicates that the Complete preprocessing method should only be used in situations requiring the highest possible accuracy or where computational cost is essentially irrelevant.
With respect to the DL model selection, the ResNet approach outperformed the VGG-based model for both CQT and Complete data representation methods by at least 8.09 percentage points in accuracy. This superior performance was likely due to the existence of residual blocks in the ResNet model, which reduce the probability of overfitting. It is worth mentioning that the ResNet model used had 17 convolutional layers, in contrast to four composing the VGGNet. Considering the relative complexity and accuracy results of both models tested in this work, we can conclude that, although ResNet18 produced the best classification results, VGG-based classifiers are still suitable models to be used in applications with limited computational resources.
The final test executed in this work evaluated the influence of the distance between the sensor to the targets with respect to the classification performance. Despite the fact that a minor degradation in accuracy was observed with respect to an increase in the distance to the sensor, the results obtained for the three scenarios showed similar figures. The tests using the combination of the data points from all three scenarios presented the worst results compared with the performance values obtained for each of its constituent scenarios. This was probably due to the fact that, although a minor variation in the SNR (resulted from the distance between target and sensor) did not affect the results obtained in each of the individual scenarios, this difference was large enough to increase the complexity of the audio patterns contained in the combined dataset, thus hindering the capacity of a simple classifier to find a suitable generalisation that represented accurately the distinct classes. It should be noted that the largest (by far) point of confusion in the Combined dataset was the classifier detecting the presence of a tug when there was no vessel present. While additional data could help overcome this error, another option for future investigation is a two-stage detect and classify process. A computationally simple detection algorithm could be used to determine the possible presence of a vessel and then the classifier used to determine what sort of vessel it is. This has the potential to both improve the error rate when no vessels are present, while reducing the computational resources as the classification network would not have to run continuously. It is likely that a detection algorithm of sufficient sensitivity will detect a vessel before sufficient structure is present for a classification algorithm to accurately classify said vessel.
Considering related work developed with data from the ONC initiative, accuracy values of the order of 80% were previously reported using raw audio data, where time-frequency filter dependency was not considered in DL pipelines [34], [35]. An accuracy value of around 87% was reported as a result of the application of a bio-inspired cochlea model preprocessing filter to a CNN-based classification [40]. In addition, a comparison of various deep learning methods was conducted in [41] using an analogous set of ONC raw data that was used in the present paper. However, it was reported that 77.53% was the highest accuracy obtained in that work. That work was developed on a dataset called DeepShip, whose recordings were divided into 613 files, which varies from about 6 seconds to 1530 seconds. Only the identification of a single vessel within a range of 2 km from the hydrophone was used to generate the data, and the background noise recordings were added from a distinct source. Table 9 shows a summary of the best results obtained in the present work (first block), against the results reported in [41].
Although it is virtually impossible to faithfully reproduce the results reported in [41] (as the code used to generate them is not publicly available) it is possible to observe that the results reported in that paper are similar to those obtained in the present work for the Combined dataset (84%). This already places the baseline results obtained in the present work within the state-of-the-art of the field. However, the research reported here achieved superior results when distinct scenarios were considered with respect to the distance to the sensor, obtaining an accuracy of 95% or 97% depending on the preprocessing filtering method used (CQT or Complete, respectively), as shown in Table 9.

VIII. CONCLUSION
This paper presented a machine learning approach for vessel type classification using underwater acoustic data. CQT, Gammatone, and Mel Spectrogram filters were used to preprocess the acoustic data, aiming to extract relevant features of the signals. Striving to achieve a better representation of the signal, the combination of the three outputs into three-channel data was also investigated. The results showed that the CQT and the Combined approaches achieved the highest accuracy results for the dataset used in this paper. This study also compared the SGD and Adam optimiser performance applied to the vessel type classification problem, showing that the SGD optimiser is more stable and presents a better generalisation than Adam. Concerning the deep learning model, results showed that ResNet18 yielded the highest evaluation metrics when compared to the VGGNet model.
A new dataset was defined, using the raw data from Ocean Networks Canada, which include the underwater acoustic signals annotated with the related vessel types. Three distinct scenarios were defined with respect to the distance between the target vessel to the hydrophone used to capture the signal. These three scenarios were compared using the proposed pipeline, achieving a maximum accuracy of 94.95% when CQT was used to preprocess the data fed into a ResNet classification model. Higher accuracy (97%) was achieved if all three preprocessing methods (CQT, Gammatone, and Mel Spectrogram) were used simultaneously. However, the increase in computational costs may not be worth the slight accuracy improvement. Furthermore, a test was conducted combining the three distance scenarios into a single dataset, resulting in a classification accuracy of 84.13% for the CQT preprocessed data.
A complete pipeline for the classification of underwater acoustic signals was proposed in this paper, whose source code and data are publicly available. However, some issues were not addressed and will be considered in future work. Despite the promising results obtained in classification using the isolated distance scenarios, the results for the combined approach have a large scope for improvement, since the accuracy across distinct scenarios had a variation of 10 percentage points. Future work will also consider the application of other machine learning models, combined with novel biologically inspired filters, to the task of classification of vessels from acoustic data, in order to investigate better trade offs between accuracy and model size in image classification tasks for the classification of acoustic data. PHILLIP S. M. SKELTON (Member, IEEE) received the Ph.D. degree in biologically inspired vision systems for robotics from the University of South Australia, in 2020. He joined Flinders University, as a Postdoctoral Research Associate, in 2020, where he is currently a Full Member of the Centre for Defence Engineering Research and Training, College of Science and Engineering. He works across all aspects of maritime autonomous systems and oversees a variety of projects. His current research interests include developing adaptive biologically inspired signal processing algorithms for multi-modal autonomous perception tasks in the underwater domain, and the complex task of tuning these algorithms using evolutionary computation techniques. KARL SAMMUT (Senior Member, IEEE) received the Ph.D. degree from the University of Nottingham, U.K., in 1992. From 1992 to 1995, he was employed as a Postdoctoral Fellow with the Politecnico di Milano, Italy, and Loughborough University, U.K. He commenced his appointment at Flinders University, in 1995, where he is currently a Professor with the College of Science and Engineering. He works as the Co-Director of the Centre for Defence Engineering Research and Training, College of Science and Engineering, Flinders University, and the Theme Leader of the Maritime Autonomy Group. His research interests include navigation, optimal guidance and control systems, and mission planning systems for autonomous marine surface and underwater vehicles.