Deep Semisupervised Semantic Segmentation in Multifrequency Echosounder Data

Multifrequency echosounder data can provide a broad understanding of the underwater environment in a noninvasive manner. The analysis of echosounder data is, hence, a topic of great importance for the marine ecosystem. Semantic segmentation, a deep learning-based analysis method predicting the class attribute of each acoustic intensity, has recently been in the spotlight of the fisheries and aquatic industry since its result can be used to estimate the abundance of marine organisms. However, a fundamental problem with current methods is the massive reliance on the availability of large amounts of annotated training data, which can only be acquired through expensive handcrafted annotation processes, making such approaches unrealistic in practice. As a solution to this challenge, we propose a novel approach, where we leverage a small amount of annotated data (supervised deep learning) and a large amount of readily available unannotated data (unsupervised learning), yielding a new data-efficient and accurate semisupervised semantic segmentation method, all embodied into a single end-to-end trainable convolutional neural network architecture. Our method is evaluated on representative data from a sandeel survey in the North Sea conducted by the Norwegian Institute of Marine Research. The rigorous experiments validate that our method achieves comparable results utilizing only 40% of the annotated data on which the supervised method is trained, by leveraging unannotated data.

image pixel to a semantic class [1], [2], [3]. When analyzing echosounder data, the aim is to assign an observed acoustic backscattering intensity to one of several given acoustic classes, often referred to as acoustic target classification [4], [5], [6], [7]. In practice, semantic segmentation of the echosounder data is still a manual and heuristic process, which is rather vulnerable to human error and bias. It is also expensive in terms of cost and time [8].
There are a few studies that intend to automate the semantic segmentation based on statistical modeling and machine learning techniques [9], [10], [11], [12], [13]. However, they are exposed to limitations such as relying heavily on handcrafted feature selection and not being able to scale well to large amounts of data. As recent echosounder technology leverages increasing numbers of frequency channels and wider bandwidth [14], automated analysis methods should therefore be scalable to cope with increased resolution and multifrequency data.
Convolutional neural networks (CNN) is a framework renowned for excelling at image segmentation tasks [15]. Recent echosounder segmentation studies introduce CNN-based segmentation methods as alternative strategies [5], [16], [17], [18], [19], where the main advantage is the capacity to learn discriminating features from the training data without requiring a handcrafted process, allowing the analysis to scale to large-sized data. Note that these methods are trained in a fully supervised manner, indicating that the network learns from fully annotated training data. The fully supervised approaches achieve good performance provided that high-quality training data and an appropriate choice for the prediction model are assured. However, it is highly challenging for the echosounder data to obtain the class annotation for each backscattering intensity pixel because this relies on the manual annotation process, which is expensive and error-prone.
Hence, a new learning scheme is required to considerably reduce the dependence on the manual annotation process while still facilitating powerful deep-learning approaches for the segmentation of the echosounder data. As a key step in this direction, we propose a novel deep semisupervised semantic segmentation method that efficiently uses a small amount of manually annotated data by combining it with a large amount of readily available unannotated data in the learning process [20], [21], [22].
The key concept invoked to train the semisupervised segmentation network is to alternate between two objective functions, namely, an unsupervised clustering objective and a supervised This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. segmentation objective, encapsulated by a single CNN. The unsupervised clustering objective is to search the underlying structure within the training data without using the class annotation. In contrast, the supervised segmentation objective is to map the input echosounder data to the given classes presented in the available annotated data. These two objective functions alternatively optimize the single CNN and gradually integrate the underlying clustering structure to the class decision boundaries presented in the small amount of annotated training data. Our proposed method can create pixel-level prediction maps using the same CNN architecture as [5] and [23]. Still, it is data-efficient because it can significantly reduce the use of the annotated data. To the best of our knowledge, our work is the first semisupervised semantic segmentation method for multifrequency echosounder data that provides prediction maps on a pixel scale, advancing the existing semisupervised method of providing patch-scale prediction maps (see Section III-C) [22]. In addition, our proposed method is end-to-end trainable, which refers to a holistic gradient-based learning system where a formulated objective function reflects the principle of a given task without requiring extensive human intervention and prior knowledge [24].
Extensive and rigorous experiments are conducted on the multifrequency echosounder data collected at the North Sea by the Norwegian Institute of Marine Research. A severe class imbalance in the echosounder data is an ever-present source of bias that prevents training of the neural networks, where 99% of the entire acoustic backscattering intensities is occupied by the background class [5], [25]. We introduce a class-rebalancing weight to each learning objective to mitigate the bias, where the weight is calculated with respect to the model prediction without relying on the annotation.
The contributions of the article are the following. 1) To propose a novel deep semisupervised semantic segmentation method for the multifrequency echosounder data, which considerably advances the existing methods. 2) To achieve comparable results with the fully supervised segmentation method by leveraging a small amount of the annotated data in addition to unannotated data. 3) To exploit the underlying structure of the training data using unsupervised deep clustering in a semisupervised learning manner. 4) To demonstrate the innovation potential of the proposed method in a real-world test case. 5) To regulate the class imbalance based on the model prediction without leveraging the annotated part of data. 6) To operate in an end-to-end and mini-batch training scheme.

II. BACKGROUND
Semantic segmentation is the process of partitioning an image into mutually exclusive subsets by assigning a class annotation to each intensity of the data, in which each subset represents a meaningful region of the original image [26]. It thereby provides a comprehensive scene description that includes object class, location, and shape. A wide range of real-world problems require semantic segmentation [27], [28], [29], [30], [31], [32], such as self-driving vehicles [33], and polyp detection [34], [35], to name a few, all depending on different types of image data.
Semantic segmentation has been considered as a challenging computer vision task due to the large distribution variance as well as the huge class imbalance among objects in the input data [25]. In recent years, however, deep learning has been rapidly advancing and has become a game-changer in many image analysis tasks including semantic segmentation. The CNN [36] is a deep learning framework that has had particular success for grid-structured data such as images. Traditional CNNs consist of convolutional layers and pooling layers, where these layers are stacked in a deep and hierarchical architecture in a particular order, providing unique properties to the analysis. For example, the weight-sharing property of the convolutional filters provides a symmetric transformation between the input space and the output space, referred to as "equivariance to translation." The pooling layers help the learned representation becoming approximately invariant to small translations of the input [15], [37]. Another advantage of the CNN is a relatively more straightforward learning process than the conventional methods, where the CNN-based models learn by minimizing a formulated objective function that reflects the strategies of a given task without requiring extensive human intervention and prior knowledge, referred to as an end-to-end learning.
CNN-based segmentation models are distinguishable through their model architecture. Their architecture consists of a downstream module that extracts the abstracted feature representations of the input data and an upstream module that reconstructs the prediction map exhibiting the class attributes of each intensity in the input data based on these extracted feature representations. Thanks to the dual architecture, those models can make class predictions on arbitrary-sized inputs [38]. Fully convolutional networks [1] and U-Net [23] are representative architectures, where the models are composed of (transposed) convolutional layers and pooling layers, and end-to-end trainable depending on their formulation of the objective functions.

A. Echosounder Data
For the sustainable management of commercially harvested marine organisms, reliable information on their abundance is essential. For example, lesser sandeel, a species of fish of interest in this study, is the primary food source in the North Sea food web thanks to its ample population [39], which are the preferred prey of a variety of predators, including marine mammals, seabirds, and piscivorous fishes [40]. Therefore, monitoring sandeel stock is critical for the sustainability of the marine ecosystem and fishery management in the North Sea. The echosounder data can contribute to estimating the abundance, leveraging the characteristics of the backscattered responses and knowledge of the target species [8]. The multifrequency echosounder data that we use in this study has been collected by multifrequency Simrad EK60 echosounder systems operating at four different frequency channels on the vessel (18,38,120,200 kHz), where the vessel speed is approximately ten knots. The Norwegian Institute of Marine Research has collected the data through the annual trawl surveys in the sandeel areas in the North Sea [41].
We leverage the data preprocessing protocol from the earlier works [5], [22], for which we share the echosounder data. For each frequency channel, a volume backscattering coefficient s v , an average amount of backscattering intensity per cubic metre [42], is stored in the 2-D echosounder data. In the physical context, the horizontal and vertical lengths of a single backscattering coefficient are, respectively, one second and 19.2 cm based on the pulse duration of 1.024 ms with respect to a common time-range grid based on the resolution of the 200 kHz echosounder data. All the volume backscattering coefficients s v are first converted to a decibel unit (dB re 1 m -1 ). We set the minimum value as −75 dB re 1 m -1 . The coefficients less than −75 dB re 1 m -1 or missing coefficients are imputed to the minimum values.
For segmentation of the echousounder data, one common approach is a manual annotation method, which relies on the operators' domain expertise of the acoustic properties, such as relative frequency response [43], [44], echo traces [45], and trawl sampling [46]. For that reason, the manual process is vulnerable to bias from the operators. In extreme cases, the systematic error associated with the manual method can be as high as ± 80% [8]. Hence, more structured and automated approaches are required to apply consistent criteria to the analysis while reducing dependence on human intervention. To this end, postprocessing systems, including the large scale survey system (LSSS) [9], are developed to facilitate the manual process. The systems support thresholding, error-checking, noise removal, and manipulation of the echosounder data. By adjusting the threshold of backscattered intensities, the postprocessing systems visualize the corresponding morphology of the fish schools to enable the operators to detect and delineate the most plausible morphology. In addition, these postprocessing systems enable relatively consistent criteria for the analysis by leveraging their acoustic feature libraries. The library consists of a selected part of the backscattered responses and their manually annotated class attributes. By comparing the statistical properties of the collected data to the feature library, the postprocessing system predicts the class attribute of the fish school, where the prediction is verified by the scattering model for the corresponding marine organism if available [47], [48].
The sandeel data in this study are manually annotated with the aid of LSSS, where expert operators determine the class of each backscattering coefficient as sandeel (SE), other fish species (OT), or background (BG) class. The primary frequency for LSSS is chosen to 200 kHz considering the highest sandeel signal-to-noise ratio [49]. The operators alter the detection threshold centered at −63 dB at the primary frequency to discover the fish school boundaries visually. The delineated boundary is refined using binary morphological closing to have smoother and pragmatic edges [5]. However, the final decision for both morphology and species is still a manual process, which is time-consuming and requires tacit knowledge that can be potentially biased as with any expert system. Therefore, recent studies have focused on the automated identification of the fish species using machine-learning methods while leveraging the conventional detection algorithm to detect and delineate the morphology of the schools. Shoal analysis and patch estimation system (SHAPES) [50], [51] is often chosen for the fish school detection algorithm, which extracts a feature vector from each fish school leveraging a single frequency channel of 38 kHz. A random forest-based classifier [12] is introduced to classify feature vectors of silver cyprinid from the other species in Lake Victoria. Aronica et al. [52] propose a classifier leveraging a shallow feedforward network and classify the pelagic Mediterranean fish schools such as anchovy, sardine, and horse mackerel. Those studies show that the automated identification can save time and cost while also achieving robust performance. However, they have limitations in generalizability and scalability because the SHAPES algorithm only exploits a single channel of the echosounder data, and a handcrafted feature selection is required to improve the performance.
Deep learning-based models generalize and scale well on various types of data using their flexibility [15], [37]. Among them, the fully supervised deep learning approaches, approaches that learn from the fully annotated training data, achieve a good level of performance provided a high quality of the training data and an appropriate choice of the prediction model are assured. To take advantage of supervised deep learning in the analysis of echosounder data, CNN-based semantic segmentation model [5] is introduced to segment the schools of lesser sandeel from the other species leveraging the U-Net architecture [23]. Without relying on the deterministic school detection algorithms and the feature vectors as input, the model constructs the prediction map directly from the input echosounder data.

B. Deep Clustering
We here discuss deep clustering since our novel CNNbased semisupervised semantic segmentation for echosounder data, presented in Section III, relies heavily on this concept. Deep clustering refers to unsupervised deep learning-based approaches, that aim to cluster data into underlying groups without requiring the class attributes of the data [53]. Deep clustering leverages the representation power of the neural network in conjunction with clustering algorithms, and partitions the input data into clusters with respect to the learned representation. As clustering performance heavily depends on the underlying structure of the data, deep clustering leverages the neural network to encode the training images in the feature representations where the clustering task becomes much easier [54].
Our proposed method is inspired by a well-known deep clustering framework, referred to as DeepCluster [53], which explicitly models the density of datapoints leveraging the k-means clustering algorithm. For a given image data set, the k-means algorithm partitions the feature representation into K different densities, where each density refers to an image descriptor or a visual feature. This has the advantage that it is easy to increase the capacity of more visual features by simply increasing the number of clusters K, leading to all-purpose visual features. The neural network produces cluster indices that can be thought of as clustering-induced annotations for the training data. The network is then updated in a supervised manner to learn the clustering structure. This annotation technique is referred to as pseudolabeling, allowing the supervised deep learning approach to be applied to unannotated training data [55].

III. PROPOSED METHOD
In this article, we propose a novel semisupervised semantic segmentation method, PredKlus, that enables a CNN to simultaneously learn from large amounts of unannotated data and a few annotated data, all in the same network.
The major novelty of our work is the methodology of how the network learns in a semisupervised manner, illustrated in Fig. 1. Our proposed segmentation network operates for two different goals: 1) searching for the internal structure of the training data without relying on external information, e.g., ground truth; 2) mapping input echosounder data to given classes. The former goal can be achieved by an unsupervised clustering objective, which clusters every pixels in the input based on their features to reveal a clustering structure of the input data in an unsupervised manner. Fig. 1(b) illustrates the clustering structure. A supervised segmentation objective, on the other hand, aims to map the input to given classes by leveraging the annotated part of training data, albeit in a small amount. Fig. 1(c) illustrates this. As these two objective functions alternately optimize the network using gradient descent, the segmentation network gradually learns the class decision boundaries (supervised) with respect to the clustering structure (unsupervised), as illustrated in Fig. 1(d). We implement the entire learning process in an end-to-end manner and a mini-batch setting, which are additional novelties of our method.

A. Model Architecture
Fig. 2 describes the model architecture of our proposed method. The encoder-decoder architecture with the skip connections is inspired by U-Net [23] and the recent segmentation study of the echosounder data [5]. The encoder part extracts the abstracted feature map of the echosounder input with a shape of 256 × 256 × 4 over five stages, where the area of the feature map is reduced to one-fourth at each stage due to a 2 × 2 max-pooling layer. By processing two sets of a 3 × 3 convolutional layer, a batch-normalization layer [56], and a rectified linear unit (ReLU) [57] at each stage, we abstract the feature map by doubling the depth. The encoder eventually creates five feature maps of different area sizes and depths, where the shape of the last feature map is 16 × 16 × 1024.
The decoder part reconstructs the prediction map leveraging five feature maps from the encoder. At each stage, a 2 × 2 transposed convolutional layer and the concatenation of the feature maps along the depth axis play an important role. The 2 × 2 transposed convolutional layer increases the area of the feature map fourfold while halving the depth. The halved feature map is concatenated with the feature map in the same shape from the encoder. The concatenated feature map is processed by two sets of a 3 × 3 convolutional layer, a batch-normalization layer, and an ReLU, where the depth becomes halved.
The novelty in our architecture is to introduce a convolutional layer for each objective function at the end of the CNN to employ two objective functions in one network. The alternation of the two objective functions takes place at the end of the decoder, where the decoder reconstructs the feature map with a shape of 256 × 256 × 64. To alternately leverage two objective functions, we append a 1 ×1 convolutional layer at the end of the network for each objective function, namely, conv1 for the unsupervised clustering objective and conv2 for the supervised segmentation objective. Note that the number of filters in conv1 matches the number of clusters or pseudoclasses K. Similarly, the number of filters in conv2 is equal to the number of classes C.

B. Two Objective Functions
Our proposed method leverages two objective functions, where those objectives alternately optimize the model. Through the alternating optimization, the CNN indirectly incorporates the class annotations (supervised) to a structured representation (unsupervised) and eventually discovers a structured representation consistent with the available annotations. The yellow box in the middle of Fig. 2 shows the overview of our semisupervised segmentation method. The first two steps of the figure, i.e., Fig. 2(a) creating pseudolabels using k-means, and Fig. 2(b) updating the model to learn the clustering structure with the pseudolabels using conv1, contribute to learning the structured representation in an unsupervised manner. The next step, Fig. 2(c) training with the partially available annotation using conv2, represents how the CNN learns in a supervised manner using the supervised segmentation objective and the available class annotations. Note that a cross-entropy loss (CE) is leveraged to update the model, as depicted in Fig. 2(b) and (c).
1) Unsupervised Clustering Objective: The unsupervised clustering objective exploits the underlying structure of the data using the unsupervised clustering algorithm, such as k-means, to create pseudolabels with respect to the clustering structure [53]. Defining the number of clusters K beforehand, the proposed The supervised segmentation objective involves in (c) training with the partially available annotation using conv2. The rectangular bars in blue or gray represent feature maps, where the size of each feature map is specified around it, e.g., 256 2 or 16 2 . We omit to specify the depth for a few feature maps, as the depth is the same as the feature map on its right, e.g., 64 or 512. model partitions the feature map Z = z (i) N i=1 located at the end of the decoder into K clusters in a way to find the best assignment by minimizing the k-means loss In this expression, N is the number of feature vectors in a mini-batch of the feature map. If the batch size B s is equal to one, N becomes 65 536 as each feature map consists of 65 536 vectors (256 × 256). The function d(·, ·) measures the L 2 distance between two vectors, where c k ∈ R 32 is the centroid of cluster k, and z (i) P C ∈ R 32 is the dimensionality-reduced training set consisting of the feature vectors z (i) ∈ R 64 . For dimensionality reduction, we use principal component analysis (PCA) [58], which computes the principal components and use only the first few principal components corresponding to the largest eigenvalues for manageable computational complexity.
The clustering result creates the pseudolabels, having K different pseudoclass attributes according to the K cluster indices. The CNN learns the structured representation from the pseudolabels using the cross-entropy loss. The unsupervised clustering objective is depicted as where CE(p, q) = − k q k log(p k ) is the cross-entropy loss of the probability distribution p for the one-hot encoded label q, y (i) ∈ {0, 1} K is the one-hot encoded pseudolabel, and g θ (z (i) ) is a probability distribution of the output from the CNN, where conv1 is appended at the end of the decoder. The scalar w (i) cls,k indicates the class-rebalancing weight to penalize the class imbalance of the pseudolabels. How to obtain this scalar will be explained in Section III-C. Once updating the CNN with the unsupervised clustering objective, we assign the current centroids of K clusters to the initial centroids for the next clustering to provide consistency of the pseudolabels over the mini-batches.
2) Supervised Segmentation Objective: To enforce consistency of predictions with regard to the given classes, we train the CNN using the partially available annotated data. The supervised segmentation objective is involved here, where conv2 layer, another 1 × 1 convolutional layer, replaces the conv1 layer to allow end-to-end training. The supervised segmentation objective is depicted as where C represents the number of given classes, y (i) ∈ {0, 1} C represents the one-hot encoded vector of the available annotation. f θ (x (i) ) a probability distribution of the output from the CNN, where conv2 replaces conv1.
3) Training Procedure: In addition to end-to-end learning, the proposed method operates in a mini-batch training manner, indicating that the network is updated once after each objective processes information in each mini-batch [59]. We form two training subsets for each objective function to facilitate the alternating mini-batch training. The training subset for the unsupervised clustering objective consists of the entire training input data, whether annotated or not, and does not include any class annotation of the data. On the other hand, the training subset for the supervised segmentation objective includes the annotated part of the training data, which takes a small amount of the entire training data in the semisupervised learning scheme. Algorithm 1 illustrates the semisupervised training procedure with two training subsets. Input: X : training input data X A ⊂ X : the annotated part of the training input data Y A : class annotation of X A X: an unannotated mini-batch of X (X A , Y A ): an annotated mini-batch of X A and Y A Output: Z: feature map of the mini-batch X at the end of the decoder Y: created pseudolabel of the mini-batch X P A : class prediction of the mini-batch X A Procedure: -Compute Z by processing X through the model -Create pseudolabelŶ by clustering the principal components of Z -Compute w cls with respect toŶ -Append conv1 at the end of the decoder -Update the model end-to-end using (X,Ŷ) and the unsupervised clustering objective in (2) with gradient descent -Replace conv1 by conv2 -Compute P A by processing X A through the model -Compute w seg with respect to P A -Update the model end-to-end using (X A , Y A ) and the supervised segmentation objective in (3) with gradient descent end for C. Advance on the Semisupervised Image Classification for Echosounder Data [22] The problem of being able to obtain manual annotations is much more severe for semantic segmentation compared to image classification, since in the former case annotations refer to the pixel level and not the entire image. The semisupervised method we propose in this article therefore solves a much more challenging problem compared to our previous preliminary work on semisupervised echosounder data patch classification [22], which is only able to classify whole image patches and not do proper segmentation. Some elements of the new segmentation method resembles the previous classification method, however with significant differences due to the completely different aims of the two approaches. For the benefit of the reader, and since we use [22] as one of the comparison models in experiments (referred to as SemiClf, Section IV-D), we will elaborate on these differences in this section.
SemiClf [22] is an image classification method, which is also semisupervised by design, built around two alternating objective functions. However, this semisupervised algorithm has some critical drawbacks. The minimum patch size that the method can classify is 32 × 32 intensity pixels. This is far too coarse-grained to provide information at a pixel level. Second, the training procedure is inefficient. During training, the method samples the patches to tackle the imbalance in the cluster size. The sampling  [22] classifies echosounder patches with a shape of 32 × 32 × 4 into three classes using the modified architecture of VGG-16 [60], where 4 in the patch shape indicates the number of frequency channels. The architecture corresponds to an encoder of the neural networks. The result can be interpreted as a coarse-grained segmentation, where the minimum resolution of prediction is equal to the patch shape. On the contrary, our method leverages the modified U-Net architecture [23], providing a fine-grained segmentation where the minimum resolution is 1 × 1 × 4.
Training the CNN for semantic segmentation is much more challenging than the one for classification because the large and sophisticated architecture may hinder the backpropagation of the gradient to the other end of the network. We leverage the coupled architecture of encoder and decoder using dilations and concatenation functions to facilitate the backpropagation of the gradient, as suggested in U-Net. In addition, we simplify the data preprocessing by avoiding applying the criteria for determining which class each patch belongs to, which is required for the classification task.
2) Annotation-Free Class-Rebalancing Weight: Our method utilizes the cross-entropy loss for both the unsupervised and supervised learning schemes. However, the cross-entropy loss does not account well for imbalanced classes as it sums over all the intensities [61]. A common approach to tackle the class imbalance problem is to allocate class importance to mitigate the imbalance based on the class distribution. This includes rebalancing the class weights [62], [63], [64] and regulating the learning frequency by sampling [22], [53], [65]. Table I shows that the echosounder data are severely class-imbalanced to the given classes, where more than 99% of the backscattering intensities belong to the background (BG) class consisting of the water and seabed features. The supervised segmentation objective, therefore, should deal with the class imbalance problem in the echosounder data.
The unsupervised clustering objective should also tackle the class imbalance problem. The clustering approaches based on DeepCluster [53] can result in a trivial solution, such as empty clusters or immensely larger clusters than their average size. This causes the imbalance among the pseudoclasses, hindering the CNN to address the structured representation. To tackle the imbalance, approaches based on DeepCluster [22], [53] purposely equalize the cluster size by sampling to uniformly distribute the pseudoclasses. For the segmentation task, however, sampling pixels to create the class-balanced pseudolabels is not a strategic choice in terms of the learning efficiency as the discarded pixels may create a mask in the pseudolabel, hindering end-to-end mini-batch training.
Hence, we apply the class-rebalancing weight technique to the objective functions to bypass the sampling procedure. The weight leverages the number of predictions to each pseudoclass or class attribute instead of leveraging the available class annotation, differentiating our method from the previous studies [5], [22]. The class-rebalancing weight w cls,k for the unsupervised clustering objective L cls in (2) is depicted as In this expression, N represents the total number of pseudolabels in a mini-batch. K represents the number of pseudoclasses or clusters that we predefined. N k represents the number of pseudolabels of the pseudoclass k, where the sum over the K pseudoclasses is equal to N (N = k∈K N k ). Equation (4) indicates that the pseudoclasses larger than the average size N/K are penalized by the smaller weight than the other classes. In this study, rather than forcing the balance in a few available annotations, we introduce the class-rebalancing weight w seg,c for the supervised segmentation objective L seg in (3) In this expression, C represents the number of classes in the annotated data. N c represents the number of prediction of the class c, where the sum over the C classes is equal to N (N = c∈C N c ). Note that we count N c from the prediction of the model rather than the available annotation to avoid the deterministic weight values, resulting in the annotation-free class rebalancing weight.

IV. EXPERIMENT
The purpose of the experiments is to explore the robustness of the proposed method in the semisupervised learning environment that exploits limited annotations and, at the same time, the contribution of the unannotated data. We evaluate our method by comparing it with other segmentation models applied for the analysis of the echosounder data, where the evaluation metrics include prediction accuracy, F1-score, confusion matrix, Cohen's kappa [66], and area under the curve-receiver operating characteristics (AUC-ROC) [67].

A. Data Setup
We leverage the echosounder data from 2016 to 2017 to train the CNN-based segmentation model and the trained model is  (18,38,120,200 kHz). We randomly extract the echosounder patches from the echosounder data. 200 patches from the echosounder data between 2016 and −2017 are used for the training set, and 60 patches from the echosounder data in 2019 are used for the test set. In addition to those sets, we extract 30 patches for the validation set from the echosounder data between 2016 and 2017 to tune the hyperparameters. There is no overlap among the patches. The model output is the segmentation map of the corresponding input, segmented by the three given classes. Table I and Fig. 3 show, respectively, a subset of the training patches and the general information of the training and test sets.

B. Annotation Ratio
To explore the impact of our semisupervised method, we compute the annotation ratio, which measures the ratio of the number of annotated patches to the number of the entire set of training patches. Six ratios are studied, namely, 1.

C. Training Configuration
The following training configuration is shared for all experiment setups. The model learns by mini-batch training, where the batch size B s is set to 2 considering the computational resource. Thus the number of feature representations in a mini-batch N is 131 072 (2 × 256 × 256). The Adam optimizer [68] with learning rate 3 × 10 -5 , beta (0.9, 0.999), and weight decay 10 -5 is applied. The training is iterated to 500 epochs for all experiments, applying early stopping [69] on the condition that the accuracy is not improved for 100 epochs. For PCA, we choose the first 32 principal components shown in (1) as they capture most of the variance of the data. Three prediction classes are given (C = 3); background (BG), sandeel (SE), and other fish species (OT).
Regarding the choice of the number of clusters K, we choose K = 512 after testing a set of different Ks. Table II exemplifies one of the tests when the annotation ratio is 0.20, where the AUC-ROC value of SE class (0.8306), prediction accuracy of BG and SE classes (BG accuracy 0.9861; SE accuracy 0.5312), Cohen's kappa (0.3449), and F1 score (0.9856) achieve the highest when K = 512. As addressed in the DeepCluster work [53], the number of cluster K does not have a significant impact on the performance if we cluster the feature representations with a sufficiently large number of clusters compared to the number of classes. We tune those hyperparameters using the validation set. All the code is implemented in PyTorch [70].

D. Validation Methods
Our proposed method, PredKlus, is designed specifically to exploit the intrinsic nature of unannotated data, as well as to enforce class structure by supervision, all while handling the inherent class-imbalance of echosounder data by class-rebalancing weights. One could envision other approaches for exploiting unannotated data in semantic segmentation for acoustic target detection.
As the first comparison model to highlight this, we reimplement a recently published work for generic semisupervised semantic segmentation [71] for our specific task of acoustic target classification. This method, which we refer to as SemiCPS, also aims to integrate pseudoclass predictions (unsupervised) to the class predictions (supervised) by introducing an additional auxiliary segmentation network mirroring the main segmentation network architecture with different initializations.
SemiCPS intends to encourage high similarity between the predictions of the two networks with different initialization for the same input image. For the annotated input, each network is individually trained in a supervised manner. For the unannotated input, the main network first creates the class prediction map by processing the input. This prediction map becomes the pseudolabel that will supervise the auxiliary network. Once the auxiliary network is updated by the pseudolabel, the main network is also supervised by the prediction map from the auxiliary network.
With SemiCPS, we implicitly explore how the unsupervised clustering objective affects the predictive performance when data are noisy. Due to the unpredictable underwater nature, the features between the target class and the nontarget are visually indistinguishable in some echosounder data. This may lead the mirrored network of SemiCPS to generate incorrect pseudolabels, which are tied to the supervision of the main network. If it eventually repeats, none of the two networks can make correct predictions. On the other hand, the pseudolabels in our proposed method are leveraging the internal structure of the data set and are not tied to the class supervision. This makes our proposed approach more robust against noisy data, such as the echosounder data. As we will show in Section V, SemiCPS does not compare favorably to our approach. We believe this to be due to an inability to exploit the intrinsic nature of the unannotated data leading to a propagation of errors induced by the pseudolabeling due to the noisy nature of the data. This will be further discussed in Section V-A.
The second comparison model is the semisupervised patch classification method [22], referred to as SemiClf, where both the annotated and the unannotated parts are involved in the analysis. This model learns from a small input patch of size 32 × 32 × 4, and classifies each patch to given classes leveraging the architecture of the modified VGG-16 [60]. We train SemiClf using the same training set, after splitting one provided echosounder input (256 × 256 × 4) into 64 small patches. In the inference phase, on the other hand, we extract the small patches with stride of one pixel only, resulting in a fine-grained prediction map. A voting mechanism determines the class for each pixel, which is based on the class prediction frequency among the overlapping small patches. This significantly increases the computational complexity of SemiClf, but provides a pixel-level comparison between all methods. The third comparison model is the fully supervised segmentation method [5], referred to as SupSeg in this study. This utilizes the same CNN architecture and the supervised segmentation objective as our proposed method, and provides the class prediction of each backscattering intensity. But it does not exploit either the unannotated part of the data or the unsupervised clustering objective. For semisupervised settings where the annotation ratios are smaller than one, this fully supervised method ignores the unannotated part and learns from the annotated part of the training set, which is partially available.

V. RESULT AND DISCUSSION
Our method and three comparison models, e.g., SupSeg [5], SemiClf [22], and SemiCPS [71], are evaluated by the various performance measures using the test echosounder data specified in Table I. The measures include AUC-ROC value and the class prediction accuracy for each class and annotation ratio (Table III), Cohen's kappa (kappa), and F1 score regarding each annotation ratio (see Table IV). The area under the ROC curve is AUC, where a higher AUC indicates better segmentation performance. Regarding the class prediction accuracy, note that the SE class achieves the lowest prediction accuracy than any other class for the many setups. This indicates that the SE class is a conservative estimate [22].
In addition to these measures, the confusion matrix and the corresponding ROC curve for each setup are computed for the comparison, as shown in Figs. 4-9. For the confusion matrices, each row of these confusion matrices sums to one, indicating the ground truth of the prediction. Each column illustrates the class prediction of the method. The first column and row indicate the BG class, the second and the third columns and rows denoting the SE class and the OT class, respectively. For the ROC curves, the vertical axis indicates a true-positive rate while the horizontal axis shows a false-positive rate. For the visual comparison, we provide the prediction map of the test data in Figs. 10-12, where four parts of the echosounder data in 2019 and their prediction maps are visualized. Overall, the results show that our semisupervised method outperforms the comparison models throughout annotation ratios.

A. Comparison to Semisupervised Segmentation Method Using Pseudolabels (SemiCPS)
Tables III and IV show that our proposed method outperforms SemiCPS through the entire evaluation metrics in the semisupervised setups containing the annotation ratios of 0.20-0.40. The greatest performance difference is observed at the annotation ratio of 0.20, which is the most extreme semisupervised setup. Our method achieves the kappa score of 0.3449, which is 18.8 times greater the kappa score of SemiCPS (0.0183).
The prediction maps in Fig. 10 also visually validate the outperforming results of our proposed method. SemiCPS does not make predictions close to the fish patterns for the annotation ratios of 0.20-0.25, but tends to capture the fish class patterns from the annotation ratios of 0.30 and higher. However, quite a few fish patterns are still misclassified to the BG class, yielding a smaller prediction area and underperforming results than our proposed method. Our proposed method, in contrast, tends to capture most of the major fish patterns on the prediction map from the annotation ratio of 0.20. Although the prediction map appears noisy due to misclassification of small clutter patterns at low annotation ratios, the noise is filtered out as the annotation ratio increases and shows a good prediction map close to the ground truth and the input. We discover the same visual trends in Figs. 11 and 12.

B. Comparison to Semisupervised Patch-Based Segmentation (SemiClf)
Compared to SemiClf [22], our proposed method outperforms throughout the measures and setups. We argue that the novelties of our method, such as the learning mechanism for the finegrained segmentation and the annotation-free class-rebalancing technique, contribute to achieving the surpassing result by addressing the shortcomings of patch-based SemiClf. The kappa scores contrast the difference nicely, where ours achieves 18.3 times greater scores than SemiClf with the annotation ratio of 0.20 (ours 0.3449; SemiClf 0.0191).
In addition to the poor prediction maps shown in Figs. 10-12, another critical drawback of SemiClf is misclassification of the seabed feature, which is known for a considerably higher intensity than the other fish classes [5]. The seabed feature is marked with a distinct yellow horizontal line in the input echosounder data. As shown in the prediction maps, SemiClf and SemiCPS predict the seabed as one of the fish classes (blue or red) throughout the annotation ratios. In contrast, our method learns the seabed feature and correctly predicts it to BG class in white as intended.

C. Comparison to Fully Supervised Method (SupSeg)
We compare the result of our method to SupSeg [5], to investigate how the unsupervised clustering objective and the unannotated data improve the predictive performance. Overall, our proposed method outperforms SupSeg through the entire annotation ratios for the entire AUC-ROC values and the SE and OT class accuracies in Table III. The results indicate that the unsupervised clustering objective improves the performance of the segmentation task by effectively exploiting the structured representation from both the unannotated data and the available annotated data.
Note that our proposed method outperforms SupSeg for the annotation ratio of 1.00 (fully supervised case). With this result, we argue that our proposed method is generic and outperforms the conventional fully supervised learning methods, such as SupSeg. Alternating two objective functions are applicable to the fully supervised case, which facilitates the interconnection of the two objectives to make good use of the annotated data based on the clustering structure. By the iteration, the datapoints in each cluster gradually share the dominant class annotation, and eventually have the same class prediction, approximating the decision boundaries that SupSeg achieves to some extent.
In Table IV, we find two inconsistent cases for the annotation ratios of 0.35 and 0.40, where SupSeg achieves greater Kappa and F1 scores. However, we argue that this result does not undermine the robustness of our proposed method. Instead, we believe that SupSeg is biased to make more predictions for the BG class, where the bias is related to a severe class imbalance in the training data, especially in the increased part of the annotated data. The prediction accuracy of the BG class for these annotation ratios validates our reasoning, where SupSeg achieves better accuracy than our proposed method for these annotation ratios (SupSeg 0.9857, ours 0.9842 with the annotation ratio of 0.35; SupSeg 0.9811, ours 0.9857 with the annotation ratio of 0.40). On the other hand, the prediction accuracies of two fish classes do not seem to increase as much as it increases in our method (ours 0.6609, SupSeg 0.6128 with the annotation ratio of 0.35 and the SE class accuracy; ours 0.6304, SupSeg 0.6238 with the annotation ratio of 0.40 and the SE class accuracy; ours 0.6419, SupSeg 0.6399 with the annotation ratio of 0.35 and the OT class accuracy; ours 0.7307, SupSeg 0.6029 with the annotation ratio of 0.40 and the OT class accuracy). Through visual inspection of the annotated part of the training data, we are able to obtain other grounds for our argument.
When performing the visual inspection of the increased part of the training set between the annotation ratio of 0.30 and 0.35, where ten input-annotation data pairs are increased, we discover that five out of ten data pairs consist of only BG class pixels without any fish class pixel. Analogously, we discover that six out of ten data pairs consist of only BG class pixels without any fish class pixel between the annotation ratio of 0.35 and 0.40. For the entire training data, the case that no fish intensity pixels are obtained in the input takes about 20% of the training data on average. Hence, we argue that the class imbalance found with these annotation ratios is more severe than the other cases and causes the prediction bias towards the BG class for the SupSeg case.

D. Confusion Matrix and ROC Curve
Figs. 4-9 compare our proposed method to other comparison models using confusion matrices and ROC curves. When comparing the diagonal components of the confusion matrices visually, our proposed method shows more distinct diagonal components than the other models. This implies that: 1) our proposed method can be seen to outperform the comparison model in terms of the class accuracy as illustrated in Table III; 2) our proposed method also achieves lower false-positive rates within fish classes compared to other models when having a deeper look at the diagonal components of the SE and OT classes (second and third row and column). For example, comparing the false-positive rate of SE prediction of the OT class ground truth, shown in the second column and the third row of the confusion matrices, ours achieves lower false-positive rates throughout the semisupervised setups. This result is consistent with the false-positive rate of OT prediction of the SE class ground truth, shown in the third column and the second row of the matrices.
The ROC curve shows the tradeoff between true-positive and false-positive rates. The curves indicate that segmentation models with curves closer to the top-left corner perform better,  Table III. The results in the curves and the AUC values validate the outperforming result of our method.

VI. CONCLUSION
In this article, we propose a novel semisupervised deep learning method for semantic segmentation of echosounder data. Our method considerably reduces the dependence on the annotated data, achieving comparable results with the fully supervised segmentation method [5], by leveraging 40% of the annotated data in addition to unannotated data. Our method also outperforms the other semisupervised methods for echosounder data [22], [71]. Our methodological novelty is to take advantage of deep clustering to exploit the underlying structure of the training data regardless of the annotation in a semisupervised learning scheme. In addition, our method is end-to-end and mini-batch trainable, and regulates the class imbalance based on the model prediction without leveraging the annotated part of data. The rigorous and extensive experiments validate the robustness of the proposed method, where various performance measures are introduced.
Our proposed method is generic and applicable to other fish species with a small amount of annotated echosounder data. To the best of our knowledge, this is the first semisupervised semantic segmentation article for the echosounder data analysis based on deep learning. The promising results imply that our proposed method can reduce the expensive costs required for the annotation. The performance can be improved by utilizing semantic information, e.g., a simple classifier that can exclude the background class pixels when collecting the echosounder data.
In future work, we intend to explore the uncertainty of the segmentation results to improve the interpretability of the model prediction. As a further example of future work, we intend to extend our method to take the uncertainty into account to create more crisp and clear decision boundaries among the clusters when the pseudolabels are created.