Automated Detection of Hydrothermal Emission Signatures From Multibeam Echo Sounder Images Using Deep Learning

Seafloor massive sulfide deposits have attracted attention as a mineral resource, as they contain a wide variety of base, precious, and other valuable critical metals. Previous studies have shown that signatures of hydrothermal activity can be detected by a multibeam echo sounder (MBES), which would be beneficial for exploring sulfide deposits. Although detecting such signatures from acoustic images is currently performed by skilled humans, automating this process could lead to improved efficiency and cost effectiveness of exploration for the seafloor deposits. Herein, we attempted to establish a method for automated detection of MBES water column anomalies using deep learning models. First, we compared the “Mask R-CNN” and “YOLO-v5” detection model architectures, wherein YOLO-v5 yielded higher F1 scores. We then compared the number of training classes and found that models trained with two classes (signal and noise) exhibited superior performance compared with models trained with only one class (signal). Finally, we examined the number of trainable parameters and obtained the best model performance when the YOLO-v5l model with a large trainable parameters was used in the two-class training process. The best model had a precision of 0.928, a recall of 0.881, and an F1 score of 0.904. Moreover, this model achieved a low false alarm rate (less than 0.7%) and had a high detection speed (20−25 ms per frame), indicating that it can be applied in the field for automatic and real-time exploration of seafloor hydrothermal deposits.


I. INTRODUCTION
A S discovery of high-grade terrestrial metal deposits has become difficult, increasing attention has been paid to seafloor mineral resources such as ferromanganese nodules, corich ferromanganese crusts, seafloor massive sulfides, rareearth element, and yttrium-rich mud [1], [2], [3], [4], [5], [6]. Among these, seafloor massive sulfide deposits, which are composed of sulfide chimneys and mounds, form in close proximity to submarine hydrothermal vents. Since these deposits are rich in various base, precious, and critical metals [1], [2], [7], [8], [9], they have attracted attention as potential sources of minerals that are essential for sustainable society in the future [2], [10], [11]. Moreover, seafloor massive sulfide deposits also attract attention as an analogue of metal deposits on land that formed over a geological time scale of hundreds of millions of years [7]. Thus, the study of submarine hydrothermal deposits is important both to thoroughly understand the resources that humankind utilizes at present and to secure resources that humankind will require in the future.
In addition to the values in resource engineering, understanding hydrothermal activities is also important for understanding the Earth and the origins of life [12], [13], [14]. Hydrothermal fluids form when cold seawater seeping into ocean crust is heated by magma beneath the seafloor. This process transfers heat and various elements (including alkali metals, Fe, Mn, and P) from the solid earth to the ocean [12], [15], [16]. It is also known that a unique ecosystem also forms around hydrothermal vents [17], [18], [19]. Unlike the photosynthesis-supported ecosystems that are widespread on land and at the ocean surface, hydrothermal systems support ecosystems that produce energy using the redox potential difference between hydrothermal fluid and seawater, which is considered to be an analogue for early earth ecosystems This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Representative image of an acoustic water column anomaly observed by the R/V Yokosuka with an EM122 (12 kHz) system. [13], [20], [21], [22], [23]. Therefore, exploration of seafloor massive sulfide deposits is important, not only as resources, but also for various scientific purposes.
Generally, exploration for seafloor resources is technically challenging and costly. In these circumstances, for seafloor massive sulfides, multibeam echo sounders (MBES) have attracted attention as a method for detecting hydrothermal signatures [24], [25]. MBES is an equipment that transmits multiple fan-shaped acoustic beams to the seafloor and measures the arrival time and intensity of backscatter waves. Although it was originally developed to measure bathymetry by detecting waves reflected from seafloor, it has been used for imaging the water column above the seafloor to detect and monitor various targets such as gas seep, shipwrecks, and marine organisms [26], [27], [28]. Surveys around Okinawa Trough revealed that vertical acoustic reflections were observed in the water column image above hydrothermal vents (see Fig. 1). Since the depth at which the acoustic reflections disappear corresponds to the phase-changing points of CO 2 from hydrate or liquid to gas, it is considered that CO 2 droplets released by hydrothermal activity are the cause of acoustic reflections [24].
In previous studies, water column backscatter data observed by the MBES were converted into color images, and potential hydrothermal signatures were detected by skilled observers. If this process was automated, it would not only improve the exploration efficiency for seafloor hydrothermal deposits but could also be applied to future fully autonomous exploration with autonomous underwater vehicles (AUVs) and autonomous surface vehicles (ASVs).
Recently, image recognition technology has demonstrated remarkable progress. Deep learning-based models can not only classify images but can also identify and locate various objects included within an image. This latter technique, called object detection, has been applied in various academic fields, such as medical image diagnosis [29], cell detection [30], and fossil observation [31], and can perform an automatic advanced image diagnosis with high accuracy, which was previously only possible for skilled observers. Therefore, we aimed to develop a deep learning model that could automatically and accurately detect signals of seafloor hydrothermal activity from MBES images. For this purpose, we compared several learning conditions and determined the optimal conditions.

A. Data Description
MBES images acquired during research cruises YK14-17 and YK15-14 onboard the R/V Yokosuka were used as the dataset. Cruise YK14-17 was conducted in the area north of Okinawa Island in the Central Okinawa Trough (see Fig. 2), from August 19 to September 19, 2014, and YK15-14 was conducted west of Okinawa Island in the Central Okinawa Trough (see Fig. 2) from August 13 to 26, 2015.
An EM122 system (Kongseberg Maritime) was used to acquire the MBES data. The acoustic beams were emitted every 5 s. The colored heat map of reflection intensity, generated by the MBES system, were captured as images of 1600 × 740 pixels.
The previous study [24] defined the signal of hydrothermal activities in MBES images as one to several streams of acoustic scatters that rise vertically from seafloor to 500-1000 m of water depth without significant change in the width of stream. MBES images also include some patterns of "noise," which are similar to signals, but are not caused by hydrothermal activities. Examples of noise include layering noise, which occurs horizontally, sidelobe, semicircular patterns that always occur at the sides of the water column, and background noise, which represents turbulence or reflections of particles in seawater [33] (supplementary material 1). Using these definitions, 1180 images containing at least one signal of hydrothermal activity were selected for the dataset in this article. We note here that the signals in the datasets were related to the actual hydrothermal sites confirmed by a manned submersible or a remotely operated vehicle. These images were randomly divided into three subsets: 800 images were used for the training dataset, 100 were used for validation, and 280 were used for testing. Images in each subset were annotated with the classes "signal," indicating signs of hydrothermal activity, and "noise," which has color patterns similar to those of "signal," but with different shapes. The datasets used in this article are available on Mendeley Data [34].

B. Model Training Conditions
A total of 10 trainings and evaluations were performed by changing the three training conditions described below (see Table I).
1) Architecture Selection: Image recognition techniques that use deep learning include image classification; which determines the class of an image, object detection; which predicts the boxes that surround objects in images, semantic segmentation; which predict the position of objects at the pixel level, and instance segmentation; which is similar to semantic segmentation but can distinguish different objects. All of these techniques seem to be capable of determining whether an image contains signals of hydrothermal activities. However, considering the application of the deep learning system for actual exploration, the system should be able to determine the position and number of signals within an image. The techniques that meet this purpose are object detection and instance segmentation. Here, we selected Mask R-CNN [see Fig. 3(a)] [35] and YOLO-v5 [see Fig. 3(b)] [36] as representative architectures of instance segmentation and object detection, respectively. In this article, Mask R-CNN was used for Cases 1-4, while YOLO-v5 was used for Cases 5-10 (see Table I).
2) Learning Classes: To investigate the importance of learning noise, we compared the model performances based on the dataset in which only signals were annotated [see Fig. 4(b)] and a dataset in which both signals and noise were annotated [see Fig. 4(c)]. In Cases 1, 3, 5, 7, and 9, the models learned only the signal information, whereas the models learned both signal and noise information in Cases 2, 4, 6, 8, and 10.
3) Number of Trainable Parameters: In general, machine learning models perform better when the number of trainable parameters increases. However, as the number of trainable parameters increases, a larger dataset is required to suppress overtraining [37]. Therefore, selecting the appropriate number of trainable parameters according to the size of the prepared dataset is effective for achieving a high performance.
In Mask R-CNN, the number of trainable parameters can be changed by selecting the trainable layers. The Mask R-CNN model can be divided into three major parts: the backbone, which extracts features from images; the region proposal network (RPN), which suggests candidate regions for objects; and the head, which predicts the classes and contours of the objects proposed by the RPN (see Fig. 5) [35]. In the "Heads" learning condition, the weights of layers in only the RPN and head are updated during training, while the "3+" condition trains a portion of the backbone in addition to those trained in "Heads," For layers that were not trained, the weights were fixed to those of the model pretrained by COCO [38], which is a large dataset of general images. In this article, trainable layers were set to "Heads" for Cases 1 and 2, and to "3+" for Cases 3 and 4.
In YOLO-v5, several models with different numbers of parameters were provided. In this article, we trained the YOLO-v5s model with a small number (∼7.2 million) of parameters in Cases 5 and 6. Similarly, we trained YOLO-v5m with a medium number (∼21.2 million) of parameters in Cases 7 and 8, and YOLO-v5l with a large number (∼46.5 million) of parameters in Cases 9 and 10.

C. Experimental Procedures
For all cases, we trained the model for up to 80 epochs using the training dataset. The training of Mask R-CNN was conducted using ResNet-101 [39] as the backbone of the model. We used the default loss functions implemented for Mask R-CNN and YOLO-v5, respectively, both of which sum up the losses in locating objects and in determining classes of the objects [35], [36]. Based on preliminary examination [31], [36], the initial learning rate was set at 0.001 for Mask R-CNN and 0.01 for YOLO-v5. In all cases of training, the images were randomly flipped, scaled, and moved horizontally and vertically to prevent overlearning.
After training for each epoch, the model was evaluated using the validation dataset. In Mask R-CNN, the mean average precision (mAP) was monitored to determine the best epoch. In YOLO, the best model is automatically determined by a built-in function that measures the fitness between the predicted and original data, which basically uses mAPs.
Finally, the best-epoch model was utilized to detect signals in the test dataset. The evaluation was based on whether the signal was detected correctly, regardless of whether the model learned the noise. For both Mask R-CNN and YOLO, an intersection over union (IoU) of greater than 50% was considered a "correct" answer. To compare the performance of the two architectures under the same conditions, the IoU of both Mask R-CNN and YOLO models were evaluated based on the bounding box. The number of true positives (TPs), false positives (FPs), and false negatives (FNs) were counted, and precision, recall, and F1 scores were calculated according to the following equations: Precision represents the extent to which a model does not misclassify nonsignals as signals, recall represents the extent to which a model does not fail to detect an actual signal, and the F1 score is the harmonic mean of precision and recall, indicating the overall balance of the model.  Table I. Examples of the detected images are shown in supplementary material 2. The F1 scores of Mask R-CNN ranged from 0.20 to 0.65, which were much lower than those of YOLO-v5 (see Fig. 6, 0.84−0.90). This The difference between the F1 scores of the Mask R-CNN and YOLO-v5 models may have been due to the differing complexity of the tasks. While YOLO performs a simple task of determining boxes surrounding objects, Mask R-CNN performs a more complex task of determining the region of the object at the pixel level (see Fig. 3). Therefore, with the limited dataset of approximately 1000 images in this article, YOLO might be able to achieve higher accuracy.
2) Effects of Noise Learning: In Mask R-CNN, models that learned noise had higher precisions and F1 scores by more than 20% and had lower recall by several percent [see Fig. 7(a)]. As the total balance of the model indicated by the F1 score increased, noise learning was considered to be effective. Cases 1 and 3, in which the model did not learn noise, were characterized by very low precisions and slightly higher recalls compared to Cases 2 and 4. This suggests that all suspicious regions were detected as signals. In this scenario, learning negative (nonsignal) objects would suppress detection and yield a more balanced model.
In YOLO, models that learned noise had higher evaluation parameters by several percent, suggesting that noise learning was also effective for YOLO [see Fig. 7(b)]. However, the models without noise learning also had relatively high precisions. This suggests that the models that learned only signals were already able to distinguish between the signal and noise to some extent. Therefore, the effect of noise learning was not as high as that of Mask R-CNN. Interestingly, recalls also increased with noise learning in YOLO, suggesting that learning noise may have refined the criteria between the signal and noise more accurately.

3) Effects of the Numbers of Trainable Parameters: For
Mask R-CNN, increasing the number of trainable parameters resulted in increases in all evaluation indices [see Fig. 8(a)]. Under the condition of only training the "Heads" layers, the weights of the backbone that extract features of the image are fixed to the weight pretrained by the huge dataset of general images (cf., Fig. 5). Owing to the heterogeneity of the MBES image from such general images the Mask R-CNN model trained only on "Heads" may not have been able to extract features from the MBES images effectively.
In YOLO, the model YOLO-v5 m demonstrated the highest F1 score when only learning signals, whereas YOLO-v5l had the highest F1 score when learning both signals and noise [see Fig. 8(b)]. This suggests that YOLO-v5m and YOLO-v5l are suitable for a dataset with a scale of approximately 1000 images. The best number of parameters differed depending on the number of learning classes. A model with relatively few parameters (YOLO-v5 m) may be suitable for simple single-class detection, whereas a model with a large number of parameters (YOLO-v5l) may be suitable for multiclass detection.

B. Applicability to Actual Exploration
Of the ten cases examined in this article, Case 10 had the highest precision, recall, and F1 score, which were 0.928, 0.881, and 0.904, respectively. Here, we further investigate the applicability of this model for actual exploration of hydrothermal activities.
First, signals rarely occur upon practical exploration, which requires the model to minimize the generation of false alerts from non-signal images. To evaluate this perspective, the existence of the signal was inferred from 1016 images that actually do not contain a signal by the best model. As a result, false signals  were generated from only seven images, with the false alarm rate being 0.69%. Therefore, by using this model, the exploration of hydrothermal activity will not be severely hampered by a large number of false positives.
Second, in terms of the prediction output format, YOLO only encloses the region of the acoustic anomaly with a box, whereas Mask R-CNN can obtain the contours of the anomaly at the pixel level. This means that Mask R-CNN can identify the shapes of the anomalies more accurately. However, the main purpose of the MBES survey is to detect the locations of hydrothermal vents and determining the shapes of the anomalies is not a high priority. Therefore, the output of YOLO is sufficient for detecting hydrothermal activity.
Finally, to determine the detection speed, we performed a detection test from a movie file using the model from Case 10 (supplementary material 3). On the Google Collaboratory platform, the average detection time using a Tesla-P100 GPU (NVIDIA Corporation) was 20−25 ms/frame, which is generally faster than the pace of typical MBES image generation. Therefore, the model from Case 10 can instantly recognize the signs of hydrothermal activity from images generated by MBES and can be used for real-time detection.
Since a single commercial GPU card is sufficiently fast, the detection system can be built in a compact and power-efficient manner without a huge machine learning server or power supply unit. This suggests that detecting seafloor massive sulfide deposits can be achieved using a relatively small-scale platform. AUVs and ASVs are small-scale platforms that have made remarkable progress recently [40], [41]. If the detection system proposed herein could be loaded onto AUVs, they would not only be able to automatically predict the locations of hydrothermal vents by detecting acoustic anomalies from the sea surface but would also be able to dive to the prospective area and perform a camera survey to confirm the presence of a deposit. Such fully automated exploration is very cost-effective compared to conventional exploration using research vessels and piloted submersibles and has the potential to accelerate future exploration of seafloor massive sulfide deposits.

IV. CONCLUSION
We trained deep learning models to automatically detect signals of seafloor hydrothermal activity from MBES images under ten training conditions and evaluated their performances. When comparing the Mask R-CNN and YOLO-v5 model architectures, YOLO-v5 presented a better performance. For the performances of models that learned only signals and those that learned both signals and noises, the F1 scores of the cases with noise learning were higher than those of cases without noise learning in all comparison pairs, indicating that noise learning was effective. When considering the number of trainable parameters, the YOLO-v5l model with a large number of trainable parameters was the most suitable. The best model had high performance scores, with a precision of 0.928, a recall of 0.881, and an F1 score of 0.904. In addition, this model achieved a low false alarm rate and had a high detection speed (20−25 ms per frame), indicating that the model is applicable for automated exploration of seafloor massive sulfide deposits.

ACKNOWLEDGMENT
The authors would like to thank the crew of the R/V Yokosuka for their technical support during the research cruises. And also like to thank H. Okamoto for providing invaluable insight into signal detection through preliminary discussions.