Optimizing the Hyperparameter Tuning of YOLOv5 for Underwater Detection

This study optimized the latest YOLOv5 framework, including its subset models, with training on different datasets that differed in image contrast and cloudiness to assess model performances based on quantitative metrics and image processing speed. The hyperparameter in the feature-extraction phase was configured based on the learning rate and momentum and further improved based on the adaptive moment estimation (ADAM) optimizer and the function reducing-learning-rate-on-plateau to optimize the model’s training scheme. The optimized YOLOv5s achieved a better performance, with a mean average precision of 98.6% and a high inference speed of 106 frames per second. The ADAM optimizer with a detailed learning rate (0.0001) and momentum (0.99) fine-tuning yielded a sufficient convergence rate (0.69% at 55th epoch) to assist YOLOv5s in attaining a more precise detection for underwater objects.


I. INTRODUCTION
Analyzing captured images in video-sequence framing, an image processing of real-time or recorded underwater video becomes imperative for extracting underwater features. However, real-time video processing is technically laborious and time-consuming when performed on a serial processor due to several factors. These factors include images comprising a large data set, and complex operations (e.g., design space complexity and require a huge amount of labelled data) are needed to process these images [1]- [4].
Meanwhile, data-driven classification models like neural networks meet the requirement of automatic system and non-destructive method as a tool for managing underwater biodiversity [5]. However, unlike the atmospheric condition, the underwater scenes comprise degraded illumination, low contrast, and changes in visibility due to turbidity [6]- [8] as depicted in Fig. 1. Therefore, it is crucial to overcome such limitations and challenges for the underwater vision system (UVS) by reducing the constrained scenes.
The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li.

II. UNDERWATER OBJECT DETECTION AND CHALLENGES
Although underwater object detection using image capturing is the most accessible approach, this method yielded several challenges on extraction the detailed imaging information through the integrations of marine vessels and robots with advanced imaging technologies. Certain factors contributed by the colors absorbance and scattering in underwater, i.e., water properties and impurities, affected the quality of the photographs captured by the underwater imaging devices [48]. The water light attenuations may include and thus processing of sea imaging data becomes more challenging. Certain studies showed that the existence of certain intrinsic deficiencies is attributed to the appearance of objects and ambient noise in underwater images [49]- [51]. Consequently, it is difficult in a real-time system to distinguish objects from their surroundings in these images.

A. OVERCOME THE CHALLENGES OF UNDERWATER ENVIRONMENT
Complex nature of underwater environment poses biggest challenge towards object detection and recognition of underwater images [55]- [57]. Advancements in camera and video technologies have increasingly evolved into broad applications with high quality and better resolution images, leading to efficient and precise underwater analysis. However, manual extraction and analysis of each frame sequence in the recorded video are labor-intensive, cost ineffectiveness, and prone to fatigue error [2]- [4]. Main challenging in underwater imaging is the limited availability of light, which causes high variability of light intensity while yielding poor luminosity, distortion, and light attenuation [6]- [9]. Other challenges include water murkiness and background confusion of the underwater floor with marine organisms [4], which decrease the accuracy of visual perception on the recorded images. To overcome the challenges, deploying a machine-learning algorithm of a computer vision system could enhance the resolution of underwater imageries due to high turbidity in the underwater environment [10], [42]. Other than improving the quality of underwater images, some of computational algorithms could accelerate the automatic detection in machine learning that will enhance the efficiency of monitoring and analysis [3]. Several imaging methods have been devised to specifically improve the imaging range and quality of underwater imaging systems [58]- [60]. As example, hyperspectral imaging method [61], [62] has been used in underwater object detection since the spectral images available in different bands can provide researchers with a better understanding of image information. Meanwhile, to solve distorted underwater image due to scattering, absorption, color loss, diffraction, polarization or light attenuation, image restoration method has been suggested by [63]. Furthermore, to cater water quality problem in underwater environment, deep learning-based method [64] has seen promising results especially for monitoring a large number of mariculture fish. Therefore, a computer-based methods is highly recommended to solve the underwater environment issues in the exploration of underwater research and engineering.

B. MACHINE LEARNING AND IMAGE PROCESSING FOR UNDERWATER IMAGING
The underwater imaging deals with detecting instances in an image or video, locating their position in a particular frame. However, detection and position location are particularly challenging of accurate underwater classes in recorded images of low resolutions. Several authors adapted the convolutional neural network (CNN) models for developing smart UVSs and achieved satisfactory performance [10]- [14]. These models comprise a stack of distinct layers of convolutional layers, Rectified Linear Unit (ReLU) layers, pooling layers and a fully connected layer that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. By contrast, most of the existing systems are designed for shallow waters or well-illuminated areas in underwater environments [7], [13], [14], where objects are visible.
To date, only a few researchers are investigating to resolve issues of low light intensity and murky water while enhancing image quality processing time to produce a more accurate UVS. However, more computational capacity in classifying object detection and algorithm processing is needed to resolve these issues. Therefore, this study aimed to develop a new optimized model using one of the network architectures for deep learning, i.e., the features extraction stage. In this proposed architecture, features would learn automatically from the input data, eliminating the requirement and engineering effort for hand-crafted feature selection and extraction.

C. UNDERWATER IMAGING USING THE DEEP-LEARNING METHOD
Currently, the most frequently used algorithm for object detection is the model known as You Only Look Once (YOLO) due to its high efficacy and accuracy [15]- [18]. As a subset of the CNN model, YOLO employs a single forward propagation through a neural network to detect objects in real-time, i.e., the entire image is predicted in a single algorithm run for training and validation [43]. This study used the CNN model to predict various class probabilities and bounding boxes simultaneously for two reasons. First, the latest version of algorithm-based YOLO, particularly version 3 or higher, is appropriate for improving object detection with more accurate positioning, faster speed, and more accurate classification. Second, comparative studies on these models for object detection under different underwater environments are yet available.
However, an improvement on YOLO models is more significant based on the tuning parameters of the model's optimizer. Tunable parameters that can improve the YOLO performance are the image of input size, number of epochs, batch size, learning rate, momentum, and activation function. VOLUME 10, 2022 Also, tuning hyper-parameters, such as learning rate and momentum, during training algorithm would significantly reduce training time and improves performance of the model [44]. Otherwise, the model would be underfitting or overfitting as the effect of poor hyper-parameter tuning. (i.e. increase regularization, increase training speed, cause instability). Thus, this study aimed to generate a robust YOLO model for underwater detection by improving its optimizer, learning rate, and momentum tuning. In this study, among all YOLO series, YOLOv5 has been selected for the model optimization due to several reasons: i the most advanced target detection algorithm with two content security policy (CSP) structures (CSP1_X and CSP2_X) [52] that able to extract generic features particularly in underwater image, ii able to adaptively change the depth and width of the network by changing parameters to its own data volume scale [53] (self-adaptation to small underwater objects), iii can guarantee good training result [54] of highest detection accuracy as proved in this study (Table 5).

D. YOLO VERSION 5 (YOLOv5)
The latest version of the YOLO family is YOLOv5, which is extended from YOLOv4 [34]. In general, the architecture of YOLOv5 and YOLOv4 is somewhat similar, especially in the backbone, neck, and head (Fig. 2). These two models use the same cross-stage partial, CSP connection in their backbone to generate a rich gradient combination while reducing computational usage. CSP portions the feature map of the base layer, splitting the gradient flow for propagating through different paths. Implementing the CSP connection in a deep network will enhance the learning ability of CNN and hence, improve the accuracy while being light-weighting [35]. Also, CSP clears the computational bottleneck by uniformly distributing the computation in each layer of CNN. Besides, CSP will help downsize the memory costs by using cross-channel pooling to compress the feature maps during the generation of the feature pyramid via the model neck. These feature pyramids help identify identical objects with different sizes and scales. In this respect, YOLOv5 uses the path aggregation network (PANet) as the model neck, and it is particularly beneficial for instance-segmentation in preserving the spatial information accurately. This accuracy helps locate the pixels correctly. Additionally, PANet provides a bottom-up path using clean horizontal connections from lower layers to the top ones (green dot) [36]. The path is called the ''shortcut'' connection, comprising ten layers only. In the section head, YOLOv5 uses a similar dense prediction as in YOLOv4 and YOLOv3. The final prediction consists of a vector containing the coordinates of the predicted bounding box and its confidence score, and the label. The output processing is carried out by getting rid of boxes with a low score (i.e., below the confidence threshold) and selecting only one box out of several that overlap with each other while detecting the same object based on the architectural variation. YOLOv5 comprises several models, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, with alterations in the depth and width in each model. However, this study used only two models, namely, YOLOv5s and YOLOv5m. In general, each model was scaled for the portions of the network independently with two configurations tuned for each model, namely depth multiple and width multiple. Depth multiple represented the model's depth factor, while the width multiple constituted the layer channel multiple used to scale up the backbone and feature network of the model.

III. THE PROPOSED METHOD
In this study, all YOLO models were trained, validated, and tested using the same platform through Jupyter notebook in Google Collaboratory or Google Colab. This platform allows users to prototype machine-learning models on devices such as Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) [22]. The training and testing of all YOLO models were performed using Nvidia Tesla T4 with a memory size of 16 GB Graphics Double Data Rate (GDDR6) and Compute Unified Device Architecture (CUDA) graphic features.

A. COMPARING AND SELECTING YOLO MODELS
This study compared the models of YOLOv3, YOLOv4, and YOLOv5 and selected the best-tested one over three open-source datasets to benchmark. The training and validation were performed using the default configuration of each model. Upon the completion of training, each model was evaluated using the test dataset to assess its performance towards never seen dataset. Table 1 shows the three public datasets used to train the YOLO architecture. These datasets were Open Image Dataset V6 [23], Aquarium Dataset [24], and The Brackish Dataset [25]. An Open Image Dataset V6 were employed in this study to investigate the YOLO models performance towards detection capability in different environments between underwater and non-underwater. The complexity and diversity characterized by all the images could provide ample challenge in testing each YOLO model's architecture in terms of different input image characteristics.
A non-underwater dataset of 2732, 341 and 342 images from the Open Image Dataset V6 used for training, validation and testing respectively. Whilst underwater dataset of 510, 64, 64 images from Aquarium Dataset used for training, validation and testing, meanwhile 11615, 1451 and 1452 image from The Brackish Dataset. The process of separating the dataset fraction was executed by taking an image list as input and randomly splits it according to the provided percentages which are 80% (0.8) training, 10% (0.1) validation and 10% (0.1) test. In data splitting, the ratio value of 1 for train, valid and test was verified before splitting the dataset. This separation verified and compared the performance of YOLOv3, YOLOv4, YOLOv5, and their sub-models that were structured and designed using the CNN multilayer. This performance was indicative of the behavior of YOLO models and tuning the optimizer's hyper-parameters improved the selected model. The YOLO algorithm was fed with 416 × 416 input images, running them through backbone blocks and layers that learned to extract statistical features for locating objects along with their labels. During the training, the batch size was set at 64, representing the number of samples/images propagated through the YOLO network before updating the model's internal parameters. Additionally, all three models used the stochastic gradient descent (SGD) as the default optimizer. During the training, the model tuned the weight of the YOLO model since this parameter would decide the amount of output that could be affected by the input while minimising the loss function through the optimizer. Meanwhile, all YOLO models in this study used a similar head structure, i.e., the head of YOLOv3. In a single-stage detector, the head section performed dense prediction composing the coordinates of the predicted bounding box, the prediction's confidence score, and the class label. In this study, the confidence score was set to 0.25 and Intersection over Union (IoU) threshold 0.5 for the detection. The training was run with the same 65 epochs at an image input size of 416 × 416, a batch size of 64, and the default SGD optimizer. This forward-and backward-pass cycle represented the number of updating the algorithm's parameters. Table 2 summarizes the configuration of all training parameters.
During the training, validation, and testing, the performance of each YOLO model was tested and evaluated through evaluation metrics. Detections were evaluated via ground-truthing. Each frame in the tested images was calculated based on the value of true positive (TP), false positive (FP), and false negative (FN), generating four standard evaluation metrics to determine two performance parameters, i.e., precision and recall, as denoted in Equations 1 and 2, respectively [25].
The precision represents the usefulness of the detection; a high precision indicates that the trained model returns a truly-detected object rather than a falsely detected one, while the recall defines the truly-detected object that the model returns [25]. A high precision shows a low value of FP, while the recall is usually related to a small number of FPs. Therefore, a higher percentage of precision and recall indicates that a model performs better [22]. The model's performance was evaluated using the F1-score given by Equation 3 below [45].
This F1-score represents the harmonic mean of precision and recall, and a higher F1-score shows a better performance. All models were then evaluated with the mean average precision (mAP) using Equation 4 below [25], [45].
Averaging the average precision (AP) for all classes involved in the trained model yields mAP. A sufficiently well-performed model tends to produce a higher accuracy. Finally, the performance of processing rate was calculated using frame per second (FPS) with Equation 5 below [45] to evaluate the model speed in processing the input in real-time applications.

FPS =
Number of Frames Total Detection Time(s) In general, the higher the FPS, the faster the model in detecting an object. Upon all training, a robust model was selected via the comparative performance. Hence, the model would be improved based on the tuning of the optimizer to increase the model's precision. VOLUME 10, 2022 B. TUNING THE HYPER-PARAMETERS OF LEARNING RATE AND MOMENTUM Fig.3 shows the proposed work for improving the selected model. In this proposed method, the YOLO model was optimized using hyper-parameter tuning. The selected YOLO model was improved based on the optimizer algorithm, focusing on the learning rate and momentum since both features contributed to the performance accuracy and the speed of processing rate. The training hyper-parameter, i.e., the optimizer algorithm, was used to reduce the losses of regression problem while providing estimation as accurately as possible. Since YOLO is a CNN-based neural network that transmits output error in a hidden layer to each neuron through backward propagation, it updates the connection weights of each neuron iteratively [24]. Updating the weight to reduce erroneous values was performed through the gradient descent type of optimiser algorithm. By default, the optimiser in the YOLO model used SGD with momentum. SGD computes the gradient of the cost function for the parameters θ for the entire training dataset, as given in Equation 6 below [46]. SGD minimised the objective function, J (θ ) parameterised by a model's parameters θ ∈ R d , and updated the parameters in the opposite direction of the gradient of the objective function, ∇θJ (θ), for the parameters [46]. The learning rate, η, determined the size of the steps to reach a local minimum. Equation 7 below gives the update of a parameter by SGD for each training example, x i and label y i [46].
Equation 8 shows that adding a fraction, γ of the update vector, v t of the past time step to the current update vector accelerates SGD in relevant directions while dampening oscillations [46]. This modification gives the updates of Equation 9 [46].
The extension of SGD is the adaptive moment estimation (ADAM) [6] that computes the decaying averages of past, m t , and the past squared gradients of v t , as shown in Equations 10 and11, respectively [46].
Estimates of m t and v t would yield the first moment (the mean) and the second moment (the uncentered variance) of the gradients, respectively. These estimates would update the parameter based on the rule of ADAM in Equation 12 [46]. In general, β 1 represented the exponential decay rate for the first moment estimate, β 2 was the exponential decay rate for the second moment estimate, and g was the gradient on the current mini-batch [46]. The ADAM optimiser proposed a default value of 0.9 for β 1 , 0.999 for β 2 , and is 10 −8 for .
Focusing on optimising the model, this study compared two optimisers for training the selected YOLO model (YOLOv5), focusing on optimising the model. The default SGD optimiser was compared with the implemented ADAM optimiser. The first execution step of ADAM was used to tune the best-fit parameters for model training. During the training, the learning rate and momentum affected the behaviour of ADAM. The tuning involved training for different values for both parameters. A range of learning rates and momentums were experimented with ADAM to determine the best combination value for better training performance for the selected YOLO. The learning rate ranged between 10 −6 and 1.0 [37], while the momentum and common values used in practice were 0.5, 0.9, and 0.99 [38]. Based on these references, this study employed a combination of the learning rate and momentum, and parameters were named in alphabetical order of upper and lowercase letters ( Table 3). The YOLOv5s with ADAM and all combinations of hyper-parameters were trained using the same configuration except for the optimiser, learning rate, and momentum. Besides, the ADAM configuration of zZ was compared with the SGD default value to study the effect of the same learning rate and momentum on different optimisers. Each experimented model was trained using the Brackish dataset. After training with all varying hyper-parameters, training performances were compared. This step was essential for choosing the best-fit combination for the optimizer for the selected YOLO model and later in the subsequent improvement of the model.

C. HYPER-PARAMETER TUNING OF THE LEARNING RATE ON A PLATEAU
Besides using ADAM as an optimizer to adapt the learning rate for each weight, this study improved the behaviour of the learning rate on each epoch during the training using the technique of reducing-learning-rate-on-plateau. This technique improved the model accuracy, descending into areas of lower loss by monitoring the loss during the training. The learning rate would be reduced if the loss was stagnant for several epochs. The term plateau was indicative of the point when the change in the loss for training iterations was below the threshold, θ. In short, the curve of epoch against loss became flat and did not improve Since a specific parameter setting was yet available for the YOLOv5 model, therefore, in this study, the implementation was executed using the module of ReduceLROnPlateau in PyTorch. By default, the first training parameter mode was set as a minimum (min) to reduce the initial learning rate (LR 1 ) once the loss stopped decreasing. Secondly, the factor (by which learning rate will be reduced) was used to reduce the new learning rate (LR new ) by this ratio, or given by Equation 13 below [47].
Thirdly, the patience parameter reduced the learning rate when the model showed no improvement after the 8 th epoch. Finally, the threshold measured the new optimum, focusing only on remarkable changes. Table 4 summarises the value for each parameter. In general, a precision-recall (PR) curve interprets the performance metrics for object detection that symbolizes the trade-off between two metrics (precision and recall) through different confidence thresholds [42]. From the trade-off, an object detector model is robust in locating correct bounding boxes if its precision stays high as its recall increases, as shown by the area under the curve. Since detecting underwater animals requires high precision and recall, the area under the curve in a PR curve will need to be as big as possible. In this study, this PR curve was plotted using the validation dataset.  Fig. 4 shows the test images detected using YOLO models performed on the Open Image V6 dataset. These images consisted of three objects of multiscale ground truth to evaluate the learning capability of each YOLO model for detecting far and a small-scale traffic sign. YOLOv3, Tiny-YOLOv3, and YOLOv5m detected two bounding boxes, while YOLOv4 and Tiny-YOLOv4 could not detect the far-end traffic signs. Thus, YOLOv5s was the only model that could detect all three bounding boxes, contributing to TP value for improving the model precision. Fig. 5 shows the detection output using the testing dataset for the Aquarium Dataset. YOLOv3, YOLOv4, and YOLOv5s detected all fishes and stingrays with multiscale that varying of sizes in the image, and essentially able to detect the targets at different scales. In addition, all the three models are also able to deal with the complexity of differentiating between the stingray and the background. Among all YOLO models that detected the underwater animals (stingray), YOLOv4 showed the highest confidence value of detection (0.99), followed by YOLOv5 (0.90) and YOLOv3 (0.81). Fig. 6 shows the detection outputs of YOLO models on the Brackish Dataset. In general, tiny models were inefficient in detecting two crabs; Tiny-YOLOv3 located three bounding boxes, and Tiny-YOLOv4 uncovered just one. Tiny-YOLOv3 gave FP to the model, and Tiny-YOLOv4 yielded FN, thus reducing the precision of Tiny-YOLOv3 and the recall of Tiny-YOLOv4.   YOLOv3 at 96.4%, Tiny-YOLOv4 at 88.6%, and Tiny-YOLOv3 at 87.2%. Even shallow network models, such as Tiny-YOLOv3 and Tiny-YOLOv4 mAP values above 87%. Despite the blur image of the Brackish Dataset, all models correctly understood the semantic representation pixel by pixel for detecting the object.

A. PERFORMANCE OF YOLO MODELS
In the Open Image Dataset V6, the YOLOv5s model recorded the highest FPS, i.e., 125.0, through Tesla T4 GPU, while YOLOv4 was the lowest at 46.6 in The Brackish dataset. YOLOv5s also outperformed other YOLO models in all datasets with an excellent execution speed. Other primary  (54, 57.7, and 46.6), respectively. In general, tiny models had a smaller weight size than their primary models. On average, the weight size of Tiny-YOLOv3 models was 86% than YOLOv3, while Tiny-YOLOv4 models were 90.8% smaller than YOLOv4 models. Tiny models performed better in FPS, i.e., more than 20% faster than primary models. Thus, FPS and weight seemed to be correlated, i.e., the smaller the weight size, the faster the execution speed of a model.

B. TRAINING EFFICIENCY BASED ON THE TUNING OF HYPER-PARAMETERS ON THE LEARNING RATE AND MOMENTUM
YOLOv5s was selected as the model for further development based on its performance. Model optimization was based on the Brackish Dataset only since this study aimed to improve underwater animal detection in low light and murky environments. Fig. 7 compares the training progress between ADAM and SGD optimizers. The SGD-based model outperformed YOLOv5s-zZ with a faster convergence speed in YOLOv5s-SGD. Besides, at the final epoch, YOLOv5s-zZ did not yield a consistent performance with mAP and barely exceeding 0.9. Such a fluctuating performance was due to inadequate generalization in YOLOv5s-zZ with an increment in the training time. Thus, a good fit in the learning rate and momentum values for one model did not apply to another model of a different optimizer. This behavior indicated that the two SGD and ADAM optimizers performed differently, even though they were set at the same learning rate and momentum value. Fig. 8 shows the tuning of the learning rate and momentum of each YOLOv5s model. Among all four trained models, YOLOv5s-aA achieved nearly similar performance as YOLOv5s-SGD towards the end with less fluctuation starting at the 20 th epoch. Meanwhile, YOLOv5s-aC and YOLOv5s-aD converged at a lower speed in execution, indicating that the learning rate and momentum values would need additional epochs to produce better performance for generalization. Meanwhile, Fig. 9 shows that YOLOv5s-bA trained with a learning rate of 0.0001 and a momentum of 0.9 performed excellently. Its performance was comparable to YOLOv5s-SGD, starting from the 34 th till the last epoch. This finding became the focal point for tuning the ADAM optimizer in this study because it represented a workable combination of learning rate and momentum. Throughout the training, this combination consistently gave a smoother convergence than YOLOv5s-SGD. By contrast, YOLOv5s-SGD fluctuated at smaller epochs. Also, YOLOv5s-bB with a momentum of 0.99 yielded a well-trained model towards the end of the epoch, despite a slower convergence speed at the 50 th epoch.    10 shows that all the four YOLOv5s models with a learning rate of 0.00001 struggled to converge, and their mAP values were below 80%. For example, YOLOv5s-cA yielded a 75% mAP only during the final epoch. Besides, all the models experienced under-fitting with lower mAP value when the training epoch increased. This result indicated that all the models would require more learning iteration with additional training times to give optimal performance. Also, Fig. 11 shows that a learning rate of 0.000001 and momentum variations of 0.9, 0.99, 0.999, and 0.9999 for the ADAM optimizer yielded poor convergence for all the models, i.e., these values were incompatible.  Table 6 tabulates the testing performance using different combinations of learning rates and momentums on all the trained YOLOv5s models. YOLOv5s-zZ yielded a slightly lower mAP value (91.5%) than YOLOv5s-SGD (97.7%) despite these two models having a similar learning rate and momentum. YOLOv5s-zZ struggled to generate a fast and smooth convergence speed, leading to a lower performance during the testing. Hence, it was a poor generalization model. Meanwhile, YOLOv5s-aA and YOLOv5s-bA, with a mAP value of 97.7% and 97.6%, respectively, had nearly similar performance as YOLOv5s-SGD. These two models were replicated with the best performance during the training at an optimum learning rate of 0.001 and 0.0001, respectively, and a momentum of 0.9. In comparison, when trained with a learning rate of 0.000001 and momentum of 0.9999, YOLOv5s-dD yielded a 21.5% mAP only, with the worst performance through the never-seen dataset. Overall, YOLOv5s-bA showed an outstanding performance in the training of mAP together with a smoother training curve and faster convergence than other ADAM-based models or even the default YOLOv5s. A smoother training result was probably due to the ADAM algorithm's capability in adapting the gradient descent after each iteration, allowing it to remain in control and unbiased throughout the training. Consequently, it could efficiently process huge input data, such as detecting underwater animals that required a large image sample for model development. Fig. 12 shows the effect of the learning rate on the testing of mAP with a fixed momentum value. In general, the learning rate increased/reduced the mAP performances in YOLOv5s.
Meanwhile, Fig.13 shows that when the momentum was set as 0.9, the testing yielded the best and optimum value for the ADAM-based YOLOv5s model. Also, at the same learning rate, an increased in momentum reduced the mAP performance. Each momentum value would change the steps taken to the minimum by enforcing previous updates into a current one. Such a change in the size of steps depended on how momentum worked in ADAM by adjusting the direction from the updates made for fast descent towards the minimum point [28], [47].

C. TRAINING EFFICIENCY BASED ON THE REDUCTION OF LEARNING RATE ON A PLATEAU
The YOLOv5s-bA model with a reduced learning rate on a plateau was designated as YOLOv5s-bA-LRP. Fig. 14 compares the training curves of YOLOv5s-SGD, YOLOv5s-bA, and YOLOv5s-bA-LRP. In early epochs of eight to 21, the YOLOv5s-bA-LRP curve showed better feature extraction performance than YOLOv5s-SGD and YOLOv5s-bA. For example, the performance of YOLOv5s-bA was 3.9% better than the two other models at the 9 th epoch and 3.6% higher at the 17 th epoch. Since better training performance was produced at earlier epochs, the speed of YOLOv5s-bA-LRP converged faster than YOLOv5s-bA. Then, from the beginning of the 18 th epoch to the 26 th epoch, the curve showed a plateauing behavior, triggering the function reduce-learning-rate-on-plateau to work on the YOLOv5s-bA-LRP model. Consequently, at the 26 th epoch, the learning rate was reduced by a 0.1 ratio. Therefore, YOLOv5s-bA-LRP yielded the best performance with a 0.41% increment upon reaching the 27 th epoch and increased further by 0.51% at the 28 th epoch. The improvement was then set below the threshold for eight consecutive epochs, activating the function again at the 36 th epoch 36.
Hence, mAP at the 38 th epoch increased by 0.35%. Subsequently, the auto-tuning of the function reducing-learningrate-on-plateau was stimulated at the 53 rd epoch as the YOLOv5s-bA-LRP model reached the 8 th patience set due to its below-threshold improvement for mAP. Finally, Fig. 15 shows the performance of YOLOv5-SGD comprising curves of all classes, and the overall classes curve was calculated from the average of all mAP classes. YOLOv5s-bA-LRP improved from 0.9779 to 0.9844 at the 55 th epoch, with a 0.69% improvement in mAP, indicating that by reducing the learning rate, the network took smaller steps to continue developing the learning progress.  Also, Fig. 16 shows a similar pattern in the PR performance of YOLOv5s-bA with a larger area under the curve, thus denoting an excellent detection by YOLOv5s-bA. Besides, this model yielded a mAP value similar to YOLOv5s-SGD with a slight increment of 0.001. This finding showed that YOLOv5s-bA could reach high precision and recall as YOLOv5s-SGD. However, Fig. 17 shows that among all tested models, YOLOv5s-dD had the lowest area under the curve in the PR curve, i.e., YOLOv5s-dD could achieve a satisfying recall value exceeding 0.8 but hardly achieved 0.7 in precision. In other words, although the YOLOv5s-dD model detected most of the positive samples correctly, it generated many FPs. Therefore, a low area under the curve led to a low mAP performance of 0.214 only for all classes.  Figure 18 compares the testing performance of mAP on the improved YOLOv5s-bA-LRP model, YOLOv5s-SGD, and YOLOv5s-bA. Overall, YOLOv5s-bA-LRP outperformed the other two models, attaining the highest mAP of 98.6%. Although the mAP value of YOLOv5s-bA-LRP was just 1% higher than YOLOv5s-bA and YOLOv5s-SGD,   implementing the ADAM optimizer was suffix to stop the learning rate from reducing on a plateau, and hence, improved the model performance. Figure 19 shows the final validation of the optimised model via output detections from the test dataset, confirming that YOLOv5s-bA-LRP located bounding boxes in the challenging underwater environment. Specifically, the multiscale detection further reinforced the efficiency of its head section and backbone competency in learning the feature complexity, i.e., differentiating the underwater animals and the background.

V. CONCLUSION
This study improved the performance of YOLO models by optimizing the tuning of learning rate and momentum in optimizer algorithm. Compared to all models, YOLOv5s produced the highest mAP at 97.7% with respective FPS of 106.4 that remarks outstanding model for detection underwater object in blur image. Furthermore, the improved model presented as YOLOv5s-bA consistently yielded a smooth training curve and faster convergence throughout the training phase with optimized parameters of learning rate 0.0001 and momentum at 0.9 producing the highest mAP at 97.6%. However, the superiority of the improved model based on ADAM is inadequate since the default SGD optimizer (learning rate of 0.01 and momentum at 0.937) is closely produced mAP at 97.7%. Thus, implementing the reduce-learning-rateon-plateau function into the improved YOLOv5s (namely as YOLOv5s-bA-LRP) facilitated the tuning of the learning rate for the model to descend into areas of lower losses. YOLOv5s-bA-LRP improved to 98.6% mAP at the 55th epoch, indicating that by scaling down the learning rate over time, YOLOv5s yielded better convergence, producing a high-performance rate for underwater detection. Since the learning rate setting is not a one-size-fits-all parameter, the practicality of reducing the learning rate scheme by cutting the step size by a constant ratio in the absence of progress would become difficult in building a deep learning model. Overall, implementing hyper-parameter tuning into the YOLOv5s optimizer and reducing the learning rate on a plateau enhanced the training for optimizing the underwater detection model more effectively.