DAMNet: Dual Attention Mechanism Deep Neural Network for Underwater Biological Image Classification

Due to the complex background and biodiversity of underwater biological images makes the identification of marine organisms difficult. To solve these above problems, we propose a dual attention mechanism deep neural network for underwater biological image classification (DAMNet). Firstly, tthe proposed DAMNet uses multi-stage stacking to suppress the complex underwater background, and the multiple stacking can reduce the number of parameters of the model and improve the generalization ability. Secondly, the dual attention mechanism module is combined with the improved reverse residual bottleneck based on deep convolution to extract the feature information of underwater biological images from space and channel aspects to obtain better discrimination and feature extraction capability. Finally, the gravity optimizer is selected to update the model weights, and the exponential translation can improve the model’s convergence speed and learning rate. Extensive experiments on a dataset consisting of seven types of underwater biological images demonstrate that the DAMNet model has higher learning ability and robustness compared to the state-of-the-art methods. Our DAMNet model achieves 96.93% classification accuracy in all categories, which is at least a 2 percentage point improvement compared to other models.


I. INTRODUCTION
Nowadays, underwater images have been increasingly studied in image enhancement, restoration, and classification [1]. Understanding marine organisms and abundance distribution is important for marine ecosystems, environmental monitoring, and fisheries [2], [3]. For underwater images, due to the scattering and absorption effects of light, suffer from image blur [4], low contrast [5], and blurred details [6], [7], [8], [9], [10], and the diversity of underwater image backgrounds is also an important influencing factor for classification.
The associate editor coordinating the review of this manuscript and approving it for publication was Fan Zhang . Therefore, classifying marine organisms is a very challenging and critical research topic.
Underwater biometric identification is gradually applied to aquacultures, such as automatic fishing and underwater survey, which is of great significance in fishery and seabed detection. There is much research on underwater image classification, but few effective underwater image recognition models exist. Image classification methods are still largely unsatisfactory for underwater image classification due to the poor performance of extracting salient features from images due to the diversity of underwater image backgrounds and image quality [11], [12], [13].
Underwater images have complex backgrounds and certain feature similarities among different species, so how to extract valuable information from the complex images is the key to solving the problem. For the underwater species classification problem, this paper proposes a DAMNet model, which uses a dual-attention mechanism to extract rich image feature information and uses a gravity optimization algorithm to improve the accuracy of the model. Experiments show higher model training accuracy than common attention mechanisms and optimizers. The main contributions of this paper are as follows.
• In this paper, we propose a method for classifying images of underwater organisms and conduct many comparative experiments on images of different underwater organisms in different environments. Compared with other methods, our DAMNet model has better classification ability.
• We use multi-stage stacking method, to reduce the number of feature parameters by stacking modules and update the weight information through the extraction of underwater features and the change of parameter gradients. And the gravity optimization algorithm uses exponential moving averages, which can effectively improve the classification performance of the model.
• A self-attention mechanism is added to the model in this paper, and a classification module combining channel and spatial attention is used to make the model have better feature description ability. Then, further using residual network and global variable features, a better underwater image feature extraction method is proposed, which improves the classification accuracy and practical application value. In what follows, section II briefly reviews the classification methods in underwater images. Section III presents the overall framework and methodology of the model. Section IV provides a comparative analysis of different models and ablation studies. Finally, Section V gives the conclusion and outlook of this paper.

II. RELATED WORKS
In this section, we summarize the related work from two aspects, including machine learning methods and deep learning methods.

A. MACHINE LEARNING METHODS
The classification method based on machine learning is semi-automatic extraction of image features, which requires manual definition and verification of functions. It uses algorithms to convert images into data to extract desired target features. Sun et al. [14] used a Support Vector Machine (SVM) as the final classification algorithm for the proposed low-resolution underwater image classification model. Iswari et al. [15] used K-nearest neighbors (KNN) as a classification algorithm for color summarization of fish imagesand achieved 91.36% classification accuracy. Deep and Dash [16] used KNN and SVM combined with convolutional neural networks(CNN) to classify underwater fish species and had better results compared to algorithms. Zhang et al. [17] classified eight different fish species based on a genetic algorithm feature construction method and a histogram of orientation gradients, and their method obtained an average accuracy of 98.9%. Cheng et al. [18] showed that combining support vector machines for the classification of underwater plankton images can improve the accuracy and recall of the model, with the best result reaching 94%. Wang et al. [19] compared machine learning and deep learning and experimentally showed that machine learning has better results on small sample datasets. Salman et al. [20] combined machine learning processing methods to classify 15 and 10 underwater fish species and showed more than 90% accuracy. Khishe and Mosavi [21] compared an artificial neural network (ANN) trained chimpanzee optimization method with an iron motion algorithm and demonstrated the reliability of the algorithm. Almero et al. [22] used a hybrid model composed of hidden layers in an ANN and achieved an accuracy of 93.6% in fish detection. Machine learning has good fault tolerance and computational power, and convolutional neural networks have shorter training times and higher accuracy rates. Still, manually extracted features are not rich, have weak generalization ability, and are inefficient.

B. DEEP LEARNING METHODS
The deep learning method to process images is to extract features, emphasizing the importance of feature learning automatically and feature extraction and transformation in multi-layer networks, so that deep learning has excellent feature expression capabilities. Mahmood et al. [23] used residual networks to extract new image features from different convolutional layers and combine them to obtain more compact and powerful deep features. Multi-channel convolution also has outstanding contributions in image classification, recovery and enhancement [24]. Xu et al. [25] improved the classification performance of deep convolutional neural networks in underwater images by optically transforming images and adding generative adversarial networks. Qi et al. [26] proposed a semantic region enhancement module that senses the degradation of different semantic regions from multiple scales and feeds them back to the global attention features extracted from their original scales to achieve enhancement of underwater images. Li et al. [27] proposed an underwater image enhancement network with multicolor spatial embedding guided by media transport, coupled with an attention mechanism where the most discriminative features extracted from multiple color spaces are adaptively integrated and highlighted. Xie et al. [28] normalized the total variance term and sparse prior knowledge sparse prior knowledge of the fuzzy kernel by varying the image resolution for fuzzy kernel estimation coarse to fine to avoid local minima, further validating its superiority relative to other state-of-the-art algorithms. Li et al. [29] investigated and studied deep learning methods applied to underwater image enhancement and evaluated the existing methods qualitatively and quantitatively. VOLUME 11, 2023 FIGURE 1. Flowchart of DAMNet model. Our model consists of a dual attention mechanism module and a stacking module. In our approach, we pool and convolve image features, input them into the dual-attention mechanism module to extract feature information, and filter out the important features by stacking of modules. In the model, the redundancy of feature information is reduced at each stage using residual connections, and all convolutional layers are of size 3 × 3 and 1 × 1.
Jaeger et al. [30] presented a Croatian fish dataset containing 12 fish images to perform fine visual classification of fish with an accuracy rate of 66.78%. Villon et al. [31] proposed a method to assist in recognition of fish species, using CNN to test that the recognition rate of fish reached 94.9%, which is higher than the human recognition rate. Balakrishnan et al. [32] used migration learning to classify underwater images, combined it with data augmentation, and achieved an accuracy of 86%. Aridos et al. [33] proposed a deep underwater image classification algorithm for turbid underwater images, which provides better classification accuracy. Fu et al. [34] use a graph CNN model to represent features in three aspects: local space, global space, and channel correlation, with better accuracy on real datasets. Yang et al. [35] proposed an initial attention network that outperformed other networks in distinguishing underwater images from images in non-underwater environments with 99.3% classification accuracy. Paraschiv et al. [36] used a lightweight CNN model to classify underwater fish, which improved the accuracy by 7% compared with the network model with many parameters. Mathur and Goel [37] proposed a migration learning based method for underwater image classification, which improved the classification accuracy by training only the last few layers of the network through migration learning and achieved 98.44% and 84.92% accuracy on large and small datasets, respectively. Jiang et al. [38] proposed an improved deep convolutional generative adversarial networks model to classify underwater objects, add data for small-sample objects, and combining the CNN models with lightweight neural network achieves a good trade-off between model complexity and classification accuracy. Deep learning methods have high generalization ability and robustness but large data volume and long training time.

III. PROPOSED METHOD
We show the flowchart of our proposed DAMNet model in Fig 1. It includes the following steps: 1) multi-stage stacking, 2) dual-attention module, and 3) optimizer. Specifically, the first step uses multi-stage stacking to reduce the model's number of parameters and improve the model's generalization ability by combining and stacking between different modules and attention. The second step adds a dual-attention mechanism module to extract saliency information of images in space and channels to improve the model's classification accuracy. Finally, it uses a gravity optimization algorithm, a gradient-based optimization algorithm, to reduce the loss of the deep learning model and improve its training results.

A. MULTI-STAGE STACKING
For the training 224 × 224 images, there are many pixels. After the convolution and attention are combined, if the relative attention is directly used, there will be problems such as a large amount of calculation, high cost, and slow speed due to many pixels. The amount of features is reduced without affecting the important features, and then relative attention is used.
The model adopts a multi-stage layout, and each stage uses max pooling to gradually reduce the space size and increase the number of channels, enabling residual networks to connect. The model is divided into five parts, containing S 0 , S 1 , S 2 , S 3 and S 4 . The first part (S 0 ) is used to extract features and increase the number of channels and is defined as: where two 3 × 3 convolutional layers (Conv 3 ) and x are the input of the features. The second (S 1 ) and third (S 2 ) parts combine convolution and self-attention to improve the model's accuracy, which can be expressed as: where S 1 in x 1 is S 0 , S 2 in x 1 is S 0 ⊕ S 1 , Conv 1 stands for 1 × 1 convolution, Lin denotes fully connected, Avg denotes average pooling, and Max is maximum pooling. The fourth (S 3 ) and fifth (S 4 ) parts are Transformer blocks, which purely use the attention mechanism to improve the efficiency of the data and the model by directly stacking the Transformer blocks concerning attention, and are defined as: and y TF can be seen as a Transformer block, both multi-headed attention and a location fully connected feedforward network (FFN), where * 5 or * 2 stands for stacking using this module 5 times or 2 times, and the stacking schematic is shown in Fig 2, x 2 and x 3 are S 1 ⊕S 2 and S 2 ⊕S 3 , respectively. Transformer blocks are used later in the model to maximize the balance between efficiency and performance. Through the model's stage layout, many parameters are reduced to make the features reach a manageable level, which speeds up the model's training speed and helps improve the model's generalization ability when the training data is limited. When the data features are rich, the stacking of modules increases the model's capacity and improves the visual processing capability to a certain extent. High capacity and good generalization ability can effectively improve the classification performance of the model.

B. DUAL ATTENTION MODULE
This paper introduces an attention mechanism to reduce the feature parameters extracted from the model. This strategy selectively focuses on regions where valuable features are closely related and thus ignores unimportant feature information. We use Depthwise Convolution and Convolutional Block Attention Module (CBAM) to form the DBAConv module, which can calculate and extract features without changing the channel and greatly reduce the number of parameters. The CBAM module structure diagram is shown in Fig 3. In Depthwise Convolution, one channel is only convolved by one convolution kernel, and the number of feature channels does not change, which can be expressed as each dimension in a predefined receptive field. The sum of . CBAM structure diagram. The CBAM module consists of two modules, spatial attention and channel attention, which infer feature maps along two dimensions, channel and spatial, and then multiply them with the original map for adaptive feature refinement, respectively. the weighted values is defined as: where x n and y n are the input and output of position n, respectively, η (n) is the local neighborhood of n, and z n−m is the weight matrix of position (n − m), respectively. The attention mechanism CBAM module is a hybrid attention module consisting of channel attention (CAM) and spatial attention (SAM) modules, and the structure of the CAM is shown in Fig 4. ⊗ is multiplied by the previous input, W , H represent the width and height of the feature map, C is the number of channels, and the input feature map F = W × H × C. For channel attention, the first is adaptive average pooling (Avg) and maximum adaptive pooling (Max), compressed into a feature map 1 × 1 × C. Secondly, a two-layer MLP neural network is input, add two activation functions, and the output is 1 × 1 × C. Get the same one-dimensional vector as the number of channels, learn the correlation between channels, and get the attention of the channels. Finally, the features computed by the module are subjected to a summation operation and a Sigmoid activation operation, resulting in (M C ) is expressed as: where the neural network of the two-layer MLP, x is the input, the number of neurons in the first MLP is N /n, (n is the scaling rate, it is set to 16), and the number of neurons in the second MLP is N .
Unlike channel attention, spatial attention focuses on the effective information on feature map, SAM is shown in Fig 5. The results of channel attention output are first pooled equally and maximally to compute spatial attention to obtain two H × W × 1 two-dimensional feature maps. Then the feature maps they produce are stitched together. Then splicing (Cat) into the spliced feature map of the feature map, the VOLUME 11, 2023 FIGURE 5. SAM flow chart. The feature map output from Channel attention module is used as the input feature map of SAM module, the channel dimension is compressed by maxpool and avgpool, the two feature maps are stitched together, and after 7 × 7 convolution operation, the dimension is reduced to one channel, and then the feature map of Spatial Attention is obtained by sigmoid activation, and finally the result is multiplied by the original map to obtain the original size feature map.
final spatial attention feature map (M S ) is obtained by the 7 × 7 convolution operation and the Sigmoid function, and it is defined as: The final element level is multiplied and the final CBAM attention module is defined as: where ⊗ represents element-wise multiplication. Finally, the DBAConv module is defined as: By extracting the feature information of the channel and space, multiplying and correcting the original input feature map, and generating the final feature map, which can suppress the complex underwater image noise information.
Although attention mechanisms have a larger capacity, they may generalize worse than CNN due to the lack of correct induction bias. The DBAConv module unifies and combines depthwise convolution and attention to make it more generalizable in classification. The stacking of Transformer blocks mainly establishes the global connection between features, while the DBAConv module establishes the local connection between features. The merging of global and local features can improve the model's performance.
Our final DAMNet mainly consists of a stack of DBA blocks and Transformer blocks, and the detailed building blocks and parameter settings are shown in Table 1. Before the first DBA module, the image size is reduced to 112 × 112 and the number of channels input to the image is increased from two convolutional layers to 64. After the DBA module, the size of the feature map is reduced and the number of channels of the feature map is increased. The final prediction of the model is obtained after averaging pooling and fully connected layers.

C. OPTIMIZER
In this paper, we use Gravity optimizer, a gradient-based optimization algorithm that reduces the parameters of the deep learning model. The gravity optimization algorithm has three hyperparameters, namely the learning rate (l), the initial step size (α), and the average moving parameter (β). The recommended values are l=0.1, α=0.01, and β=0.9, respectively, and the learning rate is a common hyperparameter in deep learning. Every optimization algorithm takes gradient as input and gives steps to update parameters in output. Where N is a normal distribution, the mean is γ , the standard deviation is σ , and G the gradient of the objective function. Where v=N (γ , σ ), σ =α/l, t are the number of update steps. β is a positive real number between 0 and 1, and values that are too large or too small affect the training results, the proposed substitution scheme is defined as: and for each weight matrix, the maximum gradient calculation formula is defined as: The main problem before applying the moving average is the initial trial delay of loss reduction. To solve this problem, the gradient terms n, v t are the velocity in the current update step, which are defined as: The final weight update is defined as: Among them, the larger the value of the gradient coefficient, the larger the step size of the weight with the higher gradient. By increasing m, the gradient of a larger range can be linearly processed, and the weight with a larger gradient value will also take a larger update: step size, low validation loss, and high accuracy. The experimental results are shown in Section IV-B.

D. LOSS FUNCTION
The sample used in this study is 9,500 underwater images of seven types, and different types of images have similar backgrounds. In the feature extraction process, to improve the learning speed of the model in the complex underwater image background and speed up the convergence of the model, the Cross-Entropy (CET) loss function in classification loss is used in this paper.
In the DAMNet model, the formula of the CET loss function is defined as: where x is the sample, f is the predicted input, r is the actual label, and n is the total number of samples. The formula is the total loss function of n samples, as long as the n losses are superimposed. Using the Sigmoid for regression can solve the problem that the weight update is too slow. Extracting similar features from different classes of underwater images in the feature extraction process is affected by the error in the output when the data converge. We use the CET loss function can avoid the problem of decreasing the learning rate. When convergence is faster, the gradient of the last layer of weights is proportional to the difference between the output and true values and is no longer related to the activation function. The backpropagation is multiplicative, so the whole weight matrix is updated faster.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
Experimental Dataset: This paper selects datasets including fish, turtles, sea urchins, sea cucumbers, corals, humans, and remains from OceanDark [39], RUIE [40], UIEB [41], UFO-120 [42], EUVP [43], and publicly available online and other underwater biological samples. We collected about 5,800 underwater images containing different degraded and enhanced images such as real underwater images, artificial low-light images, hyper-segmented images and underwater images acquired by different devices, expanded them to about 9,500 images by image filtering and rotation, and divided them into training and test sets with a ratio of 5:1, with the training set containing 7,984 images and the test set containing 1,600 images. The size of each training sample was scaled to 224×224, and images of some different species underwater are shown in Fig 6. Compared methods: This paper adopts the traditional and classic neural networks are used to compare and analyze the network models proposed in this paper, such as AlexNet [44], VGG19 [45], GoogLeNet [46], and ResNet50 [47], and new network models EfficientNe [48], CoAtNet [49], RepVGG [50] and AlterNet [51].
Evaluate metrics: 1) Accuracy(ACC): The percentage of total samples with all correct predictions.
2) Precision(P): The proportion of correct predictions as true samples to the total number of predictions as true samples.

3) Recall(R):
The proportion of correctly predicted as true samples to the total actual as true samples. 4) F1: It is the summed average of Precision and Recall, F1 = P * R * 2/(P + R).

A. EXPERIMENTAL CONFIGURATION
This paper uses the DAMNet model for feature extraction and learning. On the personal workstation, RTX 3090 GPU and NVIDIA Tesla V100s GPU server with 32G graphics card as the device of the model, and implemented using the PyTorch framework.
The original underwater creature images were uniformly converted to 256 × 256 × 3 and the input images were cropped to 224 × 224 × 3 by the center. The model was trained on 9,584 images, divided into a training set and a test set in a 5:1 ratio.The training set is used to train and optimize our DAMNet model, and the validation set is used to verify the validity of the model. We first initialize the network parameters using normalization and learn the model globally iteratively using the gravity optimizer, where we set the batch size to 16, the decay rate of the optimizer to β = 0.9, the base learning rate is set to 0.01, and the training ends with 200 iterations. In the following sections, the experimental results of classification evaluation, loss analysis, and ablation study are presented.

B. OPTIMIZER SELECTION
In this paper, the gravity optimization algorithm was selected as the optimizer of the model and compared with Adam, AdamW, SGD, and RMSProp optimizers. The experimental results are shown in Table 2 by training on the underwater biological image dataset. The results show that the gravity optimizer has lower verification loss, faster convergence, and higher accuracy. The RMSProp optimizer has a high accuracy rate, but the average loss is higher than other optimizers. Similar to the SGD optimizer, their average loss fluctuates greatly, and the convergence speed is slow. For different datasets, having a good optimizer to achieve faster weight updates and model convergence improves the model's classification accuracy and reduces the model's loss of parameters and training time.

C. CLASSIFICATION EVALUATION
As shown in Figure 6, this paper's training and test images are composed of seven types of biological images under different complex backgrounds and degradation conditions. VOLUME 11, 2023 The DAMNet model is used to classify the underwater biological image dataset, and the common network models are compared and tested. Such as AlexNet [44], VGG19 [45], GoogLeNet [46], and ResNet50 [47], and novel network models EfficientNe [48], CoAtNet [49], RepVGG [50] and AlterNet [51]. In Table 3, the classification accuracies of each class of underwater creature images are shown. The DAMNet model has a high accuracy in each class, but the accuracy of fish is lower, reaching only 91.84%, while the accuracy of all other species is around 97.5%.
The average training results of different models in all categories of the dataset are shown in Table 4, where we list the other models' accuracy, precision, recall, F1 value, and loss, verifying the learning and classification ability of the DAMNet model. The broken line comparison of training accuracy is shown in Fig 7. Table 4 shows that the DAMNet model accurately classifies underwater images in complex and similar backgrounds. Compared with models that are purely deep convolutional networks or incorporate a single attention mechanism, the addition of a dual-attention model can extract richer image features, making the model complete in extracting object features, which can identify underwater creatures well and improve the classification accuracy of the model. The DAMNet model outperforms other network models in all metrics, with an accuracy rate of 96.93% and an average loss value of 0.1860. The accuracy is about 5 percentage points higher than that of models such as VGG19, EfficientNet, ResNet50, and AlexNet, respectively, and about 2 percentage points higher than that of models such as GoogLeNet, CoAtNet, RepVGG, and AlterNet.
Combined with Fig 7, it is found that the stacking of attention and the combination of convolution and attention make the DAMNet model converge faster, reach high accuracy in a short epoch, and smooth out compared with other  Fig 7 show that the DAMNet model has a good learning ability and convergence capacity to obtain satisfactory classification results for underwater biological images.

models. The results in
Combined Table 3 shows that the classification accuracy of fish is low in all the models with higher accuracy, and we found that the main reason is that the fish in the dataset is a large class containing different classes of fish, which makes the model have the low similarity of features in feature extraction and errors in classification.

D. LOSS ANALYSIS
In this paper, the data is increased by rotation, and the ratio of samples used for training and testing is 5:1, which reduces the risk of overfitting the model training. We initialize the weights and use a gravity optimization algorithm with a batch   size of 16 and initial step size, average movement parameter, and learning rate set to 0.01, 0.9, and 0.01, respectively. The gravity optimizer speeds up the updating of weights by exponential translation, and the addition of the CBAM module enables the extraction of rich features in complex underwater backgrounds. The dual attention mechanism is employed to speed up the feature extraction and reduce the number of parameters to make the model converge faster and with lower losses. The average loss curve of the DAMNet model training is shown in Fig 8.

E. ABLATION STUDY
To further demonstrate the effective features obtained by combining dual attention mechanisms, this paper incorporates different attention mechanisms in the model for the ablation study. Attention mechanisms such as SENet [52], SKNet [53], and CBAM [54] were mainly compared, and different attention mechanisms were added to the model and trained on the underwater biological image dataset. The experimental results are shown in Table 5. SENet is a study of channel attention mechanisms for feature maps, and it is a study of attention mechanisms for convolutional kernels. Both are similar in that they enhance useful information and compress useless information.
In contrast, CBAM can generate attentional feature maps believed from both channels, and spatial dimensions have a stronger ability to improve the scalability of the underlying network. They can extract image feature information more effectively in complex underwater backgrounds. On the underwater biological image dataset, its accuracy reaches 96.93%. In conclusion, underwater images' channel and spatial attention combinations have higher classification accuracy.

V. CONCLUSION
In this paper, we propose a deep neural network model based on a dual attention mechanism for underwater biological image classification. Due to the complex underwater background and the situation that there are similar backgrounds in different categories, we choose to use the CBAM module to increase the saliency of image features. The combination and stacking of attention and convolution are realized to reduce feature parameters, thereby improving the learning performance and robustness of the model. Extensive experiments on underwater biological image datasets have shown that the DAMNet model obtains better results compared to other models. The model provides technical support in fishery and ocean exploration and deserves further study.
Although the classification results of the current models are good, these methods have certain limitations. First, the fish in the dataset is a large class, which has a poor learning effect and low classification accuracy. Second, there are limitations and singularity in the images for marine biometrics identification, such as the presence of only one category on an image. In the model, as it is processed for underwater images, there is more feature information and noise information than land images, there are a large number of parameters, and the training time is longer. In our future work, we will add the fine classification of fish and will consider designing a model applied to underwater creature target detection to achieve the recognition of multiple targets in a single image. PEIXIN