Fry Counting Models Based on Attention Mechanism and YOLOv4-Tiny

Accurate counting is difficult in the case of large numbers of overlapping and adhering fry. In this study, we propose a lightweight target detection counting method based on deep learning methods that can meet the deployment requirements of edge computing device for automatic fry counting while obtaining a high counting accuracy. We improve the structure of YOLOv4-tiny by embedding different attention mechanisms in the cross stage partial connections blocks of the backbone network to enhance the feature extraction performance. In addition, the low efficiency of feature fusion in the original model is also addressed by adding different attention mechanisms to the neck network structure to promote the effective fusion of deep feature information with shallow feature information and improve the counting accuracy. The experimental results showed that the six models proposed in this study improved the model accuracy and recall to varying degrees compared with the original YOLOv4-tiny model, while retaining the advantages of the YOLOv4-tiny model in terms of its small number of parameters and fast inference rate. It was also shown that the CBAM(n)-YOLOv4-tiny model obtained by adding the CBAM to the neck network showed the most significant improvement, with an mean average precision (mAP) of 94.45% and a recall of 93.93%. Compared with the YOLOv4-tiny model, there were increases of 27.06% in accuracy, 30.66% in recall, 38.27% in mAP, and 28.77% in the F1-score, along with a 67.82% decrease in LAMR.

urgent need for an accurate, automated, fry-friendly, real-time counting method that can easily be deployed in embedded devices for applications such as fry sales.
With the development of electronics and computer technology, methods have emerged that no longer rely on manual counting. Lemarie et al. [11] designed an electronic fry counting device was to count fish larvae and embryos, and achieved an error of less than 10% compared with manual counting, but the accuracy of the method was affected by particles or sediment in the water. Baumgartner et al. [12] and Ferrero et al. [13] applied optical counters based infrared radiation to estimate the biomass of fry, but the method was difficult to adapt to large migration rates and situations where multiple fish overlap or are in close proximity. These fry counting means brought some convenience to different applications, but their respective drawbacks also limited their use on a large scale.
With the rapid development of computer vision technology in recent years, its application to fry counting began to gain the attention of scholars [14], [15], [16]. Images of fish are collected by cameras and then analyzed to count the fry. Compared with the traditional methods, the method using computer vision has the advantages of a high efficiency, a low workload, and less damage to the fry. Many scholars have conducted studies on image analysis methods [6], [16], [17], which can be divided into image processing using traditional machine methods learning and deep learning methods based on neural networks.
Traditional machine learning-based counting methods use techniques such as image segmentation, which requires the human extraction of features, manual setting of thresholds, and setting of regression functions to perform counting. Researchers like Ibrahin et al [18], Albuquerque et al. [19], Zhang et al. [20], and Garcia et al. [21] adopted traditional computer vision processing methods such as speckle detection and edge contour extraction to achieve the automatic counting of fry. Researchers such as Zhang et al. [22] further utilized binarization and expansion erosion to determine the biomass of fry through independently refined connected domains, improving the accuracy of overlapping regions to some extent. For complex backgrounds due to lighting conditions, researchers like Jing et al. [23] used the sober operator, which is effective for low-noise low-gradient images, to detect the contour edges of fish and count the fish in such backgrounds.
In image segmentation-based counting methods, target adhesion is an important factor affecting the counting accuracy, and for the problem of the presence of adhesions among the fish, Labuguen et al. [4] adopted adaptive thresholding segmentation and edge detection to segment and count fry. Duan et al. [24] applied morphological operations to double layer images, along with a watershed algorithm based on the Otsu automatic threshold segmentation algorithm, and achieved good results in segmenting and counting fish eggs.
Deep learning has powerful feature extraction and data representation capabilities, and can achieve high-accuracy detection with sufficient samples. Compared with the counting methods based on image segmentation, a target detection method based on deep learning locates and detects the region of interest in the selected images, which can effectively solve problems such as adhesion and overlapping among the targets, and can accurately identify individual targets even in complex backgrounds. Thus, the method outperforms traditional machine learning image segmentation in the fry counting task. In recent years, target detection methods in deep learning models have mainly been categorized as one-stage and two-stage methods. The target detection algorithm based on region convolutional neural network (R-CNN) [25] is a typical two-stage method, with flow processing that includes two steps: 1) candidate region acquisition and 2) candidate region classification and regression. Tseng et al. [26] used a mask regional-based CNN (mask R-CNN) for the pixel-level detection of fish images and finally obtained an accuracy of 77.31%. Li et al. [27] achieved an 11.2% improvement compared with the deformable parts model (DPM) by using a fast R-CNN algorithm for underwater fish detection in the laboratory-made dataset of fish images. Although two-stage approaches have advantages in terms of accuracy, they suffer from problems like repeated encoding, which leads to redundant model computation and poor real-time performance. Single-stage target detection algorithms such as the you only live once (YOLO) series and single shot multibox detector (SSD) are based on regression analysis. In contrast to twostage approaches, these methods treat target detection as a regression problem and input candidate frames directly into the model for end-to-end training, resulting in better realtime performance. One study [28] used a multi-column convolutional neural network (MCNN) as the front-end network to extract the features of fish images with different sensory fields, and adjusted the size of the convolutional kernel to adapt to the angle, shape, and size changes caused by fish movement, while applying a wider and deeper dilation convolutional neural network (DCNN) as the back-end network to detect fish targets, achieving fish school counting on that basis. In another study [29], the authors used the multi-scale fusion of anchorless YOLO v3 to perform fish counting, achieving an average accuracy of 90.20%. Although the above methods have good accuracy rates in counting, they require a relatively large number of parameters and high computational performance. Thus, the existing deep learning models are difficult to satisfy both counting accuracy and detection speed requirements, making it unlikely to apply them in edge devices.
Therefore, a lightweight target detection fry counting model based on deep learning is urgently needed. In this study, six lightweight fry counting models were obtained by combining three attention mechanisms, namely, spatial attention module (SAM), channel attention module (CAM) and convolutional block attention module (CBAM) with the backbone network and neck network of YOOv4 tiny lightweight model respectively. In the following section II, we discussed the source and preprocessing of the data set for training, and proposed our model based on the discussion of the attention mechanism. In section III, the accuracy and inference performances of proposed six models were measured and analyzed against four other mainstream models. In section IV, a research conclusion is given.

II. MATERIALS AND METHODS
The shooting location of the dataset was the Guangdong International Fisheries High Technology Park, located at No. 4, Dongyong Section, Shinnan Highway, Nansha District, Guangzhou (longitude 113.4211, latitude 22.8897). The experimental equipment mainly consisted of a white-bottomed fish tray (length: 40 cm, width: 28 cm, depth: 8 cm) and camera mounted directly above the fish tank. The camera was 0.8 m away from the horizontal plane to realize overhead photography of the fry, and the camera transmitted the acquired videos of the fry to a computer and the cloud for saving through a network cable. A schematic diagram of the whole experiment is shown in Figure 1. A frame rate of 30 FPS was used for the camera, and 268, 323, 345, 313, and 181 images were obtained for the five species of fry (black carp, crucian carp, grass carp, squaliobarbus curriculus, and variegated carp) respectively, as well as 184 images of the five kinds of fry mixed together (i.e., a total of 1593 images).

2) DATASET LABELING AND PRELIMINARY ANALYSIS
The dataset format used in this thesis was Pascal Voc2007 standard, and the labeling software was Labelimg, a graphical image labeling tool. The process of making a label using this software is shown in the Figure 2. All the fry in the image were framed using a rectangular marker box labeled as fish.
In fry detection and counting, it is often difficult to accurately count the biomass of fry in practice because of their physiological characteristics, the differences between individuals, and the environment of the experimental equipment. For example, the fry individuals shown in Figure 3(a) are very small, and the density distribution of the fry in the fish tank is uneven because they like to gather in the corners of the tray. Different volumes caused by the individual differences in the fry result in large size differences, as can be seen in Figure 3(b), where the volumes of the fish in the larger frames are three to four times larger than those of the fish in the smaller frames. Figures 3(c) and (d) show that because of the large local density caused by aggregation, the fry images are obscure. Hence, aggregation, which is commonly seen in obtained image datasets, makes it difficult to distinguish individual fry by means of traditional image segmentation, causing challenges to accurate fry counting. The white boxes in Figure 3(e) indicate fry excretions, which can produce interference during fry biomass estimation, and lead to problems such as model misjudgment. However, diverse fry images help to enhance the effectiveness of the model during deep learning training, while simultaneously improving the generalization ability of the model.

B. DATA AUGMENTATION
In order to improve robustness and recognition of the model, the image data in this study were enhanced before the images were used for model training. In this study, Mosaic was used [30] to enhance the data, which used CutMix [31] as a reference. The method selected four original photos at random, and performed data enhancement operations on each of them. As shown in Figure 4, the operations included a color gamut change, scaling, flipping, and rotating. Subsequently, for each of the four processed images, an arbitrary rectangular region was selected, and the images from these four obtained rectangular regions were stitched together to form a new image, which contained information from the original images such as the detection frames marked in the original images. Through this data enhancement process, the new image contained richer image features, which enabled the training model to achieve a better result.
is a target recognition and detection algorithm based on a deep neural network that is able to detect and classify objects in images and videos simultaneously. As a single-stage target detection algorithm, the network divides the image into a H×H grid. If the centroid of the target falls inside a cell, the cell is responsible for the prediction of that target, which will predict the relative coordinates of the bounding box locations and their corresponding target confidence scores. Finally, the detection box with the largest score is filtered as the final target detection box using the non-maximum suppression method. Its core concept is to solve the object detection as a regression problem, which makes the network structure simple, and the detection speed much faster. Since the YOLO detection algorithm was first proposed in 2016, related researchers have been improving and enhancing it, developing derived versions and good applications in the manufacturing and industrial sectors.
YOLOv4-tiny is a lightweight model of YOLOv4. As shown in Figure 5, it has a relatively low detection accuracy compared with YOLOv4, but it only has about one-tenth of the parameters of YOLOv4, which makes it ideal for deployment on edge devices. The network structure can be divided into three parts, the backbone network, neck network, and detection head, and it has a reduced version of the CSPDarknet53 backbone network, which is responsible for the extraction of image features. The neck network adopts feature pyramid networks [33], convolving P5 once and then upsampling twice. The sampling result is channel spliced with P4 to realize the fusion of different high-level semantic information features extracted by the backbone network, and finally enriches the semantic information of P4. However, compared with YOLOv4, YOLOv4-tiny uses only two detection heads, and the detection effect is theoretically weaker than that of YOLOv4. In addition, the backbone network of YOLOv4-tiny has a large number of convolutional layers stacked, resulting in low feature diversity [34].

2) ATTENTION MECHANISMS
Attention mechanisms originated from human research on the visual field, where humans selectively focus on a portion of the information that they see while ignoring the rest of the visible information during the process of information attention. In the field of computer vision, attention mechanisms score the various dimensions of the data in the input models, and weight the features according to their scores to highlight the impact of the main features on the downstream models. The attention mechanism in deep learning can be generalized as a vector of importance weights. In order to improve the accuracy of the model and solve problems such as unrecognized or misrecognized fry caused by their uneven distribution, an attention mechanism module needed to be added to the model to make the model accurately focus on the fry themselves rather than other objects to obtain more accurate results in the process of fry recognition and counting. Because the number of parameters in an attention mechanism module is generally small, its addition usually does not have a significant impact on the size and inference performance of a model. Section III will compare the number of parameters and inference performance of the network structure model after the addition of the attention mechanism.
To obtain more comprehensive information about the features in each dimension of the image, as well as more fine-grained image features, three different attention mechanism modules were used in this study: SAM [35], CAM, CBAM [36], which integrated spatial attention and channel attention. As shown in Figure 6, the CAM module first compresses the spatial dimension of the feature map by Max Pooling and Avg Pooling to generate the feature weights of the two channel dimensions. Then, the two one-dimensional feature weight vectors are compressed and expanded by a multilayer perceptron with a compression rate of 16. The two obtained feature weights are added together, and the weights of each channel are calculated by the activation function to obtain the features of the channel dimension. In terms of spatial attention, as shown in Figure 7, the channel dimension is first compressed through Max and Avg Pooling to generate two spatial attention feature maps with one channel. Then, these two feature maps are stacked to generate a feature map with one channel using a 3 × 3 convolution kernel. Through the sigmoid of the feature map, the feature weights in spatial dimensions are obtained. Figure 8 shows that the feature map of the CBAM module is multiplied by the input features of the channel dimensions after channel attention to obtain a new feature map. Then, the new feature map is multiplied again by the input feature map after the spatial attention mechanism to   obtain a final feature map with the same shape as the input feature map.

3) IMPROVED MODELS
Taking YOLOv4-tiny as the basic model, this study proposed two network structures that combined different network regions. One was integrated into the backbone network, and the other was integrated into the neck network, as shown in Figure 9. As shown in Figure 9(a), two attention modules were used, embedded between the resBlock in the backbone network to enable it to extract more useful features through the attention module in the feature extraction stage. When the attention mechanism module was SAM, the obtained model was SAM(b)-YOLOv4-tiny, and when the modules were CAM and CBAM, the obtained models were CAM(b)-YOLOv4-tiny and CBAM(b)-YOLOv4tiny, respectively. As seen in Figure 9(b), a total of three attention mechanism modules were used, all three of which were integrated into the neck network to enable the model to pay attention to dense areas of fish when performing target detection, as well as feature fusion, thus improving the accuracy and efficiency of the model's recognition. Three additional models, SAM(n)-YOLOv4-tiny, CAM(n)-YOLOv4-tiny, and CBAM(n)-YOLOv4-tiny, were eventually obtained. The YOLO detection heads used in these six improved models were the same as those used in YOLOv4-tiny.

D. LOSS FUNCTION
The loss function used in the statistical model for fry counting consisted of three components, the bounding box regression loss (Loss CIoU ), confidence loss (Loss confidence ), and classification loss (Loss class ), as seen in (1). (1) The intersection over union (IoU) metric is used to measure the degree of overlapping between the predicted frame and ground-true frame in anchor-based target detection, with positive and negative samples distinguished based on the IoU value. In addition, it plays two other roles, filtering the predicted frames using non-maximum suppression (NMS) and adjusting the loss function to make the model more accurate. However, the direct use of the IoU in the loss function has disadvantages in the process of model training and optimization (i.e., it does not consider the distance between two frames, and cannot accurately reflect the degree of overlapping between them). Because the IoU between two frames is 0 when they do not overlap, if the IoU is directly used in the loss function for optimization, it is possible that no gradient data will be returned because the loss is 0, which will eventually prevent the training from being performed. Therefore, based on the IoU, researchers have proposed the generalized IoU (GIoU), distance-IoU (DIoU), and complete-IoU (CIoU) [37]. The GIoU solves the problem when the target frame does not overlap with the prediction frame in the IoU. The DIoU considers the centroid distance and overlapping area of the two frames. The CIoU is more comprehensive because it considers the aspect ratio on top of the DIoU. The formulas for the CIoU are as follows.

LOSS = Loss CIoU + Loss confidence + Loss class
Here, w gt and h gt are the width and height of the groundtrue frames, while w and h are the width and height of the predicted frames. In addition, ρ(b,b gt ) denotes the Euclidean distance between the center points of the predicted and target frames, where b is the predicted box, and b gt is the target box, while c denotes the diagonal length of the smallest box that covers both boxes, and a is a weight parameter.
The bounding box regression loss (Loss CIoU ), confidence loss (Loss confidence ), and classification loss (Loss class ) are shown in the following.
Here, Loss BCE n, n represents the binary cross-entropy loss, where n andn are the actual and predicted categories of the j-th anchor in the i-th grid; p is the probability of belonging to the fry; S represents the number of grids; B is the number of anchors in each grid, and the anchor value is taken as 3 in this study. In addition, 1 obj ij denotes that if the j-th anchor in the i-th grid contains an object, the value is 1. Otherwise, it is taken as 0, which means that the object is not included.

E. TRAINING PROCESS
The parameters for the experimental platform and environment for this experiment are listed in Table 1, and training parameters are listed in Table 2. The flow of the experimental training is shown in Figure 10, in which the dataset was divided into a training  set (60%), validation set (20%), and test set (20%). The training and validation sets were used for model training and the adjustment of the hyperparameters in each epoch, while the test set was used for model testing and the adjustment of the optimized method until the optimal results were obtained.
Because the model was downsampled five times in the backbone network, the input size had to be a multiple of 32. Taking 416 × 416 as the input size for all the models, a batch size of 32, the optimizer Adam, and an initial learning rate of 5e4, the cosine annealing algorithm was used to adjust the learning rate. In addition, label smoothing [38] set at 0.005 was added to prevent the model from overfitting, and to improve the generalization ability of the model on unknown data.
To verify the superiority of the models, this study compared them with advanced target detection algorithms, including SSD, YOLOv4, YOLOv4-tiny, and GhostNet-YOLOv4. Among these, ghostnet-YOLOv4 replaced the CSPDarknet backbone network in the original YOLOv4 with a more advanced backbone network (ghostnet), which had fewer parameters and less computation. The loss of each model was relatively large at the beginning of training. After 100 iterations of network training, the 10 models tended to converge in both the training and test sets, and the loss reached a peak, at which point training stopped, as shown in Figure 11.

F. EVALUATION INDICATORS
The evaluation indicators for the models fell into two categories. One was related to the detection accuracy, where five evaluation metrics were used in the study, including the accuracy, recall, mean average precision (mAP), and LAMR. The goal of the study was to accurately detect the number of individual fry, resulting in high requirements for both the accuracy and recall. Thus, F1 was added as an evaluation indicator to balance the accuracy and recall. Usually, a higher F1 indicates a better model effect. The other category was the evaluation indicators related to the inference speed, number of neural network parameters, and FPS. Their parameters were calculated as shown in the following.
Here, TP denotes positive samples predicted to be positive by the model; FP denotes negative samples predicted to be positive by the model; FN denotes positive samples predicted to be negative; and Miss denotes the rate of missed detections. FPPI is the number of false tests in each test image, while T represents the total number of test images, and FD is the number of false detections.
In this study, the image speed inference was further tested on Jetson Nano, a development kit released by NVIDIA in 2019 that delivers the power of a modern AI in a compact, easy-to-use platform, featuring a complete range of software programmability, a quad-core 64-bit ARM CPU, and 128-core integrated NVIDIA GPUs at an affordable price, making it suitable for deployment in a variety of edge-end environments.

III. RESULTS AND ANALYSIS A. ACCURACY COMPARISON
The six improved models with the added attention mechanism were computed on the fry training dataset, and their accuracy and recall values at different score_threhold values are shown in Figure 12, from which it can be seen that when the score_threhold was 0.5, the accuracy and recall were stable. As the score_threhold increased, the accuracy of most of the models improved slightly, but the recall rates of the models decreased. In order to accurately detect the number of individual fry, both high accuracy and recall are required. Thus, the score_threhold was set at 0.5 in this study.
As can be seen in Figures 12 and 13, models (b), (d), (e), and (f) all reached an inflection point near a recall of 92% (i.e., they reached an equilibrium point), after which the accuracy dropped sharply. At that point, both the accuracy and recall were at their highest, and the average precision (AP) of the fry were all greater than 91%, with the highest AP for CBAM(n)-YOLOv4-tiny reaching 94.45%. Models (a) and (c) reached the inflection point with recall values of approximately 74%, with average precision values of 72.75% and 70.21%, respectively.
The overall performances of the six detection models in the test set are listed in Table 3. Analyses performed from the perspectives of the different attention mechanism modules showed that the CBAM module was more effective than SAM and CAM when applied to both the neck and backbone networks, where the LAMR of the CBAM(n)-YOLOv4-tiny model was the lowest, merely 0.28, and the F1 value was the highest, reaching 0.94. The CAM(b)-YOLOv4tiny and SAM(b)-YOLOv4-tiny models were less effective, with LAMR reaching 0.86 and 0.85, respectively, and F1 scores for both at 0.76. The results indicated that, among these three attention mechanisms, the detection network with the CBAM module performed better for the fry dataset than the SAM and CAM modules. From the perspective of the positions where the attention mechanism was embedded, a comparison of the additions of the modules to the backbone and neck networks showed that the LAMR value of SAM added to the neck network decreased from 0.85 to 0.44, a decrease of 48.2%, and the F1 score increased from 0.76 to 0.90, an increase of 18.4%, relative to the values when it was added to the backbone network. Relative to the addition of CAM to the backbone network, when it was added to the neck network LAMR decreased from 0.86 to 0.40, a drop of 53.5%, and the F1 score improved from 0.76 to 0.91, an increase of 19.7%. Compared with the values when CBAM was added to the backbone network, when it was added to the neck network LAMR decreased from 0.34 to 0.28, a decline of 17.6%, and the F1 score increased from 0.93 to 0.94, an improvement of 1.1%. The results showed that the addition of the attention mechanism to the neck network significantly reduced LAMR and improved the F1 score compared to the addition to the backbone network, suggesting that combining attention in the neck network was more likely to improve the model count accuracy. Based on the data in Table 3 and the above discussion, CBAM(n)-YOLOv4-tiny was the optimal model among the six proposed improved models.
For the practical application of the models, pictures with low and high fry density were selected to compare the actual results. The results of the unimproved original model (YOLOv4-tiny) and six different improved models are presented in Figures 14-20. Figure 14 shows that the original model (YOLOv4-tiny) detected 18 fry in the fry picture with low density (21 fry) and 133 fry in the picture with high density (169 fry). In Figures 16 and 17, both CAM(b)-YOLOv4tiny and SAM(b)-YOLOv4-tiny detected only 19 fry in the low-density picture, with the two undetected fry being located at the edge of the plate. Only 143 fry and 135 fry were detected in the high-density picture. Figures 15 and 18 show that both CBAM(b)-YOLOv4-tiny and CBAM(n)-YOLOv4tiny detected all the fry in the image with low density and 151 fry in the image with high density. As seen in Figures 19 and 20, CAM(n)-YOLOv4-tiny detected all the fry in the low-density image and 152 fry in the high-density image, while SAM(n)-YOLOv4-tiny detected 20 fry in the low-density image and 151 fry in the high-density image. The above results indicated that compared with the original YOLOv4-tiny model, the detection accuracy was improved by the addition of the attention mechanism modules for both low-and high-density fish populations. Whether it was CAM, SAM, or CBAM, when integrated into the neck network, they could achieve relatively high accuracy, which was basically consistent with the calculated results in Table 3. As can be seen from their corresponding plots, the detection of the fry at the edge of the fish plate, as well as in the dense regions, improved the fry counting accuracy. However, because only two images were tested, the results do not lead to the conclusion that CBAM(n)-YOLOv4-tiny was the best of the six models, as shown in Table 3.   In order to verify the performance of the CBAM(n)-YOLOv4-tiny model proposed in this paper, four current mainstream target detection algorithms, namely SSD, ghostnet-YOLOv4, YOLOv4, and YOLOv4-tiny, were also tested on the fry dataset in this study. The results of these tests are listed in Table 4. When the score_threhold was 0.5, the     accuracies of the mainstream four advanced models ranged from 74.23% to 95.76%, and the recall rates ranged from 65.27% to 76.88%, among which SSD had the highest accuracy rate of 95.76%, but its recall rate was the lowest at only 65.27%. The CBAM(n)-YOLOv4-tiny model proposed in this paper had the highest mAP and F1 values in the fry test set, with values of 94.45% and 0.94, respectively, in addition to having the lowest LAMR of only 0.28. Compared with the original model YOLOv4-tiny, the accuracy rate was improved by 27.06%; the recall rate was increased by 30.66%; the mAP was increased by 38.27%; the F1 score improved by 28.77%; and LAMR decreased by 67.82%.

B. EFFICIENCY COMPARISON
The counting accuracy of a model is an important indicator in the evaluation of its performance. In addition to accuracy, the model size and detection speed are also important, especially for applications such as fry sales, where these models are often applied in embedded devices to improve the ease of use. Based on the PC parameters listed in Table 1, the frames per second (FPS) value was used as the detection speed indicator, where a larger value usually indicated a better result. Because the FPS calculation was related to the performance of the computing device, and different values could be obtained at different times, each model calculated the FPS five times, and the average was used for the final FPS. Table 5 compares the model size and detection speed values of the fry counting models based on the above 10 target detection networks.
From Table 5, it can be seen that both SSD and YOLOv4 had larger numbers of parameters and poorer performances in terms of efficiency, while ghost-YOLOv4 and YOLOv4tiny had smaller numbers of parameters and performed better on embedded devices. SSD had 90.07 MB of parameters and its FPS on the PC was 105.71, while its speed on the edge device Jeston Nano was only 1.70 FPS. YOLOv4 had a parameter count of 245.53 MB and an FPS of 61.63 on the PC, causing the model to fail to load on the Jeston Nano. Because ghost-YOLOv4 and YOLOv4-tiny both used lightweight backbone networks, they had parameter counts of 43.60 MB and 22.41 MB, and FPS values of 5.29 and 12.27 on Jeston Nano, respectively. Ghost-YOLOv4 had a lower FPS value on the PC than SSD, but on Jeston Nano it performed better than SSD. This was because the PC had ample video memory, while embedded devices have limited video memory, revealing that the model size has a significant impact on the inference speed of embedded devices.
In the models proposed in this paper, it can be seen from Table 5 that the number of parameters for the six models increased very little after the integration of the attention mechanism modules. Compared with YOLOv4-tiny, after adding the SAM and CAM modules to the neck and backbone networks, the CAM(n)-YOLOv4-tiny and CAM(b)-YOLOv4-tiny models increased in size by 0.65 MB and 0.15 MB, respectively, and their inference speeds on Jeston Nano decreased by only 0.68 FPS and 0.18 FPS, respectively.
For SAM(n)-YOLOv4-tiny and SAM(b)-YOLOv4-tiny, the parameters of the models hardly increased at all, and their inference speeds on Jeston Nano decreased by only 0.34 FPS and 0.23 FPS, respectively. After combining spatial attention and channel attention, the numbers of parameters for the CBAM(n)-YOLOv4-tiny and CBAM(b)-YOLOv4-tiny models increased by only 0.66 MB and 0.15 MB, respectively, when combining the CBAM module, and their speeds in Jeston Nano were reduced by only 0.91 FPS and 0.67 FPS, respectively.

C. DISCUSSION
The comparison results for mAP and FPS obtained from the six improved models proposed in this paper, together with the four models SSD, ghostnet-YOLOv4, YOLOv4, and YOLOv4-tiny on the PC and Jeston Nano are shown in Figure 21. The results showed that the mAP was higher when using SSD and YOLOv4 for fry counting, but because of their relatively large model sizes, their detection speeds were slow, and they were difficult to deploy in embedded devices to meet the demand of real-time monitoring. In terms of detection speed, YOLOv4-tiny performed the best in the PC and Jeston Nano, but in terms of accuracy, its mAP was below 70%, which was not sufficient to meet the demand of fry counting. When the attention mechanism module was added, compared with the original model, the precision rate had different degrees of improvement without much reduction in recognition speed, where CBAM(n)-YOLOv4-tiny, CBAM(b)-YOLOv4-tiny, SAM(n)-YOLOv4tiny, and CAM(n)-YOLOv4-tiny performed better in terms of accuracy, with a maximum mAP of 94.45%. The accuracies of CAM(b)-YOLOv4-tiny and SAM(b)-YOLOv4-tiny were slightly improved compared with the original model, but their improvements were lower than those for the other four models. The initial estimation for the cause was that SAM and CAM had weak feature extraction capabilities in the backbone network, while the six improved models proposed in this paper, with their number of parameters and operation speed, could effectively adapt to the embedded devices.
To explore the effect of adding attention mechanisms on the recognition capability of the model, the weights of the last layer of the model were extracted to form a heat map. Figure 22 shows that after adding the attention mechanism, the weight of the fry region was significantly larger than that of the original model, and in the detection map of YOLOv4tiny with the heat map, one can see that the weight of the region in the yellow box was lower, which led to the fry in this region not being recognized and reduced the accuracy of recognition. After adding the attention mechanism, one can see that the weight of the edges of the image increased, which meant that the fry at the edges could be successfully recognized. Thus, after adding the attention mechanism, the FIGURE 22. Heatmaps and detection maps with and without attention. VOLUME 10, 2022 model noticed the region that the original model did not notice, which increased the fry recognition accuracy.
Nevertheless, the aggregation was severe because of the tendency of the fry to gather together, especially in the case of external disturbances. Severe rejection could cause two or more fry to be identified as only one fish, possibly because the fry were too densely packed, resulting in missed detection between detection frames as a result of non-maximum suppression.

IV. CONCLUSION
(1) To develop a fry counting method that is suitable for deployment in edge devices, this study investigated six different lightweight fry counting models by adding three different attention mechanisms (SAM, CAM, and CBAM) to different network structures of YOLOv4-tiny. These combinations all showed different degrees of improvement compared with the YOLOv4-tiny model, with CBAM(n)-YOLOv4-tiny realizing the highest mAP of 94.45% and a recall of 93.93%. The accuracy rate improved by 27.06%; the recall rate improved by 30.66%; mAP improved by 38.27%; the F1 score improved by 28.77%; and the LAMR decreased by 67.82% compared with the YOLOv4-tiny model, while the number of model parameters and inference rate did not change significantly compared with the YOLOv4-tiny model. With its lightweight features, the model is suitable for deployment in various edge computing devices.
(2) The three attention mechanism modules were added to the backbone and neck in the models, and the experimental results showed that the combined effect of the model obtained by adding the attention mechanism to the neck network was better than that of the model obtained by adding it to the backbone network, while the model enhancement obtained by adding the CBAM attention mechanism module was the most obvious.
(3) Compared with SSD, ghostnet-YOLOv4, and YOLOv4, the frame rates of the six models proposed in this paper obtained in different operating environments were significantly improved, and they have a small number of parameters.
DACHUN FENG was born in Nanchong, China, in 1973. He received the Ph.D. degree from the South China University of Technology, China, in 2009. He is currently a Professor with the College of Information Science and Technology, Zhongkai University of Agriculture and Engineering. His current research interests include intelligent information systems for agriculture, the Internet of Things, artificial intelligence, and big data.
JIEFENG XIE received the bachelor's degree in computer science and technology from the Zhongkai University of Agriculture and Engineering, in 2021, where he is currently pursuing the master's degree in computer science. His research interests include intelligent information systems for agriculture and artificial intelligence. He is currently a Professor with the College of Information Science and Technology, Zhongkai University of Agriculture and Engineering. His current research interests include the areas of intelligent information systems for agriculture, artificial intelligence, big data, and computational intelligence. VOLUME 10, 2022