Research on Spider Sex Recognition From Images Based on Deep Learning

The rapid and accurate identification of spider sex is the first step in spider image recognition. The traditional artificial method used to identify the sex of mature spiders is mainly based on their genital structures (male palps or female epigynum) and highly dependent on the professional background of the identifiers. This article uses computer-based deep learning and transfer learning to identify the sex of spider, explores the design and application of convolutional neural networks in deep learning for spider sex recognition from images, and establishes a neural network model that displays excellent performance in experiments. In addition to optimizing the network model, we select appropriate hyperparameters to improve the accuracy of recognition and reduce the influence of human factors in the identification process. Through a comparison of multiple sets of experiments based on existing sample data collected in the laboratory, we find that the transfer learning method based on Xception can obtain better prediction accuracy than ResNet-152. After data augmentation, the optimization of a combined activation function and the fine-tuning of frozen layers, the prediction accuracy reaches 98.02%, and for an actual measurement of independent samples, the recognition accuracy reaches 92.38%. Therefore, the proposed method can basically replace manual identification and provide a reference for the artificial intelligence-based identification of spider species. Additionally, the model results indicate that male and female dimorphism may exist beyond the non-genital characteristics of spiders.


I. INTRODUCTION
There are nearly 50,000 species of spiders in the world and more than 5,000 species of spiders in China [1]. The identification of spider species mainly depends on the characteristics of the genitals (male palps or female epigynum); this process is time consuming, laborious, subjective, and highly dependent on spider classification experience. With the rapid development of the microbiology field, the importance of traditional taxonomy has been highly challenged. For example, there are few permanent taxonomy positions, and funding for taxonomy research is limited [2]. However, while traditional taxonomy faces many challenges, the rapid development of modern technology has led to new opportunities for developing new taxonomic methods, such as deep learning [3]; such methods have provided a new basis for spider taxonomy assessment [4]. In particular, the rapid development of deep The associate editor coordinating the review of this manuscript and approving it for publication was Yi Zhang . learning in the past ten years has led to its use in image recognition in various fields [5], but research on spider image recognition has not yet been reported. The first step in spider species classification is identifying spider sex from images; this result directly determines the accuracy of subsequent species recognition tasks. Traditional methods for the artificial identification of the sex of mature spiders are mainly based on the genital structures (male palps or female epigynum) of the spiders [6]. In addition, the body color, pattern and shape is sexually dimorphic in some spider groups [7]. With 142 species, 2000 of which are endemic to East Asia, South Asia and Southeast Asia, Pseudopoda Jäger is the third largest genus in the family Sparassidae (World Spider Catalog 2021) [8]. These spiders are highly diversified in China, where 63 species have been reported. However, according to the results of laboratory investigations, there are at least 110 species in China. As a result, Pseudopoda is an ideal candidate for studying spider image recognition. In addition, although Pseudopoda spiders are typically nocturnal, almost all species have similar body color (yellow generally) and spot patterns (fovea and radial furrows distinctly marked). It is difficult to manually distinguish males and females or identify species based on body coloration and spot patterns. However, multiple studies have found that the body color and spot patterns of spiders play important roles in attracting the opposite sex and increasing the success rate of courtship [9], [10]. Do body color and spot patterns of spiders of the genus Pseudopoda have similar functions, and can these features be identified by artificial intelligence? To date, no studies have addressed this topic.
In recent years, with the continuous improvement of deep learning technology, convolutional neural network (CNN) models have made considerable progress in the field of image recognition [11]- [14]. Each network has distinct characteristics, the recognition accuracy of networks has been continuously improved [15], and applications have been continuously refined in various fields, such as face recognition [16], medicine [17], [18], agriculture [19], [20] and others, with good results. This approach provides ideas and a research basis to study spider sex recognition. Since Alex Krizhevsky released AlexNet in 2012, many types of deep learning networks have been invented, such as VGGNet, GoogLeNet, Inception, ResNet, etc. The abstract reasoning ability of these networks has been continuously improved [21]. Additionally, computing frameworks are becoming increasingly mature. The currently popular computing frameworks include TensorFlow, Caffe, Theano, MXNet, Torch, and PyTorch [22], [23]. Among them, Ten-sorFlow performs model training and testing based on large standardized data sets such as ImageNet; the prediction accuracy of TensorFlow is very high, and its generalization ability is very strong. According to the information on the official ImageNet website, the number of image samples in the dataset has reached 14197122, spanning 21841 categories; these images are manually labeled, thus providing sufficient samples to support learning and training in various models [24], [25]. However, the resolution of the images included in the standard database is low. In the process of machine learning, a model can only learn primary features from the training set, such as the outline or texture of an object; therefore, the standard set is not specific enough to meet the requirements in certain research areas. In terms of spider sex recognition, it is relatively easy to determine the shape of a spider's genitals, but features such as the back pattern are difficult to distinguish. Therefore, a learning model that can extract primary features while also learning minor features related to the target object is needed.
Transfer learning involves applying models trained with large datasets from source fields to data from target fields [26]. This approach is important for small-sample machine learning and can effectively alleviate the various problems caused by small sample sizes, such as overfitting and weak generalization ability, among others. Transfer learning in deep learning is widely used in small-sample learning, and the results are typically good [19], [27]- [30].
For example, Issam Dagher and Dany Barbara used networks such as VGG, ResNet, and Inception for transfer learning to solve problems related to face age estimation [31]. Li Miao and Wang Jingxian et al. applied a transfer learning method for crop disease recognition [32]. Ashraf Darwish and Dalia Ezzat et al. used transfer learning and integrated learning to identify corn disease problems [33].
In this paper, we performed spider sex image recognition based on 42 Chinese Pseudopoda species and a transferring learning method. We mainly addressed two questions: 1. Can image-based sex recognition be achieved for Pseudopoda spiders, and if so, how can the recognition accuracy be improved for small sample sets? 2. Do the non-genital features of Pseudopoda spiders, such as the body color and spot patterns, display sexual dimorphic trends and play important roles in image-based sex recognition?

A. SPIDER MICRO GRAPHICS DATA SET
The spider samples used in this study were stored at the Center for Behavioural Ecology and Evolution (CBEE; College of Life Sciences, Hubei University, Wuhan, China). These samples contain 3,133 habitus photos for 30 Pseudopoda species (Table 1). We randomly took photos in dorsal view, ventral view or both views for each spider. All photos were taken with a Leica DFC450 digital camera attached to a Leica M205C stereomicroscope, with 10-20 photographs taken in different focal planes and combined using image stacking software (Leica LAS). The captured TIFF files were converted into JPEG format through Python, the file size was drastically reduced while maintaining a resolution of 2560 × 1920, and a standard data set was established in JPEG format. The image annotation result is shown in Figure 1. In this study, the samples of 30 Pseudopoda species are divided into a model training set, a validation set, and a test set, and samples for the other 12 Pseudopoda species are used as generalization test set A (Test set A is used in a supplementary experiment to verify the reliability of the machine learning model); this test set included 800 samples ( Table 2). In addition, to verify whether nongenital structural features, such as the back color and pattern of spiders, display female and male dimorphism, 328 pictures without any genital structure information are manually selected as generalization test set B from set A. According to the general practice of model training, after randomly shuffling the samples, the 30 species sets are divided into training set and validation set at a ratio of 3 to 1; additionally, 4/5 of the validation set is used for model validation, and 1/5 is used as the model test set. The final training set contains 2350 samples, the validation set contains 626 samples, and the test set contains 157 samples. The data in Table 1 and Table 2 show that the samples are balanced [34], [35].

B. TRANSFER LEARNING METHOD
A transfer learning method for image recognition with Ten-sorFlow is used, and the base model is trained based on ImageNet. The model is designed with five flows: a data augmentation flow, a data preprocessing flow, a general feature extraction flow, a domain feature extraction flow and a label prediction flow ( Figure 2). The general feature extraction flow adopts the structures and parameters of the basic network, this part does not need to be trained, so there is no backward propagation. In the other hand, in the domain feature extraction flow and label prediction flow such as the spider field,the parameters of these layers need to be retrained, so there are forward and backward propagation, the parameters are adjusted through backward propagation. Among them, the domain feature extraction flow and the label prediction flow are redesigned compared to the traditional flows. To retain the contributions of subtle features in the domain feature extraction flow, the ReLU activation function is modified to an ELU function. In the label prediction flow, the feature output of the convolutional layer is obtained by global average pooling, and a dropout layer is added before the fully connected layer to prevent overfitting; the dropout rate is set to 0.2, that is, 20% of neurons are randomly discarded [33], [36].
After several groups of experiments, Xception is finally selected as the base model for transfer learning, and the model architecture is shown in Figure 3 and  Based on our sample size and data characteristics, ResNet-152 and Xception are selected as the basic candidate models for the transfer learning network, and one base model is selected based on the final experimental results [37], [38].

2) DATA RESOLUTION SELECTION
The resolution of input samples has a considerable influence on the prediction accuracy of a model. The default resolution of Resnet-152 is 224 × 224, and that of Xception is 299 × 299. The default resolution of the original model is low. Thus, to study the impact of the resolution on the model accuracy, this study designs three groups of experiments with image resolutions of 299 × 299, 800 × 600 and 1600 × 1200.

3) DATA AUGMENTATION SELECTION
This study uses five augmentation methods, namely, random flipping, random rotation, random crop-ping, random scaling and random correction of contrast, and the related parameters are randomly designed according to the habits shown in Table 3. In addition, two sets of experiments are performed VOLUME 9, 2021   for the random cropping problem: random cropping and removing random cropping [39].

4) SELECTION OF FROZEN LAYERS IN THE GENERAL FEATURE EXTRACTION FLOW
The parameters of the general feature extraction flow are obtained by ImageNet training, and they reflect the general rule of the network model for sample feature extraction.
The Xception network has 134 layers grouped into 14 blocks [38]; according to the characteristics of this network, four sets of experiments are designed to compare the effect of the depth of the general feature extraction flows in the network on the accuracy of the model. In these experiments, the first 66, 86, 96 and 126 network layers are frozen.

5) ACTIVATION FUNCTION OPTIMIZATION
The activation function affects the output results of each layer and has a direct impact on the merit of the final prediction results. Some reports have indicated that ELU and ReLU activation functions perform well in various machine learning domains [38], but ELU provides a wider excitation boundary than ReLU. The Xception network model uses a ReLU activation function, and in this paper, four sets of experiments involving the activation function in domain feature extraction are performed: 1) ReLU is used as the default, 2) only the  Block 14 activation function is modified to an ELU function, 3) the Block 13 and Block 14 activation functions are modified to ELU functions, and 4) the activation function in Block 14 is removed [40].

D. GENERALIZED PRACTICAL TEST EXPERIMENTS
To verify the network model and assess the female and male dimorphism of the nongenital characteristics of spiders, two sets of experiments are performed. With optimally trained network models, prediction experiments based on generalization test set A and generalization test set B are performed.
V1 reflects the highest accuracy that the model can achieve for the validation set during the whole training process, V2 reflects the highest accuracy achieved for the training set, and the difference between V1 and V2 reflects the degree of fitting of the model. V3 and V4 reflect the stability of the model prediction accuracy, V5 and V6 reflect the cross-entropy loss of the model, and V7 and V8 provide intuitive feedback regarding the fit of the model [41].

A. BASE MODEL SELECTION
In the base model selection experiments, 200 training epochs were considered for the two groups of experiments, and the evaluation metrics V1-V8 are shown in Table 4. Additionally, the accuracy and loss value curves are plotted in Figure 5. The data show that the values of V1, V2, V3, V4, V5, V7, and V8 for ResNet-152 are generally higher than those for Xception; notably, only V6 is lower for ResNet-152 than for Xception. Overall, the accuracy of ResNet-152 is higher, but ResNet-152 appears to be overfitting the results. Based on Figure 5, the loss distribution of the validation set is not smooth enough and has a tendency to be overestimated. Thus, based on previous deep network model research [15], in this study, the Xception network is used as the base neural VOLUME 9, 2021

B. DATA RESOLUTION SELECTION
With a single Nvidia Tesla V100 GPU and a 32 GB video memory card, the resolution of 1920 × 1600 in the experiments directly led to memory overflow, and the learning model could not be trained. Therefore, the corresponding group of experiments was abandoned. The remaining two groups of experiments were performed, and V1-V8 values were calculated; the results are shown in Table 5. The accuracy and loss value curves are plotted in Figure 6. The figures show that the duration and parameter fits for training at a resolution of 800 × 600 were greater than those at a resolution of 299 × 299; however, the fine features of the original images were much better preserved, thereby improving the accuracy of the model. In the term of V3 values, for images with an 800 × 600 resolution, the model can reach 0.9458; for a 299 × 299 resolution, the V3 value of the model reaches only 0.8515. Thus, the prediction accuracy of the former is improved by 11.07%, and the model that uses images with an 800 × 600 resolution has obvious advantages, especially considering the results in Figure 6.

C. DATA AUGMENTATION SELECTION
In this case, the V1-V8 values for the two groups of experiments were obtained. The experimental results are shown in Table 6, the curves of accuracy and loss values are shown in Figure 5. In the case of no random cropping, the prediction accuracy for the validation set can reach 0.96 after   104 epochs of training, and with random cropping, the accuracy reaches 0.93 after 200 epochs of training. Based on both Table 6 and Figure 7, the prediction accuracy of the model is greatly reduced after adding random cropping, possibly because spiders are small, and some subtle features that play an important role in sex discrimination may be cropped. Thus, the random cropping of data led to the loss of these subtle features. Consequently, the random cropping augmentation method is not considered further.

D. SELECTION OF FROZEN LAYERS IN THE GENERAL FEATURE EXTRACTION FLOW
The four groups of models in the experimental design were trained for 200 epochs, and the V1-V8 values were obtained, as shown in Table 7. The accuracy and loss value curves are plotted in Figure 8. The experimental results indicate that the accuracy for the training set is generally higher than that for the validation set after 125 epochs of training when the number of frozen layers is 66; additionally, an overfitting phenomenon appears. The prediction accuracy is lower overall and the cross-entropy loss slowly decreases when the number of frozen layers is set to 126 layers. The results for 86 and 96 frozen layers are similar, but the V1 value in the latter case is slightly higher and V7 is smaller; thus, model performance is best when 96 layers are frozen. Subsequently, 96 frozen layers were used in all other experiments involving the spider sex recognition model.

E. ACTIVATION FUNCTION OPTIMIZATION
The four groups of experiments involved training for 200 epochs, and the V1-V8 values were obtained, as shown in Table 8. The accuracy and loss value curves are in Figure 9. Good prediction accuracy for the four groups of experiments. The V1 value of the default ReLU experiment reached 0.9786, and the V3 value reached 0.9736. In the experiment in which the activation function of Block 14 was modified to an ELU function, the V1 value reached 0.9802, and the V3 value reached 0.9739. In the group experiment which the activation functions of Block 13 and Block 14 modified to ELU functions, the V1 value reached 0.9618, and the V3 value reached 0.9521. In the experiment in which the activation function of Block 14 was removed, the V1 value reached 0.9791, and the V3 value reached 0.9747. The accuracies achieved for the training set and the validation set in these four experiments were compared. In the group experiment with the ReLU activation function, the accuracy of the training set was higher than that for validation set after 200 epochs of training, and overfitting occurred. In contrast, in the experiment with the ReLU activation function combined with the ELU function, the accuracy for the training set was similar to that for the validation set after 200 epochs of training; additionally, an overfitting state was not reached, indicating the model can continue learning after 200 epochs. This potentially increase in ability may increase model accuracy. Based on VOLUME 9, 2021  a comparison of the second and third experiments, the second experiment displayed a faster gradient decrease and higher accuracy than the third experiment. Finally, based on a comparison of the second and fourth experiments, the V1 value of the second was higher than that of the fourth. In summary, the second group of experiments corresponded to the best experimental results when the ReLU function was changed to an ELU function in the Block 14 of Xception.

F. GENERALIZED PRACTICAL TEST EXPERIMENTS
After the experiment and optimization process, the final number of parameters in the deep neural network was 20,865,578, of which 10,020,434 need to be learned during training in this study. The model structure is shown in Table 9. According to the results, after 179 epochs, the prediction accuracy reaches a maximum value of 0.9802; the corresponding model and parameters are used in predictions based on the real test set A and real test set B. For test set A, 61 samples were incorrectly predicted, and the prediction accuracy reached 92.38%. For test set B, 19 samples were incorrectly predicted, and the prediction accuracy reached 94.21%. Based on the results of the two sets of generalization experiments, machine learning can be effectively used in spider sex image recognition tasks, and the existence of male and female dimorphism phenomenon in relation to nongenital features is tentatively verified.

IV. DISCUSSION
In this study, 42 Chinese Pseudopoda species belonging to the Sparassidae family were considered in spider sex recognition based on convolutional neural networks, and through model tests, data augmentation and model optimization, the prediction accuracy for the validation set reached 98.02%; additionally, the generalization accuracy for independent samples reached 92.38%. Thus, the proposed method can replace manual identification and provide a reference for future spider species image recognition problems.
Deep learning models are generally classified into several categories, and popular models include VGG, ResNet, GoogLeNet, and Inception, among others. According to Alfredo Canziani et al., ResNet-152, Inception-V3, and Inception-V4 are highly suitable as base models for image recognition with transfer learning. Xception is based on the improved Inception-V3 network, and it outperforms Inception-V3 and Inception-V4 in some aspects. Our results show that the Xception model is slightly better than ResNet-152 in spider sex recognition from images, which may be related to the characteristics of the samples. The results of this study confirm that deep learning problems with small samples can be effectively solved using transfer learning. For small-sample training sets, transfer learning can take advantage of the existing knowledge obtained through training with large general datasets, and the unique abilities of the model can be applied in new domains. For example, this approach was applied for sex recognition from images of spiders of the genus Pseudopoda in this study and could be applied in future studies of spider species identification from images.
Data augmentation is achieved by randomly transforming the existing samples using certain rules; this process is analogous to the randomness of taking pictures in a natural environment and increases the number of samples in the study set. This approach can prevent the occurrence of overfitting due to the effects of a small training set and thus improve the generalization ability of the network. Currently, the common data augmentation methods include random flipping, random rotation, random cropping, random scaling, boundary enhancement, random deletion, randomly blending, and random contrast correction. It has been confirmed that data augmentation directly affects the learning ability and training accuracy of models. In this study, in addition to four common data augmentation methods (random flipping, random rotation, random scaling and random contrast correction), we focus on the effects of random cropping and varying the resolution of input images on model performance in spider sex recognition, which requires a fine scale. In addition to genitalia and other features, the dorsal pattern, color, morphology, etc. of spiders is also related to sex, and random cropping leads to the loss of important features.
Data preprocessing techniques also influence whether model learning can be successful, as reflected by excellent learning ability and high training accuracy. For example, in this study, the photos are uniformly processed into JPEG format, which greatly reduces the capacity of the images, thus reducing the hardware demands of the network model; considering the existing hardware capabilities, a high resolution needs to be used as much as possible to retain subtle features of a sample in addition to the general outline or textural features of an object, thus improving the prediction accuracy of the model.
In this study, the number of frozen layers of the general feature extraction flow is experimented, and the results show that the number of frozen layers of the general feature extraction flow for transfer learning is neither more nor less, and it is necessary to combine specific target samples and experiment to find the appropriate number of frozen layers. The activation function of domain feature extraction flow is redesigned, ReLU and ELU activations are combined, and Dropout layer is added in label prediction flow to prevent model overfitting, etc. All these optimizations can have a great impact on the prediction ability of the model. However, in some experiments, the accuracy achieved for the validation set may be higher than that for the training set, possibly due to the randomness of sample segmentation, small sample sizes and the dropout of some neurons. Moreover, removing the activation function from Block 14 does not improve the accuracy of the model, potentially due to the use of transfer learning and a small sample set.
In addition, the manual identification of spider sex relies mainly on the genital structures of mature spiders (male palps or female epigynum). However, we found that the studied computer was able to identify the sex of spiders when it was not possible to do so manually based on pictures, and the generalization accuracy was as high as 94.21%. This result suggests that nongenital features such as the body color and the spot patterns of Pseudopoda spiders, may be dimorphic at the genus level. This phenomenon is currently unrecognizable by humans. Although male and female dimorphism in body morphology is typical in some specific taxa, such as Nephila species in the family Araneidae (the female is almost five times larger than the male), it has rarely been reported in nocturnal spiders such as Sparassidae and Lycosidae. However, it must be noted that the sample size in this study was relatively small, and the generalization test set contained only 328 images of 12 species; therefore, this conclusion needs to be further confirmed by expanding the sample. In addition, how the computer specifically identifies spiders based on their body color and dorsal pattern for sex recognition needs to be further confirmed using deep neural network interpretation, and this result is only a preliminary conclusion.
In conclusion, this study proposes a deep learning-transfer learning model based on Xception, and the training model can be used to solve sex recognition problems for spiders after optimization. This approach provides a reference for future studies of spider species recognition from images. QIANJUN  TINGTING HE is currently a Full Professor and the Dean of the School of Computer Science, Central China Normal University, China. Her research interests include natural language processing, sentiment analysis, information retrieval, machine learning, and deep learning. VOLUME 9, 2021