Zero-Shot Classification Based on Word Vector Enhancement and Distance Metric Learning

The zero-shot classification algorithm has been widely concerned in recent years, in which the labeling of samples of a new category is unnecessary and the cost of annotations can be reduced in applications. This paper presents a zero-shot method for image classification based on word vectors enhancement and distance metric learning. Specifically, the convolutional neural network (CNN) is employed to extract image feature vectors which have the same dimension as semantic feature vectors. Then, an unsupervised learning method is applied on Wikipedia corpus for extracting word vectors and the skip-gram is used to obtain word vectors. The model of analysis dictionary learning is improved by reducing redundant information in word vectors. The obtained sparse vectors are used as semantic features and a distance metric learning method is employed to measure the distance between image features and semantic features. Finally, the classification is implemented by a nearest neighbor based classifier. The effectiveness of the proposed algorithm is validated on the AwA and CUB data sets. Experimental results demonstrate that the proposed method has good performance in terms of both accuracy and robustness.


I. INTRODUCTION
Most of the existing object classification methods are within the scope of supervised learning. The accurate identification of certain types of data means that training related models require a large amount of labeled data [1], [2]. However, some categories of data labels are difficult to obtain or require manual labeling of large amounts of data. And the number of object types in the real world continues to be showing a growing trend, which requires the recognition system to continuously increase and reconstruct new data. According to statistics, there are currently about 30,000 types of human identifiable objects [3]. It is arduous to label such a huge amount of data. Therefore, there is an urgent need for a technology that can still identify the data of the target category even if the visual annotation data of the target category is completely missing [4]. Driven by the actual requirement The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . and the continuous development of technology, zero-shot classification technology came into being.
Zero-shot learning (ZSL) in visual classification aims to recognize novel categories for which few or even no training samples are available [5]. Therefore, zero-shot learning can be employed as an effective method to solve the problem of missing class labels [6]. For the current zero-shot classification task, the key to success is to learn the cross-shot mapping relationship between visual and semantic modalities [7]- [9]. In the early stages, most of the zero-shot classification methods can be ascribed to the method of direct attribute prediction(direct attribute prediction, DAP) [10], [11]. To alleviate the unreliability of attribute predictions, Jayaraman et al. proposed a novel random forest approach [12], which leverages statistics about each attributes error tendencies in order to select discriminative and predictable decision nodes, thereby obtaining a more robust discriminative model for unseen classes. The model established with this idea has strong interpretability. However, its shortcomings are also evident.
For example, the labeling of attributes by humans is not always reliable, and the mislabeled of attributes will have a large negative impact on the performance of such methods. In addition, the correlation between attributes will also lead to the generation of redundant information, which will also seriously affect the performance of the model [13].
In order to better resolve the problems in the DAP method, a method based on the embedding model [14], [15] has proposed. The core idea of the embedding model method is to simultaneously map all visual features and category labels to a certain space, and then perform zero-shot classification based on the similarity measure [16], [17]. The method proposed by Li et al. [18], [19] is to learn a distinguishing category description feature from the representation of visual distribution. A nonlinear mapping model with piecewise linear properties was constructed by Xian et al. [20], using a ranking-based loss function for training. Akata et al. [21] used a bilinear model to establish the compatibility between visual samples and category attribute descriptions, and used 0-1 loss to learn discriminant. Information between different categories.Yu et al. presented a direct push classification method [22] for zero-shot images. This paper proposed a structured joint embedding (SJE) model, which textual image features and semantic features of a common feature space through a mapping matrix. So that the sum of the inner products of the two features is maximized. Xian et al. proposed a latent embedding model (LatEm) on this basis and achieved good results [23]. However, these methods are easily affected by the hubness phenomenon [24], and the distances between many features are quite close, which leads to a decrease in performance when using the nearest neighbor classification method for classification.
Due to the limitations of using attribute features, this paper uses word vectors to achieve semantic features. Semantic word vector is a high-dimensional vector representation of entity words obtained through unsupervised learning on large-scale text corpora adopting natural language training models. Each category name is under a unique corresponding semantic word vector, thus providing different distance relationship between categories. However, there exists certain redundant information in these semantic word vectors, which affects the effective expression of distance structure information between categories. To reduce redundant information, this paper uses analysis dictionary learning (ADL) to sparsely encode word vectors [25]. In order to better adapt the overall model to the zero-shot classification problem, this paper improves the basic ADL algorithm and proposes the LC-ADL method, which enhances the operational efficiency and has a positive impact on improving the classification accuracy.
When image features and semantic features are mapped to the same feature space, scientific distance measurement methods can accurately reflect the corresponding relationship between them, which is conducive to improving classification accuracy. Traditional ZSL methods are usually measured by Euclidean distances. Images and semantics belong to different modals, if all dimensions of sample features are still configured in equal importance at this time, the relationship between samples cannot be effectively described. In this case, this paper uses distance metric learning (DML) to measure the distance between the image feature vector and the semantic feature vector, and finally uses the nearest neighbor classifier to classify depending upon the distance. In this paper, an improved Large Margin Nearest Neighbor (LMNN) algorithm is used. LMNN shows advantages in this respect and can alleviate the hubness phenomenon to a certain extent. Experimental results show that introducing the combined DML and LMNN to zero-shot learning can achieve satisfactory results and improve the performance of image classification.
The main contributions in the paper are listed as follows: 1) The analysis dictionary learning method is implemented in sparse representation of word vectors to alleviate redundant information.The objective function of the ADL model is improved, and an error term is added to improve the decisiveness of the model. A LC-ADL model combining with a synthetic linear classifier is proposed. It further reduces noise and errors from word vectors. 2) In the distance measurement module, the LMNN algorithm of DML method is introduced. In order to avoid falling into the local optimal solution when using the gradient descent method, reconstructing the loss function can effectively reduce the error rate and the computational complexity. It has better applicability. 3) A zero-shot image classification method based on word vector enhancement and distance metric learning is proposed, which acquires better performance in accuracy and robustness than several mainstream ZSL methods.

II. RELATED WORK
In this section, we introduce the selection of the basic model. We summarize the notations and variables used in this paper in Table 1.

A. ANALYSIS DICTIONARY LEARNING
Word vectors obtained from unsupervised learning from large-scale text corpora, where each dimension contains some redundant information. It will affect the accuracy of the word vector and the effect of the final classification. Therefore, this paper will use the ADL method to enhance the word vector, and sparsely represent the word vector library initially extracted from the corpus. On the one hand, redundancy between the dimensions of these vectors can be folded up, and the information loss caused by compression may be beneficial. On the other hand, more compact vectors are more efficient to calculate. Dictionary learning can be regarded as a method of data dimensionality reduction, which is mainly divided into two categories: synthesis dictionary learning (SDL) and analysis dictionary learning (ADL). The idea of SDL is that  the dictionary and the corresponding sparse coefficients can be reconstructed to obtain the input features. ADL is like a dual structure of SDL. It applies a dictionary to known input features. The sparse coefficients of the input feature can be accessed according to the transformation rules. The advantage of ADL is that its dictionary is obtained by learning related data, which can better adapt to the characteristics of the data. It is highly interpretable and represents the encoding process more intuitively. The schematic diagram of ADL is shown in Figure 1.
In addition, the efficiency of the ADL algorithm for data processing is comparatively high. Given an input feature y ∈ R n , the first goal of the ADL algorithm is to learn a parsing dictionary ∈ R m×n with a constraint condition of a 0 ≤ T . The learned dictionary satisfies the constraints such that a − y 2 F can achieve a minimum value. The sparseness of sparse coding a is achieved by parameters T and norms l 0 . Therefore, the analysis dictionary can be obtained by solving the following objective function: where A = [a 1 , a 2 , · · · , a n ] ∈ R m×n is the sparse coding matrix. Conditions for obtaining a standardized and derivable solution: the matrix satisfies the set Γ constraint and the row norm is 1; the Frobenius norm that satisfies the matrix is the smallest. The coding coefficient a can be achieved through matrix multiplication and threshold function, which has a high operating efficiency [26], [27].

B. DISTANCE METRIC LEARNING
Traditional measurement methods often use Euclidean distance and cosine distance. However, these methods are not applicable to the case where the importance of each component of the vector is different. After obtaining the image feature vector and the semantic feature vector, a more accurate measurement method is needed to improve the classification effect. Therefore, this paper presents the DML method. Distance metric learning was proposed by Xing et al. [28]. Pan C et al. proposed an objective function based on cosine distance to learn the conversion from semantic to visual features [29]. Duan Y et al. proposed a deep adversarial metric learning (DAML) framework [30] that can generate synthetic hard negative words from original negative samples. The framework is widely applicable to existing supervised deep metric learning algorithms. In order to take advantage of the nonlinear structure of data points, Hu J et al. seek a variety of nonlinear transformations by using neural network architecture [31] and extend MvML to a multi-view deep metric learning (MvDML) method.
The idea is that for two feature vectors a and b, the learned distance metric form is as follows: In order to ensure the non-negativity of D M (a, b) and satisfy the triangle inequality, the matrix M should be a semi-positive definite matrix. When M = E, D M (a, b) is the Euclidean distance; when M is a diagonal matrix, the elements on the diagonal can be regarded as the weights given to each dimension; when M is a full matrix, the learned distance metric can be counted as the Mahalanobis distance.

C. LARGE MARGIN NEAREST NEIGHBOR ALGORITHM
The goal of DML is to find a metric matrix that minimizes the distance between pairs of similar samples when the sum of the distances between pairs of dissimilar samples is greater than a set fixed value. This paper uses the Large Margin Nearest Neighbor(LMNN) algorithm to cope with this problem. The core of the large interval nearest neighbor algorithm is to replace the Euclidean distance in the traditional K nearest neighbor with the Mahalanobis distance. The LMNN algorithm only penalizes points that are different from the target sample label but are close to it and points that are the same as the target sample label but are far away from it. The k-nearest neighbor prior knowledge of each sample in the training set is necessary for the calculation. The algorithm solves the optimal Mahalanobis distance matrix M through the semidefinite programming optimization method. The optimization maximizes the interval between different classes, so as to ensure that the classification accuracy is improved compared with the KNN algorithm. But when the data scale increases, the semi-determined planning scale in the LMNN algorithm also increases greatly, which makes the iteration cost of each step increase, leading to an increase in the computational complexity of the algorithm.
To improve the efficiency of the LMNN algorithm, Shen et al. used the gradient descent method to resolve the unconstrained optimization objective function [32]. Weinberger and Saul [33] incorporated slack variables into the objective function, thereby reducing the algorithm complexity. In addition, in order to improve efficiency of the LMNN algorithm, they also proposed a method of spatial mapping using an ellipsoid tree structure. S. Ying et al. use the manifold structure of positive-definite matrix group and deduce an intrinsic steepest descent method [34], which assures that the metric matrix is strictly symmetric positive-definite at each iteration, with the manifold structure of the symmetric positive definite matrix manifold. Peng Y et al. address the nonlinear metric learning by constructing smooth nonlinear metrics based on data [35]. The partition coefficient obtained by unit partition is smooth, and the metric at any point on the manifold can be directly defined. Huo J et al. proposed a CML method by directly maximizing AUC [36]. The method is formulated as a logdeterminant regularized semi-definite optimization problem. Li X et al. used multiple kernel representation to describe the nonlinear metrics, and projected the data into a high dimensional space where the data can be well represented by linear metric learning [37]. They designed an inherent steepest descent algorithm to learn the positive definite metric matrix.
The function of the LMNN algorithm is illustrated in Figure 2. Figure 2(a) is the original data space. Figure 2(b) is the data space after the LMNN algorithm is mapped. It can be observed that after using the LMNN algorithm, the data of the same category becomes more compact, which is conducive to the accuracy of vector mapping.
The image feature matrix set is X = [x 1 , x 2 , · · · , x s ] and the semantic feature set is Y = [y 1 , y 2 , · · · , y n ]. n is the sum number of categories. The basic idea of the LMNN algorithm is to set a suitable boundary, learn a training matrix to obtain a mapping matrix L, and then mapping the original data with x i → Lx i . The cross-validation method is used on the training set for each point x i in X. It is assumed that x l in its K-neighbors is different from its class label but within a large margin, and x j is the same as its class label within a large boundary, the large boundary conditional discriminant can be constructed as follows: where L is the distance metric matrix, and use this to define non-equivalent constraints. The formula is as follows: where K p is the prior knowledge; and the distance measure of the points x i and x j after the mapping is: indicates that the training sample x i is the K-nearest neighbor of the test sample x j ; when the semantic vector corresponding to x i is y i = y l , y il = 1; when y i = y l , y il = 0.
[Z] + = max(Z, 0). ε push (L) only affects training samples that are distinct from the test sample category but within the maximum distance. It has a visual effect of 'pushing'. VOLUME 8, 2020 Similarly, the equivalence constraints formula is as follows: ε push (L) only affects the training samples that are the same as the test sample category but the distance is beyond the maximum. It has a visual effect of 'pulling'. Finally, combining the formulas (4) and (5) to construct the loss function as follows: where µ is the weight coefficient and we take the value of µ is 0.5. It can be seen that when calculating the mapping matrix L, only the points that partially affect the classification by mistake are penalized, which simplifies the computational complexity of obtaining the global optimal mapping and effectively reduce error rates.

III. MODEL IMPROVEMENT A. LC-ADL MODEL
This paper uses ADL method to sparse the word vector library to achieve the purpose of enhancing the word vector to maximize the useful information. However, the judgment capacity of ADL is not strong and needs to be further improved. We add a classification error term based on a synthetic linear classifier to the objective function of the basic model of ADL. The synthetic linear classifier I uses the dual form of the universal linear classifier: A ∼ = IC. I establishes a corresponding relationship between the coding coefficients and the category labels of the data. The classification error term based on the synthetic linear classifier is: Therefore, the objective function of the LC-ADL model proposed in this chapter can be optimized as: where A = [a 1 , a 2 , · · · , a n ] ∈ R m×N is a sparse coding matrix. C = [c 1 , c 2 , · · · , c m ] ∈ R K×N is the word vector matrix extracted from the Word2Vec model that has been trained on the Wikipedia corpus. K is the sum number of categories of training samples. The function of the parameter α is to control the weight of the classification error term. Γ is the set of the constraint analysis dictionary , and the matrix in the set Γ satisfies the row norm of 1. In addition, in order to ensure that the results reproducible, the matrix of Γ also satisfies the Frobenius norm of the matrix to the minimum. A, , and I can be calculated by solving the optimization problem (8). We expand an alternating iterative algorithm to solve the LC-ADL model. The results of formula (8) optimization can be calculated alternately by the following two steps:

1) FIX A, UPDATE AND I
According to the constraint set Γ set above, the suboptimization problem of the analysis dictionary can be described as follows: The penalty term 2 F in equation (9) is to obtain a stable solution. β is a scalar parameter. After obtaining the optimal solution of formula (9), in order to avoid trivial solutions, each row of * must be renormalized to the unit norm. Since the term A − IC 2 F has no bearing on solving the subproblems of , this term is omitted in this step. Similarly, the formula for the sub-optimization problem of the classifier is as follows: Differentiate the objective function in formula (9) and make its first derivative is equal to 0, and a closed-form solution of can be obtained: Renormalize each line of * to the unit norm to get the final solution of the parse dictionary. Similarly, we can get the closed-form solution of I: where γ = 10e − 6 is to ensure that the inverse of CC T is obtainable. E is the identity matrix corresponding to it.

2) FIX AND I, AND SOLVE A
The solution of the coding coefficient A can be obtained according to formula (8), and the conversion process is as follows: The result obtained through this process is the best sparse coefficient matrix A * .

B. IMPROVED LMNN DISTANCE METRIC LEARNING ALGORITHM
When measuring the distance between image feature vectors and semantic feature vectors, the traditional Euclidean distance and cosine distance often cannot effectively describe the mapping relations between them. The scientific distance measurement method can alleviate the hubness phenomenon, which is conducive to improving the classification accuracy. We use the improved LMNN algorithm in the metric learning module. The linear transformation obtained in formula (6) is non-convex, when using stochastic gradient descent (SGD) algorithm it may fall into a local optimal solution. Given different initial matrices, the final results are different. It is not reproducible for some problems so the applicability needs to be strengthened. By reconstructing the formula (6), it can be converted into a semi-definite programming problem. Define the symmetric positive semidefinite matrix Q = L T L and use matrix Q instead of matrix L. The loss function can be defined as follows: In order to facilitate the solution in a larger feasible domain, this paper converts the above equation (14) into a convex program. The non-negative relaxation variable ξ ijl is introduced. The non-zero number of ξ ijl can represent the number of intrusive maximum interval samples in the triple. Construct the following positive semi-definite program: Although there are many constraints for this positive semidefinite program, ξ ijl is very sparse. The reason is that the distribution of most samples is reasonable, and only a relatively small number of samples will invade the fields of other samples, resulting in the loss of hinges, so most of the values are 0. This optimization can be resolved by subgradient descent method.

C. ZERO-SHOT CLASSIFICATION MODEL
The flowchart of zero-shot image classification model based on word vector enhancement and distance metric learning is shown in Figure 3. The model structure diagram is shown in Figure 4, which mainly includes the following four steps: Step 1: Extract the image features of the sample. We use VGGNet-19 convolutional neural network model.  Figure 4, three-channel images of 224 * 224 are input. After convolution and pooling operations, it is finally expanded to generate a 4096-dimensional vector. Add two fully connected layers at the end, and finally output image feature vectors of 200 dimensions.

As showing in
Step 2: Extract word vectors of all categories. We use skip-gram neural language model for unsupervised learning of large-scale text corpora. Set the dimension of word vectors to 300 dimensions, and each category can get a unique corresponding word vector. As showing in Figure 4, the word vectors of all categories form a category word vector library, whose size is 300 * N. N represents the number of all categories of the sample image. The word vector obtained at this time still contains some redundant information. After LC-ADL processing, the corresponding sparse coding matrix of the word vector library is obtained. Each coding dimension is 200 dimensions.
Step 3: Perform distance metric learning on image feature vectors and semantic feature vectors. Using the Euclidean distance to the training samples, the prior knowledge K-nearest neighbor of each data point in the training set is computed using the cross-validation method, and the label is set. This K value is set to K p . The improved LMNN algorithm is utilized to learn the mapping rules, and the mapping matrix Q is obtained. The training samples and test samples in the image features are mapped respectively: Step 4: Test the sample classification. Utilizing the nearest neighbor classifier, the category corresponding to the text VOLUME 8, 2020  Table 2.
In the AwA database, the image features use the same CNN features (VGGNet-19) as in document [40]. Compared to AwA, the CUB dataset is more challenging. Because its objects are birds, the differences between categories are small so it is a data set for fine classification. In addition, the CUB dataset contains more categories, and the number of samples in each category is relatively small, which also increases the difficulty of the CUB dataset. This paper uses the text corpus provided by Wikipedia to extract 300-dimensional semantic features for the category names of AwA and CUB datasets.

B. EVALUATION OF EXPERIMENTAL RESULTS
Since the CUB dataset contains many categories and the time cost of distance metric learning for all samples is very high, this paper randomly samples the training set. 30 samples were selected for each category in AwA; 5 samples were selected for each category in CUB. Using the method described above, random sampling was performed 20 times for repeated experiments. The performance of the algorithm is measured by the average classification accuracy M. Input the images of unseen categories into the model, first classify the classification accuracy within each class, and then calculate the average class accuracy by averaging [41], [42]. The class average accuracy calculation formula is as follows: where k represents the total number of unseen classes, x i represents the unseen classes, and Accx i is the classification accuracy in the unseen classes.
In order to prove that the combination of LC-ADL and improved DML method can improve classification performance, four groups of experiments will be set up for evaluation: 1) Use Euclidean distance Euc for classification; 2) Use DML method; 3) Use LC-ADL for vector analysis and combined with Euclidean distance; 4) Use LC-ADL for vector analysis and combined with DML; Table 3 shows the recognition rates of the above four groups of methods performed 20 random trials on the AwA and CUB data sets. It can be observed in the results in Table  3 that the performance of the ADL-DML method has been significantly improved compared with the Euc method. The classification accuracy rate increased by 20.4% in the AwA dataset.In the CUB dataset, it increased by 9.7%. On the one hand, because the semantic feature vectors consist of more noise, the LC-ADL method used in this paper can effectively reduce redundant information and make the semantic vector more accurate. On the other hand, the LMNN algorithm in the  DML method can give lower weight to noise, which has good performance in terms of both noise immunity and robustness. Because the CUB data set is a fine classification, there are many categories and small differences between categories. The LMNN algorithm used in this paper can better handle this problem, so that the elements of the same category are close, and the distance between different categories is farther away, which can alleviate the hubness phenomenon to a certain extent. It is better for classification than Euclidean distance.
The classification accuracy of the four methods on AwA for 20 random times is shown in Figure 5. It can be seen from the figure that our method has better classification effect. The comparison between experiments (1) and (2) proves the effectiveness of the distance metric learning method. The comparison between experiments (1) and (3) shows that sparse coding of the original word vector is beneficial to improve the accuracy. Table 4 displays the average classification accuracy of different algorithms in the AwA and CUB datasets. The comparison algorithms selected in the experiments include DeViSE [43], ESZSL [44], SJE [22], LatEm [23], and Ba et.al [45]. The experimental performance of other comparison algorithms is the value provided by the corresponding article. Figure 6 shows the recognition rate fluctuations of these algorithms using word vectors as semantic features on AwA and CUB.  It can be seen from Figure 6 that the word vector enhancement and distance metric learning method proposed in this paper can achieve good performance when using word vectors. Referring to Table 4, for the AwA dataset, the performance of this model is 4.9% and 6.6% higher than LatEm and ESZSL. For the CUB dataset, the performance of this model is 3.2% higher than Ba et.al. These results demonstrate the effectiveness of the proposed method of LC-ADL combined with DML.

D. ROBUSTNESS ANALYSIS OF ALGORITHMS
A typical advantage of the dictionary learning method is that it has good robustness for noisy data sets. Therefore, it is necessary to compare the robustness of the LC-ADL algorithm with other algorithms. In order to verify the above points, this paper conducts a comparative experiment. Random Gaussian noise is added to the word vector extracted from the Wikipedia corpus through the skip-gram method,  and the variance of the Gaussian noise is gradually increased to verify the robustness of the LC-ADL algorithm.
The comparison algorithm selected in this paper includes the basic ADL method and SDL algorithm. The results of the robustness comparison of the three algorithms are presented in Figure 7. It can be observed in the curve that the LC-ADL algorithm has better performance in robustness than the method based on the synthetic dictionary and the basic ADL method.
In order to examine the robustness of the distance metric learning method, we designed a comparative experiment on the AwA dataset. We add Gaussian noise and Speckle noise to the training images of the AwA dataset. For Gaussian noise, the mean value is 0, and the standard deviation is altered from 0 to 0.1σ , where σ is the standard deviation of the image data. For Speckle noise, we increase its content from 0 to 10%, and observe the changes in algorithm performance.
The comparison algorithm selected in this paper includes KNN method and Euclidean distance. Figure 8 shows the experimental results in these two cases. As can be observed in the curve in the figure, the performance of LMNN is consistently better than KNN and Euclidean distance. It shows that the LMNN method is more robust to noise.

E. EFFECT OF TRAINING SAMPLE NUMBER ON ALGORITHM PERFORMANCE
In practical application, because of the amount of computation and efficiency involved, the training sample is usually used to reduce the training samples. Considering that the number of randomly drawn samples will be related to the experimental results, this paper explores the impact of the number of samples in the training set on the performance of the model. Figure 9 shows the change in ADL-DML performance for different sample numbers in the AwA and CUB datasets. Figure 9 depicts that as the number of training samples increases, the accuracy rate slowly rises; when the number of samples reaches a certain number, the accuracy rate stabilizes. After comprehensively considering classification performance and calculation amount, the number of samples in AwA and CUB is set to 1200 and 750.

V. DISCUSSION AND CONCLUSION
Based on word vector enhancement and distance metric learning, this paper proposes a zero-shot image classification method, enhancing the accuracy of classification and overcoming the limitation of attribute learning, not necessarily labeling a large amount of data. Word vectors of the corresponding categories are obtained by performing unsupervised learning on a large amount of text in the Wikipedia corpus. Semantic feature vectors that are more consistent with the distance structure of the image feature vectors can be achieved to improve the robustness of the model by using the improved LC-ADL model to process the word vectors.
We introduce distance metric learning when calculating the correspondence between image feature vectors and semantic feature vectors. LMNN algorithm can effectively alleviate the hubness phenomenon by keeping the elements of the same label within the maximum boundary closer and the elements of different labels far away from each other. The classification results are given by the nearest neighbor classifier according to the distance. When the results are evaluated, in addition to the control factors, the rest of the experiments in different groups use the same network structure and classifier. Based on this, it is proved that the model in this paper has better classification accuracy than the traditional classification model.
When it comes to the Imagenet dataset, one of the limitations of this paper is the increasingly difficult classification due to the growing number of sample categories, including diverse objects, images of animals, plants, objects, scenes, etc., not confined to animals and birds. This part of the task will be placed in our subsequent research work.