An Improved Automatic Image Annotation Approach using Convolutional Neural Network-Slantlet Transform

Every day, websites and personal archives generate an increasing number of photographs. The extent of these archives is unfathomable. The ease of usage of these enormous digital image collections contributes to their popularity. However, not all of these databases provide appropriate indexing data. As a result, it’s tough to find information that the user is interested in. Thus, in order to find information about an image, it is necessary to classify its content in a meaningful way. Image annotation is one of the most difficult issues in computer vision and multimedia research. The objective is to convert an image into a single or numerous labels. This necessitates a grasp of the visual content of an image. The necessity for unambiguous information to build semantic-level concepts from raw image pixels is one of the challenges of image annotation. Unlike text annotation, where a dictionary links words to their meaning, raw picture pixels are insufficient to construct semantic-level notions directly. A simple syntax, on the other hand, is well specified for combining letters to form words and words to form sentences. The automatic feature extraction for automatic annotation was the emphasis of this paper. And they employed a deep learning convolutional neural network to build and improve image coding and annotation capabilities. Performance of the suggested technique on the Corel-5K, ESP-Game, and IAPRTC-12 datasets. Finally, experimental findings on three data sets were used to demonstrate the usefulness of this model for image annotation.


I. INTRODUCTION
In recent years, it has become very difficult to search an image in a large image database. Many methods have been proposed to access an image [1], [2]. Low-level visual content such as shape, color, and texture, as well as labels or keywords that convey the semantic meaning of the provided image, can be used to retrieve the image. The user must provide an input image of a query to access photos using low-level visual functions, and the search result returns a set of images that are visually similar to the query image. However, many customers find it challenging to find a query image that matches their needs every time. CBIR (Content-Based Image Retrieval) is a technique for recovering images from low-level visual attributes. Another way for overcoming the issues of CBIR systems is to assign labels to all photos in the database. These can be found using these labels [3]. The key advantage of this method is that the image can be retrieved in the same way that a text document can be retrieved. This label assignment method is called image annotation. During the last decade, there have been significant breakthroughs in the field of computer vision; using computers to solve problems involving the identification and classification of images has never been easier. Because of this, an alternative to the existing manual process is now possible: one that uses a deep learning neural network model to classify and tag images to automate this process. However, as training deep neural networks from scratch is time-consuming, a computationally cheaper alternative was desirable. Thus, the goal of the project became to design and implement such a solution-something that could categorize images in reasonable amounts of time even on large data sets, with limited computing power, but could still achieve reasonably accurate results. Furthermore, additional resources could be allocated to the project for more complex deep learning systems if results were good. A deep neural network, such as those used in deep learning, works in much the same way as the neural structure of the human brain-neurons connected, with the outputs of neurons further along in the network relying on the outputs of earlier neurons as their inputs. Teaching a deep neural network is best done through what is commonly referred to as "supervised learning"-giving the network a batch of data and its associated labels, and letting the network attempt to process the data. By comparing the network's actual output to the expected output that the provided data labels contain, the network's activation probabilities can be adjusted to make the actual output match the expected output. The novelty of this study is as follows: 1. Combining multiple features (SLT, YCbCr, LBP) based on CNN features and neighbors to achieve a balance between precision and recall by selecting CNN with Slantlet Transform. Flexible annotation and improved accuracy are achieved. 2. A word2vec model with CNN-SLT was used to predict the image annotation using both word2vec distributed representation and learning representation. The distributed representation approach included the encoding and storage of information regarding the image features. 3 -Based on our proposed as some image annotation models require considerable computation time and complexity during the training phase, they become computationally intensive when training datasets are large. The proposed method is efficient in terms of computing time. The rest of this paper is organized as follows. In the following section, we briefly introduce the image annotation. The remainder of this manuscript contains, sections 2 introduce some important background knowledge in the form of related works, Section 3 illustrates an improved deep feature extraction method, while section 4 proposes a novel method for image annotation. Section 5, we compare and analyze our proposed method with numerous methods such as MBRM [4], SEM [5], FastTag [6], and 2PKNN [7]. Finally, section 6 our conclusion summarizes the current research and suggests possible research venues for the future.

II. CNN related works
An automatic image annotation system involves assigning keywords from a dictionary to an image. Thus, input is the target image, and output is the best description of that image in terms of keywords. A computer can easily measure color, texture, and shape, but they cannot be interpreted semantically, unlike people who can easily deduce meaning from images. Thus, an essential challenge in automatic image annotation is to bridge the semantic gap between low-level computer features and the interpretation of images by humans [8], [7]. Several approaches have been proposed to address the issue of automatic image annotation in recent years. Several different models can be used to describe these approaches. There are three main models in automatic annotation: graphic models, generative models, and discriminating models. [9] recurrent neural networks (RNNs) and deep convolutional neural networks (MTCs) in a unified setting address the dependencies between labels in images. In the proposed CNN-RNN framework, label-image relationships characterize both semantic label dependencies and correlations between image labels. CNN generates an image's vectors. A multilabel prediction can be calculated sequentially using NRNs based on the vector of the image characteristics and outputs of recurrent neurons, where the prior probability of a label can be calculated for each step. Figure 1 illustrates the general pattern of this method. FIGURE 1. General Method Scheme [9] According to (Murthy,Maji,and Manmath [7]), the authors used CNN features and word representation vectors to perform image annotations. Canonical correlation analysis (CCA) is the basis of the proposed model that helps model both visual and textual functions simultaneously. Recurrent neural networks are used to determine the visual functions of an image. Word2vec architecture is being used to remove textual functions [10]. By late 1990s, this system was already reading almost 10% of all the cheques circulated in the United States. Later, Microsoft deployed many optical handwriting recognition and character recognition systems using CNNs [11]. As an experiment in the early 1990s, CNNs were used to detect objects in natural images, including hands and faces [12], [13]. In the 1990s, convolutional network was employed to solve issues with speech recognition [14] and document reading [15], [16], while time-delay neural networks were employed for extracting meaningful content. A hybrid of a probabilistic model with CNN was used in document reading for barriers that exist in languages. In the United States, this application was employed extensively to read cheques. Meanwhile, Tao et al. A deep learning approach based on transfer learning and multiple tasks learning was proposed for analyzing images of biological components [17], [18]. On the other hand, [19] put forward a DL algorithm based on CNN along with reported results exceeding the existing ML strategies. In the visual recognition challenge, the proposed work won accolades for the researchers. Learning and modeling complex relationships can be done with artificial neural networks. However, choosing the number of hidden layers and the number of neurons in each layer presents a problem. In fact, the exact relationship between entry and exit can be challenging to explain.

III-The Architectures of the CNN
Convolutional neural networks (CNN) are artificial neural networks used to extract local features from data. CNN simplifies the network model by allocating weights to singular features, thereby lowering the overall weights CNN has become widely popular in the field of pattern recognition due to its unique characteristics [20]. For example, a CNN is employed by the document reading system trained jointly alongside a probabilistic model comprising language constraints, In CNN architecture, there are three key constituents or layers: 1) input, 2) hidden, and 3) latent. One may categories these latent (hidden) layers as either a pooling layer, fully-connected layer, or convolutional layer. Figure 2 shows these layers adapted from [21].

A. Convolutional layer
CNN architecture includes a convolutional layer as its primary layer. Convolution involves iteratively applying a function to a varying function and then evaluating its output [22]. This layer is made up of several maps of neurons, this is also known as filters or features maps. According to size, it is relatively identical to the input data's dimensionality. One can also interpret neural reactivity through the quantification of discrete convolution of receptors. Activation functions and total neural weights of input are calculated during the quantification process. Figure 3 briefly demonstrates the discreet convolution layer.

B. Max pooling layer
The max pooling layer, several grids are created from the split convolution layer output. In matrices, the maximum values are sequenced [22]. Then, the average or maximum value of each matrix is calculated using operators. Figure 4 illustrates the procedures for maximum pooling.

C. Full connection layer
Full connection layer refers to an almost complete CNN that comprises 90% of overall CNN architectural parameters. Input can be sent using predefined vector lengths across the network in this layer [20]. Figure 5 presents a brief illustration of the full connection layer. Dimensional data is transformed through layers before it can be classified. Furthermore, the convolutional layer is transformed to maintain the integrity of the information.  [20] Fully-connected layers are connected to neurons from an earlier layer. As the final network layer, these fully connected layers assist in the classification process. An example of a CNN that explains all three layers is shown in Figure 6. Due to its design for object recognition, CNN may not be the best solution for our problems. To improve performance, we will design a customized network structure according to the problem domain. We demonstrate that all of our methods set the bar high for performance on all of our problems through experiments.

IV-Proposed AIA Architecture
The architecture proposed by initial investigation from the literature reviewed is discussed in this section. The proposed system consists of three major phases. Training is essential part of the system where a database of tagged images is used. The trained system then works on raw data to output the annotated image in the second phase. In the last phase image retrieval should be carried out to evaluate annotation results. The standard training database is used under an automatic features extraction using CNN in the first training phase. The automatic features extraction process gives the feature vector easily by understanding the contents of the images. Modelling of the features via learning mechanism is the next activity. It generates model for annotation that is to annotate new images.
In the second phase the un-annotated image is the input. Extraction of features is the next activity to generate visual characteristics of the contents to be applied to the annotation model trained in the previous phase. The model generatdevat an earlier phase will assign proper semantic labels to the image as per the contents. So, this will result in an annotated image as output. In the third phase the images from annotation phase are taken as data store. A textual query will be fired and the system will give list of appropriate images. Since the annotation is content based the retrieval of images will become easier and accurate. Figure 7 shows the framework of the proposed system architecture for the automatic image annotation.

V-Features Extraction
There are several factors contributing to an automatic image annotation (AIA) process, such as feature extraction, identification of suitable features for use in the AIA, mathematical transforms selected for determining the feedback usage, etc. An effective annotation system complements these distinguishing factors. Researchers used the low-level and high-level information contained in an image such as texture, shape, and color to reconstruct the image. Automatic features extraction will be discussed to achieve the research objective, which is to implementation of new AIA system based on automatic features extraction and object learning representation and select the most adequate features, the first one is to extract shape using Slantlet Transform, second to extract color using YCbCr Colour Space and extract texture features using Local Binary Pattern (LBP). In what follow three types of features that can be used in our experiment.

A. Wavelet Transform
Wavelet techniques are used to remove noise from image or signal for data classification and data compression, which means wavelet can be used to perform various image and signal processing operations. However, wavelet technique has certain disadvantages: if there is shifting in time for input signal, there will be unpredictable changes in values of transform coefficients. Coefficients cannot discriminate between input signal shifts due to this shifting Discrete Wavelet Transform (DWT) [23] [24]. Furthermore, images contain different edges with various orientations and DWT can only support horizontal, diagonal and vertical orientations. So, wavelet has poor directionality. The major drawback of DWT is that it considers only real coefficient filters associated with real wavelets and gives only real-valued approximations, but complex signals can be used for various operations in image processing, and phase of the complex signal is calculated by its real and imaginary coefficients. Here DWT fails to provide accurate phase information. This disadvantage can be overcome by using complex valued filtering [25]. Figure 8 below shows 4-level DWT decomposition of input image.

(1) Slantlet Transform
The SLantlet Transform (SLT) was an orthogonal Discrete Wavelet Transform (DWT) method with 2 zero moments and improved time localization. The SLT consists of all the usual features of the filter bank implementation but has a scale dilation factor of 2. This basis was not dependent on the iterated filter bank like DWT; however, different filters were used for every scale.
Generally, in the 2D SLT decomposition, the image is categorized into 4 components, LL (Low-Low), LH (Low-High), HL (High-Low), and HH (High-High) [26], as shown in Figure 9, where L and H signify the low and high frequency band, respectively. Each of them carries different image information. The low-frequency band component marked as LL of the image maintains the original image information. Conversely, the medium-and high-frequency bands, LH, HL, and HH carry the information related to the image's edge, contour, and other details. Therefore, high coefficients represent the critical information in the image. Meanwhile, the insignificant (small) coefficients are considered as noise or worthless information. Thus, these small coefficients must be ignored to get the best results in subsequent operations. The SLT [27] process is a multi-resolution method suitable for piecewise linear data. SLT was an orthogonal DWT method having 2 zero moments and better time localization features. It is based on designing different filters for different scales unlike the iterated filters approach using DWT. Previously, SLT is used in awasariety of applications such as estimation, compression, fast algorithms and de-noising various input images. In parallel processing, SLT is implemented as a filter bank with parallel structures, where different filters are configured for different scales as opposed to filter iterations at different levels. Following [28], the coefficients of the filters are calculated using the SLT equations.

B. YCbCr Colour Space
Different colour models have shown varied visibility of the tampering traces. The image forgery detection methods generally use the RGB or grey-scale colour systems. Many recent studies [29]. noted that the use of the chromatic channels instead of RGB or luminance improved the detection performance. The YCbCr colour model represented the colours in the luminance (Y) and chrominance (Cb and Cr) components. Eq. (1) presents a formula that computes the Y, Cb and Cr channels using the R, G and B channels. Figure 10 . RGB image and its YCbCr counterpart

C. Local Binary Pattern (LBP)
LBP was a local operator that can discriminate between various texture types. The initial LBP operator [30]. defined the label (LBP code) of every pixel in the image. For computing this LBP code, the researchers compared the 3×3 neighborhood pixels with the central pixel value (threshold): It was seen that if the neighboring pixel values were lesser than the center value, it would hold the binary digit '0', or else, it would hold '1'. All the binary digits of the neighbors were concatenated for building the binary code. The LBP code was seen to be the decimal value of the binary code. The example shown in Figure 12 describes the LBP code computation method. LBP_(P, R) refers to the LBP operator, and was defined as: Wherein; P refers to the no. of pixels in a neighborhood; R was the radius; P_c was a center pixel value; while the thresholding formula was defined as: In the LBP computation method, initially the T of the local 3×3 neighborhood of a Cb image was defined as a joint distribution of all grey-levels of 9 image pixels: Wherein; g_i (i=0,…., 8) corresponds to all grey values in the pixels present in a 3×3 neighborhood, based on the spatial layout, described in Figure 11 A pattern of the neighbors is known as the "window" that slides over the complete image, pixel by pixel, from left to right until it reaches the final column. Thereafter, this window again goes to the 1st column and moves downwards from the top to bottom.  This recherché suggests a novel way of using CBIR system using merging features extraction system based on CNN architecture, the proposed system combine and merge automatically between the features extracted.

VI. Word2vec Representation
The Word2vec was seen to be a successful and popular natural language processing NLP approach for words representation [4]. This approach involved encoding and storage of information within the system by interacting with the other objects. The human memory structure inspired the distributed representation technique, wherein all memories are stored in a "content-addressable" manner. The contentbased storage efficiently recalls all memories based on their partial description. Since these content-addressable thoughts and their properties are stored in a close proximity, the systems possess a viable infrastructure for generalizing the features for any item. The continuous vector representation, which acts like a distributed representation of words, was used in the Natural Language Processing (NLP) system for efficiently representing the semantic/syntactic units having multiple applications. Figure 13 illustrates the distributed representation, WORD2VEC, used in natural language processing (NLP). A word representation is learned unsupervised by Word2vec [55]. It is necessary to feed these models a sufficiently large, properly encoded text. As shown in Figure 14, the main principle of word2vec is that a piece of text is given to the neural network, which is then divided into portions of a certain size (called windows). The network analyses every fragment as a pair of target words and contexts. Below is an example of a target word and context. The target word is "music" and the context is "I", "like", and "spiked". Each, fragment's middle is used as the target word during such training while the rest is used as context. The Word2vec model learns word embedding's by predicting the middle word based on its context. Hidden layers of neural networks each contain a set of weights for each of the words (in the example above, 7 neurons). When a learning process is complete, the weights act as vectors representing the words. The important trick about word2vec is that we're not too concerned with the results of the neural network. At the end of the training phase, we extract the internal state of the hidden layer, resulting in a vector representation for every word. In Figure 14, a neural network is composed of an input layer, a hidden layer, and an output layer. There is no activation function in the hidden layer, and neurons combine weights and inputs linearly (multiply each input by its weight and add them up). In fact, word2vec requires that each word is represented as a hot encoded vector in the input layer. In the model, every word was embedded with the vector in the n-dimensional space. The similar words had closer vectors as described in Figure 15, like "King, Queen" and "Woman, Man", wherein the similarity was based on the syntax and semantics. These vectors were trained based on the idea that the meaning behind the words was characterized by their context, i.e., neighboring words. Hence, the various words and their context were considered the positive training samples [4], [46]. They observed exciting patterns by training the word vectors in the natural language. The words, having a similar vector representation, exhibit multiple similarity degrees. For example, Figure 16 shows that the words ensemble their closest vector with the word [7]. In this paper, the researcher predicted the image annotation using the word2vec distributed representation and learning representation as in the word2vec model. The distributed representation approach included the encoding and storage of information regarding the image features. The learning representation of objects in images uses the internal state of the hidden layer at the end of the training phase, which yields precisely one vector representation per object. In this section we describe the distributed and learning representation and how we can use both techniques to create new object or image representation.

VII. Experimental results and comparison
By analyzing and comparing the reasons for choosing CNN-SLT, we analyze the quality of these models using three standard benchmark datasets. The CNN annotation framework is a comprehensive method for solving image annotation problems. To ensure that the annotation effects of the framework are optimal, the system combines and merges the features extracted with the CNN architecture. We begin with an explanation of the datasets and evaluation metrics. Secondly, the results of each method are presented and analyzed briefly. The final step in our analysis presents a comparison between our model and several state-of-theart annotation methods. We also provide several examples of how the annotation process works.

A. Dataset
In our experiments, we used three popular image annotation databases: Corel-5K [31], ESP-Game [32] and IAPRTC-12 [33]. Corel-5K: This is the most popular base for annotation and image search. A vocabulary of 260 keywords is used for both training and testing the system, which has 4,500 images for training and 500 images for testing Images are categorized into 50 categories, each containing 100 images. Each image has 1 to 5 keywords manually annotated, with an average of 3.4 keywords per image. A subset of the 20770 images used in literature were obtained from the ESP-Game dataset [54]. With a vocabulary of 268 keywords, this subset consists of 18689 images for training and 2081 images for tests. Images are   Table I shows detailed information about each database [56]. Table II represents some sample images with their annotations from the three databases used for the experimentation section. For example, the second image of the Corel-5k base represents an image annotated by the keywords: "sky," "jet" and "plane.".

B. Performance Evaluation
Several quality measures for image annotation systems are used in the literature. According They can be divided, to Kwasnicka [34], Two main categories can be identified: measures by annotation and measures per word. In the following sections, we detail these two categories and the measures used in this study.

1) MEASURES BY ANNOTATION
Annotation measurements focus on the result of imageby-frame annotation. First, the measurements are calculated after the annotation of each image. Following that, we calculate the average values for each image in the test set. [35].

2) ANNOTATION RATE
One of the most fundamental measures of quality for automatic annotation methods is the annotation rate. It measures the number of words predicted correctly in the annotation. If all words are correctly predicted, the measure has a value of 1, and if none of the words are correctly predicted, it has a value of 0. The average annotation rate is based on the arithmetic average of all test images [36]. (4) Where τi the annotation rate of the image i, τ represents the average annotation rate of the test set, an annotation of image i represents the number of words correctly predicted, a length of the annotation represents the size of the test set, and ci represents the number of words correctly predicted [37].

3) STANDARDIZED SCORE
The standardized score is the second measure in this category. Again, it's rated by NS. The annotation rate is similar to it, but it also counts all misinterpreted words as a penalty.
Where V represents the size of the vocabulary and di represents the number of words predicted incorrectly. The average standardized score is calculated on all annotations in the test set [38].

4) MEASURES PER WORD
It is possible to calculate the measurements per word when all words in the test set are annotated. Annotated images contain information, which is gathered by words. Then, for each word in the vocabulary, averages are calculated [39].

5) PRECISION AND RECALL
Suppose an e label is present m1 times in the images of the truth-ground, and appears in m2 images during tests from which m3 predictions are correct.
Precision: the relationship between images that are correctly annotated by a keyword and all the images annotated by the model using that keyword [40]. (6) In the context of model annotation, precision describes the relationship between the images with a given keyword annotated correctly and all the images that have that keyword represented by the model [41]. (7) To get an overview of the performance of an annotation system, we calculate the average accuracy and reminders across the entire V-size vocabulary [40]: 6) SCORE E E-score combines the two reminder and precision measurements into a synthetic quality measurement that can be compared easily [41]:

7)
F-MEASURE F-measures are harmonic averages weighted between recall and accuracy [52]: where α >= 0 The parameter α allows us to assign more or less weight to accuracy. When α = 1, recall and accuracy have the same weight. In this case, measure F can be represented using the E score as shown in the following equation [53]:

8) N+
Measure N+ is another measure used in annotated systems, the number of words correctly assigned to at least one test image (i.e., the number of words with strictly positive reminders). N+ is a measure of the number of words used in an annotation. This represents the amount of vocabulary covered by the method [42].

VIII. Evaluation Criteria Selected
The performance of the newly proposed improved AIA scheme was assessed through the design and implementation using the standard dataset. In this paper, we have chosen, like the majority of state-of-the-art works, measurements by word. The performance of the designed AIA was evaluated in terms of various measures: recall, accuracy, F measure and N+. we have used the annotation rate, which is part of the annotation measures [51].

IX. Experiments results
The proposed code has been implemented in the Keras (Chollet, 2015), a public deep learning software, based on Tensorflow [6] . Keras was used to initialize the weights in neural networks. All layers in the deep network were initialized simultaneously with the ADADELTA [43] . The complete network was trained using the Dell Precision T1700 CPU system with a 16GB memory. We assessed the computing classification accuracy of a deep learning system using the procedure described in the section 3. The summary of the proposed CNN configuration using the combination between Y, Cb and Cr color channel of image based on Kears library gives the Figure as follows: Table  III shows the average precision, recall, and F-measure for the CNN model. for each dataset A comparison of experimental data is shown in Fig 18. Table III and Figure  18 illustrate how our method has improved with other methods 2PKNN, SEM, and GAN, that consider are more applicable to the annotation task due to their improved precision and recall. In addition, we achieve higher recall and F-values when we process espGame and laprtc12. This paper's primary objective was to propose and implement a new AIA system based on automatic feature extraction and object learning representation that would select the best features. Our model has the highest F-value out of all of them, which indicates its effectiveness.   Table IV are used for the training set, our proposed method extends the labels effectively. An image can also retain its original labels. By using the method for the test subset, each dataset is (11) (12) effectively annotated. The experimental results were analyzed for Corel5k, ESP Game, and the IAPR TC-12 datasets, respectively.

1) Proposed approach's annotation performance in
Corel5k is as follows: Results of P: CNN-SLT approach provides the highest P, which is 0. 40   The R value for the CNN-SLT and 2PKNN technique is 0.41, which is higher than the R value for the other algorithms. F1 results: The highest F1 is 0.39 in the CNN-SLT method and 2PKNN, while the highest F1 in all other algorithms is 0.46. The highest N+ is provided by the CNN-SLT technique, with a value of 280, which is equivalent to FastTag 280 in the other nine analyzed algorithms, and an improvement of at least one over the compared algorithms. We also find that, although CNN-SLT approach have provide the highest P, R than other algorithms and our approach with 2PKNN have provide the highest F1, the difference in N+ between CNN-SLT approach and 2PKNN is 7, indicating that CNN-SLT approach has superior annotation performance than 2PKNN and other algorithms.
Furthermore, despite the fact that the difference in P between CNN-SLT approach and GAN is only 0.02 and R is only 0.08, the difference in N+ between CNN-SLT approach and GAN is only 3, the difference in F1 between CNN-SLT approach and GAN is 0.1, which is quite substantial. This shows that the CNN-SLT technique outperforms GAN in terms of annotation performance. Furthermore, we compare the annotation performance of CNN-SLT with 2PKNN, and we find that the annotation performance of CNN-SLT is always better than 2PKNN, indicating that CNN-SLT may optimize the annotating result.

X. Comparison with other CNN methods
This study uses combines and merges the features extracted with the CNN architecture. Our study investigates how different CNN architectures affect experiment results based on three datasets. In Table V you will find a summary of the results. Observed in Table V more detailed network architectures improve experimental results on datasets of appropriate size. In order to construct the network architecture, we use multi features, to extract shape using Slantlet Transform, second to extract color using YCbCr Colour Space and to extract texture features using Local Binary Pattern (LBP) [44]. Our proposed is compared with traditional methods, comprising shown in table V, Additionally, deep learning techniques such as deep convolutional neural network (CNN) and k-nearest neighbor's algorithm (KNN) [45] have become increasingly popular. In Table V, you can see the results for three different datasets of the experiment. Table V shows that CNNs perform noticeably better in terms of the investigated indexes than traditional machine learning methods for multilabel annotation. In the Laprtc12 dataset for natural scenes, average precision is improved in comparison with other methods. Compared to the Laprtc12 dataset, there are improvements in recall [47].

XI. Conclusions
This paper has presented an annotation that uses CNN features and neighbors to represent each image by using the CNN feature with Slantlet Transform. Furthermore, an algorithm for semantic extension is presented along with detailed implementations. The researchers in this study assembled all the information from an image using lowlevel and high-level features, such as shape, texture and color. Automatic features extraction will be discussed to  The distributed representation approach included the encoding and storage of information regarding the image features. Our study investigates how different CNN architectures affect experiment results based on three datasets. Lastly, the effectiveness of this model for image annotation was demonstrated through experimental results from three data sets. The experimental results for three public datasets-COREL5K, ESP-Game, and Iaprtc12indicate that the average precision of the CNN-SLT is 0.40, 0.38 and 0.42%, respectively, and the average recall is 0.55, 0.46 and 0.41, respectively. The F1 value reaches 0.42, 0.50 and 0.39, respectively, and the N+ also reaches 200, 260 and 280, respectively. While some of the research on image annotation systems tends towards high accuracy and squat recall, the perfect system can be achieved by balancing precision and recall while maintaining precision. The proposed method achieves a balance between precision and recall using SLT and selects the most appropriate features. Meanwhile, As the training phase of some image annotation models requires considerable computation time and complexity, they become computationally intensive when large training datasets are used. This method has been successful in terms of computational efficiency, which is what researchers have always struggled with.