Deep Image Sensing and Retrieval Using Suppression, Scale Spacing and Division, Interpolation and Spatial Color Coordinates With Bag of Words for Large and Complex Datasets

Intelligent and efﬁcient image retrieval from versatile image datasets is an inevitable requirement of the current era. Primitive image signatures are vital to reﬂect the visual attributes for content based image retrieval (CBIR). Algorithmically descriptive and well identiﬁed visual contents form the image signatures to correctly index and retrieve similar results. Hence feature vectors should contain ample image information with color, shape, objects, spatial information perspectives to distinguish image category as a qualifying candidate. This contribution presents a novel features detector by locating the interest points by applying non-maximum suppression to productive sum of derivative of pixels computed from differential of corner scores. The interest points are described by applying scale space interpolation to scale space division produced from Hessian blob detector resulted after Gaussian smoothing. The computed shape and object information is fused with color features extracted from the spatially arranged L2 normalized coefﬁcients. High variance coefﬁcients are selected for object based feature vectors to reduce the massive data which in fuse form transformed to bag-of-words (BoW) for efﬁcient retrieval and ranking. To check the competitiveness of the presented approach it is experimented on nine well-known image datasets Caltech-101, ImageNet, Corel-10000, 17-Flowers, Columbia object image library (COIL), Corel-1000, Caltech-256, tropical fruits and Amsterdam library of textures (ALOT) belong to shape, color, texture, and spatial & complex objects categories. Extensive experimentation is conducted for seven benchmark descriptors including maximally stable extremal region (MSER), speeded up robust features (SURF), difference of Gaussian (DoG), red green blue local binary pattern (RGBLBP), histogram of oriented gradients (HOG), scale invariant feature transform (SIFT), and local binary pattern (LBP). Remarkable outcomes reported that the presented technique has signiﬁcant precision rates, recall rates, average retrieval precision & recall, mean average precision & recall rates for many image semantic groups of the challenging datasets. Results comparison is presented with research techniques and reported improved results.


I. INTRODUCTION
Digital media is increasing and demanding now a days due to its applications in many parts of life [1]. The advancements The associate editor coordinating the review of this manuscript and approving it for publication was Sunil Karamchandani . in digital image processing are required for the efficient image searching and indexing in the large databases. Generally images are extracted in three different ways namely: content based retrieval, semantic based retrieval and text tagged oriented retrieval. The increasing demand of the digital images need very specific data representation and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ retrieval; for this reason image indexing and retrieval has promoted as an active research area. For this reason, image fetching and retrieval has are direct and effective role of image searching from huge databases. Content-Based Image Retrieval (CBIR) has been an important procedure to detect the matched image primitive features based on the visual properties [2]. CBIR system extract features to represent an image. The feature extraction process is also known as image preprocessing. The visual features are classified into two categories: global features, also known as overall characteristics and local features, known as visual property of an image [3].
Mostly CBIR system uses global and local features including shapes, edges, spatial coordinates and texture information, color channels while other uses local features such as region, segmented features and interest points to extract similar images. Texture features represent neighborhood relationships as a combination of pixels and categorized into spatial texture and spectral texture. Shape features are also categorized into two types [4] including regioned and contoured where region-based method mostly applied with color features [4] and extract shape keypoints from the whole area of interest. Contour-based methods are sensitized to noise [5] which extract shape based anchors from the corners and edges of the image. Moreover, color histograms representations are rotation and scale invariant. Spatial distribution cannot be represented by color channels only. The major problem with the global features is that they are unable to reduce the semantic gap. Global features cannot represent all the characteristics of an image. For the reason global features are not applicable for the partial matching of images from a retrieval system. Moreover local features reduce the semantic gap. To overcome the drawback of global feature extraction interest point detectors are used that represent the local features of an image. Interest point based algorithms are Hessian [6], Harris [7], affine invariant [8] and scale invariant [6]. For object recognition global and local features of the image are combined to contribute the maximum image contents [9]. The proposed method also uses interest point detector and global feature descriptor.
The contribution presents corner detector to locate interest points by taking derivative of every pixel. Feature extractor algorithm is used as a global feature descriptor which uses Gaussian smoothing. Color image is converted into grey level and L2 normalization is applied on RGB channels. Principal component analysis is performed on redundant features. Bag of visual words architecture is engaged to retrieve relevant images from the visual BoW repository after indexing. The remaining article is consolidated as follow: Section 2 shows the related work on robust corner and feature detector and descriptor. In Section 3 presented methodology is explained. The experimental results with graphical representation are presented in Section 4 and conclusion of the findings are discussed in Section 5.

II. RELATED WORK
The remote sensing research is aimed to deeply learn the potential for the primitive feature synthesis for high resolution images. Several techniques have been proposed to implement the image contents matching. Similar to the work presented in the proposed paper, local features for detection and classification are investigated in different ways. A Harris with Laplace based cornered combined support vector machine based feedback method is presented by [10]. Harris-Laplace corner detector is used to extract image corner at first and then density ratio is used to obtain salient region for all distinguished parts of the image. Furthermore for the initial retrieval shape features along with color information are merged to detect the salient regions. Lastly, Support Vector Machine (SVM) classification is used to compute the relevance feedback for CBIR. Harris corner detector is used with Bi directional Decomposition technique for CBIR demonstration [11]. Harris corner detector detect corners and BEMD technique extract edge information. Extracted features from these two techniques are merged for required retrieve from database. Experiments are performed on COIL-100 database. Fisher vectors are introduced for image classification [12]. In which super-pixels approach with edges and Zernike filters repository are inducted for efficient image retrieval and used classification benchmarks including Harris, Hessian and difference of Gaussian (DoG) detectors. Results of this research shows that the condensed descriptor is remarkable for blob and super-pixel extraction if the patches are considered along the edges. In another approach [13], a visual features attentive technique is proposed by applying salient points detection for CBIR. Corel-10K and GHIM-10K databases are used to test the superiority of the presented algorithms which shows improved performance than the Bag-of-Words and descriptor for micro levels. Local invariant features are evaluated in [14] for geographic image retrieval and reported on the effects of tuned parameters on BoW structure and also performed comparisons on specific typed standardized data for primitive features.
In past, recognition based work is presented using speeded up robust features (SURF) descriptor. SURF works as an interest point descriptor and detector for images as proposed in [15], [16]. SVM and NN (Neural Network) are used in [15] for classification. SURF detector applies to extract required images and matching feature points from the image. Results show better accuracy as compared to existing methods. SURF for feature extraction and Multiple Instance Learning Support Vector Machine proposed in [16] for image classification. In the presented approach image is segmented by quad-tree method and with codebook of Lindae-Buzo-Gray (LBG) technique. Similarity measurement is performed using Histogram Intersection (HI). An attempt of visual words usage of SURF and SIFT is presented by [17]. SURF and SIFT visual words integration adds the robustness to change in rotation, scale and illumination for image extraction. In this method statistical comparisons are made on image benchmarks including Corel-2000, Corel-1000, Torralba, Corel-1500 are used validate the efficiency of the presented method. A new image probing scheme is presented in [18] to extract image features which use the fusion of Advanced SURF with dominant color description. Evaluation of the proposed approach uses simulation as F-score and average precision. Proposed method results in stability and high accuracy. Invariants image moments are uses in [19] as which are affine and describes the localities for image regions. Proposed method is evaluated using three different setups. The retrieval results are evaluated using UCID and UKBench datasets which gives promising results compared with other extensively used local descriptors. Medical Image Retrieval system which uses SURF features for the medical databases is described by [20]. SURF algorithm is applied as a detector and descriptor to extract referenced the images and the corresponding image feature points. Experiments are performed on medical images using SURF features and produces improved results. An operative deep learning framework is presented in [21] to produce hash codes in binary for efficient the image extraction. This method learns point wised hash codes and image representations. Experiments are performed on CIFAR-10 and MNIST datasets which result high accuracy. Furthermore 1 million clothing images are used to demonstrate scalability and efficacy. Robust Visual Descriptor with Whitening (RVD-W) is proposed in [22] in which local descriptors are used to assign the ranks to clusters. Furthermore a new normalization method also proposed to improves reparability between the matched global descriptors and unmatched values. Moreover the accumulation framework is established using SIFT signatures to perform with Convolutional Neural Network (CNN) features.
Another CBIR scheme was introduced in [23] which uses Multi-scale Geometric Analysis (MGA) of Contourlet Transformation to retrieve images. Relevance Feedback (RF) mechanism was used to improve retrieval performance. Experiments were performed on three datasets and results were tested with state-of-the-art methods when images were corrupted by noise. A new method to retrieve relevant images in three stages is presented by [24]. Color feature similarity measure was used to retrieve a fixed number of images. Then texture and shape features were used to find relevancy of images. Additionally, global and region features were joined to obtain accurate retrievals. Experiments were performed on COREL and CIFAR datasets. Varying illumination is proposed in [25] to compute feature descriptor, which uses HSI to compute its channel intensity and used red green blue channel to eliminate the variations in image intensity. To show the robustness and the uniformity of illuminated changes, the experimental results were tested with benchmarks feature descriptors. A rapid model for CBIR, composed of four phases is presented by [26]. The phases include features abstraction, dimensionality reduction, ANN classifier and matching strategy. Experiments demonstrate improved performance with less computation time. CBIR algorithm based on improved Histograms of Oriented Gradients (HOG) was introduced by [27]. The method uses a sliding HOG window to adjust the HOG structure and principle component analysis (PCA) technique to reduce feature dimension. The experimental results shows better performance. An accurate and rapid model is presented in [28] for CBIR process, depends on a new matching method. The new model combine four major phases called feature extraction, ANN classifier, dimensionality reduction and matching strategy. The method presents considerable results with less computational time. A novel framework is presented in [29] to retrieve color images by using low level features. To capture texture, color and shape from an image, Angular Radial Transform (ART) and Color Difference Histogram (CDH) are exploited. The presented framework combines the experimental results of the standard descriptors by using some post-classification techniques such as Min-max, Borda Count Method, and Z-score normalization. For scene categorization a new mechanism called mCENTRIST was proposed by [30]. Moreover, Sobel information was embedded into the opponent color space to improve results. Experiments were performed on RGB-near infrared databases which include aerial orthoimagery, indoor, and outdoor scene category recognition tasks. In grid computing environment [31] presents a multiple support vector machine (SVM) based architecture for CBIR. To extract features in-depth texture analysis was used and for the image representation Gabor filters, wavelet packets and curvelet were used. The proposed work is compared with the research methods to endorse the efficiency of the presented approach.
The method presented in this paper concentrates on: 1) finding effective interest points using the presented technique and describe these anchors to precisely produce the reflective signatures; 2) applying L2 normalization on RGB channel and results of L2 norm are used for the spatial arrangement; and 3) presenting a new technique to index the images by detecting their interest points and primitive and global features effectively. In the proposed method, Principal Component Analysis is applied on redundant features. Bag of features is used to provide significant results.

III. METHODOLOGY
The first step in every CBIR system is to convert a query image to grey level. The proposed method converted the color image into grey level because in greyscale image each pixel contains the intensity information. These images are also called black-and-white or monochrome images which consist of grey shades, varying from black to white ranging from 0 to 255 values. RGB coefficients are converted to grey level by discarding the hue saturation and by maintaining the luminance values.
Harris and Stephens introduced a corner detector in 1988 [22]. Harris detector was proposed by Schmid and Mohr for interest point detection. The method developed is to capture the image regions with texture and other salient image attributes. Harris detector employs the idea of applying the auto-correlation pixels detection. Harris detector localizes VOLUME 8, 2020 corners, by gradient values that change different directions for a potential corner. The second moment matrix is scale reformed version named as Harris matrix [7]. The equation-1 shows the distribution of gradient values in the neighborhoods [7]: where the integration scale, derivative, differentiation scale is denoted by δ_I , D_z and δ_D respectively. q and r shows the direction for the derivate. The derivatives with Gaussian kernels are calculated by using δ_D. To smooth and average the point neighborhoods Gaussian window is used. The gradient signal changes are depicted by eigenvalues which is oriented in orthogonal directions. The different levels of eigenvalues represent an edge and equal eigenvalues point to a corner. Consequently the value intensities creates different edges and corners which are later described as interest points [32]. Each detected interest is surrounded by a square block of twenty pixels.
The proposed method utilizes the corner detector to spot points of interest in the images. The proposed method uses the strength of Harris and Stephens method for potential interest point detection. The significance of this method is to perform the texture analysis along with the detection of salient points to segment the image regions for potential shape formation. It is adopted because of its sub-optimization and computational efficiency as contrast to correlation approach. Another advantage of choosing this algorithm is the simplicity of operations using directional corner scores instead of applying the iterative expensive shifted patches. The differential of the corner scores is computed for directions in the introduced method. The advantage of using this mechanism is to find the repeating patterns and disturbance in the series of pixels to ultimately conclude the similarities and dissimilarities. In a simple step it is applied to 2D images and results promptly without losing the generality. It computed x and y derivatives and calculates the derivatives' products for each pixel. Then it calculates the sums of the derivatives product and adds matrix definition for pixels. Furthermore, it calculates the detector output for resulting pixels and initial values of response that is NMS -non-maximal suppression. NMS is applied due to its novelty to eliminate the cascading proposals which otherwise creates ambiguity as successive candidate regions. It overcomes the problem of neighborhood windows which generates hundreds of bounding boxes. Another advantage of NMS is its controls over the recall rates by fixing the repeating proposals.
In order to detect interest point Hessian matrix approximation is used. The reason to choose blob technique is its novelty to find the image regions with constant properties. This similarity leads to the formation of objects. It also performs the comparison of properties like brightness and colors with surrounding regions. Another effectiveness of Hessian blob detection is automatic scale selection for saddles reaction. The proposed method uses integral images which lessens the computation time proposed by Viola and Jones [33]. Simard et al. [34] proposed the same type of images to be adjusted in boxlets. Integral images are used for prompt calculation of square-size convolution filters. In the input image I an integral image I_ (k) represents the aggregate of all the pixels at a point k = (x,y)^T within a rectangular region as shown in eq. 2. The computed integral image takes three sums to compute the addition of the intensities [34].
It uses the Hessian matrix [34] to obtain better accuracy.
Blob is embedded to this scheme to find the higher value of determinant. It results in optimal scale selection in image transformations which is better than the Laplacian operator. For some point k = (a,b) the Hessian matrix H(k, λ) in k for λ scaling is defined as below in equation 3 [34].
where, C_xx (k,λ), C_xy (k,λ) and C_yy (k,λ) are the Gaussian based convolution with second derivative ∂^2/[∂p]^2 g(λ). These derivatives are also called Laplacian of Gaussians. For scale-space analysis Gaussians are the optimal but in reality they are discretized and cropped [35], [36]. Moreover, Hessian matrix approximation is also used with box filters to evaluate the computational cost and to approximate the second order Gaussian derivatives. The nine square box filters are used for Laplacian of Gaussians and present the lowest scale for calculating the blob response maps. Interest points are required at different scales where scale spacing is treated as pyramids. To repetitively smoothen the images Gaussian is employed and furthermore sub-sampled to attain corners of higher level. Gaussian smoothing is applied at this step to perform image enhancement. Another advantage of Gaussian smoothing is its scaling at different levels to obtain the maximum image information. The decency of Gaussian that it is two pass process and results in convolving with single pass in fewer calculations. It reduces the computational cost by selecting fewer samples by small kernel size so the resultant feature vectors are small and efficient. Hereafter, the box filters is applied on the image. Box filtering is applied due to its uniqueness of linear filtering over spatial domain where the resultant pixel values are generated by averaging its neighbors and produce the sharp edged information. It's another strength is the convolutional patterns which are time and computation efficient. Box filtering is also adopted due to its equal weight attributes which produce simple accumulation significantly faster than global sliding window fashion algorithm. Moreover, its bonding with Gaussian smoothing is much compatible and faster. Thus the scale spacing is inspected by increasing the size of filter instead of shrinking the image size in steps. The nine squared filter provide the output that is measured approximating Gaussian derivatives with scale λ. The nine squared filter calculates the blob values for the lowest level. The image is filtered with gradually bigger masks to obtain the outputted layers, by considering the specific structure of the filters and the specific type of integral images. The step size for the succeeding mask also is scaled as per steps. This step of scale space reveals the benefits of suppressed fine scaling achieved by applying the parametric smoothed kernels with minimum returned parameters. These scale parameters support to generate fine level of scaling with maximum image information. Moreover, it's another advantage is that it is widely applicable and can be derived from limited axioms. The scale space is divided by highlighting a sequence of filter actions correspondence acquired by applying the convolution. Octave incorporates a scaling factor of size 2 and is equally divided into scale levels. Since integral images are of having distinct attributes, the smallest difference of scale is in the direction of derivation of the partial second order derivative between two successive scales of the positive or negative lobes. For the 9 × 9 filter the length 10 is 3. D non-maximum suppression [37] spatially applied and the neighboring scales in the image to localize the interest points. The beginning and ending Hessian response maps are used for comparison only. Therefore, Hessian matrix determinants interpolation is applied for scaling and image spacing for the maxima response computation so that by applying the interpolation, the lesser obtained scale is λ.
Similar consideration also needed for the related octaves. For the octaves, the increase in the filter size is doubled each time. Moreover, by increasing the sampling intervals by twice minimizing the computational time and increases the accuracy comparably better than the traditional approach. Other octave can be calculated in the same way. In a typical scenario of scale space analysis, the interest points detected on an octave decreases very quickly. VOLUME 8, 2020 The large level changes are between these octaves and first filters which reduces the scale sampling. Scale spacing is implemented by applying improved sampling rates of scales to calculate the specified image. Filter size of fifteen is optimally used as the first octave. The lowest scale is computed using quadratic interpolation for better accuracy.
The Frobenius norms are already scale normalized as they remain constant for the proposed filters at any size. The Hessian matrix maxima determinants are interpolated in scaling and spacing with [37]. Scale space interpolation is specifically important, since the difference is relatively large between the first layers of every octave. In the scale space division process some pixel values are missed and required to be estimated. To best approximate the missing intensity of the pixels from their neighbors, interpolation is inducted at this step. It calculates the missing points from the known data. This scale space interpolation facilitates the composition of real intensities for correct feature vector generation that leads to better precision. Interpolation at this step is the novelty of the proposed model which otherwise produces less reflective image signatures with discarded samples. The problem occurred during subsampling, reduction, and truncation are approximated using interpolated values. To reduce the feature vector size, principal component analysis (PCA) is performed by applying Eigen coefficients and cyclic steps to calculate the principal components. It is an orthogonal transformation where uncorrelated coefficients are formed from the correlated variables. These computed interrelated coefficients are called principal components. It is a fact that the computed PCs are normally lesser than the original discrete values; however these can be equal to original number of input values. The maximum variance is found in the first component, then in second and it decreases serially. The following variables are orthogonal to their previous serial neighbors and possess less reflections. The results are irrelevant to dependencies if data no less internal relations which is also not convergent to original values. The choice of the PCA over independent component analysis (ICA), linear discriminant analysis (LDA) is due to strong data covariance computation and factor scores. Moreover, ICA searches for separable components instead of successive ones. Separately, an RGB image is considered as RGB components to be treated each as a channel to represent those features. RGB channels are carriers of primary colors to represent the image features. The significance of proposed method is that it equally collects the color channel coefficients along with grey level intensities. The proposed approach perform spatial mapping of these colors to reveals the deep image contents. Color information coupling with grey level values generates maximum image content representation. Color information specifies typical objects and their positioning with spatial coordinates resolves the semantic similarities; which is focused in our approach to obtain better precision and recall rates.
Moreover, a general physical model [38] presents a dense sample of material reflected with a related components can be correctly estimated by a function as in equation 4 [38]: where ϑ represents dependency on the angles and σ represents the wavelength [39]. Where S and B indicate the surface reflection and body reflection respectively. Equation-4 part 2 represents the results for non-homogeneous material. Since surface reflection from an inhomogeneous dielectric is focused in a single direction [38], the contribution of MS(ϑ) illuminated from a single direction to measured.
To measure the sensor values at each (x,y), it can be shown as equation 5 and 6 [38], [39]: where ϑ is a function of image points (x, y). The material surface given by (5.5) highlighted by a spectral distribution L(σ ). Let S = (s 0 , . . . .., s N−1 ) denote the directional line then L2 normalization is represented in equation-7 [38], [39]: Let, then from equation 6 [38], [39]: andŝ_i (x,y) depends on the input sensor, the brightness and the reflective output; it is not dependent to ϑ. L1 normalization is used to produce the color space coordinates which are also called chromatic coordinates [40], [41], so that L1 normalized coordinates like the L2 normalized are not fully reflect the scene geometry. The proposed model applied L2 normalization instead of L1 because L2 focuses the optimization of mean cost rather than median. The choice of L2 results in performance gain. Comparatively, overall error rate is lower in L2 regularization by limiting the outliers. L1 has the problem of limited differentiation due to preventive outfitting and sparsity enforcement. L2 has smoothens it and shows invariance with better coverage. Another advantage of applying L2 is its nature of squaring the input that is closed form while L1 pairwise absolute function; therefore L1 is computationally expensive. In L1 norm, for two different material the distance between the coordinates in depends upon line location. Two texture based materials with θ1 angle; where it denotes the angle between materials. For L1 normalization and Euclidean distance dl the equation becomes [39]: For L2 normalization the Euclidean distance d between is as [39]: where d is dependent to θ, k = 2 and s is sine computation. For higher dimensional sensor spaces the situation is same. It is evident that the two colors on the surface will be same even if one point is directly highlighted and the other is in a shadow [42]. It is also seen that the normalized color behave differently as corresponding to dark points. Two undesirable properties using the L1 norm associated with computing normalized color has analyzed. First property is that the color proportions in sensor space has a zero point s 0 = s 1 = . . . = s N−1 = 0. The next and the final step is to perform the indexing and retrieval of images using bag-of-words (BoW) architecture.
Bag-of-words or bag-of-visual-words framework is applied in contrast to support vector machine (SVM) because of the multiclass nature of classification and retrieval. BoW uses occurrences as features instead of class-by-class binary matching. Moreover, BoW applies k nearest neighbors (KNN) model that stores the current instances and classifies based on similarity and produces the results efficiently. BoW has the strength to show the image with local patches which thereby treated as numeric vectors by our approach. These vectors are candidate descriptors to gauge and handle the variations with invariances which otherwise difficult to manage in binary classification schemes. BoW is efficient due to its clustering and codewords modeling where the learned patches are mapped to codewords using clustering. Moreover, BoW is also a powerful solution against other models including AdaBoost [43] and pyramid matching [44]. The BoW representation is histogram description with each local descriptor is allocated to visual word. In the offline training staged {s_1,. . . ,s_n } of n clusters trained by K-means. The descriptors from a given image are the vectors quantized into a pre-structured vocabulary. A histogram of local descriptors is constructed to form a fixed length of representation with n bins of an image and based on mapping each descriptor is assigned to the nearest cluster. For efficient comparison of BoW representations the inverse document frequency is applied with inverted valued list. BoW images are indexed and relevant results are searched from visual BoW database and display retrieval results.
This research work has the following contributions: 1. The model comprehensively collects and analyses the entire image contents including texture, color, shape, object and spatial information which actively produces the highest precision and recall rates. 2. Introduced a light-weight feature detection and description model that efficiently retrieves the relevant results from complex and cluttered datasets. 3. A novel image feature fusion method is incorporated by assembling the spatial coordinates with primitive candidates. 4. First time presented a technique performs suppression, scaling, and interpolation together to obtain the deep finer image content details. 5. To enforce the semantic difference, a new method is introduced with spatial color mapping to highlight the objects. 6. An innovative methodology is presented that successfully returns remarkable performance on tiny object, similar textures, complex background objects, overlay ambiguous objects, resized/enlarged images, cluttered patterns, color dominant arrangements, mimicked, occluded and cropped objects. 7. The strength of the presented technique is to reveal only the relevant image contents information from anchor translation rather than complete image iterations. 8. A unique recipe that works over color channels and grey levels simultaneously to act upon the symmetric content representation strategy. 9. A time, computation and storage efficient retrieval system is introduced that retrieved the results in fraction of seconds. 10. A new idea to accumulate the strength of normalized scaled features with bag-of-words architecture to stimulate the indexing and classification.

1) COREL-1,000 DATASET
Corel-1000 dataset is a renowned standard used for image classification and retrieval [49]. The dataset comprises of 1,000 images divided in 10 categories namely Africa, bus, beach, dinosaur, horse, food, building, elephant, flower, mountains. All categories include hundred images of size 384 × 256 or 256 × 384 as shown in figure 2(a).

2) ALOT DATASET
Amsterdam Library of Textures (ALOT) [51] is a color image dataset of 25,000 images with 250 rough textures, used for the scientific purposes and is available to download. In order to capture the sensory variation in the object recognition 10 categories with 100 images of each material were selected. The selected categories include fruitsprinkles, rope, red-coal, orange-parts, toy-marbles, coins, corn, stones, ice-thick-layer and mandarin-pee as shown in figure 2(b).

4) COIL-100 DATASET
Columbia Object Image Library (COIL-100) [52] is a standardized database which includes 7200 images from hundred different objects. The dataset corresponds to 72 types of rotations for each object. Sample dataset images of COIL-100 are depicted in figure 3(b). For the experimental purpose 15 objects were chosen. Each objects is turntable at the rotated 360 degree having black noncomplex background.
To change the object position images were taken at rotation 5 degrees.

5) COREL-10,000 DATASET
Corel-10K dataset [50] is the most widely used database representing various scenes and subjects to test CBIR performance. Corel-10K database consists of 100 categories and 10,000 images. Every category has 100 images in JPG format of size 85 × 128 or 128 × 85. Images in the dataset are from diverse contents such as sunset, planets, flowers, butterfly, cars, hospital, flags, trees, food, texture, etc. represented below in figure 4(a).

6) IMAGENET SYNSET
ImageNet [45], [46] is a large scale dataset (synset) with over fifteen million high resolution images which belong to hundred thousand categories; used to index and retrieve multimedia data. The repository contains a huge collection of more than 14,197,120 images. For experimentation 15 synsets each containing 100 images were randomly selected which include dust bag, aeria, cherry radio-telephone, nard, tomato, coffee cup, dish, car, gas fixture, flower, golf ball, scootie, wooly bear-caterpillar, flag, and Walnut. These classes were chosen by to their complex nature, textures and art, versatility, and object features. Sample images are shown in fig. 4(b).

7) CALTECH-101 DATASET
Caltech-101 [48] image database is used for the image retrieval, image classification, object matching and recognition tasks as shown in figure 5(a) with sample images. It contains more than nine thousand images belonging to more than hundred distinct image categories. 15 categories with 80 images in each were selected for the retrieval task, including face, airplane, bonsai, face-easy, brain, ketch, chandeliers, things, buddha, tortoise, motorbikes, leopard, wrist watch, butterfly, and ewer. These categories were selected to test the superiority of the presented technique which have ability to share the spatial values, rounded and multi-shaped objects with texture information integrated with color channels.

8) CALTECH-256 DATASET
Caltech-256 dataset [47] contains more than 30 thousand images; which are assigned to 257 categories. Caltech-256 database is a more complex than Caltech-101 dataset with having variation in image categories and contents. Experiments are performed on 15 diverse categories including airplane, swan, back-pack, boxing gloves, bonsai, spider, billiards, tomato, cactus, bulldozer, teapot, butterfly, and teddy-bear. The selected semantic groups represented in figure 5(b) belong to many areas of real life. All categories in dataset are important due to their foreground and background objects and texture patterns. A total of 1500 images with 100 images per category were selected for experimentations.

9) 17-FLOWERS DATASET
The 17-flowers dataset [54] contains 1360 images of flowers with different sizes belonging to 17 numerous classes containing 80 images. These classes show higher level of image changes within the class and possess resemblance with other classes. The dataset images were gathered by web surfing and taking pictures. Images from the dataset are represented in figure 6.

B. EXPERIMENTAL RESULTS
Experiments were performed on 9 standard datasets and results were compared with the research techniques and benchmark descriptors. These descriptors include HOG [  and MSER [59]. and Experimental results of top 20 outcomes against the query image for the each dataset are shown in figure 7. The accuracy of the proposed approach is calculated by applying the challenging measures. These are average precision (AP), mean average precision (mAP), average recall (AR), mean average recall (mAP), precision & recall (P&R), average retrieval precision (ARP) and average retrieval recall (ARR). Precision is calculated by dividing relevant results to sum of retrieved results. Average precision is the ratio of precision in respective image category to total number of iterations. Average retrieval precision is the ratio of average precision of image category to total number of categories in which each category precision is summed up to first category. Before plotting, these ARP values are sorted to show the gain or loss gradually.

Precision
In equation 19-21, @C represents each category and AP, AR are the manipulated average precision and average recall rates.
Retrieved Results for Corel 10,000 dataset and ALOT dataset are shown in figure 7 with up to 95% accuracy. 17-Flowers dataset, Caltech-101 dataset, FTVL dataset and COIL dataset has up to 100% accuracy rates as shown in figure 7. Corel-1,000 dataset has up to 90% accuracy as represented in figure 7. Caltech-256 has accuracy rate of 70% against the query image of category Airplanes as presented in figure 7. It can be shown from Fig 7(h) that ImageNet synset has 65% accuracy rate in complex category of tomatoes. Top 20 images retrieval time taken by the presented method is ∼ 0.3 − 2.19 sec. The variation of in the computation time is due to the size, and number of images in the datasets. The experiments are conducted on core-i5 @2.5Ghz with 8GB RAM.

1) EXPERIMENTAL RESULTS OF THE COREL-1000 DATASET VS. RESEARCH METHODS
Corel-1000 benchmark is used to measure the accuracy of the presented technique. Results of the presented method are shown in comparison with research methods presented in literature named Kundu et al. [23], Shrivastava and Tyagi [24], Dubey et al. [25], ElAlami et al. [26], Pan et al. [27], ElAlami et al. [28], Walia and Pal [29], Xiao et al. [30], Irtaza et al. [31]. Results are graphically represented in figure-8 which shows that the presented approach works VOLUME 8, 2020  better than many research methods. The presented technique shows better performance in 'dinosaur' and 'horse' for average precision rates for the comparison research techniques. Mean Average Precision (mAP) for the presented approach is also compared with other research techniques is shown in table 13. The presented approach shows improved mAP value of 0.804 than the research techniques. Cumulative results are also remarkable for the presented technique.
Average precision of the presented approach is compared with research techniques is represented in Table 1. The existing methods also show good results in the categories of bus, dinosaur, flower and horse. Shrivastava and Tyagi [24] reported improved rates for dinosaur category because it works well with plain background images but the similar results are not observed in all categories due to specific content analysis while missing the overlay object detection. However, the presented approach shows improved results in dinosaur and other image groups due to its uniqueness of finding the foreground and background objects.
Moreover, Pan et al. in [27] reported better precision for building and mountain category because their scheme focuses the scene domain and lacks in object recognition. Walia and Pal [29] provided better precision in dinosaur, flower and horse by showing the strength in color and uncluttered images and resulted low accuracy in texture and complex images. Xiao et al. [30] reported the highest precision in bus category and lacks in spatial domain. Irtaza et al. shows [31] has the highest precision rates in food and misses to collect the scene domain attributes. Comparatively, the proposed method reports improved precision in scene, texture, color, and spatial features by its entire content analysis methodology.

2) EXPERIMENTAL RESULTS OF THE FTVL DATASET IN COMPARISON WITH EXISTING METHODS
FTVL dataset [53] is used for the experimentations due to its illumination differences, pose variations, partial occlusions and cropped object. Figure 9 graphically depicts the average precision rates of the presented technique with the   [54]: Average Precision (%) in comparison with other standard retrieval methods significant results in comparison with the existing literature research techniques [62]. Table 2 presents the average precision results of the presented approach in assessment with existing research techniques. Some methods CDH + SEH show the low accuracy due to missing nature of their method in cropped objects. Some methods incorporate the texture attribute and reports average precision but mixed the ambiguous objects. Two methods with deep texture patterns reported above average precision by also considering the lacks in other methods. However objects with similar shape and color are still difficult to recognize for them. The proposed method takes into consideration the color coordinates with texture and shape properties to and reports 0.937 mean average precision.

3) EXPERIMENTAL RESULTS OF THE 17-FLOWERS DATASET IN COMPARISON WITH EXISTING METHODS
17-Flowers dataset is used for colored images with texture and shape experimentation. Figure 16 shows comparative results of the presented approach with research techniques for average precision and average recall. The approach in [64] performs spatial matching and computes differences based on this criteria but lacks in color with shape matching therefore results low precision. The other approaches [64]- [66] incorporates linear coding and reports average results. These methods lack shape with texture pattern analysis. The finegrained [63] approach reports improved results and can be more improved by adding the deeper color and shape details. The average precision rate of the presented approach outperforms in many flower groups by putting spatial color and texture patterns with shape information. Table 3 shows the average precision rates for the presented approach versus the existing research techniques. The proposed method reports 0.876 mAP rates higher than all competitive methods.

4) EXPERIMENTAL RESULTS OF BENCHMARK DESCRIPTORS VERSUS THE PROPOSED METHOD
Experimentation is performed on 7 benchmark descriptors and detectors. These widely used key point descriptors and detectors are MSER [59], LBP [61], DoG [60], SIFT [55], SURF [56], RGBLBP [67], and HOG [57]. Image retrieval systems use feature descriptors and detectors for the texture, object recognition and detection. Image features such as edge, corner or blob are extracted using detectors and descriptors from the interest points. Speeded up robust feature (SURF) is used for the image retrieval [18] and presented at the ECCV. Histogram of oriented gradients is used for classification, object detection and recognition [68], [69] and image retrieval [70]; which was presented at CVPR. To detect colors in an image, RGB model is used in which red, green and blue channels are fused to produce different colors. Image is represented in RGB model as a matrix of X × Y × 3 pixels for the each color component where X & Y are rows and columns of pixels. Maximally Stable Extremal Region (MSER) was described by [59] and is used for blob detection to find correspondences between image features. Difference of Gaussians (DoG) enhances the features by subtracting a less blurred image from the original image. For texture classification local binary patterns (LBP) is used, presented in 1994 [71]. LBP is a type of the texture descriptor. Scale invariant feature transform (SIFT) is used for the object detection [72] and content based image retrieval [73], [74] which was presented in ICCV. These descriptors are compared with the proposed method to measure the efficiency of the presented approach.

a: AVERAGE PRECISION (AP) & AVERAGE RETRIEVAL PRECISION (ARP)
In pattern recognition, image retrieval and object classification Average Precision (AP) is somewhat tricky to interpret. Precision is the average probability among the retrieved instances. AP is the further averaged of all queries and represented as a single score. To measure the versatility of the proposed method, experiments were performed on 9 standardized dataset. Figure 11 graphically represents the average precision rate of top 20 retrievals, compared with benchmarks. Input validation images are collected from all categories to compute the P&R rates for the all semantic groups. VOLUME 8, 2020  Figure 11(a) represents that the presented approach outperforms in the semantic groups of Horse, dinosaur and Africa from Corel-1,000 due to its superior object recognition capability. Figure 11(b) shows that ALOT database has remarkable average precision rate for texture images. FTVL database shows significant results in most of the image groups as shown in figure 11(c). It is observed from the results represented in figure 11(d) that COIL database has significant results in all the categories. Average Retrieval Precision is shown graphically in Figure 12 and compared with benchmarks. Figure 12(a) represents that Corel-1000 database outperforms in all categories. Figure 12(b) shows that ALOT database has remarkable average precision rate for texture images. FTVL database shows remarkable results in most of the semantic groups as shown in figure 12(c). It is observed from the results represented in figure 12(d) that COIL database has significant results in all the categories.
Average precision for Corel 10,000 database shown in figure 13(a), has improved results in many categories due to its spatial features. Figure 13(b) represents that ImageNet synset has considerable precision rate in complex categories. Caltech-101 dataset contains the images with versatile contents, textures patterns and shapes. The presented approach reflects better precision rate in several image groups as shown in Figure 13(c). Caltech-256 dataset has improved results in many categories due to its robustness as represented in figure 13(d). The proposed method is showed in figure 15(a) outperforms than the others in many semantic groups. ARP for Corel 10K database shown in figure 14(a) has improved results in many categories. Figure 14(b) represents that ImageNet synset has considerable ARP in complex categories. Caltech-101 dataset includes images with similar objects and complex backgrounds. The presented approach reports significant precision for this dataset as shown in Figure 14(c). Caltech-256 dataset has improved results in many categories due to its robustness as represented in figure 14(d). The proposed method shown in figure 15(b) outperforms than the others in most of the image categories. Results reported that the descriptors suitable for object recognition show better AP and ARP in such categories of image datasets. This fact is reflected from the graphs, as figure-11 (a) shows highest AP for HOG in bus category of Corel-1000, rope and red-coal of ALOT in (b), granny smith apply in (c) of tropical fruit. Similarly average and below average AP is returned for color and texture image classes like crocus and iris figure 15(a). Descriptors suitable for texture and colored textured are RGB and RGBLBP report improved AP and ARP in most of the categories of Corel-10000 and ALOT figure 12 (a,b). Other than the descriptor domain, significance performance is not returned like FTVL figure 12 (c), COIL figure 12 (d). Gaussian differences application on tiny cropped object are better for DoG figure 12 (c) and below average for cross domain categories figure 12 (d) and figure 15 (b). SIFT is used for object detection tasks; hence reports better AP and ARP in most of the categories of object dominant datasets Caltech-101 and Caltech-256 in figure 14 (c, d). However in color texture dataset 17-flowers, below average results are reported in figure 15 (b). COIL with rotational objects in figure 11 (d) has better precision rates by SURF due to its specialty; whilst report average results in out of domain datasets as depicted by figure 14 (a-c). The proposed method reports significant AP and ARP in most of the descriptor domains including object, texture, color and shape as observed by the figure 11-15. The proposed approach is capable to search for cluttered, cropped objects from foreground and background in figure 13 (c, d). It is also able to distinguish to the texture patterns as shown by the figure 12 (b) and figure 15 (a, b). Tiny similar object are successfully distinguished in figure 11, 12 (c). Objects in small and large size images are accurately identified by the proposed method which is endorsed by the figure 11, 12 (a) and figure 13 (a-d).

b: AVERAGE RECALL (AR) &AVERAGE RETRIEVAL RECALL (ARR)
Recall is the fraction of retrieved matched images divided by the total number of matched images. Average Recall (AR) is the average probability of complete retrieval, graphically represented in Figure 16, 18 and 20. In AR graphs, the presented method is compared with the benchmark descriptors. It is evident from the outcomes that the presented method has significant recall rates in all databases. Figure 16(a) represents significant recall rates in different image groups of Corel-1000 dataset. ALOT Dataset has significant recall rate as shown in Figure 16(b). FTVL Dataset has improved recall repsented in figure 16(c). Figure 16(d) shows considerable racall rate for the COIL Dataset. Figure 17 shows Average Retrieval Recall (ARR) rate against the each category. It is evident from the graphs that the presented appraoch has significant recall rate in all databases. Figure 17(a) shows significant ARR for Corel-1000 dataset. ALOT dataset also has better ARR as represented in fig. 17(b) and fig. 17(c) presents remarkable ARR for FTVL dataset. COIL dataset has better ARR as shown in fig. 17(d). Figure 18(a) represents better recall rate in many of the categories of Corel-10,000 Dataset. ImageNet synset has significant recall rate as shown in figure 18(b). Caltech-101 Dataset has improved recall represented in figure 18(c). Figure 18(d) shows considerable recall rate for Caltech-256 dataset.
Results reported in figure 20(a) that 17-Flowers Dataset has remarkable recall rate. ARR for Corel-10,000 dataset is presented in figure 19(a) which shows considerable results. ImageNet synset has better ARR as represented in fig. 19(b). Caltech-101 and Caltech-256 datasets has improved ARR, as shown in figure 19 (c, d). It is observed from fig. 20(b) that 17-Flowers dataset has remarkable ARR.

c: PRECISION AND RECALL RATIO
In an image processing system precision is part of the output images and the relevant content images which are indexed to fetch. While recall is the part of relevant content images with the retrieved images. Precision and recall are inversely proportional set-based procedures, computed using unordered results. A retrieval system needs to achieve a balance between precision and recall. This situation, leads to the need of Precision-Recall (PR) graph based on the measure and understanding of relevance. Figure 21 is the precision and recall depiction of Corel-1000, A LOT, FTVL and COIL datasets.
PR rates for Corel-1000 dataset are between 65% and 100% as shown in fig. 21(a). PR rate for ALOT dataset shown in fig. 21(b) is from 75% to 100%. Figure 21(c) represents PR for FTVL dataset which is from 70% to 100% and COIL dataset has more than 80% PR rate in many image groups as presented in figure 21(d).  Corel-10,000 dataset has more than 70% PR rate in many categories as shown in figure 22(a). PR rates for ImageNet synset and Caltech-256 dataset are shown in figure 22(b, d) are around 75%. Figure 27(c) represents PR for Caltech-101 dataset which is more than 60%. It is graphically shown in figure 22(d) for 17-Flowers dataset with 60% to 100% PR rates.

d: MEAN AVERAGE PRECISION (mAP) AND MEAN AVERAGE RECALL (mAR) RATES
Mean Average Precision (mAP) is a very popular standard single-number performance measure in information retrieval for comparing search algorithms. mAP is the average of AP across a set of queries. A single bar in the graph represents overall mAP of all categories for the presented approach and the benchmarks.
For ALOT benchmark, the presented apporach shows 0.93 mAP that is greater than in comparison benchmarks as depicted by figure 23. SIFT shows lowest mAP rates since it shows the lowest results in Rope and Ice-Thick-Layer categories. Table 4 shows mAP of the presented approach and the research based benchmarks. It can be shown that MSER and RGBLBP descriptors report almost equal mAP. Moreover HOG, DoG and LBP report average mAP.
For COIL dataset, the mAP of the presented appraoch is 0.93 as represented in figure 24 that is 15% higher than HOG. HOG shows the second highest mAP that is 0.77. SIFT and DoG shows the lowest mAP that is 0.20 and 0.30 respectively. SIFT shows low results in tomato and truck categories for the reason of their same color. Furthermore DoG reports mAP rates in the categories having same color features. Table 5 shows the mean average precision rates for all seven descriptors experimented along with the proposed method. VOLUME 8, 2020     Figure 25 shows that the mAR of the presented approach for Corel-10,000 dataset is 0.16. Other methods report lower mAR as compared to DoG and the proposed method. DoG reports the second highest mAR that is 0.18. SURF, HOG and SIFT reports 0.297 and 0.317 mAP as shown in table 6 while RGBLBP, MSER and LBP report mAR between 0.317 and 0.367.  For ImageNet synset the mAP for MSER, SIFT is 0.433 and 0.47 that is among the highest results. The proposed method reports the highest mAP that is 0.5 as represented in figure 26. Other descriptors show an average results since ImageNet synset is difficult to categorize. Table 7 show the mean average precision in tabular format for the  presented approach in comparison with benchmark descriptors. ImageNet shows average results due to its complex nature of images which falls into different semantic group at the same time. It is however probable that the image is correctly indexed depending upon the individual contents; however on matching the label from the respective category VOLUME 8, 2020   makes the results false. In our case bag of words approach collects and index the image using KNN approach in which the presented approach returns better results in comparison with challenging descriptors.
The mAP of the proposed method for Caltech-101 dataset is 0.67 as shown in figure 27. HoG reports less than 0.5 mAP rates in most of the categories for this reason HoG reports below average mAP rates. SIFT and MSER report better mAP   rates 0.437 and 0.460 as compared to the other benchmark descriptors as depicted in table 8. HOG, RGBLBP and LBP reports the mAP below than 0.270. Moreover SURF and DoG has mAP in the range of 0.38 that is an average rate as compared to the other methods.
Caltech-256 is results are shown to show the competitive outcomes of the presented technique in contrast with benchmarks. The mAP for the presented approach is 0.50. Other benchmarks show below 50% results. HOG and RGBLBP shows the lowest mAP for Caltech-256 dataset as shown in table 9.
For ALOT dataset, the presented approach returns 0.10 mAR that is better than the competitive benchmarks as shown in figure 29. SIFT shows the lowest mAR rates. Table 10 shows the mAR for the presented approach and the existing benchmarks. It can be concluded that MSER and RGBLBP descriptors report almost equal mAR. Moreover HOG, DoG and LBP report mAR up to 0.17.
For COIL dataset, the mAR of the proposed method is 0.10 as shown in figure 30. HOG shows the second highest mAR that is 0.14. SIFT and DoG report the lowest mAR       Figure 31 shows the mAR of the presented approach for Corel-10,000 dataset that is 0.65. Other methods shows much less mAR except DoG. Difference of Gaussian shows the second highest mAR that is 0.56. SURF, HOG and SIFT reports mAR between 0.40 and 0.46 as shown in table 12. Moreover

V. CONCLUSION
This paper introduced an approach to retrieve the images by discriminatively recognizing the shapes, objects, texture and spatial color information. The described method extracts the spatial image color components, distinctive shape and object features for the cluttered objects and complex background images from diverse image categories. The precision is computed in all respect including average, mean and average retrieval and the same is executed for recall rates against VOLUME 8, 2020 seven challenging benchmarks. Results are also compared with many others from the latest literate. Experimentation results for nine highly recognized benchmarks showed that the presented approach yielded outstanding performance as compared to the research techniques and benchmark descriptors. Results reported that the fused spatial color and shape features can distinctively retrieve the images from texture, shape, color and object datasets. An extension to the presented work will be the integration of convolution networks to achieve more improved results.