Directional Magnitude Local Hexadecimal Patterns: A Novel Texture Feature Descriptor for Content-based Image Retrieval

Social media platforms such as Twitter, Facebook, and Flicker, and the evolution of digital image capturing devices have resulted in the generation of a massive number of images. Thus, we experienced exponential growth in digital image repositories in the last decade. Content-based image retrieval (CBIR) has been extensively employed to reduce the dependency on textual annotations for image searching. Effective feature descriptor is mandatory to retrieve the most relevant images from the repository. Additionally, CBIR methods often experience the semantic gap problem, which must be addressed. In this paper, we propose a novel texture feature descriptor, Directional Magnitude Local Hexadecimal Patterns (DMLHP), based on texture orientation and magnitude to retrieve the most relevant images. The objective of the proposed feature descriptor is to examine the relationship of neighboring pixels with their adjacent neighbors based on texture orientation and magnitude. Our DMLHP texture descriptor is capable of capturing the texture and semantic information of the images effectively with the same visual content. Furthermore, the proposed method employs a learning-based approach to lessen the semantic gap problem and to improve the understanding of the contents of query images to retrieve the relevant images. The presented descriptor provides remarkable results by achieving the average retrieval precision (ARP) of 66%, 92%, 83%, average retrieval recall (ARR) of 66%, 92%, 83%, average retrieval specificity (ARS) of 99%, 99%, 76%, and average retrieval accuracy (ARA) of 98%, 99%, 85% on the AT&T, MIT Vistex, and Brodatz Texture image repositories, respectively. Our experiments reveal that the proposed DMLHP descriptor achieves far better performance, i.e., 95% on AT&T, 92% on BT, and 99% on MIT Vistex, when used with a learning-based approach over a non-learning-based approach (similarity measure). Experimental results show that the proposed texture descriptor outperforms state-of-the-art descriptors such as LNIP, LTriDP, LNDP, LDGP, LEPSEG, and CSLBP for CBIR.

of media in the presence of complex backgrounds, and the semantic gap between computers and humans. Therefore, there is a need to develop more effective image retrieval systems that are robust to the above-mentioned limitations. Text-Based Image Retrieval (TBIR) is commonly employed for retrieval where the search is based on automatic and manual annotation of images. TBIR-based techniques use similar image text to search and retrieve relevant images from the repositories. It has been observed that the description of the texture images using the text becomes difficult at times because different users employ distinct keywords for annotation. This reveals the limitation of text descriptors being subjective, which results in low retrieval accuracy. Additionally, it is difficult to express the entire visual contents of an image in words, so, TBIR may produce irrelevant results. To overcome the limitations of TBIR systems, researchers introduced the concept of a content-based image retrieval system. CBIR addresses the limitation of TBIR, as CBIR does not need manual annotation to retrieve visually similar images [2]. A CBIR system is based on the visual contents of the images described in the low-level features, that is, texture, shape, color, and spatial locations to build the feature repository. CBIR gives more prominent attention to the local and global information such as the color, texture, and shape of an image. In the CBIR system, the image is provided as an input query instead of feeding the textual query.
We have witnessed a tremendous evolution in CBIR systems over the years to solve many retrieval issues in large repositories. However, some open issues in the image retrieval domain must still be addressed. Working on the semantic gap for image repositories of large sizes is still a challenging problem. The semantic gap refers to the limitation of low-level feature representation of images in describing the actual visual perception of the image, that is, human semantics. Visual similarity belonging to two different semantic categories reduces the performance of the CBIR system because images that have no semantic relation are retrieved. Fig. 1 represents two sample images from the MIT Vistex image repository, which are similar in terms of visual perception and image semantics. The degree of similarity in terms of visual content among these images are the same due to low-level features, that is, color, shape, and texture as well as high-level image semantics. Actually, these two images belong to different semantic classes where the first image belongs to the "Food" class, whereas, the second image belongs to the "Fabric" class. The CBIR system may retrieve irrelevant images due to common visual contents like color, shape, and texture. Similarly, we can observe from Fig. 2 that the similar-looking image "GroundWaterCity" is returned against the inquiry image "Clouds". These images seem to be semantically related due to the presence of common visual contents such as the sky and water, but they actually belong to two different semantic groups. These images describe the semantic gap issue between low-level feature representations and high-level user semantics.
Existing local texture descriptors such as local binary pat-terns (LBP), local ternary patterns (LTP), and others compute limited directional information and ignore the magnitude information. The retrieval performance of these texture descriptors can be further enhanced by capturing more directional and higher-magnitude information from the neighboring pixels. This observation motivated us to develop a novel texture descriptor by considering the aforementioned limitations of existing texture descriptors. The proposed DMLHP feature descriptor is capable of effectively capturing the characteristics of an image based on texture orientation and magnitude. The texture orientation-based pattern (TOBP) encodes more detailed discriminative information in sixteen directions of the neighborhood region, using first-order derivatives. The amount of intensity variation is computed at 0 • , 45 • , 90 • , and 135 • directions to illustrate maximum changes by, using a texture magnitude-based pattern, (TMBP). To represent the image features, the two patterns are formulated (i.e., texture orientation is extracted from TOBP, and magnitude information is extracted from TMBP). Each image is represented as a fusion of texture orientation and magnitude patterns. The histograms are obtained by concatenating both patterns, and the classifier is trained using the proposed features. The learning-based approach is required to bridge the semantic gap problem in CBIR systems. The proposed descriptor is also used with a learning-based approach to enhance the CBIR accuracy. The results show that the proposed texture descriptor achieves high retrieval and classification performance in both the traditional (similarity) as well as the learning approaches. The major contributions of this work are as follows: 1) We propose a novel texture descriptor that is robust to noise, color, pose variations, and a variety of natural and artificial regular textures. 2) We present an effective features fusion of texture orientation and texture magnitude to extract more discriminative information from the input image. 3) Our method addresses the issue of the semantic gap between high-level semantics and local features. The upcoming sections of the article are outlined as follows. In Section II, we provided a comprehensive overview of existing state-of-the-art CBIR systems. The detailed procedure of the proposed texture descriptor is described in Section III. In Section IV, we present the performance evaluation of the proposed CBIR method in detail. Finally, Section V concludes the proposed work.

II. RELATED WORK
This section provides a critical investigation of existing stateof-the-art CBIR techniques. Effective image representation is required for accurate and highly relevant image retrieval. For this purpose, researchers have proposed different local and global descriptors over the last few decades. Although global descriptors are more computationally efficient than the local descriptors, they are unable to perform well under certain conditions, e.g., scaling, rotation, and viewpoint changing. On the other hand, local features are more robust under the ii VOLUME 4, 2016 above-mentioned limitations, and are better able to capture the complex texture information in the images [3].
Local descriptors have recently received considerable attention in the community of image retrieval because of their robustness to illumination changes, noise, and variations in pose. Different local feature descriptors like BRISK [4], SURF [5], SIFT [6], and HOG [7] were used for effective feature extraction. SIFT has been widely employed as a local feature descriptor because of its invariance to scale, rotation, transition in lighting, and 3D camera perspective properties. Various drawbacks are associated with the original SIFT descriptor such as high computational cost and slow processing. SURF improves the computational efficiency of features computation as compared to SIFT. SURF performs well on blur and rotated images, but does not perform well when illumination or viewpoint change. The enhanced SURF keypoint descriptor was introduced in [8], which combined the color information to improve the accuracy of keypointbased descriptors. HOG [7] is an effective feature descriptor used for object detection. The HOG descriptor finds the edges and shape of an object in a localized portion. Therefore, it is invariant against photometric and local geometric variations.
Many extensions were made for conventional local binary pattern descriptors such as multichannel adder and decoder local binary pattern (MadLBP), modified local binary pattern, multiscale local binary pattern, pyramid local binary pattern, and local derivative radial pattern (LDRP). The dimensionality of the conventional patterns is increased by concatenating the LBP histograms derived from each channel. MaLBP and MdLBP were introduced in [9] to obtain the LBP's cross-channel co-occurrence details. However, the traditional LBP dimensionality issue is reduced by using the multi-channel decoded-based LBP fusion scheme. Mamta et al. [10] presented a variation of the local binary pattern, that reduced the execution time by considering only four local neighbors information. In [11], LBP is used as a texture descriptor to extract surface textures of multi-object images in face recognition applications; since the majority of the approaches employed a fixed scale LBP that was not effective to extract the structural features. Manish et al. [12] used the LBP to extract texture features and decomposed the LBP image into multiple scales through DWT. Some of the multi-scale LBP methods consider only boundary pixels while ignoring the inner block information. Prashant et al. [13] introduced multi-scale LBP, which efficiently extracts dominant features at multiple blocks of 3×3, 5×5, and 7×7 windows. In [14], the conventional LBP is extended to the spatial pyramid domain to improve computational efficiency. In [15], LDRP used multi-level encoding in various directions to overcome the problem of information loss of existing local patterns. Cevik et al. [49] introduced a directional local gradient-based descriptor (DLGBD) for face recognition. DLGBD calculated the relationship of the reference pixel with its neighboring pixels based on ternary encoding (9 states) by considering both predecessor and successor pixels. Moreover, this approach examined the relationship of neighboring pixels with their adjacent neighbors by utilizing the mean information of the successor and predecessor adjacent neighbors. However, this descriptor is unable to extract spatial structure information in various directions. To overcome the problem of missing information, this descriptor can be further improved by considering texture orientation, texture magnitude, and edge orientation-based information in different directions.
As we experience an extensive range of colors and textures in natural images, feature descriptors must effectively capture such rich information of colors and textures. In [16], color histogram features, texture discrete wavelet transform features, and edge histogram (EH) features were combined to create a new CBIR system. Li et al. [17] introduced neighbor intensity co-occurrences local ternary pattern (NI-CLTP) for image representation. The Gabor filters were integrated with NI-CLTP to extract the texture details at different scales and directions. By integrating the different cross channel combinations of HSV, Megha et al. [18] implemented a multichannel local ternary pattern to capture texture-chromatic characteristics. In [19], color features were used in combination with a local directional pattern for CBIR. Feature normalization was performed before integrating the histograms of color and LDP features. In [50], Rehan et al. fused the color and texture features where CM was employed to extract the color features, whereas Gabor Wavelet and Discrete Wavelet transforms were used to extract the texture data. Moreover, Color and Edge Directivity Descriptor (CEDD) was employed to improve the feature representation. To overcome the challenge of extracting features from the most visible objects, Rehan et al. [51] presented a CBIR method to extract the texture and color features of the most salient objects. Texture features were extracted through bandelet transform and fused with the color histograms features in the HSV domain to improve the feature space capability. SVM was used to determine the semantic association.
In CBIR, several descriptors have been introduced to combine the benefits of the local visual contents of the images. To integrate color spatial structure information of an image with local visual contents, low-level local features were used with color features in [20]. Low-level salient visual features were extracted using the accelerated segment test feature descriptor. Then color features were extracted and segmented using the L*a*b* color space model. The fusion of these features performed well compared to the existing CBIR methods. Vibhav et al. [21] extracted color and shape features, using the color moments (CM) and invariant moments (IM) respectively. The feature fusion of color and shape gives significant precision in comparison to the previous CBIR frameworks. Prashant et al. [22] used a Local Ternary Wavelet Gradient Pattern to capture shape and texture characteristics in images at various resolutions. Since the descriptor computes the features at different image scales, this feature extraction process makes it computationally more expensive. The existing descriptors such as structure element histogram (SEH) and Color difference histogram (CDH) VOLUME 4, 2016 iii FIGURE 1. Related visual textural appearance of two sample images belonging to MIT Vistex image repository. integrated the texture with color features and have certain limitations. SEH was not rotation invariant, and thus did not employ the idea of symmetry with the center pixel, whereas CDH's scale and rotation invariance properties are affected by the usage of strongly correlated adjacent data. The rotation and scale-invariant hybrid descriptor [52] (RSHD) was introduced to address the issues associated with both descriptors by incorporating the rotation invariant structure elements. RSHD fused the texture and color features. To accurately describe an image, the fusion of all local visual features has both advantages and drawbacks. However, the aforementioned integrated low-level local visual features are greatly improved in terms of high-level semantic meanings, but are computationally complex in terms of retrieval time due to the large dimension of the feature vector.
Color-based descriptors are unable to achieve reasonable accuracy for images, where multiple objects have similar colors [20]. Shape features often fail to extract more information in the presence of noise, occultation, and non-rigid deformations. As a result, images captured from a single point of view are sensitive to it. The content of many realworld objects, such as clouds, trees, valley, food, and fabric, can be better described using texture. As a result, texture plays an important role in describing high-level semantics for feature extraction. Many texture-based descriptors have been introduced in the CBIR domain. Bhavana et al. [23] presented a computationally efficient descriptor called the Dual Cross Pattern. A local DCP sampling was done in eight directions to extract each pixel information twice. DCP has improved the grouping approach by joint Shannon entropy to reduce information loss. Ranjit et al. [24] presented a fused feature descriptor by integrating the threshold local binary AND pattern and local adjacent neighborhood average difference pattern for CBIR. This method provides higher accuracy over other local features-based methods (e.g. SURF, SIFT, LBP); however, this also comes with increased features computation cost. Satya et al. [25] implemented a local mean differential excitation pattern (LMDeP) to extract features from the images of noisy texture. The LMDeP descriptor encoded the correlation between center pixels and their neighbors through differential excitation, rather than a gray level difference. The contourlet tetra pattern was introduced in [26] for efficient image retrieval. This descriptor employed the contourlet transform to determine the direction instead of using spatial first-order derivatives to evaluate the direction. The Contourlet transform used the Laplacian Pyramid and Directional Filter Bank to decompose an image in multidirectional scales. Faiq et al. [27] introduced an amended LTP version called the Extended Local Ternary Pattern for CBIR. This descriptor presented an automatic threshold calculation mechanism for ternary code generation as compared to a static threshold used in LTP. Existing patterns such as CSLBP and CSLTP encoded a restricted number of pixels, which makes them unable to work effectively in an unconstrained environment. The feature lengths of these descriptors, on the other hand, are a main source of concern. In [53], the center symmetric quadruple pattern (CSQP) has been introduced to address these drawbacks by encoding the pixels in a large neighborhood in diagonally opposite quadruple space. CSQP generated an 8-bit pattern from 16 pixels in the immediate vicinity. CSQP has a clear advantage in terms of computational complexity. The primary issue associated with the existing intensity order-based descriptors such as Local Intensity Order Pattern (LIOP) and soft ordinal spatial intensity distribution (soft OSID) is to increase the size of the descriptor with a little increase in the neighboring pixels. Shiv et al. [54] introduced intensity order-based interleaved local descriptor (IOLD), which is based on the division of N neighbors into k interleaved sets has proven to substantially achieve low time complexity while maintaining reasonable performance under noisy conditions. The internal spatial structural knowledge of an image is represented by texture features, which are more descriptive of high-level image semantic observations than color features. However, existing texture descriptors have certain limitations such as they are sensitive to noise, are less efficient, fail to capture finer variations and have smooth image regions that must be addressed to achieve better retrieval performance [24].
All the above-mentioned methods employed the traditional similarity matching-based approach (Euclidian disiv VOLUME 4, 2016 tance, Manhattan distance, L2, and Canberra distance) to perform the retrieval task. Existing methods also employed the learning-based approaches for CBIR. Gajanan et al. [28] presented a directional magnitude local triplet pattern to effectively extract both the orientation and structural microscopic details. This feature descriptor improves the retrieval performance using similarity distance measure as well as the Artificial neural networks learning approach. In [29], a feature descriptor consisting of SIFT, LBP, LDP, LTP, and HOG was proposed to better extract the patchlevel information from the images. The performance of the patch-based descriptor was also evaluated with classifiers, which include Support vector machine (SVM) and Random Forest. Chandani et al. [30] integrated the SIFT and Gabor descriptors. The method applied both the learning as well as non-learning-based approaches. Experimental results indicate better performance when these features were used with the SVM. Dakshata et al. [31] employed the region-based descriptor based on Zernike-'s and Hu's seven-moments with the SVM classifier for CBIR. According to the experimental results, Hu's seven moments seem to do best in almost all classes of the CT database in terms of accuracy. In [32], EH features were used in combination with color autocorrelogram, color moment, and Gabor wavelet transforms to train the SVM for classification.
Ahmad et. al [36] presented a bilinear CNN-based architecture that utilizes a pre-trained CNN network (VGGm or VGG-16) to maintain high discriminative image representations at a compact scale. This significantly improves the retrieval efficiency in terms of extraction and search time. Furthermore, various distance measures (Euclidean, Manhattan, and City block) indicate the highest retrieval accuracy. In [37], the DBN method was introduced for learning effective image representations. Safa et al. [38] introduced ensembles of CNNs for the CBIR task. In comparison to individual CNNs, it was demonstrated in [38] that using CNN ensembles is very effective in producing a strong image representation. Amjad et. al [39] suggested a CNN-based deep learning algorithm for feature extraction to improve the CBIR retrieval performance. Miroslav et. al [41] presented 1D GLCM with 25 layers of CNN. In [42], the image retrieval problem was examined using two deep learning approaches: DBN SAID.
The common issues associated with the above approaches further inspired the researchers to progressively investigate CBIR and design more robust feature descriptors to improve retrieval accuracy. In CBIR, texture-based descriptors have been actively explored over the years to effectively capture the discriminative information from the images. The proposed DMLHP texture descriptor is an effort in this direction to capture maximum discriminative information from the images via orientation and magnitude patterns.

III. PROPOSED METHODOLOGY
This section provides a detailed discussion of the proposed CBIR method that presents a novel directional magnitude local hexadecimal pattern descriptor. The flow of the proposed method is illustrated in Fig. 3. The proposed method initially applies image resizing as a prepossessing step to prepare the images for further processing. Afterward, we employ the novel DMLHP descriptor to extract the features of each image in the training and testing sets. In the next step, histograms of the texture orientation and magnitude patterns are computed and fused together to generate the feature vector. Next, we train the classifier using this feature vector. After training the classifier, we apply the same procedure to the selected query image. Finally, we compute the similarity between the feature values of the query image and the images store in the repository.

A. DIRECTIONAL MAGNITUDE LOCAL HEXADECIMAL PATTERN FEATURE DESCRIPTOR
Over the last few years, we have observed various local patterns such as LBP, MLBP, LDP, and ELTP, designed for CBIR. Existing local texture patterns have certain limitations such as sensitivity to noise and lighting conditions, failure to capture finer variations, multiple objects, and complex background problems. All these limitations motivated us to propose a novel directional magnitude local hexadecimal pattern (DMLHP) descriptor. The framework of the proposed feature descriptor, based on the texture orientation and texture magnitude, is shown in Fig. 4. The texture orientationbased TOBP pattern effectively extracts more discriminative information from the images, as our pattern computes 16 aspects of directional information based on the horizontal, diagonal, and vertical derivatives. Furthermore, we compute the texture magnitude-based TMBP pattern using horizontal, diagonal, and vertical derivatives that capture more detailed edge information from the images. Moreover, we employed the proposed descriptor with learning methods to enhance the classification performance of CBIR.
For a given image I(x, y), we compute the 1 st order derivatives at the grayscale value of the surrounding pixels along 0 • , 45 • , 90 • , and 135 • directions defined as Block diagram of the proposed work with similarity-based and learning-based approaches.
where g h , g d , g v , and g db denote the horizontal, diagonal, vertical, and diagonal-back neighborhoods direction of the center pixel, respectively. Based on the 1 st order derivative values of center pixel, the direction of the center pixel is computed using (5). The 2 nd order derivative of the center pixel is defined as: Finally, we obtained an 8-bit texture orientation-based pattern for each pixel by comparing the (n − 1) th order derivatives of center pixel direction with all the eight surrounding neighbors' direction, using (6) and (7). Then the orientation pattern is separated into 15 binary patterns. Similarly, for the n th order texture orientation-based pattern, the (n − 1) th order derivatives in horizontal, diagonal, vertical, and diagonalback directions, denoted as I n−1 α (g s )| α=0 • ,45 • ,90 • ,135 • , is computed as: An example of the TOBP calculation procedure for a center pixel highlighted with yellow color and surrounding neighbors highlighted with red color are presented in Fig. 5. The direction of the center pixel and each surrounding neighbor are calculated using (5). We have discussed in Fig. 5 that if the direction of the center pixel is the same as that of surrounding neighbors, assign 0 to the corresponding bit of TOBP according to (7), and if the direction of the center pixel is different from that of the surrounding neighbors, retain the corresponding bit of TOBP with the direction of the surrounding neighbors using (7). In Fig. 6, for center pixel 5, let the direction of the center pixel I 1 dir. (g c ) obtained using (5) be 14. The direction of the first neighboring pixel is 15, which is different than the direction of the center pixel, so the first bit of TOBP is retained with the same vi VOLUME 4, 2016 neighboring pixel value, which is 15 based on (7). Similarly, for the second neighborhood pixel 3, the calculated direction is 11, which is again different than the center pixel direction. Hence, the second bit of TOBP is coded with 11. For the third neighborhood pixel 6, the direction is 14, which is the same as the direction of the center pixel; thus, the pattern is coded with 0 according to (7). Furthermore, the remaining neighborhood directions are different from that of the center pixel, and thus, the other bits of TOBP are coded with 2 3 8 16 and 5 respectively. The resultant 8-bit TOBP is 15 11 0 2 3 8 16 5. Afterward, an 8-valued orientation code for each direction is separated into 15 binary patterns based on the direction of the central pixel.
Although for every binary pattern, the sign information is more important, the magnitude information also plays a significant role, which is ignored in the binary pattern. However, the magnitude information effectively captures the edge and gradient structure over other texture descriptors such as LBP. The idea of LBP guided us to introduce a novel magnitude pattern for image retrieval. Since compact texture information lies along the horizontal, diagonal, and vertical directions, the main aim of the proposed pattern is to examine the relationship of neighboring pixels with their adjacent neighbors by utilizing the magnitude information in horizontal, diagonal, vertical, and diagonal-back directions. The 241 st TMBP is computed from the magnitude of horizontal, diagonal, vertical, and diagonal-back 1 st order derivatives using the (12).
Here T 3 is a function, where x is calculated based on the difference between the magnitude of surrounding neighbors and the magnitude of a center pixel as can be seen in (12), that is, The magnitude of the center pixel M 1 I (g c ) is 2.6, which is calculated by using (11). Similarly, the magnitudes of the surrounding neighbors M 1 I (g S ) are calculated. For the 1 st neighborhood pixel 4, the magnitude is 5.4, which is greater than the magnitude of the center pixel. Hence, the first bit of TMBP is coded with 1. Moreover, the magnitude of the second neighbor is 5.8, which is again greater than the magnitude of the center pixel. Hence, we assign the value of 1 to the corresponding bit of TMBP. For the third and fourth neighbors, TMBP is coded with 1. For the fifth neighbor 5, the magnitude is 1.4, which is less than the magnitude of the center pixel. So, in this case, the magnitude pattern is coded with 0. Based on the neighboring magnitude, other bits are coded with 1 1 1. The resultant TMBP is 1 1 1 1 0 1 1 1.
After extracting the local image pattern LIP (TOBP and 241 st TMBP) of each pixel (j, k), we obtain the histogram of TOBP and TMBP, using (14). The final feature descriptor is formed by concatenating the two histograms of our TOBP and TMBP patterns, which are Hist TOBP and Hist TMBP viii VOLUME 4, 2016 as shown in (16).
where lε [0, s(s − 1) + 2] is the maximum LIP pattern value, M × N represents the size of input image, x = LIP (j, k), y = l, and function T 4 is defined as: The algorithm of the proposed work is described below.

B. IMAGE CLASSIFICATION
Ensemble classifiers have demonstrated their effectiveness for various classification tasks in computer vision. We employed the proposed descriptor to train the Ensemble subspace discriminant (ESD) for classification. Later, we performed the testing on the unseen images of the repository. Next, we used the proposed descriptor to extract the features of the query image selected by the user, and the trained classifier determined the category of the query image. We employed the ESD classifier for the proposed method, as we obtained the best results on ESD when used with our novel features. Another benefit of subspace discriminantbased ensemble methods is their ability to handle large image repositories because they are fast, accurate and easy to interpret. In our case, ESD took a reasonable computational time as it has a prediction speed of 40 observations per second.
The range of evaluation parameters for the Ensemble Subspace Discriminant classifier in our work is as follows: the number of learners -range [10,500], learning raterange [0.001,1], and the subspace dimension -range [1,319]. In our experiment, we explored different hyper-parameter values to enhance the classification performance of different classifiers, namely ESD, SVM, KNN, and XGBoost. We trained these classifiers on different hyper-parameters (mentioned below) and reported the results on those parameters, where we achieved the best results. We tuned the following parameters for ESD and set the learning rate = 0.1, the number of learners = 30, and the subspace dimension = 160, as we achieved optimal results in these settings. For SVM, we set the kernel function = Gaussian and kernel scale = 18 with one vs all multiclass method, whereas for KNN, we set the method = weighted KNN, number of neighbors = 1, and distance metric = Euclidean. With regard to XGBoost, we set the maximum depth = 30, maximum round = 500, and evaluation metric = mlogloss.

C. IMAGE MATCHING
In case of the CBIR approach, our goal is to retrieve the top k most similar images against a query image based on the distance. We employed the weighted Manhattan Distance (Weighted L1 norm) for this purpose due to its ability to yield robust results. Moreover, the weighted Manhattan Distance outperforms other similarity measures, such as Euclidean, Minkowski, and chi-square, on higher-dimensional data [40]. Thus, the image matching function can be expressed as Here p i and q i represent the feature vectors of the images present in the repository and target image, respectively, whereas DM (p, q) is the image retrieval function of CBIR that retrieves the most visually similar images based on most to least similarity.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
This section provides a discussion on different experiments conducted to evaluate the performance of our method. We evaluated our method on three standard image repositories, that is, AT&T [33], Brodatz Texture (BT) [34], and MIT Vistex [35]. All of these comparative methods such as LNIP [43], LTriDP [44], LNDP [45], LDGP [46], LEPSEG [47], and CSLBP [48] have also used AT&T, Brodatz Texture, and MIT Vistex open datasets with the same experimental setup x VOLUME 4, 2016 Algorithm 1: Proposed Algorithm Input: Image Repository C= I 1 , I 2 , ..., I n , Query Image X Output: Classification Output, Retrieved Images S= S 1 , S 2 , S I , ..., S n 1 prepossessing 2 for TOBP do // Retrieval output 28 Image classification methods 29 Classification output used for our proposed approach. The details of these image repositories and evaluation metrics are also provided in this section.

A. PERFORMANCE EVALUATION PARAMETERS
The performance of the proposed descriptor is evaluated using the Average retrieval precision (ARP), Average retrieval recall (ARR), Average retrieval specificity (ARS), and Average retrieval accuracy (ARA). Precision is defined as the ratio of the total number of relevant images retrieved to the total number of images retrieved. We computed the precision as follows: where I R and I T represent the number of relevant images retrieved and the total number of retrieved images, respectively in response to the inquiry image represented by i. The recall is another evaluation parameter that denotes the ratio of the total number of relevant images retrieved to the total number of relevant images in the repository.
Here, I R is the number of relevant images retrieved, and I C is the total number of images in each category of the repository in response to the inquiry image represented by i.
Here, C k represents the precision of the k th category of the image repository, and T C represents the total number of categories present in the image repository. Similarly, ARR is calculated by using (23).
In (22) and (23), G k represents the recall of k th category of the image repository. Specificity is the ratio of a total number of correctly labeled negative images to the total number of negative images, and we compute the specificity as follows: Here, S k represents the specificity of the k th category of the image repository, T c is the total number of categories present in the image repository, I T N is the number of not matched images that are correctly identified, and I F P is the number of not matched images that are not correctly identified.   Accuracy is another evaluation parameter that is calculated as A K = I T P + I T N I T P + I T N + I F P + I F N (26) Here, A K represents the accuracy of the k th category of the image repository, I T P is the number of matched images that are correctly identified, and I F N is the number of matched images that are not correctly identified.

B. DESCRIPTION OF THE IMAGE REPOSITORIES
To evaluate the performance of any CBIR method, it must be tested on a diverse and challenging image repository. For this purpose, we selected three standard image repositories having a wide range of image themes. Moreover, these repositories are diverse in terms of pose variations, noise, occlusions, and a variety of natural and artificial regular textures. Most commonly-used natural scene and rich textural image repositories such as MIT Vistex, BT, and AT&T face image repositories are used for image retrieval tasks. Every experiment is repeated multiple times and average xii VOLUME 4, 2016   retrieval precision, recall, specificity, and accuracy values are reported. We split each image repository into an 80-20 training-testing ratio for experimentation. The division of image repositories is shown in Table 1. Some samples of each category of image repository are shown in Fig. 8.

C. PERFORMANCE ANALYSIS ON THE AT&T IMAGE REPOSITORY
The AT&T image repository consists of 400 images partitioned into 40 different categories, and each category consists of 10 images having a resolution of 92 × 112. Some sample images from 40 categories of the AT&T image repository are depicted in Fig. 8(a). We performed two different experimental analysis, that is, similarity matching-based approach and the learning-based approach. For the similarity matching-based approach, we randomly select an inquiry image from each category of the image repository to measure the retrieval accuracy of the proposed descriptor. The proposed texture features (TOBP + TMBP) are extracted for a given inquiry image and compared with the feature values of the images stored in the image repository based on the similarity index, i.e., weighted Manhattan dis-tance. For this image repository, the number of top matches (NT) is retrieved in a group of 1, 2, 3, . . ., 10 images. The proposed descriptor gives a retrieval accuracy of 66% ARR, using weighted Manhattan distance. For the learning-based approach, the set of 320 random images from the AT&T image repository is selected for training, and the remaining 80 images are used for testing. To measure the robustness of the proposed descriptor, we evaluated the CBIR results, using our features with the ESD classifier. We obtained a high average precision rate of 95%. These results signify the effectiveness of this learning-based approach for CBIR.

1) Performance comparison against different state-of-the-art descriptors
To show the robustness of the proposed descriptor, a comparative analysis of the proposed descriptor against existing state-of-the-art descriptors is provided in terms of ARP and ARR in Figs. 9 and 10, respectively. From these results on the AT&T image repository, we can conclude that the ARP and ARR are inversely related by varying the value of NT. As the value of NT is increased, the ARR is also increased because of the high true-positive rate while the ARP is decreased due to the high false-positive rate. However, the proposed texture descriptor gives an ARR of 66% on a maximum value of NT, that is, 10, which is outstanding as compared to recent CBIR texture-based descriptors. The ARR indicates that our method yields better image retrieval performance over LNIP by 9%, LTriDP by 12%, LNDP by 13%, LDGP by 21%, LEPSEG by 30%, and CSLBP by 23% as shown in Table  2.
We also tested the robustness of the proposed method for the face recognition task. For this purpose, we designed an experiment to compare the performance of the proposed method against the existing state-of-the-art DLGBD descriptor [49] for facial recognition on the face image repository, AT&T, which is diverse in terms of variations in pose, face angles, gender, and race. Moreover, this image repository also includes the face images of people with and without glasses. For this experiment, we used 80% of the images from the AT&T image repository for training and the remaining 20% of the images for testing. For training purpose, we used 8 images from each category (i.e., 40 x 8 = 320 images in total), whereas for testing, we used 2 images from each category (i.e., 40 x 2 = 80 images in total). For classification, we employed the ESD classifier and tuned the following parameters: learning rate, number of learners, and subspace dimension. We selected the learning rate of 0.1, number of learners of 30, and the subspace dimension of 160 after extensive experimentation as we obtained the best results on these parameter settings. By using this experimentation protocol, we evaluated the performance of the proposed DMLHP and the DLGBD method. The proposed method achieves the classification accuracy of 95% while the DLGBD method [49] obtains an accuracy of 82.5%. The results of this experiment reveal that the proposed method provides superior detection performance by achieving 12.5% higher accuracy over the DLGBD method. We can conclude from this experiment that the proposed DMLHP descriptor is also capable of effectively representing the facial images to achieve remarkable performance for facial recognition.

2) Performance comparison against different classifiers
To evaluate the effectiveness of the ESD classifier with the proposed descriptor for CBIR, we compared the performance of ESD against conventional classifiers, including SVM, knearest neighbors (KNN), and Extreme Gradient Boosting (XGBoost). The results of this comparative analysis in terms of average precision, average recall, average specificity, and average accuracy rate are provided in Table 3. From the results shown in Table 3, we can conclude that the ESD classifier outperforms almost all the categories and achieved the highest average precision value of 95%, the highest average recall value of 96%, the highest average specificity value of 99%, and the highest average accuracy value of 95%. SVM performs second best by achieving an average precision, recall, specificity, and accuracy rate of 86%, 87%, xiv VOLUME 4, 2016  99%, and 86%, respectively, while KNN achieves an average precision, recall, specificity, and accuracy rate of 85%, 86%, 99%, and 85%, respectively. XGBoost performs the worst and achieved 81% average precision, 79% average recall, 99% average specificity, and 81% average accuracy rate. A sample process is shown in Fig. 11, where the query image is taken from the 11 th category, and all the retrieved images are relevant to that query image.

D. PERFORMANCE ANALYSIS ON THE BRODATZ TEXTURE IMAGE REPOSITORY
The Brodatz texture image repository is a combination of 112 grayscale textures with a resolution of 640 × 640. Each category (D_1, . . . . . . , D_112) is divided into 25 nonoverlapping sub-images. Thus, the BT image repository contains 2800 images in the form of 112 textural categories and each category contains 25 images with a resolution of 128 × 128. Few sample images from the BT image repository are depicted in Fig. 8(b).
For the similarity-based approach, the retrieval performance of our descriptor is evaluated by randomly selecting images from each category. Therefore, 25 images are retrieved initially followed by increasing the retrieved images in a group of 5. Thus, an overall 70 images are retrieved in this process. Hence, the proposed descriptor achieves high retrieval accuracy of 83% using weighted Manhattan distance. As a learning-based approach, a set of 2240 images are used to train the classifier, and a test set of 560 images are used for the evaluation of the proposed descriptor using the ESD classifier. We obtained the average precision rate of 92% that indicates the superior performance of the proposed descriptor for classification.

1) Performance comparison against different state-of-the-art descriptors
A performance comparison of the proposed descriptor based on texture orientation and magnitude in terms of ARP and ARR with other state-of-the-art methods on BT image repository is shown in Figs. 12 and 13, respectively. When comparing our proposed descriptor with the recent texturebased descriptors, we can see that the retrieval accuracy of the proposed texture descriptor is significantly higher than those of LNIP, LTriDP, LNDP, LDGP, LEPSEG, and CSLBP descriptors by up to 4%, 7%, 8%, 19%, 20%, and 30%, respectively, as shown in Table 2. These results signify the effectiveness of our descriptor over comparative descriptors for CBIR on the Brodatz Texture image repository.

2) Performance comparison against different classifiers
Due to the high impact of classifiers on the semantic gap problem, we performed a comparative analysis using different classifiers with the proposed descriptor. Apart from the   Table 3, we can observe that the ESD classifier achieved the best results when used with the proposed descriptor. More specifically, we achieved an average precision of 92%, an average recall of 94%, an average specificity of 99%, and an average accuracy of 92%. SVM performed second best and achieved an average precision of 87%, an average recall of 90%, an average specificity of 99%, and an average accuracy of 87%. KNN achieved an average precision of 85%, an average recall of 87%, an average specificity of 99%, and an average accuracy of 85%. The proposed descriptor with the XGBoost classifier performed the lowest by achieving an average precision, recall, specificity, and accuracy of 89%, 88%, 99%, and 89% respectively, as shown in Table 3.
In Figs. 14(a) and (b), the single image shown in the first row is the query image, while the remaining 10 images are the retrieved images in response to the query image. It can be seen from Fig. 14 that the retrieval results for categories D_16 and also for D_21 are quite effective. The smoothness of the sample images may look similar in terms of the spatial arrangement of colors or intensities; despite this resemblance, the proposed texture descriptor can recognize the texture of images accurately with the same visual appearance.

E. PERFORMANCE ANALYSIS ON THE MIT VISTEX IMAGE REPOSITORY
The third image repository, MIT Vistex contains 30 visual texture categories of some natural scenes, including MtValley, ValleyWater, GrassLand, GroundWaterCity, and GrassPlantsSky, with a resolution of 512 × 512. It also contains some random categories, e.g., clouds, food, fabric, and buildings. Each category is divided into 16 sub-images with a resolution of 128 × 128, so there are a total of 30 categories with 16 images in each category. Sample images are presented in Fig. 8(c). For retrieval purposes, the images are retrieved in a group of 16, 32, 48, ..., 96. For this experiment, we randomly selected the inquiry images from each category of the MIT Vistex image repository, and we achieved an ARR of 92% on an NT value of 16, using a similarity matching-based approach. For the learning-based approach, we split the image repository into a training set of 360 images and a testing set of 120 images. We observed a significant performance improvement over the similarity matching-based approach. More specifically, we obtained an average precision rate of 99%, which indicates a clear winner between these two approaches.

1) Performance comparison against different state-of-the-art descriptors
To evaluate the retrieval capabilities of the proposed descriptor, Figs. 15 and 16 present graphical plots of the proposed descriptor with comparative CBIR descriptors in terms of ARP and ARR on different values of NT, i.e., from the top 16 to 96 images. From the results (Table 2), we can easily observe that the proposed texture descriptor outperforms the comparative CBIR descriptors in terms of ARR. Our descriptor achieves better retrieval performance over LNIP by 2%, LTriDP by 6%, LNDP by 7%, LDGP by 12%, LEPSEG by 14%, and CSLBP by 19%.

2) Performance comparison against different classifiers
We performed the same comparative analysis experiment on different classifiers with the proposed descriptor for the MIT Vistex repository. The average precision, recall, specificity, and accuracy rate comparison is illustrated in Table 3. The proposed (DMLHP) feature descriptor achieves the best average precision, average recall, average specificity, and average accuracy rates with the ESD, i.e., 99%, 98%, 99%, and 98%, respectively. SVM performs second best and achieves the average precision rate, average recall rate, average specificity rate, and average accuracy rate of 98%, 98%, 99%, and 98%, which is approximately similar to the ESD. Similarly, the average precision, recall, specificity, and accuracy rates for the KNN classifier are 92%, 94%, 99%, and 92%, respectively, while the XGBoost classifier gives an average precision of 85%, an average recall of 86%, an average specificity of 99%, and an average accuracy rate of 85%. Similar to earlier experiments, XGBoost performs the worst for CBIR.
Moreover, we evaluate the accuracy of each semantic class of the MIT Vistex image repository, and we observe that ESD gets a 100% precision ratio on all the categories that are semantically more enriched because of their overlapped background, color, and texture, while ESD obtains a 60% precision rate on just one category (WheresWaldo). Similarly, ESD achieves a 100% recall ratio on more complex and semantically enriched categories (Clouds, ValleyWater, GraoundWaterCity, Grass, Leaves, etc.), while the recall ratio is 75% and 80% on just two categories, i.e., GrassPlantsSky and MtValley, respectively. Moreover, our method achieves 100% specificity on all categories except on WheresWaldo, where we obtained 98% specificity. Additionally, the ac-curacy ratio is 100% on almost all the categories except for three: GrassPlantsSky, MtValley, and WheresWaldo. The accuracy ratio on these categories is 99%, 99%, and 98%, respectively. However, the average precision (99%), average recall (98%), average specificity (99%), and average accuracy (98%) rate of the ESD classifier on the Vistex image repository are high compared to those of other selected classifiers such as SVM, KNN, and XGBoost. To show the semantic robustness of the proposed descriptor, the results of the top 16 image retrievals, by taking the target image from the semantic category "Clouds" of the Vistex image repository, are shown in Fig. 17. From the visual samples shown in Fig. 17, all the retrieved images belong to the same query class of Clouds, which clearly shows that the proposed descriptor provides accurate results. The most related image class, GroundWaterCity, may be retrieved as the retrieval output because both classes of images (Clouds and Ground-WaterCity) are related due to common visual properties such as texture and color, and image semantics such as sky and water. However, both images belong to two different classes. These results demonstrate that our proposed descriptor is capable of retrieving the images of a relevant class even in the presence of images belonging to other semantically similar classes such as Clouds and GroundWaterCity class. Thus, we can argue that the proposed descriptor successfully addresses the issue of the semantic gap in CBIR.

F. PERFORMANCE COMPARISON AGAINST STATE-OF-THE-ART DEEP LEARNING METHODS
The objective of this experiment is to compare the performance of the proposed method against state-of-the-art deep learning methods. For this purpose, we compared the performance of our method (DMLHP -ESD) against the deep learning systems, i.e., DBN & SAID [42], and CNN [41] on the MIT Vistex image repository, and the results are reported in Table 4. From these results, we can observe that the CNN model [41] achieves the lowest accuracy of 95.28%, whereas the proposed method performs best and obtain the highest accuracy of 98%. This comparative analysis illustrates the effectiveness of the proposed method over deep learning models for CBIR.

G. TIME COMPLEXITY ANALYSIS
The objective of this experiment is to compare the computational cost of the proposed method over state-of-the-art CBIR methods based on feature descriptors. The response time of the CBIR system can be determined by the time it takes to extract the features and retrieve the images. For this purpose, we calculated the average computational complexity from two perspectives: average features computation time and average retrieval time. We have computed these times for the proposed and baseline models, and the results are shown in Table 5. We calculated the average feature extraction time of random query images for all the three image repositories individually such as 33.8 seconds for AT&T, 31.29 seconds for Brodatz Texture, and 29.9 seconds for MIT Vistex. Similarly, we computed the average retrieval time of AT&T, Brodatz texture, and MIT Vistex for the retrieval of 10, 25, and 16 images against multiple query images, i.e., 2.1 seconds for AT&T, 2.21 seconds for Brodatz texture, and 2.7 seconds for MIT Vistex. Then we have taken the average of these query feature extraction times, i.e., 31.66 seconds, and retrieval times, i.e., 2.33 seconds, and reported the results of the proposed and comparative methods in Table 5. For fair performance comparison, we ensured to compute the average features computation time and average retrieval time of the same query images for all of the comparative feature descriptors. From this time complexity analysis, we can observe that the CSLBP descriptor achieves the lowest time for both the features computation and retrieval, whereas LTriDP achieves the highest time for both the features extraction and retrieval. The proposed (DMLHP) descriptor achieves 4th place in terms of efficiency among the seven descriptors used in this experiment. It is worth mentioning that the feature dimensions of the CSLBP, LDGP, and LTrP descriptors are 16, 64, and 80, respectively, while the proposed descriptor has a feature dimension of 320. Although our descriptor is vastly superior in terms of the feature dimension and requires numerous spatial computations to obtain the gradient directions, LDGP and LTrP are just 0.03 and 0.04 seconds faster than our proposed descriptor. This negligible difference in the computational cost of the proposed descriptor over the LDGP and LTrP descriptors is compensated with our method yielding the best average retrieval rate performance. Moreover, the extraction and retrieval time of the proposed descriptor is less than those of the LEPSEG, LTriDP, and LNIP descriptors. This time complexity comparative analysis demonstrates that the proposed system achieves high efficiency as well as effectiveness for the CBIR task.
It is to be noted that all the experiments in our implementation were executed on MATLAB R2018a version, running VOLUME 4, 2016 xix on the computer system with the following specifications: Intel(R) Core (TM) i3-8130U CPU @2.21 GHz processor and 8 GB RAM. The feature extraction computation and retrieval time of the proposed method can be further improved by using a high-performance GPU. Our method has the potential to become more suitable for real-time CBIR applications.

V. CONCLUSION
This paper has presented a novel local texture descriptor that captures the internal structure of an image, based on texture orientation and magnitude, for effective image retrieval. The proposed approach uses 16 directions to represent the visual contents of the image in a robust way due to the formation of orientation and magnitude patterns. Additionally, we employed a learning-based approach to reduce the semantic gap problem. The performance of the proposed descriptor is measured on three standard image repositories that are diverse in terms of pose variations, noise, occlusions, and a variety of natural and artificial regular textures. The experimental results signify the effectiveness of the proposed CBIR system. To measure the retrieval performance of the proposed descriptor, we compared our method with relevant state-of-the-art descriptors. The experimental results indicate the superiority of the proposed method over comparative approaches for image retrieval. Moreover, we have compared the results of both the similarity matching-based approach and the learning-based approach. The comparative results show that the classification-based approach outperforms the conventional similarity matching-based approach by a clear margin, i.e., AT&T 30%, BT 11%, and MIT Vistex 7%. These results demonstrate the effectiveness of the proposed descriptor with a classification-based approach. As compared to some existing feature descriptors, the proposed feature descriptor takes more time to extract a feature vector. Therefore, there is room to improve the computational efficiency of our method. In the future, we plan to enhance the efficiency of the proposed method. Additionally, we will also explore other similarity metric techniques to improve the retrieval results.