Implementing Relevance Feedback for Content-Based Medical Image Retrieval

Content-based image medical retrieval (CBMIR) is a technique for retrieving medical images on the basis of automatically derived image features such as colour, texture and shape. There are many applications of CBMIR, such as teaching, research, diagnosis and electronic patient records. The retrieval performance of a CBMIR system depends mainly on the representation of image features, which researchers have studied extensively for decades. Although a number of methods and approaches have been suggested, it remains one of the most challenging problems in current (CBMIR) studies, largely due to the well-known “semantic gap” issue that exists between machine-captured low-level image features and human-perceived high-level semantic concepts. There have been many techniques proposed to bridge this gap. This study proposes a novel relevance feedback retrieval method (RFRM) for CBMIR. The feedback implemented here is based on voting values performed by each class in the image repository. Here, eighteen using colour moments and GLCM texture features were extracted to represent each image and eight common similarity coefficients were used as similarity measures. After briefly researching using a single random image query, the top images retrieved from each class are used as voters to select the most effective similarity coefficient that will be used for the final searching process. Our proposed method is implemented on the Kvasir dataset, which has 4,000 images divided into eight classes and was recently widely used for gastrointestinal disease detection. Intensive statistical analysis of the results shows that our proposed RFRM method has the best performance for enhancing both recall and precision when it uses any group of similarity coefficients.

INDEX TERMS Content-based image retrieval, feature extraction, voting method, relevance feedback.

I. RESEARCH BACKGROUND
Automatic image retrieval has become a major research problem considering its use in managing and retrieving large amounts of unlabelled gigabyte image data that are generated and digitally stored in large databases and contain visual information both on the web and in-network computing systems.
The previous traditional retrieval method is text-based image retrieval (TBIM), which uses text and images for both storing and retrieving purposes, but this method is no longer used due to its human and manual efforts and has two main common disadvantages. First, accurate textual knowledge about these images is hardly available, which makes it even more difficult to construct successful image retrieval systems. The second obstacle is that it can be a major problem in The associate editor coordinating the review of this manuscript and approving it for publication was Maurizio Tucci. manually labelled collections because a human annotator can miss an image's specific definition or may disagree with another person over its ''correct'' interpretation [1].
In the early 1990s [2]- [4], content-based image retrieval (CBIR) was introduced to address those limitations. Currently, this method is considered to be the best method and approach used in image retrieval. The main concepts of CBIR systems have the ability to produce machine-readable representations of the physical characteristics of an image, such as colour, texture and shape. Instead, these descriptions or definitions, known as the extracted feature space, can be measured by similarity coefficients or metrics. The similarity between a query image and each image in the image archive is then determined by the CBIR method, and the images in the database are retrieved in descending order, with the most similar images retrieved at the top, and that may help the specialists in what is known as case-based diagnosis. In recent years, the use of digital images has become increasingly VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ popular across various sectors, including medical, science, and educational sectors. Hospitals and medical facilities generate a large number of digital images as part of their daily routines, such as X-ray, mammogram, and magnetic resonance imaging (MRI). It is definitely a complex task to interpret medical images, which requires extensive knowledge. Scientists have developed support systems such as the computer-aided diagnosis (CAD) system and content-based medical image retrieval (CBMIR) system to help radiologists interpret medical images. The most important benefit of CBMIR is that it aids radiologists in identifying similar medical images in recalling previous cases during diagnosis [5]- [7]. Many content-based medical image retrieval (CBMIR) systems are based on image similarity, whereby a user enters a query image, and the system responds by supplying the most similar image focused on a certain measure of similarity. The results of the similar image query are then shown in descending order. The basic idea of any CBMIR system involves two main steps or phases: feature extraction (offline phase) and similarity measurement (online phase). The result of this process is similar images retrieved in descending order.

II. RELATED STUDIES
Many studies summarize the latest developments and enhancements in this area of study. Through their literature, we can classify them into two categories. The first category is CBMIR systems and CBMIR methods and approaches. The intensive review and future direction of this category can be found in [8]. The authors gave a brief description of the most common (CBMIR) web-based systems along with the suitability of each of them. The second group of studies can be summarized into three main sub-disciplines: local features, visual words and dictionary and convolutional neural networks (CNNs) and deep learning. The following paragraphs give a brief description of these approaches.

A. LOCAL FEATURES
Related to using local features of images, Tao [9] proposed a multilevel learning-based algorithm for automated medical image annotation based on the image's local presence. He achieved high precision even considering different diseases. With the support of content-sensitive hanging protocols and autoinvocation, their algorithm improves the detection of PA-AP chest images. Additionally, Srinivas et al. [10] used the dictionary learning method to retrieve relevant images from the IRMA and FE1 databases. In their study, they used the mean and variance in the pixel in the image as features, and K-SVD was used to generate dictionaries for each cluster, achieving up to 87.5% performance.
In the same discipline, Shamna et al. [11] examined and analysed the automated medical image retrieval system, integrating the performance-enhancing subject and position probabilities. They presented an automated medical image retrieval system in their study using topic and location models.

B. VISUAL WORDS DICTIONARY
With the increasing usage of bag-of-visual-words (BoVW) techniques, Shamna et al. [12] proposed an unsupervised content-based medical image retrieval (CBMIR) framework based on visual word spatial matching. Additionally, Zhang et al. [13] proposed a BoVW retrieval method for identifying discriminative characteristics between different medical images with a pruned dictionary based on the description of the latent semantic topic.

C. CNN AND DEEP LEARNING
Convolutional neural networks (CNNs) and deep learning have been widely used in the area of CBMIR. Qayyum et al. [14], Cai et al. [15] proposed a deep learning framework for the CBMIR system through the use of a deep convolutional neural network (CNN), which is trained for medical image classification. The learned features and the results of the classification are used to obtain and retrieve relevant medical images. Sun et al. [16] proposed a novel approach using deep learning to retrieve medical images to improve the accuracy of CBMIR in integrated RIS/PACS.

D. FEEDBACK IN OTHER RETRIEVAL AREA
Relevance feedback, as proposed in this study, was recently used in the area of information retrieval, in general, to increase and enhance the retrieval process in terms of recall and precision. Karamti et al. [17] developed an effective retrieval framework that includes a technique of vectorization combined with a model of pseudo-relevance. Banerjee et al. [18] proposed a retrieval system that exploits a hybrid feature space (HFS) developed by combining low-level image features and high-level semiconductor words, through relevance feedback (RF) rounds and perform similarity-based retrieval to enable semi-automatic image interpretation. Qazanfari et al. [19] proposed a short-term learning approach focused on content-based image retrievalrelated feedback. A collection of low-level features is used in content-based image retrieval systems to find similar images to the query image. Recently Trojmen -Khemakhem and Gasmi [20] uses new idea of relevance feedback that based on query expansion and selecting of significant concepts for context retrieval of medical images. Additionally, relevance feedback concepts were successfully used in our previous study to improve the performance and effectiveness of ligand-based virtual screening for molecular retrieval [21], [22].

III. PROPOSED METHOD
The main idea of relevance feedback is that the user marks or labels some important images as positive feedback and a number of insignificant pictures as negative feedback. Based on these labelled samples, the CBMIR system then refines its retrieval process. In this study, the idea is adopted based on the top images retrieved by using a group of eight similarity measures or coefficients. Then, the top similar retrieved images are used as voters to select the best and bestperforming measures for each image class. Fig. 1 illustrates the general framework of CBMIR with our proposed feedback addition, which is shown as a dotted line.
This study proposed the use of relevance feedback to improve and enhance the retrieval accuracy in terms of recall and precision of the CBMIR method for gastrointestinal disease images. Our proposed method shown in Fig. 1 involved the main contribution relevance feedback stage as well as several important steps or phases that are described in the following paragraphs.

A. PREPROCESSING
This stage is important and is assessed in the further feature extraction phase. Here, the RGB colour space of our images is converted to the HSV colour space, and many studies have noticed that the HSV colour space is more suitable for the colour moment function and was implemented in previous studies and gives accurate numeric values [23], [24]. Additionally, preprocessing includes cropping processes for some images and removes image boundaries that include noise or labelling information.

B. FEATURE EXTRACTION
A group of eighteen features (six features from each colour space) were extracted from all image data using colour moment feature extraction functions and the grey-level co-occurrence matrix (GLCM) texture functions, as shown in Table 1. These functions proposed by [25] and it were used in our previous studies [26], [27]. The other three texture features is a part of texture features founded by [28]. Where z i is a variable indicating intensity, p (z) is the histogram of the intensity levels in a region, L is number of VOLUME 8, 2020 possible intensity levels, m is the mean and σ is the standard deviation

C. SIMILARITY MEASURING
In this phase, similarity scores are calculated using eight different similarity metrics or coefficients. These similarity coefficients were widely used successfully in many previous studies. A comparison review of the similarity and dissimilarity coefficients found in [29], a brief description of their formula and previous related studies is shown in Table 2.

D. FEEDBACK BASED ON RANKING METHOD
This phase illustrates our amendment and contribution to enhance the retrieval process. Here, a few top images retrieved from rapid searches based on a single query for each class are used as voting populations to vote and select the most effective and highly performing similarity coefficient. Then, a new search is conducted using this best and effective similarity coefficient, and finally, the related images are retrieved for the end-user. The ranking process for a group of similarity coefficients used here is an automated process without more effort from the user. The user sends a query image. Then, our method ranks the eight similarity coefficients or metrics and selects the best one for further and final search processes. A summary of our proposed method is shown in Fig. 2. For m image classes and n top retrieved images, our method first uses a rapid search to retrieve the top similar images for each class. Here, our model was based on the top ten similar images, but it could be any value. Then, for each class, the top retrieved images were counted and used as voters to select the most effective similarity coefficient that had the highest score of relevant images. After that, the selected effective similarity coefficient was used for the final search process, and the final average recall and precision were calculated. Additional explanations are shown in the pseudocode in Fig. 3.

E. KVASIR DATASET
Our experiments in this study used the Kvasir dataset [30], which consists of 4,000 coloured endoscopic images annotated by medical experts. The images were grouped into 8 different classes (500 images in each class) based on anatomy, pathology, or polyp removal procedures. Fig. 4 shows sample images from the Kvasir dataset; this dataset has been used in the latest (CBMIR)-related studies [31], [32].

IV. EXPERIMENTAL RESULTS AND DISCUSSION
There are two types of experiments in this study. The first experiment was performed using one image query from each class, and the top retrieved images for each class were used to provide information for feedback. This information was considered in terms of the voting process. To achieve this goal, the top retrieved images per class were used to vote and select the best and most effective similarity coefficient for that class, so the coefficient with high score votes is used in the further search. The second and main experiment was performed using ten random queries for each class, and finally, the average recall and precision were calculated for different types of cut-off or top values of retrieved images, as we will explain in further paragraphs. For recall and precision, the following known formulas were used:  The accuracy of various methods is also stated as a confidence interval of 95%. It is predicted that the recall rate (mean) of any method will fall 95% of the time within this range. Confidence intervals help us decide whether the difference in the rate of recall was significant enough to be of practical significance. The results in Fig. 6 (the mean, lower and upper bounds of the confidence intervals of different similarity methods) reveal that we can be 95% confident that our proposed RFRM works best for the Kvasir dataset. Based on these results, therefore, we can say with 95% statistical certainty that the search for similarity based on our method will do better (or at least not worse) than conventional similarity coefficients; the same finding was observed for precision rate (mean) shown in Fig. 7.  Visual inspection of the average precision for the top-10 and top-20 retrieved images was very important since the expert in the diagnosis field could make similar investigations and evidence, which also enables the comparison to be made on the different retrieval methods. Table 3 shows the average precision values at the top-10 and top-20 for all based coefficients and the proposed method. In the table, it is obvious that our proposed method successfully outperformed all based coefficients. Fig. 8 (a and b) also shows the top-10 retrieved image samples for the ''normal z line + normal caecum'' and ''polyps + ulcerative colitis'' classes; in both classes, nine out of ten images are retrieved successfully, as shown.
However, using the Kendall W concordance test [33], a more quantitative approach is possible. This test was developed to measure the level of agreement between multiple sets of rankings of the same set of objects; we used this kind of test in our previous works [34], [35] to evaluate the effectiveness of different retrieval methods. The image classes were considered judges in the present context, and the recall rates of VOLUME 8, 2020   the different retrieved methods were considered objects. The test outputs are the value of the Kendall coefficient, ranging from 0 (no agreement between a set of ranks) to 1 (complete agreement), and the associated level of significance indicates whether this coefficient value could have occurred by chance.
If the value is significant (for which we used the 0.01 or 0.05 cut-off values), then an overall ranking of the graded items may be given. Kendall analytical results (for top-100, top-200, top-300 and top-400) are summarized in Table 4 and identify the ranking for the different retrieval methods. Table 3 and Table 4 show that our proposed method gives the best performance of all retrieval methods for this dataset. The results in Table 3 show that RFRM has the best performance retrieval process across the eight image classes in terms of the average precision of ten random image queries. In Table 4, the top-100 results show that the value of the Kendall coefficient, 0.438, is significant at the 0.01 level of statistical significance; based on this significant result, we can conclude that the overall ranking of the nine retrieval methods is: Similarly, from the same table, we observe that the ranking of the nine methods (for other values of the top-200, top-300 and top-400) shows that our proposed method has the best retrieval performance according to their ranking based on the Kendall W test. A comparison of average recall and precision values in Tables 3 and 4 and Figures 5, 6 and 7 shows that there are often significant enhancement of results based on our proposed method. Also, an obvious surprising pattern of behaviour is observed from Table 4, in terms of enhancement of average recall and precision for each image class. The same observations could be seen from the recall and precision curve in Fig. 5.

V. CONCLUSION
We developed a new relevance feedback method using a voting technique based on any group of similarity measures. A group of colour and texture features was extracted using well-known feature extraction methods based on colour moments and GLCM texture features, respectively. For similarity measures, eight common similarity coefficients were used as similarity bases. Experience with the well-known Kvasir medical image dataset shows that this method provides a very simple method for enhancing and improving the retrieval effectiveness of related medical images. For any future efforts for this method, a weighted voting method could be considered. Instead of treating all eight similarity measures with equal weight, a weighted method or other technique may be used to give more score or vote for a good effective coefficient based on retrieved images for a quick searching process.