Automatic Estimation of Ulcerative Colitis Severity by Learning to Rank With Calibration

For automatic disease-severity-level estimation, a large-scale medical image dataset with level annotations is generally necessary. However, attaching absolute-level annotations (such as levels 0, 1, and 3) is very costly and even inaccurate due to the level ambiguity. In this study, we proved experimentally that using a ranking function for level estimation can relax this difficulty. We propose a multi-task learning method for automatically estimating disease-severity levels that combine learning to rank with regression. The ranking function of the proposed method is trainable by relative-level and a small number of absolute-level annotations. For relative-level annotation, an annotator only needs to specify that one image has a higher disease level than another—this is much easier than absolute-level annotation. The proposed method enables disease-severity classification by calibrating the ranking function based on relative-level annotation through regression. The effectiveness of the method was proved through a large-scale experiment of ulcerative colitis-severity estimation with colonoscopy images.


I. INTRODUCTION
To realize automatic disease-severity-level estimation, we often prepare a dataset with level annotation. Fig. 1 (a) shows an absolute-level annotation, where an annotator attaches an absolute disease level to each image. Using the annotated dataset, we can estimate the disease-severity level by using a regression method or classification method.
Even for medical specialists, attaching accurate absolutelevel annotations is difficult. This is because the level of disease is inherently continuous with gradual tissue and organ changes; thus, discrete levels such as absolute levels always have quantization errors. For example, even when a four-level annotation (0, 1, 2, 3) is requested, they easily find medical images that should be ''level 1.5''. Moreover, the level itself The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . easily fluctuates among annotators (e.g., [1]) or even with the same annotator.
The purpose of this study was to relax this difficulty by using relative-level annotation instead of the above absolutelevel annotation. Fig. 1 (b) shows the idea of the relativelevel annotation. Given a pair of images (x i , x j ), the annotator just specifies the image with a higher severity level. This task is far easier and even more accurate than the absolutelevel annotation, especially when the paired images have a clear severity-level difference. Therefore, the difficulty in annotating a large number of images can be greatly reduced by using relative-level annotation.
Given a dataset with the relative-level annotation, we can automatically estimate severity level by training a ranking function f (x). Fig. 1 (c) shows the idea of the so-called bipartite ranking problem. The basic objective of this problem is to train a function f (x) that maximizes the number of sample pairs whose the relative-level annotation is ''satisfied.'' More specifically, assume a pair of images x i and x j and their relative-level annotation stating that x i has a higher level (i.e., rank) than x j . The annotation is then satisfied when f (x i ) > f (x j ). The trained function f is expected to be relative to the original (continuous) disease level.
Since there is a nonlinear relationship between the original image features and the disease levels, a function f should be highly nonlinear to satisfy the relative level annotations as many as possible. Therefore, we use representation learning by a convolutional neural network (CNN) to obtain a nonlinear f that satisfies the relative annotations as many as possible. Fig. 1 (d) shows a nonlinear f with representation learning by using a CNN. The thick green arrow curve shows the nonlinear f . The dotted curve shows the isoline where the samples on it have the same rank values.
It should be emphasized that the above ranking function is still not enough for practical diagnosis. This is because f satisfies only the order between the samples, and its value has no specific meaning for diagnosis. For example, if we realize a ranking function f for or ulcerative colitis (UC) diagnosis with colonoscopy images, the value of f does not have a clear relationship with a common severity level, such as the Mayo score [2]. Colony images x i and x j with Mayo levels 0 and 2 might have the ''satisfactory'' rank value −100 and 35, although it is impossible to guess the Mayo levels from the rank values.
In this paper, we propose a multi-task learning method for obtaining a calibrated ranking function f (x) by using a large amount of (easy) relative annotations and a small amount of (costly) absolute-level annotations. Roughly speaking, we train a ranking function f to satisfy the relative-level annotations while satisfying y ∼ f (x) for the sample x with the absolute-level annotation y, as shown in Fig. 1 (e). By the calibration, the ranking function f can estimate the realvalued absolute levels (such as Mayo scores) of all samples. Fig. 1 (f) shows the three steps of the proposed method. At the first step, an initial ranking function is obtained through a training process with only using relative-level annotations like Fig. 1 (d). At the second step, several samples are selected based on the estimated rank scores, and their absolute level annotation is attached by human (e.g., medical experts). At the final step, the ranking function is calibrated by multi-task learning of ranking and regression, as shown in Fig. 1 (g). The calibrated ranking function f will give a real-valued severity score. If a target severity score is expected as a discrete one, the score can be quantized into several levels, like Mayo 0, 1, 2. In other words, the calibrated ranking function can be seen as a severity classifier.
We applied the proposed method to a UC-level classification task. Specifically, we obtained a f that estimates the Mayo level of the given endoscopic image x. The Mayo level ranges from 0 (normal) and 3 (the most severe) with discrete values. We prove that f trained with the proposed method achieves high classification performance (accuracy and VOLUME 10, 2022 F1-score) with far less annotation effort, aiding in supporting UC diagnosis.
The main contributions of this paper are summarized as follows: 1) To the best of our knowledge, this is the first trial of using the learning to rank framework for drastically reducing the annotation effort for a medical image dataset through relative-level annotation. 2) We developed a new multi-task learning method that calibrates the rank score to the absolute disease level. 3) Through an experiment to estimate the UC severity, the proposed method achieved even higher performance than the conventional classification methods trained with fully absolute-level annotations. This means that our method increases the estimation performance with much less annotation effort.

II. RELATED WORK
In gastrointestinal diseases, various lesions exist in different parts of the digestive organs, and endoscopy is used for lesion detection. Research on supporting endoscopic imaging diagnosis using machine learning is currently being conducted. There have been many investigations on automating classification tasks, such as classification of gastric cancer [3], [4], gastric precancerous disease [5], colorectal cancer using narrow-band imaging (NBI) images [6], and severity using endoscopic and biopsy histological images of UC [7]. An automatic abnormality detection task on capsule endoscope images has also been investigated [8]- [10]. These machine-learning applications aim to support diagnosis through classification, segmentation, and abnormality detection but do not focus on reducing the annotation cost of training data. Learning to rank is widely used for recommendation systems and has been used for several image-analysis problems. For example, the ranking function has been applied to image-quality assessment and image attractiveness [11]- [15] because it is difficult to give an absolute quality evaluation for each image in these tasks.
Learning to rank is not common in medical image analysis, despite its usefulness in drastically reducing annotation effort. UC-level estimation is still often formulated as a classification task [16]- [18] and requires a dataset with absolute-level annotation. To the best of our knowledge, only a few studies [19]- [22] used the bipartite ranking problem for medical image analysis. However, none focused on the advantage of the ranking function for annotation-cost reduction. Moreover, some of these studies [19]- [21] just used simple or handcrafted features and thus did not use representation learning, although it drastically enhances the performance of the ranking function.
On the basis of a previous study [23], a ranking task is often converted into a multi-task learning problem (instead of the original bipartite ranking problem) then used in age estimation [24]- [26] and medical analysis [27]- [30]. Each multi-task learning is a binary classification to determine if the input sample is larger than a certain level. This approach requires absolute-level annotation, thus cannot use the benefit of relative-level annotation.

III. TWO ANNOTATION TYPES
In this study, there are the two types of ground-truth labels: absolute labels (ALs) and relative labels (RLs). In the proposed method, RLs are initially given, and then ALs are given to a small number of samples shown in Fig. 1 (f).
A. ABSOLUTE LABELS AL is a disease severity level. In this study, it corresponds to one of the four-level Mayo scores. As noted in Section 1, giving accurate AL for a medical image is a difficult task even for experts. This is because of large image appearance variations within each level, and ambiguous samples that fall in the middle of two levels, say Mayo 1 and 2. These difficulties increase the annotation costs and thus prevent the realization of a large medical image dataset with ALs.

B. RELATIVE LABELS
RL is attached by comparing the severity of the disease between the two images as shown in Fig. 1 TheP ij takes one of three values according to the following equation: if x i has a higher level than x j , 0.5, else if x i and x j have the same level, 0, otherwise. (1) The annotation for RLs is much easier than that for ALs because annotators do not need to identify the level of difficult samples that have a middle level of severity, such as level 1.5.

IV. LEARNING TO RANK WITH CALIBRATION
The proposed method consists of three steps. In step 1, we first train the initial ranking function by using learning to rank with RLs. In step 2, we then select a small number of samples from the training data to annotate them regarding ALs. The samples are selected using the ranking function trained using the RLs. In step 3, we finally carry out multi-task learning with RLs and additionally and adaptively prepared ALs, for calibrating the ranking function to more meaningful disease-severity levels. An overview of the proposed method is shown in Fig. 1 (f). Before providing further details of the above steps, two important aspects should be clarified. First, the rank score by the ranking function in step 1 is not a disease-severity level, and thus the calibration of step 3 is necessary. Second, an AL is not given in advance but given after the ranking function is trained by RLs. This provides a more appropriate choice of samples where ALs should be attached, resulting in more accurate severity-level estimation with less AL annotation cost.

A. LEARNING TO RANK
In step 1, we train the initial ranking function using learning to rank. The ranking function f (x) is trained with a CNN for the representation (i.e., feature extraction) that is suitable for the ranking. A CNN is composed of multiple convolutional layers, a single fully connected layer, and a single output node to give a single scalar value f (x). This can be considered a powerful extension of the classical RankNet [31] where a linear ranking function is trained using a very shallow neural network.
The CNN is trained using sample pairs with RLs. For training, we input two images to two CNNs with shared weights and then minimize the loss for the pair. Specifically, the CNN is trained with the loss function L rank = (i,j)∈P L i,j rank , where P is the set of sample pairs. The function L i,j rank is defined as a cross-entropy, where

B. SAMPLING FOR ABSOLUTE-LEVEL ANNOTATION
In step 2, a small number of samples are selected from the training data and attached ALs. As noted above, the proposed method assumes that the samples to which ALs are attached are selected after f (x) is estimated. This is more reasonable than, for example, a random selection because we can select samples that are expected to be more necessary for the calibration step by using the clues from f (x).
To select a small number of samples, we first obtain the rank score of the training samples using f (x) and represent the rank score of the training data as a point on a number line. Next, we select M ( N) samples at equal intervals on the number line within the maximum and minimum rank scores. Finally, ALs are attached to the selected M samples by absolute-level annotations.

C. MULTI-TASK LEARNING
In the final step 3, the proposed method calibrates the ranking function to give the absolute severity score. This calibration process can be seen as a fine-tuning process of f (x) so that the output of f (x) becomes closer to the AL of x. At the same time, we need to be careful that the fine-tuning process does not destroy the sample ranks learned in f (x). These two requirements result in a multi-task learning to fine-tune f (x).
As shown in Fig. 1 (g), the multi-task learning combines regression to make y ∼ f (x) for the sample x with AL and learning to rank for the pairs (x i , x j ) with RL. The loss function of learning to rank is cross entropy of Eq.(2). The loss function for regression, L reg = (i,j)∈P L i,j reg is defined by adding the mean squared error (MSE) loss function for each sample pair, where y i and y j are the ALs attached by the absolute-level annotation of x i and x j , respectively. Furthermore, the multi-task loss function L multi is defined as the sum of the loss functions of learning to rank and regression, where λ is a hyper-parameter to balance the losses. The trained multi-task f (x) is expected to be a ranking function corrected for the region of severity levels in the feature space optimized by representation learning, as shown in Fig. 1 (e). Note that, in Step 3, we only use M samples with ALs. Therefore, L rank in (4) is minimized with the RLs of M (M −1) pairs. Finally, the calibrated rank score f (x) is quantized into the nearest discrete disease-severity level as the classification result of x. For example, with the severity levels ∈ {0, 1, 2, 3}, the level becomes 3 for x whose f (x) = 2.7.

A. DATASET
We used 10,265 colonoscopy images of UC from 388 patients at Kyoto Second Red Cross Hospital as the dataset. These images were taken from multiple patients (including healthy participants). The images have different sizes and therefore were resized to 224 × 224 pixels. Fig. 2 shows several examples of each of four levels of Mayo, which is the standard disease severity score for UC. According to Schroeder et al. [2], Mayo 0 is normal or endoscopic remission. Mayo 1 is a mild level showing erythema (i.e., abnormal redness), a decreased vascular pattern, and mild friability. Mayo 2 is a moderate level showing marked erythema, an absent vascular pattern, friability, and erosions. Mayo 3 is a severe level with spontaneous bleeding and ulceration.
Although our method does not require a dataset with full ALs, we attached ALs to all samples for a quantitative performance evaluation. Specifically, a four-level Mayo score is carefully attached to each colonoscopy image by multiple medical experts. The dataset contains 6,678, 1,995, 1,395, and 197 samples for Mayo 0, 1, 2, and 3, respectively. Note that it is common to have such a heavily imbalanced dataset for colonoscopy, as well as other medical image diagnosis tasks.
In the following experiments, five-fold cross-validation was performed. The colonoscopy images were divided into 60%, 20%, and 20% for patient-disjoint training, validation, and test sets, respectively. Note that we divide all images into training, validation, and test sets to have the same severity proportion for keeping a fair and practical evaluation scenario.
For a fair comparison with a conventional method (detailed later), we carefully control several conditions. First, we used the same number of annotations for the conventional and proposed methods. More precisely, the conventional method uses 8,212 images-(80% of the entire data) with AL for training, and the proposed method uses 8,212−M pairs with RL at step 1, and M images with AL at step 3. Since AL has more information than RL, this condition is a handicap for the proposed method. Nevertheless, we adopted this condition so that the conventional method would not be disadvantaged.
Second, we allow over/under sampling for class imbalance removal to the conventional method but not to the proposed method. This is because the conventional method has ALs for all samples and thus, such sampling is possible, whereas the proposed method does not. This condition will be another and large handicap for the proposed method. Table 1 shows the number of ground-truth labels (RL and AL) and the annotation time for each method. In our interview with endoscopists, AL labeling takes 20 seconds per image, and RL labeling only takes (less than but roughly) one second. For the case M = 400, this indicates that the proposed method requires just 10% of the annotation time of the conventional method.

C. IMPLEMENTATION
The implementation environment is shown as follows. We used an Intel(R) Core(TM) i9-10980XE 3.00 GHz as the CPU and two NVIDIA TITAN RTX 24 GB as GPUs for training. We wrote the code in Python 3.6 and used Tensorflow 1.13.1 and Keras 2.2.4 as the deep learning library. The CUDA version was 10.0. We used Adam as the optimizer to train the weight parameters. The learning rate was set to 5 × 10 −6 . The learning was terminated by the early stopping rule (no decrease in validation loss for 20 epochs). For λ in Eq.(4), we examined the range of 0.001 to 1 and had the highest F1-score at λ = 0.01 for the validation set.
We used DenseNet [32] as the CNN. DenseNet has been widely used in various medical-image classification and analysis tasks due to its state-of-the-art performance (e.g., [33], [34]).

D. EVALUATION METRICS
The proposed method is evaluated in four-Mayo class classification performance by accuracy, recall, precision, and F1-score. Recall that the class is determined by quantizing the rank score into its neighboring level, e.g., 2.7 → 3. We leave the test samples imbalanced to mimic realistic medical situations. To avoid the under/over-estimation risk of the accuracy values in the imbalanced situation, F1-score is also employed.

E. COMPARISON METHOD
The performance of the proposed method was compared with the conventional CNN-based multi-class classification method. DenseNet-169 trained by the standard categorical cross entropy is used for this comparative method. As the training data, all 8,212 training samples are used with their absolute-level annotations. This means that it uses all of the absolute-level annotations. Table 2 shows the classification performance of the proposed method at M = 400. The proposed method achieves higher F1-score than the conventional method. This result shows that the proposed method achieves even higher classification performance than the conventional method, although the proposed method only needs 1/10 annotation cost of the conventional method. We evaluated the performance of the proposed method for severity classification with various numbers of ALs (M = 50, 100, 200, 300, and 400). Fig. 3 shows the results of the performance evaluation with the proposed method using different numbers of ALs. Acc and F1 represent accuracy and F1-score, respectively. The F1-score increases as the number of ALs increases. From M = 300, the F1-score of the proposed method is higher than that of the conventional method. The accuracy of the proposed method is higher than that of the conventional method for M = 200 and over. Therefore, the proposed method achieved higher performance than the conventional method when the number of ALs is more than M = 300.

G. ABLATION STUDY
We examined the effect of calibration on rank scores by multi-task learning with the proposed method. Specifically, we verified the effect by comparing the classification performance between calibrated and uncalibrated cases. To determine the classification result for the uncalibrated case, we defined the range of the rank score for each Mayo score by logistic regression using M = 400 ALs, which were used for multi-task learning, for a fair comparison. Fig. 4 shows box-plots for each correct Mayo score of test samples by (a) the uncalibrated case and (b) the proposed method. The horizontal and vertical axes correspond to the correct Mayo score attached by the annotators and the rank scores, respectively. The rank scores obtained with the proposed method are located nearer to the Mayo score range of 0 to 3 than those with the uncalibrated case. Therefore, these results indicate that the rank scores are calibrated with the ALs as the anchor by using regression as the anchor task in multi-task learning. Table 3 shows the results of the performance evaluation for the uncalibrated case. The overall precision, recall, and F1-score of the uncalibrated case were lower than those of the proposed method. Comparing the F1-score for each class, Mayo 3 was particularly low with the uncalibrated case, indicating an imbalance in the classification performance for each class.   Compared with the proposed method, the uncalibrated case had a higher rate of incorrectly predicting Mayo 2 and Mayo 3 as Mayo 1 and could not accurately classify images with high severity. These results indicate that the calibration effect improves the performance of classifying images with high severity and that the proposed method has higher performance than the uncalibrated case.

VI. CONCLUSION
We proposed a multi-task learning method that combines learning to rank with regression for automatically estimating UC severity levels (Mayo scores). The proposed method has a strong advantage in that it can substantially reduce annotation costs by using relative-level annotation instead of costly absolute-level annotations. Our experimental result shows that the proposed method achieved even higher classification VOLUME 10, 2022 performance (accuracy and F1-score) than the conventional classification method while requiring just 1/10 annotation cost.
The limitation of the proposed method is that it requires more training time than the conventional method because the number of pairs increases with the number of images to which AL is attached. We will investigate ways to make effective pairs for learning with as few AL combinations as possible.
Future work will involve the proposal of a new continuous-valued UC severity level. Currently, we discretize the regression result into four levels to follow the traditional Mayo-based evaluation. However, the regression result can show an intermediate score, such as Mayo 1.75 by itself. Our continuous severity score can be an accurate and precise alternative to Mayo through discussion with the medical expert committee.

ACKNOWLEDGMENT
All of the endoscopic images used in this article are approved by the ethical review committee at the Kyoto Second Red Cross Hospital.