Active Learning for Imbalanced Ordinal Regression

Ordinal regression (OR), also called ordinal classification, is a special multi-classification designed for problems with ordered classes. Imbalanced data hinders the performance of classification algorithms, especially for OR algorithms, as imbalanced class distributions often arise in OR problems. In this article, we address an active learning based solution for imbalanced OR problem. We propose an active learning algorithm for OR (AL-OR) to select the most informative samples from unlabeled samples, mark them and add them to the training set. Based on AL-OR, we put forward an improved active learning for imbalanced OR (IAL-IOR), which further adjust the sampling strategy of AL-OR dynamically to make the training data as valuable and balanced as possible. Recall rate for multi-classification and new mean absolute error are designed to measure the performance of the algorithms. To the best of our knowledge, our algorithm is the first algorithm for imbalanced OR in algorithm level. The experimental results show that the proposed algorithms have faster convergence and much better generalization ability than the classical methods and the state-of-the-art methods under the evaluation measurements for imbalance problems. In addition, we also proved the effectiveness of our algorithms by statistical analysis.


I. INTRODUCTION
Multi-classification is an important task in machine learning. As a special case of multi-classification, OR is designed to solve the problems with ordered labels. For example, a teacher always rates his/her students by giving grades (A, B, C, D, F) on their performance [1]. When we use common multi-classification algorithms to predict the grades, the order information of the grades is obviously ignored. OR algorithms are designed to make full use of the order among the labels. OR problems usually appear in many research areas, such as medical research, age predict, brain computer interface, credit rating, econometric model, face recognition, facial beauty assessment, image classification, wind speed prediction, social sciences, text classification, and more [2].
Class imbalanced problem is a big challenge for classification algorithms. In this problem, class with more samples is majority class and class with fewer samples is minority class. It is important to predict the minority classes correctly, for the The associate editor coordinating the review of this manuscript and approving it for publication was Sabah Mohammed . minority classes often represent the unusual cases to which we should pay special attention, such as, high rate in credit rating, high speed in wind speed prediction and so on. Due to the designed principles, most machine learning algorithms optimize the overall classification accuracy, while sacrificing the accuracy on minority classes. Therefore it is necessary to design some methods to improve the classification accuracy on minority classes without jeopardizing the accuracy of the majority classes severely [3].
Imbalanced problem has been widely studied for standard classification problems (binary and multi-class classification). Data level methods balance the skew distributions on the dataset by using data pre-processing, such as undersampling, over-sampling. In under-sampling algorithms, the sample size of majority classes will be dislodged to reach the desired rates of different classes. In over-sampling algorithms, such as SMOTE (Synthetic Minority Over Sampling Technique) [4], MDO (Mahalanobis Distance-based Oversampling technique) [5], the sample size of minority classes will increase by generating new samples to reach the desired rates. When imbalanced data occurs in OR, new methods should be designed to tackle the peculiarities. Currently, the Graph-Based Over-sampling can generate synthetic samples by considering the distribution of minority class data and the order of samples. During over-sampling it captures the structure of the data by constructing a sample graph and considers the paths which contain the ordinal constraints of the data. Besides, new samples are generated near the boundary of two adjacent classes to soften the ordinal structure of the samples [6]. The Cluster-Based Weighted Over-sampling clusters minority classes at first, and then oversamples them based on their distance, and finally sorts the classes [7]. Synthetic Minority oversampling technique to deal exclusively with imbalanced Ordinal Regression (SMOR) is a direction-aware oversampling algorithm [8], and it can effectively avoid wrong synthetic samples generation by considering the rank of the classes. SMOR computes the selection weight of being used to generate synthetic samples for each candidate generation direction. However, they are both designed in data level for OR. For under-sampling, it may dismiss valuable samples which are decisive for building classifiers. For over-sampling, it increases the probability of overfitting by replicating the minority classes samples.
Instead of focusing on modifying the training set in order to combat class skew, other methods aims at modifying the classifier learning procedure itself, such as cost-sensitive methods [9], ensemble learning algorithms [10], active learning [11], [11]- [14], and one-class learning [15]. Among them, active learning can counteract the harmful effects of learning under imbalanced classes by selecting the most useful samples for the classifier [11]. Since active learning is implemented on a random set of training populations rather than the entire training dataset, it can reduce the computational complexity of dealing with large imbalanced datasets [14]. Moreover, active learning provides a progressive sampling strategy which makes it is possible to adjust a sampling strategy dynamically by observing some indicators of the ease of the different classes [13], and gradually improve the performance of the classifier by selecting the samples that have the most learning value for the current classifier each time. The time complexity of active learning mainly comes from the time it takes to find the best samples from unlabeled samples and retrain. The time complexity of searching most informative sample from unlabeled data has been solved by small pools and early stopping, and the time complexity of retraining has been solved by incremental learning [12]. Active learning not only inherits the advantage of reducing computational complexity (by small pools, early stopping and incremental learning) of under-sampling, but also avoids the disadvantage that under-sampling may delete useful information. It's noteworthy that learning difficulties of the different classes that make up a learning task may be different (in terms of the number of instances required). Therefore, we need sampling strategies that can be adjusted dynamically by observing different learnability indicators. Such sampling strategies can be achieved by active learning [13], which can select the most useful samples for OR classifier by these sampling strategies. Moreover, as shown in Figure 1, we can also reasonably assume the samples inside the different boundaries of OR are balanced [12].
In this article, we propose an active learning method to deal with the imbalanced OR in algorithm level. First, we transform OR to an extended binary classification [16], [17], so that ordinal regression can be achieved by SVM (Support Vector Machine). Then we design a sampling strategy of active learning for OR, and then we adjust the sampling strategy dynamically to get a more valuable and balanced training set from imbalanced data. Finally, we propose new evaluation methods specifically for imbalanced OR to prove the efficiency of our algorithm.
The main contributions of this article are summarized as follows.
1) To the best of our knowledge, our algorithm is the first algorithm for imbalanced OR in algorithm level. 2) We put forward a sampling strategy for OR (AL-OR), and an improved active learning method for imbalanced OR (IAL-IOR). 3) We propose improved evaluation methods to evaluate the performance of imbalanced ordinal regression. We organize the rest of the paper as follows. In section II, we give a brief review on the related works. In section III, we present the transformation of OR to an extended binary classification model, and its SVM based solution.
In section IV, we put forward a sampling strategy for OR and an active learning with balanced sampling process and novel evaluation methods for imbalanced OR. In section V, we carry out our experiments on a variety of datasets and the experimental results are discussed. Finally, in section VI, we give some conclusions.

II. RELATED WORK
In this section, we give a brief overview on ordinal regression and active learning.

A. ORDINAL REGRESSION
Many real world applications present the label ordinal structure and ordinal regression has increased the number of methods and algorithms developed over the last years [2]. VOLUME 8, 2020 Over the past decade, a number of noteworthy research advances have been made in supervised learning of ordinal regression [17]- [19]. Since support vector machines (SVMs) have gained profound interest because of good generalization performance [16], there are several support vector OR (SVOR) formulations proposed to tackle OR problems. Shashua and Levin [20] proposed fixed-margin-based formulation and sum-of-margin-based formulation by finding multiple parallel hyperplanes. Chu and keerthi [19] improved the fixed-margin-based formulation by explicitly and implicitly keeping ordinal inequalities on the thresholds. Cardoso and Costa [21] proposed a data replication method and mapped it into SVM by using the fixed-margin-based formulation implicitly. Li and Lin [16] proposed a reduction framework from ordinal regression to binary classification based on extended examples. This framework allows to design a ordinal regression model based on a binary classification and derive new generalization bounds for ordinal regression from known bounds for binary classification. Moreover, it unifies many existing ordinal regression algorithms. In our paper, we build the ordinal regression by using this reduction framework and SVM. Recently, [22] presented a new Kernel Extreme Learning Machine for ordinal regression (KELMOR) by exploiting a quadratic cost-sensitive encoding scheme to deal with the efficiency of OR in the big data scenario. Reference [23] proposed a novel ordinal regression model, which is named as nonparallel support vector ordinal regression (NPSVOR), a set of possible nonparallel hyperplanes are constructed independently. However, a small number of literatures have considered the imbalanced ordinal regression [24].
Reference [6] creates synthetic samples by considering the distribution and ordering of minority data. The main assumption of this method is that when resampling samples in ordinal regression problem, the ordering of the classes should be considered, and the ordering is generally represented by a latent manifold. To take advantage of this collector, it captures the structure of the data by constructing a pattern-based graph, and considers paths preserve data order constraints when oversampling. In addition, new samples are created at the boundaries between adjacent classes to smooth the ordinal nature of the dataset.
Reference [7] aims to address the imbalanced ordinal regression by clustering the minority classes and over-sampling them based on their distance firstly, and then ordering the relationship with the samples of other classes. The final size of an oversampling cluster depends on its complexity and initial size in order to generate more synthetic instances for more complex and smaller clusters and fewer instances for more complex and larger clusters. An improved agglomerative hierarchical clustering algorithm is proposed to reduce the occurrence of superimposed composite samples during oversampling. Moreover, a new measurement method is proposed to quantify the balance between the complexity of the cluster and the initial size.

B. ACTIVE LEARNING
Active learning as a standard machine learning problem, has been extensively studied in many research filed. Based on different sample strategies, they can be grouped into these categories [25]: 1) uncertainty sampling where an active learner queries the samples about which it is least certain how to label. 2) Query-By-Committee which involves maintaining a committee, and the most informative query is considered to be the samples about which they most disagree. 3) Expected Model Change where an active learner queries the samples that would impart the greatest change to the current model. Moreover, more and more studies focus on the active learning for imbalanced data [11]- [14], [26].
Reference [12] assumes that samples inside the boundaries are balanced, and active learning is used to choose samples in the boundaries so that the learner has a more balanced training set. Completely active learning is used to solve the imbalanced problem and the experimental results show that active learning implements fast solutions with competitive prediction performance in imbalanced classification. Meanwhile, it assumes that samples inside the boundaries are balanced, as shown in Figure 1, and active learning is used to choose samples in the boundaries so that the learner has a more balanced training set.
A co-selecting method is proposed in [26] which uses twofeature-subspace classifiers to choose the balanced samples by adjusting a sampling strategy dynamically from imbalanced sentiment data. Experiments of four domains demonstrate great potential and effectiveness of the approach for imbalanced sentiment classification.
Reference [27] analyses the effect of resampling techniques used in active learning for word sense disambiguation. It's worth noting that the technique does not requires architecture or learning algorithms modification, which makes them very easy to use and extend to other areas. Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice.
Reference [14] proposes an ensemble-based active learning algorithm to tackle the medical diagnosis imbalance problem. The artificial data is created according to the distribution of the training set to make the ensemble diverse, and the random subspace re-sampling method is used to reduce the data dimension. In selecting member classifiers based on misclassification cost estimation, the minority class is assigned with higher weights for misclassification costs, while each testing sample has a variable penalty factor to induce the ensemble to correct current error. Experimental results show that compared with other ensemble methods, the proposed method has best performance, and needs less labeled samples.

III. SVM BASED OR SOLUTION
In this section, we first transform OR to an extended binary classification problem, and then give a SVM based solution.

A. OR AS AN EXTENDED BINARY CLASSIFICATION
The problems that OR handles can be described as follows: when an input vector x is given, we can get a label y, where x ∈ X ⊆ R d and y ∈ Y = {C 1 , C 2 , . . . , C K }, i.e., x is a sample of a d-dimensional input space, and y is one of K different labels, where C 1 < C 2 < . . . < C K . Suppose that OR is a threshold model, then a K ordinal classes problem has K − 1 ordered thresholds: θ 1 < θ 2 < . . . < θ K −1 [28]. Thus, a sample x is considered as class C i when the predictive function h(x) = w T x − b falls between θ i−1 and θ i , where w ∈ R d and b is a offset, and θ 0 = −∞ and θ K = ∞ are typically assumed. For example, the output of a class label C 3 should fall between θ 2 and θ 3 .
OR as an extended binary classification can be formed as: where e k ∈ R K −1 denotes a (K − 1)-dimensional vector whose kth element is 1 and the rest of the elements are 0, the function I [·] is an indicator function which will return 1 if the inside condition holds, otherwise zero will be returned.
The weight vector can be used to predict y k i such that wx k i = (w, −θ)x k i = w T x i − θ k . Therefore, the θ k can be calculated by feature extension. Finally, the label of each OR sample can be predicted as: where g(

B. SVM BASED SOLUTION FOR OR
Given the original OR dataset L = {(x 1 , y 1 ), . . . , (x L , y L )}, we can extend the datasets L into the corresponding dataset of binary classification and y k i ∈ 1, −1. Thus, we can minimize the structural risk function by [29] and [16] as the following primal problem: where φ denotes kernel function, C is a positive number and the ξ k i denotes slack variables that allow x i to have some error at the kth boundary. The kernel function in Equation (4), Algorithm 1 AL-OR Input: Labeled data L and Unlabeled data U Output: The OR model 1: for i < N do 2: Learn an OR classifier using current L 3: Use the classifier to predict the unlabeled data U 4: Use sampling strategy as Equation (8) shows to select informative samples for manual annotation 5: Move the informative samples from U to L 6: end for 7: return An OR classifier makes the decision function become nonlinear by virtue of the kernel trick [30].
By introducing lagrangian α k i and µ k i , the dual form of the minimization problem in Equation (4) becomes: is the resultant kernel evaluation of x k 1 i and x k 2 j . In this way, the theoretical rigor of SVM is inherited, and moreover, typical caching and optimization techniques such as SMO [31], [32] can also be used in OR [28].

IV. IMPROVED ACTIVE LEARNING FOR IMBALANCED OR
In this section, we first put forward a sampling strategy of active learning for OR. Then we design a balanced active learning method for imbalanced OR. Finally, we introduce two improved evaluation indicators for imbalanced OR. The general flow of the algorithm is shown in Figure 2.

A. ACTIVE LEARNING FOR OR
It is easy to collect large number of unlabeled data in many real-world applications, so that effective pool-based active learning, as shown in Algorithm 1, becomes more and more important [33]. The most critical step of pool-based active learning is how to evaluate informative samples.
The most commonly used technique in active learning focuses on selecting samples from the area of uncertainty (the area closest to the prediction decision boundary of the current model), and many exiting popular techniques are specializations of uncertainty selecting, including query-bycommittee-based approaches [34]- [36]. Therefore, the most frequently used active learning strategy in SVM is to check the distance of each unlabeled sample to the hyperplane, by which the most informative samples are decided for the VOLUME 8, 2020 learner [37]. We can get the parameter of ordinal regression as Equation (4) shows: Sample x i will be extended to K − 1 samples by Equation (1), therefore a sample's confidence can be calculated as: As Equation (7) shows, we can calculate the distance from the extended samples x k i to the boundary by | (w, −θ )x k i − b |. The minimum of K − 1 distances represents the distance between the sample x i and the decision surface of the final category. Then, we can get the most informative sample from the unlabeled data: According to Equation (8), we can get our AL-OR algorithm as shown in Algorithm 1.

B. BALANCED ACTIVE LEARNING FOR IMBALANCED OR
The sampling strategy for OR proposed before just find the informative samples from the entire unlabeled data. This may aggravate the imbalance rate in labeled data if the informative samples are always in majority classes. To balance the samples in different classes, once a majority classes sample is chosen by Equation (8), we will choose an informative sample from the least class. Algorithm 2 illustrates the improved active learning for imbalanced OR (IAL-IOR) in detail.
The main difference of our algorithm from the original active learning is step 6-8. If the most informative sample in current unlabeled dataset U belongs to the majority classes in current labeled dataset L, a sample of the minority classes in current L will be added by manually annotating the most informative sample of the same class in U . After many iterations, the samples in labeled dataset L will be more balanced. Learn an OR classifier using current L 3: Use the classifier to predict the unlabeled data U 4: Use sampling strategy as Equation (8) shows to select the most informative sample x in 5: Put the x in into set A 6: if x in belongs to a majority class then 7: Choose the most informative sample from the least class 8: end if 9: Manually annotate the samples in A 10: Move the informative samples in A from U to L 11: end for 12: return A balanced OR classifier

C. EVALUATION METHODS FOR IMBALANCED OR
Making proper comparisons between classification models is a complex and unsolved challenge. This task depends not only on the understanding of errors, but also the nature of the problem itself. To avoid erroneous biases in the assessment, evaluation indicator error rate is designed to evaluate the accuracy of classification algorithms, while for OR, the mean absolute error (MAE) may be a more reasonable metric [38]: where N is the number of samples. Small errors are not as important as large errors in OR. For example, when a student's true grade is A, being predicted as B is more acceptable than being predicted as F. However, error rate and MAE may be deceptive in imbalanced situation. For example, for a given dataset with 10 percent of the samples belong to the minority class and 90 percent of the samples belong to the majority class, if a classifier predicts every sample to be the majority class, it will be evaluated to have an accuracy of 90 percent by error rate. It is obviously that the classifier may be ineffective to the minority class. In research area, other evaluation indicators are used to provide a comprehensive assessment of imbalanced problems, such as Recall. It is defined as the following formula in binary classification: where TP is the number of positive samples correctly predicted and FN is the number of positive samples predicted as negative. Equation (10) can be improved for OR (also can be used in multi-classification) as: where Recall k denotes the recall rate in class k. There are also some well-accepted measures for imbalanced OR, the average mean absolute error (AMAE) [39] and and the maximum mean absolute error (MMAE) [40]: where MAE k is the MAE of class k. The MAE can also be improved for OR as: where N is the size of dataset and N k is the size of class k. N N k denotes the weight of different classes. It is easily to prove that Equation (14) is equivalent to MAE when the dataset is balanced.

V. EXPERIMENTS
In this section, we first introduce the experimental setups, and then present our experimental results and discuss.

A. EXPERIMENTAL SETUP 1) DESIGN OF EXPERIMENTS
As an active learning algorithm, we should verify the availability of our sampling strategy, and as an algorithm for imbalance OR, we should show the generalization of our algorithm.
To show the availability of our sampling strategy, we compare the performance of our algorithms (including AL-OR and IAL-IOR) with the random sampling (randomly select query samples). To show the generalization of our algorithm, we compare the performance of our algorithms with the stateof-the-art imbalance methods (including under-sampling (US) and an over-sampling methods (SMOTE [4])) and recent proposed imbalanced methods (including SMOR [8] and SMOM [41]). The performance of all algorithms will be evaluated by general measures such as accuracy, MAE, and measures especially for imbalanced OR, such as AMAE, MMAE, Recall m and MAE im .

2) IMPLEMENTATION DETAILS
We implement our algorithms and the other comparison experiments in MATLAB. Experiments are run on a 2.4-GHz Intel Xeon machine with 128-GB RAM. The active learning algorithms we designed are based on the OR classifier mentioned in Section 3. To be fair, all the base classifiers of the next comparison algorithms are this OR classifier. This classifier mainly involves two parameters: kernel functions and C. The kernel is Gaussian kernel K (x 1 , 2 ) with k = 0.1, and the C is fixed to 10 respectively. The values for the parameters of SMOR are: k = 2, w = 0.25 and r = 1/4. The nearest neighbor parameter k in SMOTE is set to 5. The parameters of SMOM is set as follows: k1 = 12, k2 = 8, rTh = 5/8, nTh = 10, w1 = 0.2, w2 = 1/2, r1 = 1/3 and r2 = 0.2. The code of SMOR [8] and SMOM [41] is available at the website https://github.com/zhutuanfei/SMOR, and the code of our algorithms is available at the website https://github.com/gjmrookie/active-learning-forimbalanced-OR.
For the experiments to verify the availability of our sampling strategy, we divide the original dataset into a test set and a training set at a ratio of 4: 1, and the samples proportions of different classes in the training and test sets are the same as in the original dataset. In the training set, we set different size of L to ensure only a small amount of data at the beginning of training. Random sampling and AL-OR take 10 samples per generation. For the experiment to verify the generalization of our algorithm, the original dataset is also divided into a test set and a training set as above.

3) DATASETS
The datasets used in our experiments are summarized in TABLE 1. The first three benchmark datasets are used for metric regression problems. Equal-length merging are used to discrete target values into ordinal numbers by dividing the range of target values into a given number of equal-length intervals [42]. The last twelve benchmark datasets are OR datasets from the real world.  the test data by running 20 trials. The results show that our algorithms (AL-OR and IAL-IOR) are much better than random sampling, and our algorithms converge early. The performance of AL-OR demonstrates that active learning can deal with the class imbalanced problems. The performance of IAL-IOR is even better, however it is less effective because the samples added per generation is limited.   two findings: 1) our algorithms have a similar or even better generalization than OR; 2) the accuracy and MAE can not reflect the accuracy on the minority classes. Our algorithms outperform the OR and the standard imbalanced algorithm on the AMAE, MMAE, Recall m and MAE im , which means our algorithms are better than under-sampling and SMOTE on most datasets. It also means the completely active learning is an efficient method for imbalanced OR.
In addition, we quoted two evaluation methods: Bayes SignTest and Bayes Signrank Test [43], and applied them to our three experimental indexes (accuracy, AMAE and MAE im ) for verification. The Bayes Signrank results are shown in TABLE 5 and the Bayes Signtest results are shown in TABLE 6. The comparison results of each pair of algorithms have three indicators: left, rope and right, where left represents the probability that Classifier 1 is superior to Classifier 2, rope represents the probability that the algorithms are equivalent, and right is the opposite of the first case. The experimental results come from 20 trails on 15 data sets. According to the experimental VOLUME 8, 2020   results, our active learning algorithm designed specifically for imbalanced OR (IAL-IOR) performs the best. Active learning for imbalanced OR (AL-OR) and the compared oversampling algorithms (SMOR, SMOM and SMOTE) have similar generalization performance. Both algorithms are definitely better than undersampling algorithm (US).
From Figure 3 and Figure 4, we can conclude that the proposed algorithms have faster convergence and better generalization ability on Recall m and MAE im . Our algorithms have the generalization ability similar to the classical methods (US, SMOTE) and the recently proposed (SMOM, SMOR) methods in general evaluation (accuracy, MAE) measurements, but have more excellent results under the evaluation measurements (Recall m , MMAE, AMAE, MAE im ) for imbalance problems. In TABLE 5 and TABLE 6, we also proved the effectiveness of our algorithms through statistical analysis.

VI. CONCLUSION
In this article, we put forward a sampling strategy for ordinal regression (AL-OR) and design a improved active learning for imbalanced ordinal regression (IAL-IOR). Firstly, we convert the ordinal regression problem to a binary classification problem. Secondly, we design a sampling strategy for ordinal regression and a balanced active learning based on the sampling strategy. In order to get more reasonable evaluations, we design improved recall Recall m and the mean absolute error MAE im for imbalanced ordinal regression. Moreover, we conduct experiments to compare our algorithms with other algorithms on benchmark datasets. The results show that the proposed AL-OR and IAL-IOR both can be used to deal with the class imbalance problem in OR efficiently.