Multi–class Classification Performance Curve

Quality of predictive models is a critical factor. Many evaluation measures have been proposed for binary and multi–class datasets. However, less attention has been paid to graphical representation of the classification performance, where the ROC curve is extensively used for binary datasets but there is no standard method accepted by the scientific community for multi–class datasets. In this work, a multi–class classification performance (MCP) curve based on the Hellinger distance between true and prediction probabilities of the classifier is introduced. The MCP curve shows the classification performance, contributes to highlight the low or high confidence on correct predictions, and quantifies the quality by means of the area under the curve.


I. INTRODUCTION
Q UALITY of decision is an important concept in machine learning, because it assesses the performance of a predictive model in terms of comparison with reality. The simplest measure of decision quality is the accuracy, i.e. the fraction of correct decisions. While it seems obvious that high accuracy is good, the truth is that it might be very misleading. Many medical datasets report very few cases of positive cases (disease) against negative cases (normal)very common for rare diseases. For example, if only 3% of patients in the dataset have colorectal cancer, a predictive model that blindly provides negative case would be correct 97% of the time.
Accuracy is, in general, useless in various contexts, particularly in biomedicine and psychology. Therefore, when prevalence is not around 0.5 (balance between positive and negative cases) the accuracy looses credibility as a measure of quality. To redeem the issue there exist other measures that can be useful to compare the performance of several predictive models, since even when the accuracy is equal for two models, their performances could be quite differentone due to a low rate of false negative cases and the other to false positive cases. More sophisticated measures, like the Matthews correlation coefficient [1], the K-category correla-tion coefficient, R K (less well-known generalization of the two-class Matthews correlation coefficient) [2], Cohen's κ [3], or the F1-score [4] try to provide a better perspective.
The interpretation of false decisions (positive or negative) could lead to complex situations, e.g. expensive medical treatments for healthy patients (false positive) or absence of treatment for seriously ill patients (false negative). Sensitivity and specificity are two measures that appear to mitigate the effect of an incorrect interpretation of accuracy. Sensitivity is the ratio between the true positive decisions and the number of positive cases; specificity is the ratio between the true negative decisions and the number of negative cases. Both are perfectly combined in a very useful graphical measure: the Receiver -or Relative-Operation Characteristic (ROC) curve.
The ROC curve [5]- [7] represents the relation between the false positive rate (FPR) and the true positive rate (TPR), and illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied from the confusion matrix. FPR and TPR are calculated from sensitivity and specificity (F P R = 1 − specificity and T P R = sensitivity). The curve shows a solid idea of the behavior of a classifier, and allows to compare several classifiers in one image for the same dataset. Also, it has been proven that the area under the ROC curve is an excellent measure of classification performance [8], as it conveys more information than many scoring metrics by visualizing the performance of the classifier by a curve instead of providing a single scalar value. Most importantly, the ROC curve, unlike the accuracy, only depends on the FPR and TPR, which are independent of imbalance, i.e. of class distribution. Given its easy interpretation and usefulness for comparative analysis, the ROC curve has been used in myriads of diverse applications [9]- [12], sometimes with questionable statistical confidence [13]. However, its binary character has substantially limited its use to strictly dichotomous decision contexts.
The ROC curve presents many interesting properties, but also an important shortcoming: it can only be applied to twoclass (binary) datasets. In case of multi-class datasets (more than two labels for the nominal target variable), there is no solution that aggregates and summarizes the classifier behavior for all the classes together (e.g. 3-classes: early and advanced disease, and normal cases). The most common approach is to represent the class of interest (positive) against the others (negative), but it does not contribute to understand the global performance of the classifier (also, information is always lost in the reduction of a problem to a dichotomy). If all the classes are important, the smoothed solution named macroaverage is to average all the ROC curves (one-versus-rest for each class), but this compromises its natural insensitivity to class skew. On the contrary, the micro-average approach takes into account the proportion of every class (aggregating the contributions of all classes), given more importance to larger classes and, therefore, assigning greater values when the largest class performs better (i.e. it is dominated by the more frequent class). In short, macro-average gives equal importance to each class, and micro-average to each sample. Obviously, in case of equal number of samples for each class, then both macroand micro-average ROC curves will result in exactly the same score. However, when the imbalance is significant, both approaches lead to very different curves with refutable interpretations.
The area under the ROC curve (AU(ROC)) is a robust evaluation measure [14]. It is equivalent to the probability that a randomly chosen positive sample will be rated higher than a negative sample, and it is useful to compare classifiers when no one dominates the others [15]. There exist approaches for the AU(ROC) in the context of multi-class classification, based on averaging pairwise comparison of classes [16] or on the use of K-dimensional space to compute the volume of the ROC surface [17]- [19], being K the number of class labels. However, none of these approaches provide a graphical representation.
This paper presents an intuitive method to calculate the classification performance in the multi-class context, i.e. for datasets whose target variable contains any number of class labels. The approach provides a two-dimensional classification performance curve (MCP curve) to visualize the behavior of predictive models and subsequently, as a scalar measure of quality, the area under the MCP curve (AU(MCP)).
Unlike most approaches for binary datasets, it is not based on the confusion matrix but on the prediction probabilities generated by the classifier for the K class labels.
The paper is organized as follows: after introducing the research context, Sec. II presents the mathematical foundations based on probability distributions that lead to the definition of the MCP curve; Sec. III shows, based on the results of two classifiers for multi-class data, how the MCP curve can graphically compare classifier performances, what is not possible with ROC curves; finally, Sec. IV discusses the most important conclusions and future work.

II. MULTI-CLASS CLASSIFICATION PERFORMANCE
Let D = {e i |e = (x, y), x ∈ X m , y ∈ Y, i = 1, 2, . . . , n} be a dataset with n samples, which belong to a m-dimensional feature space and have a corresponding outcome y in the space Y. When |Y| = 2 we say that it is a 2-class (binary) classification problem; otherwise, when |Y| > 2, we say that it is a multi-class classification problem.
For simplicity, let us assume that Y = {1, 2, . . . , K} (set of class labels). The true probability of sample e i , denoted by t(e i ), can be encoded as a K-dimensional one-hot vector, which all the values are 0 except for one 1 at the position k, that satisfies y i = k (following the indicator function, for all k, t(e i ) k = 1 yi=k ), being k ∈ Y. Obviously, there will only be K different true probability vectors. The prediction probability of sample e i , denoted by p(e i ), is also a K-dimensional vector, whose values are provided by the classifier as output probabilities of belonging to class k ∈ Y.
In order to observe the quality of the classifier prediction, a distance function between the two discrete probability distributions must be applied to each sample. Let d be a distance function between two distributions t and p, such as d : [0, 1] K × [0, 1] K → [0, 1] (certainly, the true probability could be expressed in the space {0, 1} K , but it is generalized to [0, 1] K with the intention of including uncertainty in later work). The interest lies in calculating the distance d for each sample e ∈ D, i.e. d(t(e i ), p(e i )) for all i = 1, 2, . . . , n. The distance value is very important because it informs about how far the prediction is from the true observation for each sample. Values close to 0 mean that the prediction is good, and values close to 1 mean that the prediction substantially differs from the observed class.
can be interpreted as the probability that the classifier correctly assigns the observed class label to the test sample e i (certainty).
The Hellinger distance [20], [21], is a metric oriented to probability distributions. Let P = [p 1 , . . . , p K ] and Q = [q 1 , . . . , q K ] be two discrete probability distributions. The Hellinger distance is defined as: The choice of the Hellinger distance is due to its interesting properties: a) it is symmetric; This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186444 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ thanks to the factor 1 √ 2 ); c) convex with respect to both P and Q; and d) it satisfies the triangle inequality. As a consequence, it is recently attracting considerable amount of attention in many scientific fields [22]- [30].
The discrete α-divergence [31] defines a family of functions as follows: The α-divergence has some important properties: a) nonnegative; b) convex with respect to both P and Q; c) it is 0 if and only if P = Q. In addition, there is a very interesting relationship between the Hellinger distance and the Kullback-Leibler divergence [32] through the α-divergence.
The special cases of α = 0 and α = 1 are related to the Kullback-Leibler divergence (KL): 2 is related to the Hellinger distance: Hence the Hellinger distance H(P, Q) = 1 2 (D ½ ) ½ can be expressed as a α-divergence, and its square is related to the midpoint between KL(Q||P ) and KL(P ||Q) taking D α (P ||Q) as a reference, although unlike KL-divergence, H(P, Q) is a mathematically valid metric. H(P, Q) can also be expressed as a function of the symmetric Bhattacharyya coefficient (BC) [33] (also known as fidelity), which can be derived from the Chernoff α-coefficient [31]: where If the distance between the true probability t(e i ) and the prediction probability p(e i ) of a sample e i is close to 0, then the classifier is making a good prediction for e i ; otherwise, when it is close to 1, the prediction is bad. Good or bad predictions are not necessarily correct and incorrect, respectively. The distance is a quality measure of the classification performance, so it would be possible to provide at the same time a not small distance associated with a correct prediction (e.g. for a 3-class problem, when t = [1, 0, 0] and p = [0.4, 0.3, 0.3], H(t, p) = 0.606 and the prediction is correct).
The boundary for a correct prediction for a K-class classification problem is given by the expression: that means all the correct predictions must satisfy the expression (it is necessary but not sufficient condition). Let φ i = 1 − H(t(e i ), p(e i )) be the performance of the classification model with sample e i . The measure φ i is equivalent to the probability of correctly classifying the sample e i . Therefore, if φ i ≤ 1 − θ then sample e i will not be correctly classified. However, the opposite is not true, that is, even when φ i > 1 − θ there will be some samples that will not be correctly classified (e.g. when t = [1, 0, 0] and p = [0.4, 0.5, 0.1] the prediction is not correct), thus establishing a theoretical bound for incorrect classifications (error rate). In fact, this is a very interesting aspect, because two classifiers both with accuracy = 1 (no errors), could vary their aggregated mean distanceH, so the performance would be very different in terms of prediction certainty (equal in accuracy, but not in confidence).
Mathematically, a sequence S is an ordered collection of objects, (S n ) n∈N . The set Φ of values φ i calculated for all the test samples can be increasingly ordered, producing a sequence can be plotted as a line connecting the points, and it will provide a curve within the unit square. The AU(MCP) can also be obtained as a quantification of the classifier performance, comparable to AU(ROC) (trapezoidal areas between every two consecutive points), although in this case for multi-class datasets.
The MCP curve algorithm design (see Alg. 1) prioritizes interpretability over efficiency. In case of k-fold crossvalidation, it would be necessary to join all the k sequences Φ k and then (merge)sort the final sequence, before generating the points of Ω. The complexity of the algorithm is O(n log n) and it is independent on the complexity of the classifier .
As Ω contains all the points of the MCP curve, increasingly ordered by φ, the first α points would not likely be greater than 1 − θ, which means these samples are not correctly classified. From the point α all the rest might be correctly classified, although not with the same confidence, as the values of φ would range in [1 − θ, 1]. However, if BC(t(e i ), p(e i )) > 0.5 then sample e i is correctly classified, which defines another threshold for H values: Therefore, if φ < 1 − θ then the sample is incorrectly classified; if φ > 1 − δ then the sample is correctly clas-VOLUME 4, 2016

Algorithm 1 MCP curve
Require: D λ , D π : Datasets for learner (λ) and predictor (π) C: Classifier d: Distance function Ensure: Ω: Sequence of points of the MCP curve sified; and if φ ∈ [1 − θ, 1 − δ] then the behavior of the classifier is uncertain. As the MCP curve is a monotonically increasing function, the prediction performance will depend on the shapes of the function within the three possible regions (incorrect, uncertain and correct) defined by the two points (α, 1 − θ) and (β, 1 − δ). Thus, given these two points, we could observe many different performance behaviors for the same confusion matrix, in which is based most of the classification performance measures. The promising scenario opened by the MCP curve requires further analysis and study of the properties in relation to the most commonly used metrics in the scientific literature, since prediction probabilities have much more potential than the confusion matrix (counters from the one-hot vector provided by a discrete mapping function over the probabilities: g : [0, 1] K → {0, 1} K ).

III. EXPERIMENTS
The Drug Consumption dataset [34] contains records for 1,885 respondents, and 12 attributes: neuroticism, extraversion, openness to experience, agreeableness, conscientiousness, impulsivity, sensation seeking, level of education, age, gender, country of residence and ethnicity. All input attributes were originally categorical and were quantified by the authors to be considered as real-valued, by using ordinal and nominal feature quantification techniques (polychoric correlation and non-linear categorical principal component analysis, respectively). In addition, participants were questioned concerning their use of 18 legal and illegal drugs (alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and one fictitious drug -Semeronwhich was introduced to identify over-claimers). For each drug they have to select one of the answers: never used the drug (CL0), used it over a decade ago (CL1), or in the last decade (CL2), year (CL3), month (CL4), week (CL5), or day (CL6). Therefore, the dataset contains eighteen 7-class classification problems.
Two well-known classifiers were used: Naïve-Bayes (NB) and Random Forests (RF). All the experiments were validated by means of stratified 10-fold cross-validation.
Basic statistics for NB (default parameters) and RF (gain ratio as splitting criterion for decision trees, no pruning and 500 trees) are illustrated in Tables 1a and 1b, respectively (results for accuracy and the Cohen's κ are also included). The difficulty of classification is remarkable, mainly due to the imbalance of the target variable. Values for precision and sensitivity are notably low for all the classes except for CL0, which represents about 85% of data ([CL0, 85.1%], [CL1, 3.6%], [CL2, 5%], [CL3, 3.5%], [CL4, 1.3%], [CL5, 0.8%], [CL6, 0.7%]) . However, the interest lies in classes with high number (CL4-CL6), as they indicate recent drug consumption. Accuracy, as overall measure, reveals a much better performance of RF (0.851) over NB (0.585). However, RF is assigning most test cases to the majority class (CL0), which is completely useless. On the other hand, the Cohen's κ (measure of agreement between what it is relatively observed and what it would be expected by chance) does show the unsatisfactory behavior of NB (0.100) and, more importantly, of RF (0.019), as opposed to accuracy [35].
Considering the ROC curves, which in principle should visually show the behavior of each class with respect to the others, the interpretation becomes inconsistent and inconclusive. For NB (Fig. 1a), the behavior ranges from the worst (a) One-versus-all ROC curves for NB.
(b) One-versus-all ROC curves for RF.  (CL2) to the best (CL6) over a wide range. However, the accuracy for CL6 is minimal (1.87% in Tab. 1a). For RF (Fig.  1b), the behavior is more stable, with lower variance, but still offers good performance for CL6 (AU (ROC) = 0.823), although its precision is null. In sum, both accuracy and ROC curves (as well as the area under the curve), offer quality indicators that tend to overestimate the real performance of the classifiers. The curves depicted in Fig. 1c shows the multi-class classification performance (MCP) curve of Naïve Bayes and Random Forests classifiers on the Drug Consumption dataset for the target heroin. The best case would be when all the distances are zero, so the φ values would be 1, and the curve would be a flat line equivalent to the constant function y = 1 (AU(MCP) would be 1). The worst case would be when all the distances are 1, i.e. the true and prediction probabilities are maximally distant, and the curve would be a flat line equivalent to the constant function y = 0 (AU(MCP) would be 0). Therefore, the visual interpretation of the MCP curves is quite similar to the ROC curve. The goal of this image is to show that both the shape of and the AU(MCP) consistently illustrate the classification performance of both predictive models (NB and RF).
RF reaches quickly the values 1 − θ = 0.21 (for x = 0.12) and 1 − δ = 0.46 (for x = 0.16), which means that 12% and 84% of samples are incorrectly and correctly classified, respectively, and only 4% of uncertainty. For NB, x = 0.33 and x = 0.43, respectively, which means 33% and 57%, and 10% of uncertainty. These values are directly related to the accuracy (N B = 0.585 and RF = 0.851) shown in Tabs. 1a and 1b. As for the values of AU(MCP) (N B = 0.515 and RF = 0.667), they are more conservative than the values of accuracy, which is consistent with the very low precision and sensitivity values for six out of seven class labels. The most relevant aspect is that it is observed in the MCP curve the low confidence of predictions for correctly classified cases, as both classifiers only provide probabilities close to 1 for just a few sample cases. In contrast, the ROC curves (Figs. 1a and 1b) reach high TPR values very soon (when F P R = 0.5, values of TPR are over 0.8 for RF).
Considering the accuracy and ROC results for RF, it could be concluded that they are satisfactory, although they are certainly not reliable. The MCP curve does not only show the classification performance, but also contributes to highlight the low or high confidence on correct predictions.
There exists a structural difference between ROC and MCP curves: the ROC curve always starts at (0, 0) and ends at (1, 1), because it is consequence of setting the decision threshold at 1 and 0, respectively. However, the MCP curve could start at any point (0, φ min ) and ends at any point (1, φ max ), being φ min and φ max the minimum and maximum values of distance between the true and the prediction probabilities of samples, respectively. For instance, if none sample e i gets a perfect prediction, with d i = 0, then φ i < 1, and will not reach the upper-right corner of the unit square. This feature is also interesting when comparing the performance of classifiers.

IV. CONCLUSIONS
Multi-class classification is a very common and important problem. Many quality measures, most of them based on the confusion matrix, exist to assess classification performance for binary data sets. The ROC curve stands out from the others in that it provides a graphical representation of classifier performance, and also a quantification of its quality (AU(ROC)). However, it is not possible for it to provide a unique representation for multi-class datasets. Several approaches have been formulated, like micro-average and macro-average ROC curves, but none of them has proven to be effective, as they do not provide a unique curve for the classifier (instead, one per class), and therefore have not been widely adopted by the scientific community. The MCP curve VOLUME 4, 2016 arises as an alternative to fill this gap in the context of multiclass classification.
The confusion matrix is based on prediction probabilities. All the values of this matrix are calculated by the arg max function from the probabilities (class with the largest predicted probability). Therefore, real numbers (probabilities) are transformed into a one-hot vector for each test instance before updating the confusion matrix. The MCP curve works directly with prediction probabilities, avoiding a loss of information (with respect to classification performance) caused by the transformation into the confusion matrix.
Understanding the confidence of predictions is an important issue, because it lends mathematical continuity to discrete decisions. To calculate the MCP curve, the classic correct or incorrect classification is replaced by probabilities, which enriches the interpretation of results and allow enhance the behavior of predictive models. For example, for a 4class problem, a medical decision would not be as reliable when the probability for the majority class is 1 as it would be when it is 0.51 (both correct). In well-known measures like accuracy, the Matthews correlation coefficient or F1score, the decision is made by voting without analyzing the outcome of prediction probabilities for each class. Therefore, from the perspective of reliability, for a given dataset, several classifiers yielding exactly the same confusion matrix might provide different MCP curves, which contributes to a deeper insight into the prediction performance of each classifier.
As stated, the ROC curve is very useful to graphically compare the prediction performance of classifiers when datasets are binary (2-class), but it is not applicable to multi-class datasets. The MCP curve offers to the research community a novel mathematical tool for the comparative analysis of classifiers when dealing with multi-class datasets.
Finally, the use of prediction probabilities instead of the confusion matrix opens new research directions to deepen prediction performance measures, like the MCP curve. Future work will focus on investigating the impact of prediction probabilities on performance metrics for classifiers. Since 2012, he is an assistant professor at SUT and senior specialist at the Research Network Łukasiewicz -Institute of Innovative Technologies EMAG in Katowice, Poland. He published over 90 scientific papers. His scientific fields of interests include rough set theory, biclustering, machine learning and data analysis. VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186444