Skip to Main Content
In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of ldquohow to evaluate a classification performance evaluation method?rdquo We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the imp- rtance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of these fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.