Visualizing Classification Results: Confusion Star and Confusion Gear

Recent developments in machine learning applications are deeply concerned with the poor interpretability of most of these techniques. To gain some insights in the process of designing data-based models it is common to graphically represent the algorithm’s results, either in their final or intermediate stage. Specially challenging is the task of plotting multiclass classification results as they involve categorical variables (classes) rather than numeric results. Using the well-known MNIST dataset and a simple neural network as an example, this paper reviews the existing techniques to visualize classification results, from those centered on a particular instance or set of instances, to those representing an overall performance metric. As classification results are commonly summarized in the form of a confusion matrix, special attention is paid to its graphical representation. From this analysis, a new visualization tool is derived, which is presented in two forms: confusion star and confusion gear. The confusion star is centered on the classification errors, while the confusion gear focuses on the classification hits. The proposed visualization tools are also evaluated when facing: (i) balanced and imbalanced classifiers issues; (ii) the problem of representing errors with different orders of magnitude. By using shapes instead of colors to represent the value of each matrix cell, the new tools significantly improve the readability of the confusion matrices. Furthermore, we show how the area enclosed by the confusion stars and gears are directly related to standard classification metrics. The new graphic tools can be also usefully employed to visualize the performances of a sequence of classifiers.


I. INTRODUCTION
Machine learning models in general, and deep learning algorithms in particular, are powerful algorithms able to provide very good results when there is a pattern to be learnt from available data, but at the cost of operating as a black-box.
On the other hand, having some insights about how they work is a key issue for several reasons: improving the interpretability and explainability of the models [1], debugging and improving architectures and algorithms [2], comparing and selecting results [3], and even for pedagogical purposes [4]. Therefore, a common approach to unveil their functioning relies on some kind of visualization of their inner operation and final results e.g., in the computer vision domain [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Weiping Ding .
The main target audience of these tools is the model developer community [6], but also technically skilled model users [7] and even non-experts [8] can benefit of a visual description.
These users may be interested in the visual representation of different types of models' information, such as model architecture [9], neural network's weights [10], convolutional filters' values [11], neurons' activation outputs [12] or edges' backpropagation gradients [13]. However, by far the most represented information is the model's predictions either for a particular instance [14], for a group of instances [15] or for the overall dataset [16].
Many methods have been described with the aim of visualizing the prediction process. An up-to-date comprehensive survey of them, structured using the Five W's and How questions (Why, Who, What, How, When, and Where), can be found in [17]. Also a perspective of visual analytics for understanding, diagnosing, and refining models is reviewed in [18]. Additionally, different visualizing tools integrating several approaches have been developed [19]- [23].
Focusing on how to visualize the results predicted by machine learning algorithms, different approaches should be considered depending on the type of problem addressed. The information to be represented (prediction results) is qualitatively different for tasks such as regression, classification, clustering, reinforcement learning, etc. This paper addresses the issue of visualizing the results obtained by multiclass classification algorithms, since this is one of the most frequent tasks in machine learning applications (for instance, around 75% of the datasets in the well-known University of California Irvine Machine Learning Repository [24] contain classification problems).
In most cases, the performance of a classifier is summarized by a single metric (accuracy, precision, etc.), but ''it is important to understand both what a classification metric expresses and what it hides'' [25]. For this reason, a classification metric can also be disaggregated as a set of values with the purpose of gaining better insight into the classifier's results.
As for the level of disaggregation to be used in visualizing classification results, three approaches are considered in the paper: • Low-detailed results, using a single-valued metric for the classification of the whole dataset.
• Medium-detailed results, where the classification of the whole dataset is summarized by a small set of values.
• High-detailed results, representing classification scores for a single instance or a set of instances in the dataset. Although the paper briefly examines how to represent low and high-detailed classification results, its main focus is on how to visualize them at a medium level of detail, which is commonly described by its multiclass confusion matrix [26].
The main contributions of this research can be summarized as follows: • Two new approaches to visualize the results of a multiclass classifier are proposed, namely the confusion star and confusion gear graphics.
• Their use as an intuitive guideline to understand the classification behavior is explored.
• Their application to imbalanced datasets is considered. • Their role to compare different classifiers is highlighted, as well as to understand the influence of classifier's hyperparameters.
• The relationships between the shape of these graphs and common classification metrics are derived. The paper is organized as follows. Section II describes the structure of the dataset used in the research, defines the classification scoring procedure and formalizes the concept of confusion matrix. Then, in section III, several techniques to visualize classification scores and multiclass confusion matrices are reviewed. The extension of these ideas is addressed in section IV, where the confusion star and confusion gear concepts are presented. Later, in section V these new tools are discussed, tackling issues such as the impact of imbalanced datasets, the inner and outer areas of the graphics, the use of logarithmic scale and the visualization of evolving classifiers by means of a sequence of the new graphics. Finally, the main findings of the research are presented in the conclusion section.

A. DATASET
Throughout this research the MNIST (Modified National Institute of Standards and Technology) dataset [27] has been used as the primary dataset. It contains 70,000 images, each of them representing a handwritten digit (0 to 9). The dataset is split into a 60,000 images subset that is used to train the classifier (training dataset) and a 10,000 images subset employed for generalization purposes (testing dataset). In this case there are 10 classes, one for each digit.
This dataset has been widely used as a reference to analyze different classification algorithms. Our goal in this paper is not to obtain a better classifier but, given the results of any of them, to explore how to represent its confusion matrix.
As a first example, a classifier implemented as a very simple neural network has been considered, with only an 8-neurons hidden layer and a sigmoid as activation function. The output layer contains 10 nodes (one for each class) with a softmax activation function. Such a network is trained during just 5 epochs, and its generalization results are evaluated on the testing dataset. These test results are used in the following to show different visualization methods.
This classifier is advisedly simple for the purpose of obtaining low performance: in this case, differences among the considered visualization techniques can be more easily appreciated. By increasing the number of hidden layers, the number of nodes per layer, and the number of training epochs, much better classification results can be obtained. As instance, using convolutional neural networks, excellent results (99.8% accuracy) have been reported [28].
In the final part of the paper, it is discussed the evolution of the classification performance as a function of the number of training instances. In this case the MNIST dataset has also been used, now raising the number of neurons in the hidden layer up to 128-neurons and training the neural networks during 100 epochs.
To show the ability of confusion stars and gears to visualize classification results in problems with a high number of classes, a second dataset, the CIFAR-100, has also been considered [29]. This dataset consists of 60000, 32×32 color images in 100 classes, with 600 images per class, where 50000 images (83%) are used for training and 10000 for test (17%). A 6-layer Convolutional Neural Network (CNN) classifier has been employed, according to the code in [30]. This is not a very powerful classifier as it shows an accuracy of about 40%, while the state of the art classifiers for this problem reach figures over 96% [31]. However this moderate accuracy is quite convenient to depict confusion stars and gears with many classes.
Finally, to show the impact of imbalanced datasets on confusion stars and gears, a reduced version of the Abalone dataset is employed. This dataset, available in [24], derives from a non-machine-learning study [32] and contains physical measurements (height, several lengths, diameter, sex) of the abalone mollusk exemplars, along with the number of ''rings'' present in the shell. The number of rings is proportional to the age of the mollusk. The purpose is to classify each observation in its age class. Certain classes in the original dataset contains very few instances (some classes with one or no elements) making unaffordable any prediction. To overcome this problem a reduced dataset has been obtained by selecting only 10 classes, from class (age) 4 to 13, containing 3670 instances which represents the 88% of the total population. The resulting dataset contains the same number of classes (10) than the MNIST problem but they are highly imbalanced, which is quite convenient for the sake of comparison. A simple multiclass logistic regression has been used as classifier.

B. CLASSIFICATION SCORE MATRIX
Let us consider a statistical population P that contains a set of elements, usually in a large and potentially infinite number. In this population n elements are randomly sampled, obtaining a dataset D = {d 1 , d 2 , . . . , d n }, where d i represents the i-th element. Let also be a set of classes = {θ 1 , θ 2 , . . . , θ C } where C is the number of classes, and θ j represents the j-th class. A certain element d ∈ D is defined by a pair , θ formed by a vector = [ϕ 1 , ϕ 2 , . . . , ϕ F ] that contains the F features that define the element, and the class θ to which the element belongs to. Let us call P = { 1 , 2 , . . . , n } the set containing the feature vectors of the population P.
A classifying algorithm A is defined as a function from the population P to R C (the set of real numbers of dimension C), which can be expressed as A : P → R C . Therefore, each element belonging to P is associated to a scoring vector = [ψ 1 , ψ 2 , . . . , ψ C ], that is, a score for each class in . If the scores can be interpreted as probabilities, the algorithm is a probabilistic classifier. Otherwise, if scores are binary values (0,1) the algorithm is a hard classifier.
A decision rule R is defined as a function which associates a scoring vector , defined in R C , in an estimation of the classθ ∈ .
Finally, a classifier C is defined as an ordered pair of functions A, R indicating that it first applies the classification algorithm A, and then the decision rule R. So, C : P Considering not the whole dataset but each single element, a classifier can be described as two sequential transformations, Therefore, the result obtained applying the classifier C to an element in the dataset D is a scoring vector [ψ 1 , ψ 2 , . . . , ψ C ], and a class estimationθ. To measure the classifier performance, the actual class of the element must also be included. Then, the performance of a classifier, operating on a dataset with n instances, can be expressed by the score matrix (SM), with SM ∈ R n×(C+2) , given by This matrix contains the information about the performance of the classifier at its maximum level of disaggregation.

C. CONFUSION MATRIX
In many situations, the classifier performance is analyzed not considering the scores associated to each class, but just comparing the estimated and the actual class for each instance in the dataset. So, by discarding the first C columns of the score matrix, the more compact estimation matrix (EM) is obtained The estimation matrix EM is has a smaller dimension (less columns) than the score matrix SM, but it still has a high level of disaggregation, since it contains information for each instance in the dataset. Therefore, it is common to summarize it using the confusion matrix (CM) defined as where m ij represents the number of instances of class θ i estimated by the classifier as belonging to the classθ j . The results obtained classifying the MNIST dataset with the neural network previously described can be summarized in the confusion matrix shown in TABLE 1. Last column also shows that there are a similar number of instances in each class, which means that classes are quite balanced.
In the definition of the confusion matrix, is usual to describe m ij as a fraction of the total number of instances m i belonging to the class θ i . By calling this ratio λ ij ≡ m ij /m i , then m ij can be expressed as m ij = λ ij · m i , and the confusion matrix can be rewritten as (4) VOLUME 10, 2022  The symbol • represents the element-wise multiplication (also called Hadamard product), is the unit confusion matrix expressed by, and M is the matrix defined by TABLE 2 summarizes the main elements considered in the definition of the confusion matrix, where g j is the number of instances estimated as belonging to the j-th class.

A. CLASSIFICATION SCORES OF INSTANCES
Fully detailed classification results regarding the i-th instance correspond to the i-th row of the score matrix (1) defined by The classification scores for the first three instances in the MNIST testing dataset can be depicted as in Fig. 1 (in colors blue, orange and green respectively). In this classifier the scores are generated by the softmax activation function of the 10-neurons output layer, so they are in the range [0, 1], sum up to 1 and, therefore, they can be interpreted as probabilities. For example, the first instance (represented by a blue line) has a (0.05, 0.07, 0.12, . . .) probability of belonging to the class (0, 1, 2, . . .). Belonging to class 7 obtains the highest probability (0.34), so this is the class estimated by the classifier. In this case, the instance is classified correctly, which is indicated using filled dots. For the second instance (represented by an orange line), belonging to class 0 obtains the highest probability (0.26). In this case, this is an error as the actual class is 2 which is indicated using empty dots.
This type of representation only has meaning for a single instance or for a very reduced number of them. In order to depict classification scores for many instances a scatter polar plot has been proposed [33] as it is shown in Fig. 2 for every instance in the MNIST testing dataset. Each class is represented by a certain angle ϕ j , which for the j-th class is defined by The classification result for the i-th instance is depicted as a dot in a position defined by its vector where r ij is a vector of module ψ ij and phase ϕ j .

B. CLASSIFICATION SCORES OF CLASSES
A partial perspective of the score matrix may consider not every instance in the dataset, but only those belonging to a certain class. In this case, the scoring results of the instances belonging to the j-th class are defined by a slice of the score matrix SM (j) As the level of disaggregation of this matrix is still very high, it is commonly summarized using some statistics for each column (mean score value, standard deviation, density function, etc.). Fig. 3 depicts one of these summaries, here in the form of a boxplot. The i-th subplot considers the m i instances belonging to the i-th class and the j-th box indicates the distribution of the values ψ ij , ∀i|d i ∈ θ j , that is the scores of elements of the i-th class that are being estimated as belonging to the j-th class.
The instances belonging to class 1, for example, are estimated as belonging to class 1 with a probability distributed as it is shown in the second box of the second plot, clearly outperforming the remaining probability distributions. Then, very good classification results should be expected for instances belonging to class 1.
Conversely, instances belonging to class 2 (third plot) have a probability of being correctly classified as it is shown in the third box. This distribution is only slightly better than the ones corresponding to the estimated classes 1 and 7, so many classification errors should be expected for instances belonging to class 2.

C. REPRESENTATION OF THE CONFUSION MATRIX
Let us now focus on how to represent the classification results using a medium level of detail, that is, based on its confusion matrix. The most common way to depict a certain multiclass confusion matrix is straightforwardly drawing it as a C × C colored grid where each cell has a color scaled according to its value. Sometimes the cell also contains a text with its numeric value, as it shown in Fig. 4.
In case of an imbalanced dataset, it is better to represent the unit confusion matrix, commonly expressed by the percentage values, as it is depicted in Fig. 5.
The confusion matrix or the unit confusion matrix can be alternatively represented as in Fig. 6 where, for each actual class, a set of C stacked bars are drawn. The height of each bar in a certain stack (actual class) is proportional to the number (or ratio) of instances estimated as belonging to each class, that is, corresponding to the values of a row in the confusion matrix. A similar stacked bar approach is used in [34].

D. REPRESENTATION OF BINARY CONFUSION MATRICES
Sometimes it is worth to assess the classification results of one class versus all the remaining ones (OvA binary classification). So, let us consider the instances belonging to the i-th class which will be denoted as the ''positive'' (P) class. The remaining instances belong to different classes which will be collectively denoted as the ''negative'' (N) class. In this way, the number of instances correctly classified as positives (TP: True Positives) is TP = m ii . Similarly, the number of instances erroneously classified as positives (FN : False Negatives) is The number of elements not belonging to the i-th class (that is, belonging to the negative class) which are erroneously classified (FP: False Positives) is Finally, the number of elements not belonging to the i-th class (that is, belonging to the negative class) which are VOLUME 10, 2022 Considering these results, the binary matrices corresponding to every class can be represented as it is shown in Fig. 7. Alternatively, they can be represented using stacked bar plots, as it is depicted in Fig. 8.
Binary classification results can also be analyzed using the receiver operating characteristic (ROC) curve [35]. Converting the classification scores for an instance into its estimated class requires a decision rule R which, in the binary case, is usually a threshold τ . If the score of belonging to the positive class ψ iP is greater than the threshold, the instance is estimated as positive; otherwise as negative. So, the elements of the binary confusion matrix depend on τ , and also their related metrics. Specifically, the True Positive Rate (TPR) and  the False Positive Rate (FPR) are defined as   The ROC is built as a parametric curve in τ , with FPR (τ ) in the horizontal and TPR (τ ) in the vertical axis. The resulting ROC curves for the 10 binary classifiers are depicted in Fig. 9.

E. ALTERNATIVE REPRESENTIONS OF THE CONFUSION MATRIX
Some authors have proposed alternative representation for the confusion matrix such as, for instance, in [14] where a chord diagram, called by the authors confusion wheel, is used. This plot is depicted in Fig. 10 where each class corresponds to a circular sector with a size proportional to the number of instances belonging to that class. Later a chord is drawn starting at the actual class sector and ending at the estimated class sector. The width of the chord at each side is proportional to the number of instances belonging to that class classified as belonging to the other side's class. The color of the chord is that of its widest side.   Also in [36] it is proposed to represent the confusion matrix using a Sankey diagram as in Fig. 11. In the upper part each class (origin) is represented by a rectangle with a width proportional to the number of instances belonging to that class. In the lower part, the estimated classes (destinations) are drawn with a width proportional to the number of instances predicted as belonging to that class. The ribbons drawn in the middle represent the instance belonging to the upper side class but classified as belonging to the lower side class.
In [37] the confusion matrix is conceived as a similarity matrix between classes. Then, it is transformed in its opposite, that is, a dissimilarity or distance matrix. Finally this matrix is represented in a two-dimensional plane using the multidimensional scaling (MDS) technique.
The result is shown in Fig. 12 where each class is represented by a point in the new 2D plane. The closer a pair of classes, the more similar they are and, therefore, the more difficult is to separate them. For example, classes 4 and 9 are very close in the MDS plane, which means that it is very difficult to separate them and so a high number of classification errors should be expected.
Along with the individual representations described above, it is also common to find visual representations of the classifier results that combine several of the preceding graphs. Although these more sophisticated graphics may seem visually very appealing, they do not necessarily provide additional information compared to the more conventional representations. Therefore, in the following section, a new graphical representation is proposed. Then, it is possible to represent the confusion matrix as a sequence of C lines, each of them corresponding to a row CM i . Every line is defined by C values, corresponding to each m ij elements. The result is depicted in Fig. 13. A similar approach is used in [38].

IV. BEYOND CONFUSION MATRIX
In a good classifier most instances are correctly estimated as belonging to its actual class, so m ii ≈ m i ; m ij ≈ 0, ∀j = i. That is, a single very high value escorted by the remaining very low values. This important imbalance in the values of each row is clearly seen in the plot and it makes difficult its interpretation. This is also the reason why such graphic is not commonly used to represent the confusion matrix.
To overcome the issues raised in the previous representation, the CM i containing the classification results corresponding to the i-th class is transformed into a new vector EM i ≡ e i1 e i2 . . . e iC , where its elements are defined as e ii = m i − m ii and e ij = m ij , ∀j = i. Then, for a perfect classification, e ij = 0, ∀j. The matrix EM = {e ij } is denominated the error matrix.
The i-th row of this matrix can also be formulated in terms of the ratio over the total number of instances belonging to the i-th class, EM i = i1 m i i2 m i . . . iC m i , where the ratio ij = e ij /m i . The matrix E = { ij } is denominated the unit error matrix and can be represented as a sequence of C lines, as it is shown in Fig. 14. In the previous linear representation (Fig. 14) an abnormally high value is observed in each line, corresponding to the element e ii , that is, the number of instances belonging to the i-th class erroneously classified as belonging to any other class.
To explain these peaks let us first remind that the C elements of the i-th row in the confusion matrix are mutually dependent having C −1 degrees of freedom, that is, they obey the equation Then, the number of hits (correct classifications) for the i-th class is Recalling the definition of the elements of the error matrix, its diagonal elements can be written as The term e ij counts the number of instances belonging to the i-th class, erroneously classified as belonging to the j-th class. Callingē ij its mean value , ∀j = i, it can be written that e ii = (C − 1) ·ē ij . Then, in the MNIST example (with C = 10), the value of e ii will be 9 times higher than the mean of the remaining e ij . This is the reason why a peak appears in the linear representation of Fig. 14. The distribution of the classification errors for each class is depicted in Fig. 15, where it is clearly shown that the value of e ii (in green) is much higher (9 times) than the value ofē ij (in blue).
Considering the C − 1 degrees of freedom in the rows of the error matrix, any of them can be omitted without losing information. Then, removing the element e ii is a convenient decision as it eliminates the peaks in the plot, as it is depicted in Fig. 16. It must be noted that the horizontal axis does not indicate the estimated class but an index to this class once the redundant element has been removed, that is, the value VOLUME 10, 2022 corresponding to the same class. Then, for instance, in the green line (actual class 2), the index corresponds to the estimated classes 0, 1, 3, 4, . . . , 9, a sequence where the class 2 has been omitted. More formally, for the i-th actual class (the row in the matrix) and the j-th estimated class (column), the index k of the estimated class is defined by the expression

B. STEP REPRESENTATION OF THE ERROR MATRIX
In the linear representation without redundancies of the unit error matrix (Fig. 16) let us focus on a particular class, for instance, class 2 as this is the class obtaining the worst classification results. The row of the matrix corresponding to this class can be represented as in Fig. 17.
Recalling that the row values in the error matrix are e ij = m ij , ∀j = i, the sum of this values is j =i As e ij = ij m i , this equation can be rewritten as which is the sum of the values in Fig. 17. The term m ii /m i is usually denominated the True Positive Rate of the i-th class (TPR i ), also known as Sensitivity or Recall. Its complementary, that is 1 − TPR i , it is called False Negative Rate (FNR i ) or Miss Rate. Then it can be said that the sum of values in Fig. 17 is To visualize this value as the area under the line in Fig. 17, it is better to transform the linear representation of the error matrix in a step representation, as it is depicted in Fig. 18. There, each non-redundant value of the error matrix for the i-th class is represented as a step of unit width. The linear equivalent representation is also drawn as a dashed line.
Considering the unit width of each step, the area under the step line is The cumulative values of these areas are also drawn in the graphic (dashed blue line).

C. POLAR REPRESENTATION OF THE ERROR MATRIX
The visualization of the error matrix row for class 2 (linearly represented in Fig. 17), can be redrawn in a radial shape. For this purpose, C −1 radii are sketched, each one corresponding to a non-redundant element of the i-th row in the error matrix. The k-th non-redundant element is represented by a line at an angle (respect to the horizontal) Then the angular width corresponding to each class is The result of this plot is depicted in Fig. 19. It must be noted again that the labels in the outermost circle do not indicate the estimated class but the indices to the estimated class.
To make the resulting area meaningful, it is better to transform the radial representation of the error matrix into a representation by arcs, as shown in Fig. 20. In this plot, which can be denominated the sectorial or pie representation, each non-redundant value of the error matrix for the i-th class is represented by a circular sector of constant angular width, ϕ. The dashed line represents the equivalent radial representation. The area inside the resulting plot is It can be seen that this area is proportional to the miss rate.

D. CONFUSION STAR
In the radial (Fig. 19) and sectorial (Fig. 20) plots discussed in the previous subsection, the representation of a single row of the error matrix have been addressed. To extend this visualization to the whole matrix, one plot for each actual class can be drawn, using different colors to distinguish them. The resulting graphic is depicted in Fig. 21.  Reading this plot is not an easy task as the C lines are overlapped. An alternative to improve its readability is to divide the circle in C regions, each one corresponding to an actual class (a row of the error matrix). Then, each region is again divided into C − 1 sectors, one for each column once the redundant e ii element is removed.
If the C regions have the same size a balanced representation is obtained where the angular separation between two radii is .
The so obtained star-like result is depicted in Fig. 22. This shape justifies naming this representation as the confusion star. It must be noted that the gray labels in the outermost circle do not indicate the estimated classes but the indices to these classes once the redundant elements e ii have VOLUME 10, 2022 been removed. In [39] a similar although simpler polygonal solution is proposed with the name of cobweb.

E. CONFUSION GEAR
The confusion star has been defined based on the error matrix. An alternative election is to use the classification hits instead of the errors. So, the classification results of the instances belonging to the i-th class, summarized in the i-th row of the confusion matrix CM i , are now transformed in the vector HM i ≡ w i1 w i2 . . . w iC , whose elements are defined as w ii = m ii and w ij = m i − m ij , ∀j = i. For a perfect classification w ij = m i , ∀j. The matrix HM = {w ij } is called the hit matrix of the classifier.
The i-th row of this matrix can also be formulated in terms of the ratio over the total number of instances belonging to the i-th class, To represent this matrix, a procedure similar to that used in the representation of the error matrix is followed: the circle is divided into C regions (one for class) and then each region is again divided into C − 1 sectors, one for each column once the redundant w ii element is removed. If the C regions have the same size, a balanced representation of the hit matrix is obtained as in Fig. 23. The resemblance of this graph to a gear is used to refer to it as the confusion gear.
Recalling that the row values in the hit matrix are w ij = m i − m ij , ∀j = i, the sum of these values is and substituting this result in (26), it is obtained that As ψ ij = w ij /m i , this equation can be rewritten as Recalling that the term m ii /m i is the True Positive Rate of the i-th class (TPR i ), (30) can finally be expressed as j =i

A. IMBALANCED CONFUSION STAR AND GEAR
To obtain the balanced confusion star (Fig. 22) and gear (Fig. 23), the circle was divided into C equal-sized regions.
A different imbalanced approach is also possible using regions whose sizes are proportional to the number of instances belonging to each class. The region corresponding to the i-th class spans an angle of and the angular separation between two radii is .
As the classes in the MNIST dataset are barely imbalanced, the reduced Abalone dataset is used in this case. The balanced confusion star is depicted in Fig. 24, while the corresponding imbalanced version is shown in Fig. 25.  In the imbalanced star it can be noted that, for example, the region corresponding to class 9 (with 138 instances) is remarkably wider than that corresponding to class 4 (with 11 instances).

B. AREAS OF THE CONFUSION STAR AND GEAR
Intuitively it can be seen that the area enclosed by the confusion star is a metric of the classifier's performance: the larger the area, the worse the classifier. The opposite statement can be affirmed for the confusion gear: the larger the area, the better the classifier. So, analyzing these areas can be useful as their use as alternative classification metrics.
Let us consider a balanced confusion star where the enclosed area is the sum of the area of each sector A ij , that is, Recalling (25) Considering (20) The ratio of this area to the total area of the circle is called the Internal Area Ratio (IAR) and is defined as that is, a value proportional to the multiclass miss rate (FNR). For binary classification (C = 2), IAR = FNR.
Focusing on the area outside the confusion star, an analogous External Area Ratio (EAR) can be defined as Recalling that FNR = 1 − TPR it can be written that Considering now the imbalanced confusion star, the enclosed area is Recalling (33) Since e ij = m i ij , this equation can be rewritten as Recalling (18) Two of the most common classification performance metrics are the accuracy, defined as and the error rate ER ≡ 1 − ACC. Substituting these expressions in (47) yields The Internal Area Ratio (IAR) is then that is, a value proportional to the multiclass error rate (ER). For binary classification (C = 2), IAR = ER. VOLUME 10, 2022   Regarding now the confusion gear, similar expressions can be derived for its internal and external areas. All these results are summarized in TABLE 3. From the previous results it can be seen that the areas in the confusion star and gear, are directly related to classical performance metrics. For example, the imbalanced confusion gear has an internal area linearly proportional to the accuracy (ACC), while the external area is linearly proportional to the error rate. So these areas can be considered a visual representation of the classification performance.
However, the relation among areas and classical metrics has to be carefully considered. While this relation is linear for a certain dataset (a constant number of classes C), it becomes nonlinear if the classification performance is analyzed through different datasets. The relationship between the IAR and the ACC for the imbalanced confusion gear is depicted in Fig. 26, both for a single dataset (left) and for different datasets (right).

C. LOGARITHMIC CONFUSION STAR
Both the balanced (Fig. 22) and the imbalanced (Fig. 24) confusion stars do not properly visualize the values of the error matrix when they are very small. To overcome this problem and to accommodate in a single graphic very different error values, the length of the radii are made proportional to the logarithm of the errors. The result obtained using this procedure is depicted in Fig. 27.
In this graphic the center of the circle does not correspond to a null error but to an arbitrarily chosen small value (0.01 in the graphic).
In general, hit matrices do not have very small values (usually greater than 50%), so the use of the logarithmic scale is not required.

D. CONFUSION STARS FOR MANY CLASSES
As the number of classes increases, any graphic representation of the confusion matrix becomes less clear. For instance, the colored grid corresponding to the classification of the CIFAR-100 dataset is shown in Fig. 28. In that graphic is very difficult to identify in which classes the classifier is underperforming and should be improved.  If the classifier performance is visualized using the confusion star, the result is depicted in Fig. 29. In this plot is easier to identify that, for example, the classifier is having problems to correctly identify instances of class 47 and 52. Therefore, although the confusion star becomes less clear as the number classes increases, it is a better representation than the classical colored grid.

E. SEQUENCE OF CONFUSION STARS
Following the evolution of a certain feature or metric is a common task in science and engineering [40]. In the field of classification algorithms there are some applications where it is convenient to visualize the performance, not of a single classifier, but of a sequence of classifiers, comparing their results depending on the value of a certain parameter or hyperparameter. Even some tools has been proposed to visualize the evolution of the classification process either at the instance level [36] or the confusion matrix level [41]. Also the confusion stars and gears can be used for this purpose. To show how it can be done let us consider again the MNIST dataset and the same neural network classifier with a single hidden layer and a sigmoid as the activation function. For this analysis the number of neurons in the hidden layer is increased from 8 to 128 and the number of training epochs rises from 5 to 100. The objective of these improvements is to obtain a wider range of classification performances.
To determine the impact of the number of training instances on the classification performance, a variable number of instances to train the network are used, observing the accuracy of the classification in each case. The result is usually known as the learning curve, depicted in Fig. 30.
This representation properly summarizes the performance of a classifier in a single metric, the accuracy in this example. However, it is possible to exploit the descriptive power of the confusion star for a better and more detailed insight of the evolution of the classification performance. Indeed, each dot in the learning curve has a corresponding confusion matrix that can be properly visualized as a confusion star.
Let us consider, for example, the significant increasing in the accuracy occurring around 500 training instances. While the learning curve does not detail what this improvement is due to or how it is distributed in each of the classes, an analysis of the confusion stars in accuracy, before and after the jump, can shed more light on the question. In Fig. 31, the confusion stars corresponding to a point with 502 samples (before the jump, accuracy of 38%) and another point with 610 samples (after the jump, accuracy of 67%) are shown. Quite important improvements (smaller errors) can be observed in, for example, the 0 classified as a 2, the 2 classified as a 1, and so on. In other words, the representation of the confusion matrix not only informs us of the overall improvement of the classifier, but also of how this improvement is distributed.
A similar representation can also be obtained using the confusion gear.   The application of the confusion stars to compare two points of the learning curve can be extended to a sequence of points, drawing a grid of stars as it is shown in Fig. 32. In that graphic, which resembles the concept of as small multiple [42], can be seen that, for example, the problems classifying instances of classes 4, 8 and 9 that shows the classifiers trained with up to 1000 training instances, are mostly solved once the 3000 instances barrier is overcome. From this point on, a smooth and continuous improvement of the classification results is obtained. The same information can be obtained analyzing the corresponding confusion gears.
Representing a sequence of confusion matrices by a grid of stars has an obvious limitation of space: the more matrices to be represented, the smaller is the size of each star. To tackle this problem, the sequence of confusion matrices can be represented generating a movie where each frame corresponds to a single confusion matrix. An example of this video can be seen in the online version of the paper (see also appendix). In Fig. 33, an example of a frame of the movie is shown.

F. SUMMARY OF VISUALIZATION METHODS
Through the paper, up to 13 methods for visualizing classification performance have been described. Some of them focus on classification scores of single instances while others are interested on how the classifiers behave for the instances of certain classes. On the other hand, some visualization  methods are designed primarily for two classes (binary classification) while others can represent multiple classes.
Visualization methods can be featured by how they represent the different classes (actual or estimated) and the classification performance. Some of them use color to convey the required information while others use geometric elements for this purpose: X and/or Y axis position in rectangular plots, radial and/or angular position in polar plots, length and/or width of graphical elements, etc.
A summary of the visualization methods described in the paper is shown in TABLE 4.

VI. CONCLUSION
This paper has reviewed several methods to visualize classification results at different levels of detail: from those centered on how a particular instance or set of instances are classified, to those that summarize the classification performance in a single metric.
A particular interest has been devoted to classification results which are summarized in the form of a confusion matrix, presenting the main procedures to visualize it from the straightforward row-column matrix representation, with colors indicating the value of each matrix cell, to more complex and sophisticated graphics.
From this analysis, a new way of representing the information conveyed by confusion matrices is proposed in the form of a confusion star (focusing on the errors) or a confusion gear (centered on the hits). The new visualization tool can be employed to represent the original and possibly imbalanced confusion matrix, or the balanced unit version of that matrix.
The new tool successfully represents multiclass classification results in the form of a radial plot. The traditional way to represent confusion matrix uses colors (and eventually texts) to indicate the number of instances belonging to an actual class that are classified to an estimated class. Instead, confusion stars and gears use shapes to convey that information. Changing colors by shapes significantly improves the readability of the proposed graphics.
An additional property of the confusion stars and gears is that the enclosed area provides information about the overall classification performance. The relation of these areas to standard classification metrics has also been derived.
Finally, it has also been shown that the new graphic tools can usefully be employed to visualize the performances of a sequence of classifiers.

APPENDIX
Supplementary materials can be found in the on line version of the paper or they and can also be downloaded from https://github.com/amalialuque/confusionstar. They contain: 1) Three Excel files with the confusion matrices described in Section II.A. 2) An Excel file with the sequence of confusion matrices described in Section V.E. 3) A video file (in Graphics Interchange Format, GIF, format) visualizing the learning process described in Section V.E. 4) A Jupyter notebook, providing an implementation of the functions required to plot a confusion matrix as a confusion star (or confusion gears); and to generate a video file visualizing a sequence of confusion matrices in the form of confusion stars (or confusion gears).
Additionally, the algorithm that converts a confusion matrix into a confusion star plot can be found as supplementary material to the paper.