Decision Making in Evolutionary Multiobjective Clustering: A Machine Learning Challenge

Evolutionary multiobjective algorithms have become a popular choice to tackle the clustering problem. On the one hand, the simultaneous optimization of complementary clustering criteria offers an increased robustness to changes in data characteristics. On the other hand, the evolutionary search is able to approximate the Pareto optimal front and deliver a set of trade-offs between these criteria in a single algorithm execution. Decision making is the concluding stage of the pipeline, having as its goal the selection of a single, final solution from the set of candidate trade-offs produced. This is a complex task for which a definitive answer does not seem to be available, as the underlying assumptions of existing techniques may not hold for all applications. In this paper, we investigate an alternative approach to address this challenge: posing it as a learning problem. The key idea is to build a model that, given a proper characterization of solutions and their context (defined by the full approximation solution set and the specific clustering task at hand), is able to estimate quality and facilitate the identification of the best choice. To evaluate the suitability of this approach, we conduct a series of experiments over diverse synthetic and real-world datasets, including comparisons against a range of representative decision-making strategies from the literature. Our proposal exhibits greater flexibility in dealing with problems of varying characteristics, consistently outperforming the reference methods considered. This study demonstrates that it is possible to learn from the decision-making process in example settings and generalize the acquired knowledge to new scenarios.


I. INTRODUCTION
Clustering is a fundamental, unsupervised data analysis and machine learning task. Its goal is to determine the intrinsic organization of a set of elements into groups, such that this partition reflects the similarities and differences between them. Evolutionary multiobjective clustering (EMC) involves formulating this task as a multiobjective problem and adopting evolutionary algorithms as the optimization engine [1], [2], [3]. By exploiting multiple clustering criteria simultaneously, EMC methods are able to assess partition quality more comprehensively, which translates into an increased effectiveness and the ability to handle problems with a wider The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani . range of features. However, there is unlikely to be a single best solution for the resulting multiobjective formulation; due to the complementary but conflicting nature of the optimization criteria chosen, EMC methods generally produce a set of trade-offs between these criteria as output (this is illustrated in Figure 1 and further explained in Section II-B) [4]. Given that all the obtained trade-offs are considered equally good, i.e., they are all nondominated in the Pareto-optimality sense [5], how can one of them be selected and delivered as the final solution? The decision making process is concerned with this particular question, being the last step of the EMC pipeline. Identifying a single, promising solution may represent the ultimate goal in practice; this highlights the relevance of decision making, which is the specific focus of this study. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The complexity of decision making has hindered the development of a definitive approach to carry out this process. Some of the existing methods employ an additional clustering criterion to induce an ordering (break ties) over the set of nondominated solution alternatives [6], [7], [8], [9]; this assumes compatibility between the criterion chosen and the clustering task at hand, being inconsistent with the motivations behind the adoption of a multiobjective problem formulation. Other methods rely on geometric considerations, analyzing the relative location of solutions in objective space [10], [11], [12], [13]; although such an approach is widely used in multiobjective optimization [14], we show later in this paper that it can lead to poor decisions in the specific context of EMC. Finally, there are also methods that construct a consensus solution from the set of nondominated candidates available [15], [16], [17], [18]; this approach assumes that all candidates are equally important, but the inclusion of low-quality partitions (despite being nondominated) can negatively affect the outcome of this process. Representative examples of the above three categories of decision-making methods and a detailed discussion of their limitations are provided in Section II-C.
Acknowledging the importance of decision making and its complexity, and motivated by the limitations of existing techniques, we explore a novel approach to tackle this challenge. Specifically, we frame decision making as a supervised learning problem. Our approach relies on the construction of a predictive model that is able to associate characteristics of individual solutions and their context with a measure of partition quality (as explained in Section III, by context we refer to the full set of competing nondominated solutions as well as to the particular clustering problem they try to solve). In this way, the derived model can be used to estimate the quality of candidate partitions and guide the decision-making process. We investigate the suitability of this approach in terms of its ability to automate the selection of high-quality partitions from the nondominated solution sets produced by a state-of-the-art EMC algorithm. For this sake, our proposal is compared with respect to different baselines and a set of reference approaches that are representative of the three categories of existing methods discussed above. Our experimental analysis is conducted over a diverse collection of both synthetic and real-world data clustering problems.
The remaining of this paper is structured as follows. Section II introduces the necessary background and reviews the related literature. Then, Section III describes our proposed approach to decision making in detail. The experimental setup is described in Section IV. Section V discusses our results and main findings. Finally, Section VI concludes this study and highlights potential directions for future research.

II. BACKGROUND AND RELATED WORK
Below we introduce background concepts and review the relevant literature. First, Section II-A presents a formal definition of clustering and states it as an optimization problem. Then, the formulation of clustering as a multiobjective problem is discussed in Section II-B. Finally, Section II-C covers the central topic of this paper: decision making in EMC.

A. DATA CLUSTERING
Clustering is the task of finding the best way to partition a collection of samples into two or more disjoint subsets. Because of its unsupervised nature, this task is mostly based on the analysis of the similarities between the samples, and it heavily relies on a mechanism that supports the effective assessment of partition quality. Such a mechanism, known as clustering criterion or cluster validity index [19], allows us to frame the clustering task as an optimization problem and use a range of techniques to search for the best possible partition.
Let X = {x 1 , . . . , x N } be a set of N samples, f : → R be a clustering criterion, and = {{c 1 , . . . , c k } | c i ∈ P(X ) \ {∅} and X = c i , for i = 1, . . . , k} be the set of all possible partitions of X . Clustering can be stated, without loss of generality, as the following optimization problem: where C = {c 1 , . . . , c k } is a partition of X into k subsets, called clusters, and the constraint that C belongs to the feasible space ⊂ implies that it must be a proper partition; that is, it must hold that X = c i , c i = ∅, and c i ∩c j = ∅, for i, j = 1, . . . , k and i = j. Despite that many clustering methods require the specific value of k to be known in advance, frequently this information is not readily available in practice. The task of partitioning X without any prior knowledge of the correct value of this parameter is commonly referred to as automatic clustering in the literature [20].

B. MULTIOBJECTIVE CLUSTERING
From the above definition, the critical role of the clustering criterion f is evident, as we aim to find a partition that is optimal according to it. Thus, f needs to correctly evaluate the properties that determine a good partition, being responsible for guiding the search process towards high-quality solutions. Many criteria have been proposed so far [21], [22], each presenting a specific formulation to evaluate (either one or a combination of) properties such as intra-cluster homogeneity (compactness), connectedness, and inter-cluster separation. The diversity of existing clustering criteria highlights the lack of consensus on how to assess partition quality, and the fact that it is unlikely that a single solution can simultaneously satisfy all the desirable but (usually) conflicting properties [23]. The effectiveness of a clustering algorithm strongly depends on whether the underlying assumptions of the specific criterion adopted hold for the particular characteristics of the problem, which is in line with the No-Free-Lunch theorems [24], [25]. Acknowledging these facts, however, allows us to realize that reasonable trade-offs may be possible through the simultaneous consideration of multiple criteria, FIGURE 1. The clustering phase of EMC produces a Pareto front approximation (PFA). This PFA was obtained using algorithm -MOCK [26], which simultaneously optimizes the connectivity and intra-cluster variance criteria (to be minimized). The PFA obtained includes trade-off partitions of varying quality and numbers of clusters (k). The decision-making phase is concerned with identifying a member from this set as the final solution.
an approach which may deal more effectively with a wider range of characteristics in the problem domain.
That is, rather than focusing on a single criterion, clustering can be stated as a multiobjective optimization problem: where f (C) = [f 1 (C), . . . , f m (C)] T and f i : → R is the i-th criterion to be optimized. Due to the conflict that may exist between the m clustering criteria, we are now interested in identifying the set of the best possible trade-off solutions [5]: P * = {C * ∈ | C ∈ : C ≺ C * }. 1 P * is referred to as the Pareto-optimal set, and the image of P * in the objective function space is the so-called Pareto front.
The simultaneous optimization of multiple, complementary criteria commonly results in the identification of promising, high-quality partitions which may be unattainable through the optimization of a single criterion. Moreover, set P * may include solutions with a range of potential k values if the criteria optimized present opposing biases regarding this parameter, which can be particularly useful in an automatic clustering setting. Figure 1 helps to illustrate the above behaviors in the context of two specific clustering criteria: connectivity and intra-cluster variance [26]. Whereas the optimization of the former tends to decrease k, the optimization of the latter tends to increase it; consequently, simultaneously optimizing these criteria produces a set of tradeoff partitions that can vary greatly both in characteristics and k values. 1 Symbol ≺ refers to the Pareto-dominance relation. Solution C is said to dominate solution C (which is denoted by C ≺ C ) if and only if ∀i : . . , m. All solutions in set P * are said to be nondominated with respect to each other.
Although the intrinsic multiobjective nature of clustering was acknowledge since the early work of Delattre and Hansen [27], it was not until the application of metaheuristics to this multiobjective task that the topic started to attract increasing attention [1], [2], [3], [4], [10], [28]. In particular, population-based metaheuristics such as multiobjective evolutionary algorithms offer a significant advantage: they are able to construct a Pareto front approximation (PFA) in a single execution. Such is the case of the recently reported algorithm -MOCK [26], which produced the PFA of Figure 1.
-MOCK is a newer version of MOCK [10], one of the most representative EMC algorithms from the literature.

C. DECISION MAKING
EMC can be seen as a two-phase process. First, the clustering phase is concerned with the generation of a PFA of candidate partitions. Then, the decision-making phase focuses on the selection of one of the PFA members as the final solution. Whilst obtaining the full PFA can be useful in some cases, obtaining a single solution more likely corresponds to the ultimate goal in practice. The relevance of decision making is even more evident in automatic clustering, where the PFA may offer a diversity of choices with respect to the value of k, as shown in Figure 1; the selection of a final solution is the step that actually materializes a decision on this parameter.
Decision making in EMC has been accomplished using strategies that fall into three broad categories, as discussed in Section I. These categories and representative works for each of them are separately described below, followed by a discussion of the limitations and areas of opportunity motivating our new proposal introduced later in Section III.

1) DECISION MAKING BASED ON ADDITIONAL CLUSTERING CRITERIA
Perhaps the most widely adopted approach to select a final solution in EMC corresponds to the use of an additional criterion to discriminate between the (otherwise incomparable) nondominated partitions in the PFA. In algorithm MOGAC (multiobjective genetic algorithm for clustering) [28], for example, decision making relies on the use of index I [29]. In a closely related work [6], however, the authors replace index I, adopting the silhouette index instead [30]. Other examples of the use of the silhouette index for decisionmaking purposes include algorithm MOVGA (multiobjective variable string length genetic fuzzy clustering) [31] and algorithm MVMC (multi-view multiobjective clustering) [7].
Algorithm MOKGA (multiobjective k-means genetic algorithm) considers both indices Davies-Bouldin [32] and SD [33] at the decision-making phase [34]. In a separate study [35], the authors use six different criteria for decision making: the silhouette, C [36], Dunn [37], Davies-Bouldin, SD, and S_Dbw [38] indices. In the work of Garcia-Piquer et al. [8], the silhouette, Dunn, Davies-Bouldin, and Calinski-Harabasz [39] indices are all explored as alternatives to guide decision making, within the context of algorithm CAOS (clustering algorithm based on multiobjective strategies) [40]. VOLUME 10, 2022 In particular, it is shown that the effectiveness of these indices in identifying a final solution can be improved by filtering solutions at the extreme regions of the PFA (hence, this proposal relates also to the strategies discussed in Section II-C2, considering geometric aspects of the PFA).
Besides the use of additional clustering criteria in an individual manner, decision making has also been assisted by index combinations. In a recent study [9], Zhu et al. report that a linear combination of the Calinski-Harabasz, Davies-Bouldin, and silhouette indices (denoted CH+DB+SIL) in most cases outperforms the use of several individual approaches. Their analysis initially centers on the application of decision making to PFAs generated by algorithm -MOCK [26], but combination CH+DB+SIL is later explored in the context of the consensus-based decisionmaking strategy of algorithm MOAC (multi-objective automatic clustering) [9], which is further discussed in Section II-C3.

2) DECISION MAKING BASED ON THE SHAPE OF THE PARETO FRONT
In the general area of multiobjective optimization, selecting a final solution based on geometric considerations is a popular strategy, which is inherited to the specific domain of EMC. The prominent regions of the Pareto front, commonly referred to as knees [14], [41], are considered particularly relevant. These regions tend to offer the most interesting trade-offs between the optimization criteria, where a minor improvement in one dimension causes a more significant deterioration in another. In the absence of any explicit preference information, knees are usually assumed to correspond to the likely choices of a decision maker. Thus, some of the strategies that have been proposed for decision making in EMC are based on identifying knee regions in the PFA and selecting promising trade-off partitions from such regions.
A representative example of the methods in this category is algorithm MOCK [10]. To locate a knee in the PFA, this method computes a number of control fronts by applying its clustering strategy (i.e., the same strategy used initially to obtain the PFA) to randomly generated data. Then, a potential knee is identified and selected as the final result based on the distance of candidate solutions with respect to such control fronts. Alternative schemes have also been proposed with the aim of lowering the computational complexity of MOCK's strategy. One of them relies on a simple heuristic: the solution minimizing the sum of (normalized) objective values is always selected [12]. Another proposal reduces the cardinality of the PFA, removing solutions not complying with the assumptions of convexity; then, knee-like solutions are identified by analyzing the adjacent angles of PFA members with respect to their immediate neighbors [11].
Recently, two multiobjective fuzzy clustering methods were proposed [13]: ECM-NSGA-II (entropy c-means-nondominated sorting genetic algorithm II) and ECM-MOEA/D (entropy c-means-multiobjective evolutionary algorithm based on decomposition). These algorithms apply certain rules regarding the location of solutions with respect to a reference line connecting the extreme points of the PFA. Starting at the extreme point where cluster compactness (first objective) reaches its lowest value, the position of the next points along the PFA, relative to the reference line, determine the solution to be selected. If the points fall above the line, the decision is simply to keep the extreme point. Otherwise, when the points fall below the reference line, the one maximizing the distance with respect to this line is identified as the knee and delivered as the final partition.

3) DECISION MAKING BASED ON ENSEMBLE CLUSTERING
In ensemble clustering [42], the goal is to derive a new, consensus partition by integrating the information contained in a collection of base partitions. This concept has been adopted by several EMC methods, where the solutions in the PFA are taken as the ensemble members and a consensus is constructed on the basis of them. The motivation behind this approach is that every nondominated solution contains useful information on the cluster structure of the data, and the incorporation of ensemble techniques thus provides a means to exploit all this information in delivering a final answer.
Consensus-based approaches have also been combined with the use of a classifier [17], [18]. The resulting technique initially produces a consensus partition by means of a voting strategy, which assigns samples to clusters whenever the majority of the PFA members agree on the assignment. This is likely to result in a partial clustering, as some samples can be left unassigned due to the lack of agreement across voters. Hence, the complementary step exploits this partial consensus solution for the purposes of training a classifier, which is later employed to determine the cluster membership of the remaining (originally unassigned) samples. It is unclear, however, whether the above approach is applicable when the PFA involves solutions with different numbers of clusters.
Decision making in algorithm MOAC relies on both, the generation of consensus solutions and the use of additional criteria [9]. First, the cardinality of the PFA is reduced based on an indicator which takes into account the quality and diversity of solutions. Then, a new set of consensus partitions with a range of numbers of clusters is generated, from which a final solution is chosen by means of index combination CH+DB+SIL (refer to Section II-C1). To generate the consensus partitions, two recently proposed ensemble techniques are adopted [44]: LWEA (locally weighted evidence accumulation) and LWGP (locally weighted graph partitioning).

4) LIMITATIONS OF CURRENT DECISION-MAKING APPROACHES
Despite their simplicity, methods that use an additional clustering criterion present an inherent limitation: they implicitly assume that the criterion chosen will be compatible with the properties of the data. This opposes the motivations for the use of a multiobjective formulation of clustering. It is unlikely that a single criterion can properly capture all aspects of a partition and induce an effective discrimination between candidate solutions in every scenario. Each criterion introduces a specific bias and its effectiveness will depend on the characteristics of the particular problem at hand (usually unknown in advance). To a certain extent, this may be offset by the use of more elaborate criteria, attempting to evaluate multiple aspects simultaneously, or by the use of criteria combinations (e.g., CH+DB+SIL [9]). However, finding the right weighting between various aspects or criteria may not be straightforward; it may certainly be problem-dependent.
Methods in the second category are supported by common assumptions in multiobjective optimization regarding the regions of the Pareto front which should be prioritized (in the absence of explicit decision-maker's preferences). Whilst favoring knee regions would probably be a wise choice in the general case [14], in the specific setting of EMC the best trade-off in the PFA may not always correspond to the correct answer. It has been argued, for example, that the cluster structure of the data is reflected in the shape of the Pareto front [10]. Evidently, this depends on the particular optimization criteria used but, more importantly, on how compatible these criteria are with the target data. If all criteria are compatible and contribute equally to solving the problem, then we would expect that the best trade-offs between these criteria would provide us with the most reasonable answers.
Otherwise, if one of the criteria is much more helpful than the others, we would expect better choices to be located closer to the corresponding extreme region of the Pareto front. These behaviors are clearly illustrated in Figure 2.
Finally, consensus-based approaches do not necessarily select one of the solutions in the PFA; instead, they construct a new solution from them. These strategies operate under the premise that all partitions in the PFA contain useful information that can be integrated and exploited to construct a higherquality final solution. This assumes that PFA members are all equally reliable, which implies that the optimized clustering criteria are all compatible with the characteristics of the data. This may not be the case, as discussed before. When one of the objectives contributes significantly more than others in solving the problem, and the best choices are therefore located at one of the extreme regions of the PFA (as seen in Figure 2), including the information of solutions from other regions of the PFA may be equivalent to introducing noise and can negatively affect the generated consensus.

III. DECISION MAKING BASED ON MACHINE LEARNING
The above discussion highlights the challenging nature of decision making, and the fact that current techniques operate under assumptions that do not necessarily apply given the peculiarities of EMC. This stresses the need for alternative, more effective and robust strategies to accomplish this task.
This section introduces a novel approach as our attempt to meet this need: machine learning-based decision making (MLDM). Our MLDM strategy treats decision making as a supervised learning problem. The goal is to learn by example, i.e., to learn from the decision-making process in example settings, and build a model which can capture the available knowledge for subsequent exploitation in unknown settings. MLDM consists of two main stages: the learning stage and the decision-making stage. These stages are separately described in Sections III-A and III-B. Then, Sections III-C and III-D respectively discuss some design choices adopted in this study and the characterization process of PFAs, which is an essential component of the proposed methodology.

A. LEARNING STAGE: MODEL CONSTRUCTION
The learning stage of strategy MLDM, depicted in Figure 3, is responsible for the construction of a regression model. This model is used later, at the decision-making stage (see Section III-B), to predict the quality of the solutions in a given PFA and enable the identification of the best alternative.
This stage starts with the formation of a repository of PFAs for the purposes of training the predictive model. These PFAs are produced through multiple independent executions of a chosen EMC algorithm (process A in Figure 3), over a collection of sample clustering problems. Each training PFA obtained consists of a set of nondominated partitions that the EMC method offers as potential solutions to the given problem. By sample clustering problems, we mean datasets for which the correct cluster structure (i.e., the ground truth) is known, so that learning can occur in a supervised manner. Since the correct clustering solution is known for these (sample) problems, a direct comparison using this solution as a reference provides us with an objective measurement of the quality of every member of the training PFAs (process B in Figure 3). Such a comparison is made by means of an external cluster validity index (refer to Section III-C for details), and the resulting quality value is used as the target (response) variable to be predicted by the regression model.
At the core of the proposed decision-making methodology is the characterization of the PFAs (process C in Figure 3), i.e., the extraction of a set of features (explanatory variables) that the model will later learn to associate with the target variable at the training step (process D in Figure 3). These features are extracted for every member of the PFA (just as quality measurements are computed independently for each of these members, in process B of the figure). The feature set encompasses individual aspects of the PFA members and the partitions they represent, as well as global aspects of the PFA and the particular clustering problem being solved. Section III-D elaborates further on the characterization process.

B. DECISION-MAKING STAGE: MODEL APPLICATION
In the decision-making stage of MLDM, the knowledge acquired during the learning stage is exploited to guide the selection of a final solution in a real (unsupervised) setting.
As illustrated in Figure 4, the input to this stage is a PFA generated by the chosen EMC method, which comprises a range of candidate solution alternatives for an unknown clustering problem (a problem for which information about the correct partitioning is unavailable, as it generally occurs in practice). Given that all these solution alternatives are nondominated, i.e., equally good in the Pareto-optimality sense, the goal is to employ the regression model constructed in advance to estimate their quality, enable discrimination (breaking ties), and hence identify a promising final choice.
More specifically, once the input PFA is characterized (process A in Figure 4, which is explained later in Section III-D), the regression model is applied to the feature vectors extracted for all PFA members to get their corresponding quality estimates. The candidate PFA member with the highest estimated quality value is chosen and presented as the final solution recommendation (process B in Figure 4).

C. DESIGN CHOICES AND CONSIDERATIONS
The methodology proposed, as described above, is independent from the specific definition of its main design components: (i) the EMC algorithm used to generate the PFAs; (ii) the set of features used to characterize the PFAs; (iii) the external cluster validity index used as the target variable; and (iv) the machine learning technique used to build the predictive model. Despite this flexibility, evaluating the impact of varying these components is beyond the scope of this study. We adopted specific choices for our experimental analysis: • Algorithm -MOCK [26] is chosen as the EMC method.
• To our knowledge, this is the first work that explores a methodology like the one proposed, including the need for defining a set of features (explanatory variables) to characterize the solutions in the PFAs produced by EMC methods. As such, the engineering of these features is seen as one of the contributions of this paper, thus receiving a separate, detailed treatment in Section III-D.
• Our supervised learning approach (building a model from sample problems with known solution), allows us to employ an external cluster validity index as an indicator of partition quality. This indicator is exploited as the target (output) variable, i.e., as the solution quality measure that the model will learn to estimate on the basis of the extracted features of PFA members. The adjusted Rand index (ARI) is the particular measure we adopted [48]. ARI evaluates the pairwise co-assignment of elements to clusters between two given partitions (in this case, a candidate partition in the PFA and the correct clustering for the sample problem). This measure is defined in the range [∼0, 1], where a value of 1 indicates a perfect agreement between the two partitions.
• Finally, given that the model is intended to predict solution quality, i.e., the (continuous) ARI value for PFA members, we approach decision making as a regression (rather than classification) task. The random forest technique has been adopted to construct such a regression model [49]. This technique has shown robustness to deal with mixed-type and high-dimensional features sets (as in our case, see Section III-D), obtaining promising results in different application domains [50], [51], [52].

D. CHARACTERIZATION OF APPROXIMATION SETS
The characterization process of the PFAs is a critical component that enables the application of the above-described decision-making methodology. This process involves extracting a set of features that allow us to describe each of the candidate partitions in the PFA, so that these features can later be linked to the adopted measure of solution quality. In other words, these features will assume the role of explanatory (input) variables during the supervised construction and subsequent utilization of our method's predictive model. Although separate feature vectors are to be extracted for the different PFA members, these features need to capture aspects of these members both at the individual level and at the global, context level. That is, the decision on a final solution should not be made by only analyzing individual candidates in isolation, but by also taking into account the VOLUME 10, 2022 relationships between these candidates in the PFA. Looking at such relationships makes possible the evaluation of properties related to the geometry of the PFA as well as those referring to the entire set as a whole. These properties, together with information regarding the particular clustering problem being addressed, represent the context against which these alternative solution choices are presented for consideration.
In this study, a total of 55 features are defined to evaluate different aspects of PFA members and their context. The full description of these features, as well as of the five categories in which they are organized, is provided in Appendix A.

IV. EXPERIMENTAL SETUP
The following subsections describe the main settings of our experimental study, including the clustering problems used for testing, the decision-making approaches adopted as references, and the performance assessment measures considered.

A. CLUSTERING PROBLEMS
A total of 50 clustering problems are considered in our experiments, out of which 40 are synthetic and 10 are real-world datasets. The 40 synthetic problems, as illustrated in Figure 5, vary in size and present a diversity of characteristics regarding the shape, overlap/separation, and density of the clusters. The motivation for using these synthetic, low-dimensional problems, is to be able to associate the performance of the methods evaluated with the observable features of the data.
The 10 real-world problems are included to evaluate our proposal and reference methods under conditions which can be more representative of those encountered in practice. Table 1 specifies the size, dimensionality, and correct number of clusters in these real-world datasets: Banknote authentication (Banknote); Breast cancer Wisconsin diagnostic (Breast), Optical recognition of handwritten digits (Digits); Ecoli; Iris; Statlog landsat satellite (Landsat); Palmer archipelago penguin data (Palmer); Seeds; Thyroid disease (Thyroid); and Wine. All these datasets except one are available from the UCI machine learning repository. 2 The remaining problem, namely, Palmer, is provided by its authors via GitHub. 3

B. REFERENCE APPROACHES
Our proposed MLDM method is evaluated with respect to a diverse set of decision-making approaches that have been proposed in the EMC literature. We consider representatives from the three categories described in Section II-C as well as some additional references, as detailed below.

1) REFERENCE METHODS BASED ON ADDITIONAL CLUSTERING CRITERIA
We include comparisons against the use of three separate clustering criteria: SIL [30], DB [32], and DUNN [37]. These indices, as discussed in Section II-C1, are commonly used for decision-making purposes. In addition, we consider the recently proposed index combination CH+DB+SIL [9].

2) REFERENCE METHODS BASED ON THE SHAPE OF THE PFA
Our analysis includes two methods based on the selection of a final solution from the knee of the PFA: MOCK's strategy based on the computation of control fronts [10], and Shirakawa and Nagao's approach of selecting the solution that minimizes the sum of objective values (SUMO) [12]. These approaches have previously been discussed in Section II-C2.

3) REFERENCE METHODS BASED ON ENSEMBLE CLUSTERING
The two ensemble-based approaches by Zhu et al. [9], to be referred to as LWEA and LWGP, are included in our comparison. These approaches generate a set of consensus partitions using the techniques proposed by Huang et al. [44], and then apply index combination CH+DB+SIL to select a final solution (refer to Section II-C3 for additional details).

4) ADDITIONAL BASELINE REFERENCES
Finally, the following baselines are considered: the best and worst solutions in the PFA (BEST and WORST), representing the upper and lower bounds on the achievable performance; the extreme points of the PFA (EXT1 and EXT2), referring to the naive approach of always favoring one objective function over the other (intra-cluster variance and connectivity, respectively); and a random selection (RAND), which any reasonable decision-making strategy should outperform.

C. PERFORMANCE ASSESSMENT
The results of our experiments are evaluated considering three different aspects of performance. First, we evaluate prediction performance, referring to the capacity of our proposed method, and more specifically of the regression model employed, to accurately estimate the quality of the candidate solutions in the PFA. As indicated in Section III-C, we adopted ARI [48] as the measure of solution quality to be predicted by the model. Therefore, we evaluate prediction performance in terms of the root-mean-square error (RMSE) between the predicted and actual (measured) ARI values of candidate solutions. Lower RMSE values are always preferred, with 0 being the best possible value for this measure.
The second aspect evaluated is decision-making performance. Since the goal of decision making is to ultimately select a good solution from the PFA, this aspect refers precisely to the quality of the solutions chosen, which is given by their actual ARI values (as computed with respect to the correct solution of each problem). As stated in Section III-C, ARI is defined in the range [∼0, 1] and is to be maximized.
The last considered aspect of performance is the effectiveness of the methods at determining the correct number of clusters, k. As discussed in Section II-C, selecting a final solution also implies deciding on the value of k, given the diversity of choices available in the PFA. Thus, we complement our evaluation of decision-making approaches by analyzing the absolute differences between the correct value k * and the value of k of the solutions selected, |k * − k| (the lower the difference, the better the performance of the method is).
Finally, given the stochastic nature of some of the decisionmaking methods evaluated, we consider 20 independent repetitions of each experiment. In the specific case of MLDM, each repetition consists of the full process of training the regression model and testing; during training, the main hyperparameters of the random forest technique are adjusted by means of exhaustive (grid) search and 5-fold cross-validation. Statistical significance analyses are conducted for our main results using the (nonparametric) Mann-Whitney U test, considering in all the cases a significance level of α = 0.05 and the Holm-Bonferroni correction procedure.

V. EXPERIMENTS AND RESULTS
This section presents the results of a series of experiments conducted to evaluate the suitability of the MLDM method proposed in this paper. First, the experiments of Sections V-A and V-B focus on synthetic clustering problems, each considering a particular decision-making scenario with different difficulty. Then, Section V-C extends this evaluation, analyzing the performance of our proposal on real-world datasets.

A. EXPERIMENT 1: KNOWN DATASETS
In the scenario considered in this experiment, we aim to select a final solution from an unknown PFA, generated for a dataset VOLUME 10, 2022 which is already known to MLDM. That is, the specific PFAs used for testing are completely unknown to MLDM, but other sample PFAs, obtained for the same dataset, were used during the training of MLDM's regression model. Despite being generated independently, the testing PFAs may share some similarities with the ones included in the training set. Hence, this particular scenario is evidently less challenging for MLDM, in comparison to the one analyzed later in Section V-B. Note, however, that this scenario is representative of situations where the same clustering problem (or a similar one) needs to be solved repeatedly (with certain frequency). In market segmentation, for example, the goal is to identify groups of customers so that differentiated, more effective strategies can be devised. We may expect the characteristics of the problem (and those of the PFAs produced for it) to remain comparable if the analysis always centers on the same type of information (e.g., demographics). Thus, it should be possible to learn from the outcome of previous decision-making processes, as evidenced by historical data or even by data from other business branches.
This experiment considers the 40 synthetic problems described in Section IV-A. For each problem, a total of 40 PFAs were generated through independent runs of -MOCK. Out of these 40 PFAs, 20 were included in the training set and the remaining 20 were reserved for testing purposes. Considering that the average carnality of the PFAs is 100, the training and testing sets used in this experiment each contains approximately 80,000 solution samples. As indicated in Section IV-C, we performed 20 independent repetitions of the full process of training (including crossvalidated hyperparameter tuning) and testing. The results are summarized in Figures 6, 7, and 8, which cover the three performance aspects discussed in Section IV-C and include comparisons against the references described in Section IV-B. Detailed results on decision-making performance, focusing on individual problems and including the findings of the statistical significance analysis, are provided in Table 2 (Appendix B).
As can be seen from Figure 6, MLDM reports promising results in terms of prediction performance, scoring relatively low RMSE values in most cases when applied to the unknown, testing PFAs (note, however, that these RMSE values are higher in comparison to those observed for the  training data, which is an expected behavior in supervised learning). Low RMSE values confirm that MLDM's regression model has reasonably succeeded in estimating partition quality. This translates into a highly competitive decisionmaking performance, as shown in Figure 7 and Table 2. The solutions selected by MLDM yield high ARI values, significantly surpassing those selected by the reference methods and competing closely with the best solutions available in the PFAs (baseline BEST). MLDM was able to identify highquality solutions for most problems (indeed, it chose the best solution alternative in many cases), in spite of the wide range of qualities observed across PFA members (see the large differences between baselines BEST and WORST). These observations are also consistent with the results of Figure 8, where MLDM is found to be the best performer in terms of the correct determination of the number of clusters.
The strongest contender among the references considered is MOCK's strategy, based on the identification of the knee of the PFA. This is followed by five approaches that seem to provide comparable (average) performances: SIL, DB, and CH+DB+SIL, which are based on the use of additional clustering criteria, and LWEA and LWGP, which are based on ensemble clustering. It is interesting to observe that the use of strategies DUNN and SUMO leads to the selection of solutions which are, at least in average, poorer in quality than those selected at random (baseline RAND).

B. EXPERIMENT 2: UNKNOWN DATASETS
The scenario of our second experiment focuses on the selection of a final solution from a PFA generated for a dataset which is unknown to MLDM. That is, contrasting with the experiment presented earlier in Section V-A, in the more challenging setting considered herein the training set completely excludes PFAs produced for the same dataset being used for testing. Therefore, this experiment is intended to investigate the ability of our proposal to learn from the knowledge available for example problems and exploit it to guide decision making in the context of new applications.
For each of the 40 synthetic problems, we generated 20 PFAs by means of independent runs of algorithm -MOCK. In this case, however, we only used 39 problems at a time for training, leaving the remaining problem out for testing. Consequently, 40 configurations of this experiment were required, each allowing a different problem to be excluded from training and reserved for testing. These configurations consider training and testing sets with roughly 78,000 and 2,000 solution samples, respectively (given that the 20 PFAs of every problem contain about 100 solutions). Furthermore, for each experiment configuration 20 independent repetitions of the training and testing processes were performed, as indicated in Section IV-C. Below, the performance of MLDM is compared against several reference methods (Section IV-B) and analyzed from multiple perspectives (Section IV-C).
Our results make evident the more challenging conditions of this new scenario. From Figure 9, it is possible to observe a decrease in prediction performance when MLDM is applied to PFAs of unknown problems, with significantly higher RMSE values than those reported for the previous experiment (see Figure 6). Such a decrease in prediction performance is also reflected in reduced decision-making and k-determination capabilities, as can be seen from Figures 10 and 11 (and by contrasting these results with those shown previously in Figures 7 and 8). It is noteworthy that the increased difficulty of this scenario is only relevant to MLDM; reference methods thus maintain the same behaviors as observed and discussed at the end of Section V-A.  Despite the aforementioned performance drops, MLDM's results in terms of decision-making and k-determination are clearly competitive. From the overall results of Figures 10 and 11, our method is seen to outperform all reference approaches evaluated. This suggests that the estimations of partition quality produced by MLDM's regression model, although not as accurate as those observed in Section V-A, are still sufficiently informative so as to induce an effective discrimination among the competing solutions in the PFAs. At this point, it is worth considering the question of how strong the correlation between prediction performance and decision-making performance is, which we investigate by analyzing Figure 12. The figure confirms that the two performance aspects are certainly correlated, where lower RMSE values tend to be associated with a higher quality of the solutions chosen. More interestingly, though, the figure also reveals that there is an important number of cases indicating the selection of high-quality solutions in spite of relatively high prediction errors. From this, we can stress that even inaccurate predictions can provide useful information to guide the identification of promising solution alternatives, which is the ultimate goal of decision making. The accuracy of the regression model's predictions is therefore a sufficient, but not necessary condition for the effectiveness of MLDM.
Finally, some interesting behaviors can be derived from the analysis of problem-specific results of MLDM and reference VOLUME 10, 2022 methods. Figure 13 presents individual results for a sample of 12 problems, but detailed results for the full set of 40 problems can be found in Table 3 (Appendix B). On the one hand, we would like to emphasize that all contestant methods have stood out, showing a good performance for specific subsets of problems. However, they lack robustness and fail when problem properties change. For example, DB, DUNN, and SUMO are among the best performers for problems that present non-overlapping and non-linearly separable clusters (these properties can be visualized in Figure 5), such as: atom, chainlink, circles1, inside, orange, part2, and smile1. Note that the remaining references, namely, SIL, CHz+DB+SIL, MOCK, LWEA, and LWGP, in most cases scored a poor performance for this particular subset of problems. The completely opposite situation occurs when we consider datasets with linearly separable clusters and varying degrees of overlap, such as: blobs1, blobs2, data_5_2, data_9_2, r15, sizes1, sizes5, square2, triangle2, and twodiamonds. In these problems, methods SIL, CH+DB+SIL, MOCK, LWEA, and LWGP report high ARI values, whereas methods DB, DUNN, and SUMO show a low performance.
On the other hand, it is possible to highlight the increased robustness that MLDM has shown across the diversity of characteristics covered by our dataset collection. Our proposal competes with some of the best results for most of the above-mentioned problems. Moreover, there are other problems for which our proposal is clearly the best performer (in fact, for some problems MLDM is the only method providing a reasonable result): flamesize5, longsquare, moons3, moons5, spiralsizes5; with the exception of longsquare, these problems seem to combine characteristics of nonlinearly separable and overlapping clusters. Finally, we can also identify three datasets, namely, circles2, data_9_2, and spiralsdata92, for which MLDM exhibits a notably low performance. These are difficult problems, as can be judged from the results of all baselines and strategies evaluated. It might also be the case, however, that the particular properties of these problems are not well represented in the training data, which would certainly explain the low performance observed (as a supervised learning method, MLDM's success depends on the availability of representative training samples).

C. EXPERIMENT 3: REAL-WORLD DATASETS
So far we have considered low-dimensional, synthetic clustering problems. This has allowed us to ensure that our collection of test scenarios spans a diversity of characteristics and, more importantly, has enabled us to relate such characteristics   with the performance of MLDM and the reference methods evaluated. Nevertheless, it is essential to validate our findings and the suitability of these approaches under more realistic conditions. We thereby replicate the experiment presented previously in Section V-B, focusing now on the set of 10 realworld datasets described in Section IV-A.
As explained in Section V-B, the experiment has been designed to investigate the ability of MLDM to model available knowledge from example settings and exploit it to accomplish decision making in the context of a completely new (previously unseen) problem. Distinct configurations of the experiment are considered so that each of the 10 real-world problems is used exactly once for testing, ensuring that no sample PFAs for the same problem are included in the training set (the training set involves only the remaining 9 problems). This results in training and testing sets with 18,000 and 2,000 solution samples, respectively (we generated 20 PFAs for each dataset, each of which containing about 100 solution samples). As before, we ran every configuration of this experiment multiple times independently, presenting summaries of the results obtained from the perspective of different performance indicators in Figures 14, 15, and 16. Additionally, Figure 17 and Table 4 (Appendix B) provide separate results for the 10 problems considered.
The results obtained on the real-world problems bear some resemblance to those observed for the synthetic data in Section V-B. Despite the relatively high RMSE values reported in Figure 14, Figure 15 indicates that MLDM selected final solutions which in average present a better VOLUME 10, 2022 quality in comparison to those selected by all the reference methods and baselines (with the evident exception of BEST), therefore being the best overall performer in this test. Analyzing decision-making performance from the perspective of individual problems, the results of Figure 17 are consistent with previous observations regarding the increased robustness that our proposal exhibits across test scenarios. MLDM ranks among the best performers for all the 10 problems considered, unlike the reference methods which only stand out in particular cases. Specifically, the most robust references, SIL, LWEA, and LWGP, only performed well in half of the problems: Breast, Digits, Ecoli, Iris, and Wine; CH+DB+SIL performed well in all these problems, except Digits; MOCK appears as one of the best performers only for three problems: Ecoli, Iris, and Palmer; DB competes with some of the best results only for problems Landsat and Thyroid; and, finally, approaches DUNN and SUMO scored a poor performance in all cases.
It is interesting to note, however, that the real-world datasets certainly posed some challenges for all the methods evaluated. In general, we can see from Figure 15 that all the methods scored ARI values which are consistently lower than those reported for the previous experiment ( Figure 10). Approaches DB, DUNN, MOCK, and SUMO have performed even worse than baseline RAND (which selects a solution at random). Furthermore, it is possible to observe that baseline BEST (upper bound on achievable performance) shows an average ARI of about 0.75 (in contrast to the value of 0.95 that it reports in Figure 10). This indicates an unavailability of high-quality solutions in most of the PFAs considered (at least from the perspective of the ARI indicator), which can indeed explain the lower performance of all methods.
An alternative explanation for the lower ARI values observed consistently in this experiment, is the assumption that the class assignments (labels) specified for these realworld problems reflect the correct partition of the data, which we use as the reference (ground truth) for the computation of this measure. Although these class assignments effectively group the samples and are undoubtedly relevant to the particular application domains of these datasets, they do not necessarily match exactly the inherent cluster structure of the data. To illustrate this, consider the multidimensional scaling projection of dataset Banknote, shown in Figure 18. As can be seen from the figure, the two classes defined for this problem result in a clear separation of the data samples; however, from the clustering perspective it makes more sense to split these classes further into multiple clusters. This finding also offers an explanation to the large errors that all the methods report in terms of k-determination, as shown in Figure 16.

VI. CONCLUSION
Limitations of existing decision-making methods challenge the applicability of EMC algorithms, as the delivery of a single final solution is a necessary step for them to be fully useful in practice. The underlying assumptions of current proposals do not always hold under the peculiarities of the data and the application domains, stressing the need for alternative approaches to cope with the complex nature of this task. In view of this, we explored a novel approach to decision making, demonstrating the viability of addressing it through machine learning. The key concept of our supervised learning-based proposal involves: (i) learning in advance the association between partition quality and a set of features extracted from nondominated solutions; and (ii) exploiting the knowledge gained to drive decision making, enabling the identification of a promising final solution. Our main finding is that, by following the proposed methodology, it is possible to generalize prior learning to completely new scenarios.
Our proposal was evaluated and compared to eight representative decision-making approaches from the literature and some additional baselines. This evaluation was conducted over a diverse collection of synthetic and real-world datasets, under different experimental conditions. Our method consistently reported the best overall performance throughout our experiments. Moreover, it showed an increased versatility with respect to the changing problem characteristics. In general, our results underline the suitability of this proposal and its superiority with respect to existing techniques. On the other hand, the proposed method also presents a limitation that is inherent to supervised learning settings: it relies on the availability of training samples which are representative of the scenarios seen in practice. Although our proposal has shown some robustness, providing competitive results for problems which were completely excluded from the training process, special attention needs to be given to the training set compilation process in order to overcome this limitation.
The outcomes of this study are encouraging, confirming that the development of alternative decision-making approaches is a valuable direction which merits additional research. Preliminary analyses have revealed potential opportunities to further improve the effectiveness of our proposal through an in-depth inspection of our feature set. This should lead to the removal of irrelevant or redundant features, to achieve meaningful dimensionality reductions, as well as to the engineering of new features that can better capture the complexities of the decision-making task. Despite being proposed as a generic framework, to delimit the scope of this study the evaluation of our proposal centered around a specific EMC algorithm, -MOCK [26], and the particular optimization criteria used by it (similar design choices had to be made regarding other components, such as the machine learning method used to construct the regression model and the partition quality criterion used as the response variable). Hence, extending this study to other different conditions, including the use of more than two optimization criteria, would certainly support the generality of our conclusions. Finally, our method exploits characteristics of the decisionmaking task which apply only to the specific EMC context. An interesting research direction concerns exploring the applicability of a methodology like the one proposed in this paper with the aim to assist decision making in the more general context of multiobjective optimization.

APPENDIX A DEFINITION OF FEATURES
As discussed in Section III-D, a set of 55 features are used in this study for the characterization of candidate solutions in the PFAs. These features have been assigned specific acronyms and are organized into five distinct categories, which are separately defined in the following subsections.

A. CATEGORY 1: FEATURES DESCRIBING PFA MEMBERS INDIVIDUALLY
The first category involves features which are defined to describe aspects of the PFA members at the individual level. A total of 11 features are included in this category: • Objective values (OBJ1, OBJ2). These features correspond to the (raw) objective values of the candidate partition (its specific coordinates in objective space, see Figure 19). In the particular case of algorithm -MOCK, OBJ1 and OBJ2 refer to the values scored for the intracluster-variance and connectivity criteria.

FIGURE 21.
Feature TEND captures the tendency that a PFA member can present towards favoring one objective over the other, which depends on its location with respect to a line connecting reference points (0, 0) and (1, 1).
• Average of normalized objectives (ANOBJ). This feature is computed as the arithmetic mean of the normalized objective values (NOBJ1 and NOBJ2 features).

• Tendency towards a particular objective (TEND).
Each solution in the PFA exhibits a particular trade-off, which can favor one of the optimized criteria more than the other. Accordingly, we assign three possible values to the TEND feature, {0, 1, 2}. As shown in Figure 21, this depends on the location of the PFA member with respect to a reference line connecting points (0, 0) and (1, 1) of the normalized objective space. If the PFA member appears in the upper half, then it shows a tendency towards the first criterion and we assign a value of 0. If it locates in the opposite half, then it favors the second criterion and we assign a value of 2. Otherwise, no tendency is observed and we assign a value of 1.  In multiobjective optimization, the vector given by the best (feasible) value that can be reached for every objective function constitutes the so-called ideal point. Considering the normalization of the PFA to the range [0, 1] in both dimensions, point (0, 0) can thus be seen as our current approximation to the ideal point. We compute feature IDEAL as the distance from the PFA member to such an approximated ideal point, see Figure 25.   • Distance to the nadir point approximation (NADIR).
Contrary to the ideal point, the nadir point is given by the worst value for each objective function in the entire Pareto-optimal set. Therefore, point (1, 1) of the normalized objective space represents an approximation to the nadir point from the perspective of the PFA under consideration. Feature NADIR, as illustrated in Figure 26, is calculated as the distance of the PFA member with respect to such an approximation.  • Number of clusters (KCLU). As initially exemplified through Figure 1, the PFA may involve solutions showing a diversity of numbers of clusters, k. Feature KCLU refers to the specific value of k of the partition represented by a given PFA member.

• Internal cluster validity indices (SIL, DB, DUNN).
These are unsupervised measures that assess solution quality by analyzing specific aspects of the clusters in the partitions they define. We include in our feature set three indices which are popular choices and have also been exploited for decision-making purposes, as seen in Section II-C1: SIL [30], DB [32], and DUNN [37].

C. CATEGORY 3: FEATURES DESCRIBING THE PFA MEMBER IN RELATION TO OTHER PFA MEMBERS
Unlike previous categories, the third category of features considers properties of the PFA member whose evaluation depends on other candidate members of the PFA (such as neighboring solutions or the extreme points of the PFA). This category includes the following 16 features: • Ranking for individual objectives (RANK1, RANK2). Features RANK1 and RANK2 are computed as the rank positions of the candidate solution after sorting the full list of PFA members according to their values for the first and second objective functions, respectively. Given that all PFA members are nondominated with respect to each other, a total order is obtained when considering the two objective functions independently. Thus, no PFA member is ranked equal to any other, and every member receives a distinct value for features RANK1 and RANK2. The solution with the best objective value, i.e., the extreme point of the PFA, is assigned rank 1 for the corresponding objective, whereas the solution with the worst objective value (at the opposite extreme) is assigned a rank that equals the cardinality of the PFA.
• Extreme points of the PFA (EXT1, EXT2). Binary features indicating whether or not the PFA member is the extreme point (best objective value) for the first and second objective functions, respectively (see Figure 27).
• Membership to convex hull (INCVX). The convex hull (or convex closure) is given by the smallest subset of points which define a convex polygon enclosing all points in a set. After discarding solutions outside the zone of interest (see Figure 22 and description of feature  Figure 22).

FIGURE 29.
Feature ANGNE is defined as the angle between the lines connecting the PFA member characterized to its left and right neighbors. ZINT), the convex hull is computed for the remaining PFA members, as shown in Figure 28. Feature INCVX is assigned a value of 1 whenever the PFA member is a vertex of the resulting convex hull, and 0 otherwise.
• Angle between closest neighbors (ANGNE). As illustrated in Figure 29, this feature is computed as the angle between the lines joining the solution being characterized with its left and right closest neighbors in the PFA.
• Angle between left and right approximations (ANGAP). ANGAP is proposed as a ''smooth'' version of feature ANGNE defined above. Rather than considering the left and right neighbors, ANGAP uses hypothetical points calculated as the average of the coordinates of all PFA members at the left (respectively right) side of the point being characterized, see Figure 30.
• Contribution to hypervolume (CHVOL). The hypervolume, as discussed further in Appendix A-D (definition of feature HVOL), is a well-known performance indicator in evolutionary multiobjective optimization, evaluating the quality of a PFA as a whole [53].   The contribution of individual solutions to the value of this indicator (see Figure 31) has been used as a criterion to guide the search process [54]. This approach is adopted as one of our features to characterize PFA members.
Features DEXT1 and DEXT2 are given by the Euclidean distance from the point being characterized to the extreme points of the PFA, as shown in Figure 32 (objective values are normalized to range [0, 1]).

• Number of points within a certain radius (RADIUS).
This feature, exemplified in Figure 33, refers to the total number of solutions lying within a certain radius r from the PFA member under consideration. In this study, we adopted a fixed value of r = 0.1, which represents about 7% of the distance between the extreme points of the PFA (assuming the previous normalization of the PFA).
• Crowding distance (CROWD). The crowding distance is a measure implemented within the nondominated sorting genetic algorithm 2 (NSGA-II) as a means to   promote the diversity and distribution of solutions in the PFA [55]. Feature CROWD refers to the value of this measure, whose computation is illustrated in Figure 34.
• Neighbors with the same k value (NEIGK). As shown in Figure 35, feature NEIGK specifies the number of consecutive neighbors (at both sides) presenting the same value of k as the PFA member characterized.
• Percentage of points with the same k value (PERCK). PERCK is the percentage of the full set of solutions in the PFA that exhibit the same value of k as the specific PFA member being characterized.
• Triangle area (TRIAR). This feature is illustrated in Figure 36. As can be seen, it is defined as the area of the triangle which has the PFA member and the extreme points as its vertices. The area can be calculated from the length of the triangle's sides, using Heron's formula. If the point being characterized is outside the zone of interest (see description of feature ZINT, Appendix A-A), the negative of the computed area is used instead.
• Triangle height (TRIHE). In a similar manner to feature TRIAR, this feature considers the triangle defined  by the extreme points of the PFA and the candidate solution being characterized. Feature TRIHE is given by the height of such a triangle (considering as the base of the triangle the line connecting the extreme points of the PFA, see Figure 37). If the solution considered is beyond the zone of interest (refer to the definition of feature ZINT in Appendix A-A), the negative of the computed triangle's height is used as the value of feature TRIHE.

D. CATEGORY 4: FEATURES DESCRIBING GLOBAL ASPECTS OF THE PFA
Features in the fourth category capture aspects of the PFA as a whole. That is, their computation involves the full set of solutions in the PFA. Therefore, it is worth noting that the value of these features is invariant across the feature vectors extracted for all solutions which are members of the same PFA (unlike features in the three previous categories). The following 22 features are defined within this category: • Cardinality of the PFA (CARD). As the name of this feature suggests, it refers to the total number of candidate clusterings in the PFA. In the specific case of this study, the maximum cardinality is 100, which corresponds to the population size used by algorithm -MOCK during PFA generation. However, given that the PFA consists of nondominated solutions only, it is possible that it contains fewer solutions in some cases.
Features MIN1 and MIN2 are given by the minimum (best) value scored for the first and second objectives, respectively, considering all PFA members. • Average normalized objective values (NAVG1, NAVG2). These features are defined equivalently to the AVG1 and AVG2 features described above. However, NAVG1 and NAVG2 are computed after normalizing the PFA to range [0, 1] independently in each dimension.
• Hypervolume of the PFA (HVOL). The hypervolume is one of the most widely used indicators to assess the performance of evolutionary multiobjective optimizers [53]. It is able to simultaneously evaluate both aspects of convergence and diversity of the PFAs produced by these algorithms. Briefly, this indicator is defined as the volume of the region of the objective space, delimited by a reference point, which is dominated by the solutions in the PFA (see Figure 38). Feature HVOL is given by the value for such an indicator, which in this study is computed for the normalized PFA and using always a fixed reference point, namely, (1.01, 1.01).
• Minimum and maximum hypervolume contribution (MINHV, MAXHV). In Appendix A-C, we describe  • Area and perimeter of convex hull (ACVX, PCVX). These features, as illustrated in Figure 39, are computed as the area and the perimeter of the convex hull of the PFA, respectively. Note, however, that the computation of the convex hull disregards PFA members outside the zone of interest (see feature ZINT in Appendix A-A).
• Cardinality of the convex hull (CCVX). This feature is given by the total number of vertices in the convex hull. As discussed before, convex hull computation considers only PFA members inside the zone of interest (refer to the description of feature INCVX in Appendix A-C).
• Minimum, maximum, and average value of k (MINK, MAXK, AVGK). Considering that the PFA may contain candidate partitions with a range of values for k, features MINK, MAXK, and AVGK refer to the minimum, maximum, and average values of this parameter across the full set of PFA members, respectively.
• Mode of the k values in the PFA (MODK). Adhering to the definition of mode in statistics, feature MODK is computed as the k value that appears most frequently across the clustering solutions in the PFA.
• Number of unique k values in the PFA (UNIK). This feature simply reflects the total number of distinct values for parameter k in the PFA's candidate partitions.
• Percentage of solutions favoring a particular objective (PTEND1, PTEND2). As explained in Appendix A-A and Figure 21 for feature TEND, a PFA member may exhibit a tendency towards favoring one objective over the other depending on its location. Features PTEND1 and PTEND2 indicate the percentage of the PFA members showing a tendency towards the first and second objective functions, respectively.

E. CATEGORY 5: FEATURES DESCRIBING THE CLUSTERING PROBLEM BEING SOLVED
In this last category, we consider features that reflect aspects of the particular clustering problem at hand. Similar to the features in the fourth category (Section A-D), the value of these features is fixed for all members of the given PFA (it is, indeed, fixed for all solutions in all PFAs generated for the same problem). Two features are included in this category: • Size of the dataset (DATA). This feature refers to the number of samples in the dataset under consideration.
• Dimensionality of the dataset (DIME). Feature DIME captures the total number of dimensions (i.e., features or variables) in the clustering problem.

APPENDIX B DETAILED RESULT TABLES
This appendix includes detailed results for the main experiments of this study. Tables 2, 3, and 4 present the results for individual problems and the findings of the statistical significance analysis for Experiments 1, 2, and 3, respectively.