R-Metric: Evaluating the Performance of Preference-Based Evolutionary Multiobjective Optimization Using Reference Points

Measuring the performance of an algorithm for solving multiobjective optimization problem has always been challenging simply due to two conflicting goals, i.e., convergence and diversity of obtained tradeoff solutions. There are a number of metrics for evaluating the performance of a multiobjective optimizer that approximates the whole Pareto-optimal front. However, for evaluating the quality of a preferred subset of the whole front, the existing metrics are inadequate. In this paper, we suggest a systematic way to adapt the existing metrics to quantitatively evaluate the performance of a preference-based evolutionary multiobjective optimization algorithm using reference points. The basic idea is to preprocess the preferred solution set according to a multicriterion decision making approach before using a regular metric for performance assessment. Extensive experiments on several artificial scenarios, and benchmark problems fully demonstrate its effectiveness in evaluating the quality of different preferred solution sets with regard to various reference points supplied by a decision maker.


I. INTRODUCTION
M OST real-world problem solving tasks usually involve multiple incommensurable and conflicting objectives which need to be considered simultaneously.Such problems are termed as multi-objective optimization problems (MOPs) that have earned considerable attention in engineering design, modelling, and operations research.Instead of a single solution that optimizes all objectives simultaneously, in multi-objective optimization, we often look for a set of Pareto-optimal solutions none of which can be considered better than another when all objectives are of importance.
Over the past two decades and beyond, evolutionary algorithms (EAs) have been widely accepted as a major approach for multi-objective optimization.Many efforts have been devoted to developing evolutionary multi-objective optimization (EMO) algorithms, such as elitist non-dominated sorting genetic algorithm (NSGA-II) [1]- [3], indicator-based EA (IBEA) [4]- [6] and multi-objective EA based on decomposition (MOEA/D) [7]- [9].These algorithms, without any additional preference information (or intervention) from a decision maker (DM), are usually designed to obtain a set of solutions that approximate the whole Pareto-optimal set.However, the ultimate goal of multi-objective optimization is to help the DM find solutions that meet his/her own preference information.To facilitate the decision making process, it is desirable to integrate the DM's preference information into the search process of EMO for the following reasons: 1) Supplying a DM with a large amount of trade-off points not only increases his/her workload, but also provides many irrelevant or even noisy information to the decision making process.Rather than the whole Pareto-optimal front (PF), the DM is usually interested in only a small set of trade-off points most relevant to him/her.A biased search, according to the DM's preference information, is able to provide more acceptable alternatives.2) Due to the curse of dimensionality, the number of points used to accurately represent the whole PF increases exponentially with the number of objectives.This not only severely increases the computational burden of an EMO algorithm, but also causes extra difficulties for the DM to comprehend the obtained solutions and then to make decisions.Therefore, it is more practical to search for a fine-grained resolution of a preferred region of the PF by incorporating the DM's preference information.3) In a high-dimensional space, the mushrooming of nondominated solutions, even for a randomly generated population, renders the traditional Pareto dominance based selection useless [10].However, by considering the DM's preference information, we can expect a necessary selection pressure additional to Pareto dominance [11].
In the past decade, there have been a number of studies on the preference-based EMO.Generally speaking, their ideas can be divided into four categories.The first one modifies the original Pareto dominance by classifying objectives into different levels and priorities (e.g., [12]- [14]) or expresses the DM's preference information by fuzzy linguistic terms according to different aspiration levels (e.g., [15]- [17]).The second sort modifies the diversity management module so that the density of Pareto-optimal solutions can be biased towards the region of interest (ROI) (e.g., [18]- [20]).The third approach combines the classical reference point based method [21] with EMO (e.g., [22]- [24]).The last category, as a recent trend, combines the DM's preference information with performance metrics (e.g., weight hypervolume [25], R2-indicator [26] and averaged Hausdorff distance [27]) in algorithm design.In this paper, our discussion focuses on the reference point based method, which has been recognized as one of most popular methods in this literature [28].
Despite the progress in algorithm design, few have been done on evaluating the quality of preferred solutions obtained by a preference-based EMO algorithm.Although a number of performance metrics have been suggested to evaluate the quality of solutions that approximate the whole PF, including metrics for evaluating convergence (e.g., [29]- [31]) and diversity (e.g., [32]- [34]) separately, and metrics that evaluate both aspects simultaneously (e.g., [35]- [37]), none of them can be directly applicable when only a partial PF is considered.Some attempts to adapt the regular metrics to serve the purpose of assessing the quality of a preferred solution set have been reported in [38]- [41].However, they are ad-hoc and oversimplified which could make the assessments misleading.Possibly due to the lack of reliable metrics, many studies, if not all, on the preference-based EMO heavily rely on the visual plot in performance comparisons.These methods are rather subjective, and how to visualize data in a high-dimensional space is itself an open problem.This paper presents a systematic way, denoted as R-metric, to quantitatively evaluate the quality of preferred solutions obtained by a preference-based EMO algorithm using reference points.Our basic idea is to use a multi-criterion decision making (MCDM) approach to preprocess the obtained solutions, according to their satisfaction to the DM's preference information, before using a regular metric for performance assessment.It is so simple and general that any existing metric can be adapted with little modification.Note that it is a comprehensive extension and generalization of the metric developed by the second author and his collaborators [42], while the major differences are listed in Section I of the supplementary file.
The rest of this paper is organized as follows.Section II gives some preliminary concepts related to this paper.In Section III, the motivations of this paper are delineated and discussed.Section IV is devoted to the description of the proposed method.Section V and Section VI present the empirical studies on several artificial scenarios and a series of benchmark problems respectively.Finally, Section VII concludes this paper and provides some future directions.

II. PRELIMINARY CONCEPTS
This paper considers the following continuous MOP with box constraints: where + constitutes of m real-valued objective functions, and R m + is called the objective space.The attainable objective set is defined as Θ = {F(x)|x ∈ Ω}.
Definition 1: x 1 is said to Pareto dominate x 2 , denoted as x 1 x 2 , if and only if: Definition 2: x * ∈ Ω is said to be Pareto-optimal if there is no other x ∈ Ω such that x x * .Definition 3: The set of all Pareto-optimal solutions is called the Pareto-optimal set (PS).The set of all Pareto-optimal objective vectors, P F = {F(x)|x ∈ P S}, is called the PF.
For the ease of later discussion, we briefly introduce two widely used performance metrics in the EMO literature.
1) Inverted Generational Distance (IGD) metric [35]: Let P * be a set of points uniformly sampled along the PF, and S be the set of solutions obtained by an EMO algorithm.The IGD value of S is calculated as: where dist(x * , S) is the Euclidean distance between the point x * ∈ P * and its nearest neighbor of S in the objective space, and |P * | is the cardinality of P * .2) Hypervolume (HV) metric [31]: Let z w = (z w 1 , . . ., z w m ) T be a worst point in the objective space that is dominated by all Pareto-optimal objective vectors.HV metric measures the size of the objective space dominated by solutions in S and bounded by z w .
where VOL(•) indicates the Lebesgue measure.Both IGD and HV metrics are able to give a comprehensive information, including the convergence and diversity, of S simultaneously.The lower is the IGD value (or the larger is the HV value), the better is the quality of S for approximating the whole PF.

III. MOTIVATION
This section first discusses the shortcomings of some existing metrics for evaluating the partial PF.Then, we develop the motivation for our proposed R-metric from the perspective of the achievement scalarization function (ASF) in MCDM.

A. Shortcomings of Regular Metrics
Let us start from the regular IGD and HV metrics, which are used as the baseline of R-metric, for assessing the partial PF.Note that although IGD and HV are not originally designed to assess the partial PF, some studies [43], [44] use them for comparing preference-based EMO algorithms.The example PF considered here is a line (i.e., f 2 = 1 − f 1 ) having an intercept of one with each objective axis.The DM's preference information is specified as a reference point z r = (0.16, 0.9) T in the objective space.Points focusing on the region closest to z r are most relevant to the DM's preference information.To calculate the IGD values, we sample 670 evenly distributed points along the PF; and we set the worst point as z w = (1.1,1.1) T for calculating the HV value.obtains the second worst metric values, whereas S 5 and S 6 , far away from the ROI, obtain the best metric values.In summary, neither IGD nor HV metric is reliable for evaluating the quality of a preferred solution set.A solution set with additional but unwanted points may obtain a better metric value, thereby making the IGD and HV metrics unsuitable for performance assessment in the toy example shown in Fig. 1(a).On the other hand, even for different point sets having the same spread along the PF, their IGD and HV values depend on their positions and the PF's geometric property.This makes the IGD and HV metrics unsuitable for performance assessment in the toy example shown in Fig. 1(b).

B. Overviews of Existing Preference-Based Metrics
In [41], a modified IGD was proposed to assess the closeness of an approximation set S toward multiple reference points.It calculates the average of the shortest distance between a solution in S and the point in P * closest to a reference point.This metric degenerates to the shortest distance measure when only considering one reference point.Analogously, [27] proposed a preference-based EMO algorithm that tries to minimize the averaged Hausdorff distance between solutions and the DM supplied aspiration set, i.e., a set of reference points.Although this idea is not directly proposed for performance assessment, it can be used to assess the quality of preferred solutions with respect to the DM's preference information.[38] and [39] adapted the regular HV metric for the preferencebased EMO.Their basic ideas are similar.At first, they merge solutions obtained by all considered algorithms into a composite set.Then, they specify a preferred region within the composite set.Finally, only solutions falling within this preferred region are considered for performance assessment.The major difference between [38] and [39] is the setting of the preferred region.As shown in Fig. 2(a), [38] uses the closest point to the origin as the center of the preferred region; while, as shown in Fig. 2(b), [39] uses the closest point to the DM supplied reference point as the center.Both these two metrics do not require any prior knowledge of the PF, and they work for some simple examples.However, they have some flaws that make them misleading: 1) It is obvious that [38] does not take the DM's preference information into consideration.For the example in Fig. 2(a), S1 is obviously preferable over S 2 considering the given reference point z r .However, S 1 and S 2 are distant from each other, and the origin is closer to the points in S 2 .Therefore, S 1 will be wrongly excluded from the preferred region for performance assessment.2) On the other hand, although [39] considers the DM's preference information in computation, it treats points outside the preferred region equally redundant, e.g., in Fig. 2(b), no point in S 2 will be considered in performance assessment.Considering the example in Fig. 1(b), all ten point sets, except S 2 , cannot get any meaningful metric value.This gives the DM a wrong information that S 1 to S 10 , except S 2 , are equally bad.

C. Intuitions of MCDM Approach
In the MCDM literature, there exists a number of methods for finding a preferred solution set according to the DM supplied reference point.In this subsection, we describe the basic idea of the ASF method [45], which is the foundation of our proposed R-metric, in detail.In particular, The ASF  considered in this paper is formulated as follows: where z r is the reference point that represents the DM's aspiration level for each objective, and w is the weight vector that implies the relative importance of objectives.Based on the ASF, each objective vector has a projection, called iso-ASF point, on the reference line originated from z r and along w, as shown in Fig. 3. Specifically, for a point a ∈ Θ, its corresponding iso-ASF point a l is calculated as: where δ = max 1≤i≤m ai−z r i wi .This iso-ASF point gives us an information about the closeness of a to z r along the preferred direction w.Note that not all Pareto-optimal solutions are equally important when considering the DM's preference information.Based on the supplied reference point and a preferred direction, ASF is able to rank all Pareto-optimal solutions.As shown in Fig. 3, for a, any point on its ASF contour line (e.g., point b) has the same ASF value, i.e., they are equally good and have the identical rank.For another point c, its iso-ASF point is c l .Comparing to a l , c l is closer to z r along the preferred direction.Thus, c should have a better rank than a and b.Another nice property of this ASF-based ranking concept is its scalability to many-objective problems.

IV. R-METRIC CALCULATION PRINCIPLE
The basic idea of our proposed method, denoted as Rmetric, is to use an MCDM approach to pre-process the preferred solution set according to the DM supplied preference information.Thereafter, regular metrics, e.g., IGD and HV, can be applied for performance assessment.Note that the R-metric is specifically designed for evaluating the performance of a preference-based EMO algorithm using one or more reference points.In particular, we assume that the DM prefers the solutions lying toward the preferred direction, represented as a direct objective-wise weighting information or a worst point.In the R-metric calculation, the DM is required to provide three parameters relating to his/her preference information: i) a reference point z r that represents his/her aspiration level or desired value for each objective, ii) a worst point z w or a weight vector w that specifies the relative importance of each objective, and iii) a relative extent of the ROI, denoted as ∆ (0 < ∆ ≤ 1).Note that most of these parameters are used to elicit the DM's preference information and to help the preference-based optimization procedure find a set of trade-off solutions in the ROI.Our proposed R-metric calculation is simple in principle and its high level flowchart is given in Fig. 4. In the following paragraphs, we first describe each step in detail.Then, we provide some further comments followed by a time complexity analysis.

A. Descriptions of Each
Step 1) Prescreening Procedure: In multi-objective optimization, only the non-dominated solutions are of interest to the DMs and are meaningful for performance assessment.Assume that there are L (L ≥ 1) preferred solution sets (denoted as S 1 , • • • , S L ), obtained by L different preference-based EMO algorithms, at hand.We at first merge these L preferred solution sets into a composite set S c .For each S i , i ∈ {1, • • • , L}, only the non-dominated solutions, comparing to those in S c , are retained for the R-metric calculation.The pseudo-code of this prescreening procedure is given in Algorithm 1.

Algorithm 1: Prescreening Procedure
2) Pivot Point Identification: As the name suggests, the pivot point (denoted as z p ) of a given set of preferred solutions (denoted as S) is used as the representative that reflects the overall satisfaction of S with respect to the DM supplied preference information.In this paper, we use the best solution with respect to (4) to serve this purpose and thus z p is: 3) Trimming Procedure: Instead of the whole PF, the ROI is a bounded region, i.e., a part of the PF, given the DM's preference information.Only solutions located in the ROI are of interest to the DM.This paper defines the ROI approximated by S as the cubic that is centered at the pivot point and is with a side length ∆.Only solutions located in this approximated ROI are valid for performance assessment.The pseudo-code of this trimming procedure is given in Algorithm 2.

Algorithm 2: Trimming Procedure
Input: Preferred solution set S, ROI's relative extent ∆ Output: 6 return S 4) Solution Transfer: This step is the main crux of our R-metric by which the trimmed points are transferred to a virtual position.Then, we can assess their closeness to the ROI along the preferred direction.To this end, we first compute the iso-ASF point of z p (denoted as z l ) on the reference line connecting z r and z w .According to equation (5), this requires to identify the objective k that contributes to the ASF value: Then, we can compute z l as: where i ∈ {1, • • • , m}.Thereafter, all trimmed points are transferred, along the direction vector z l −z p with the distance z l − z p , to a virtual position.The pseudo-code of this solution transfer procedure is given in Algorithm 3.

5) R-metric Calculation:
In this paper, we choose the IGD and HV as the baseline metrics to evaluate the quality of a preferred solution set.The resulting R-metric is thus denoted as R-IGD or R-HV depending on the chosen baseline.For the R-HV, we simply compute the hypervolume of the transferred points with respect to the worst point z w .For the R-IGD, we need to pre-process P * beforehand.More specifically, we first use the method developed in Section IV-A2 to identify Algorithm 3: Solution Transfer Input: Preferred solution set S Output: Processed S Shift S(i) along the direction vector z l − z p ; 6 return S the pivot point of P * .Then, we use the trimming procedure suggested in Section IV-A3 to trim the points outside the ROI along the PF.In the end, the remaining points form the new P * for the R-IGD calculation.

B. Further Comments
1) In this paper, we set z w = z r + 2.0 × w for proof of principle studies, where w is an unit vector.This w setting implies that all objectives are equally important.However, in practice, different objectives might have various importance to the DM.For example, if we set T , the first objective is assumed to be twice less important than the second one.In particular, the importance is in the inverse order of the weights.More detailed discussions on an unequal weight setting can be found from Section II of the supplementary file.
2) In practice, the DM has no idea about the range of the whole PF, not to mention the extent of ROI.Thus, ∆ plays as an approximate expectation of the relative extent of ROI comparing to the whole PF.Note that the objective space is assumed to be normalized to [0, 1].
3) The trimming procedure penalizes the solution set, in term of the diversity, for having an excessive extent or deviating from the ROI.For example, in Fig. 5(a From the DM's perspective (given ∆ = 0.3), S 2 is preferable over S 1 .After the trimming procedure, |S 1 | reduces to 5 while |S 2 | is still the same.Accordingly, S 1 will sacrifice its diversity when calculating the Rmetric value.As for another example shown in Fig. 6(a) and Fig. 6(b), S 1 and S 2 again have the same cardinality.Since S 2 deviates from the ROI, its pivot point is identified as one of its extreme.After the trimming procedure, solutions far away from the ROI are excluded from further consideration.4) Given the DM's preference information, the convergence is not only the closeness to the PF, but also the closeness  of the transferred points to the ROI along the preferred direction.This re-definition of convergence is fulfilled by transfering points to a virtual position along the iso-ASF line between z p and its iso-ASF point z l .Let us consider the example shown in Fig. 6 again.The pivot point of S 1 is exactly its iso-ASF point, since this pivot point lies exactly on the reference line connecting z r and z w .In this case, solutions of S 1 stay in their original positions after the transfering procedure.In contrast, since the pivot point of S2 deviates from z r , the remaining solutions of the processed S 2 after the trimming procedure are shifted, along the direction z l −z p , to a virtual position away from the PF, as shown in Fig. 6(c).

C. Time Complexity Analysis
We now analyze the time complexity of R-metric calculation.The prescreening procedure requires at most O(L 2 N 2 ) dominance comparisons.The identification of a pivot point z p from N trade-off points requires O(N ) computations.The trimming procedure can be achieved in O(N ) computations.Transfering at most N points towards the reference line also requires O(N ) computations.The complexity of R-IGD or R-HV computation in last step is the same as the regular IGD or HV.In summary, the time complexity of R-IGD is max{O(N M ), O(L 2 N 2 )}, where M is size of the newly formed P * after the filtering procedure; the time complexity of R-HV is max{O(N m−1 ), O(L 2 N 2 )} by using the method developed in [31].

V. EMPIRICAL STUDIES ON ARTIFICIAL SCENARIOS
In this section, we verify the proposed R-metrics on several artificial scenarios.First of all, the effectiveness of R-metrics is investigated on the two toy examples introduced in Section III.Afterwards, four popular benchmark problems with different population distributions and different reference point settings are used to further examine R-metrics.Next, we investigate the applicability of R-metrics on a problem with disconnected PFs.Finally, R-metrics are used to evaluate the quality of a preferred solution set with respect to multiple reference points.

A. Investigations on Toy Examples
For the first example discussed in Section III, as shown in Fig. 7(b), S 1 and S 2 are rearranged according to our proposed R-metric calculation procedure.Since S 2 does not fully meet the DM's preference, points therein are penalized in terms of convergence and diversity.More specifically, after the filtering procedure, only three points have been left in the processed S 2 .Then, the processed S 2 has been shifted to an area away from the PF in the solution transfer step.Thereafter, the Rmetric values of S 1 should be better than that of S 2 , i.e., R-IGD(S 1 ) = 0.0222 and R-IGD(S 2 ) = 0.0452; R-HV(S 1 ) = 1.1550 and R-HV(S 2 ) = 1.1108.As for the second example discussed in Section III, now shown in Fig. 7(c), all ten point sets have been shifted to their corresponding virtual positions along the preferred direction (the dashed arrow denotes the transfer direction for each point set).Now R-IGD and R-HV successfully figure out the best point set S 2 .This is because the processed S 2 , denoted as P 2 , is closest to the ROI.In addition, S 1 and S 3 , which have a similar distance to the ROI after data pre-processing, achieve similar R-IGD and R-HV values.As for S 4 to S 10 , their R-metric values become worse with their increasing distances to the ROI.

B. Investigations on Benchmark Problems
In this subsection, we investigate the effectiveness of Rmetrics on four classic benchmark problems, including twoobjective ZDT1 and ZDT2 [46], and three-objective DTLZ1 and DTLZ2 [47] 2 .For ZDT1 and ZDT2, ten sets of points, S 1 to S 10 , are sampled from different regions of their PFs.Each set contains 40 evenly distributed points.For DTLZ1 and DTLZ2, we sample 21 sets of points from different regions of their PFs, where each set has 25 evenly distributed points.For R-IGD calculation, we first sample 10,000 evenly distributed points from the corresponding PFs.Only those located in the ROI (∆ is set as 0.2) are chosen to form P * in the R-IGD calculation.
1) Two-objective cases: As shown in Fig. 8(a), we investigate three kinds of reference points for ZDT1.
• Unattainable reference point z r1 = (0.2, 0.5) T : From the results shown in Fig. 8(b), we find that both R-IGD and R-HV are able to make a reasonable assessment on the quality of a point set with respect to the DM supplied preference information.For example, S 3 resides in the ROI with respect to z r1 .As shown in Fig. 8 Fig. 6: Illustration of the R-metric calculation principle.Note that the solution transfer is according to iso-ASF lines for the ASF with z r as the reference point and z w − z r as the weight vector.Fig. 8: Variations of R-IGD and R-HV with respect to an unattainable reference point z r1 = (0.2, 0.5) T , an attainable reference point z r2 = (0.6, 0.3) T and a outside reference point z r3 = (1.1,−0.1) T on ZDT1 problem.
2) Three-objective cases: As shown in Fig. 9, we investigate two kinds of reference points for DTLZ1.
• Unattainable reference point z r1 = (0.05, 0.05, 0.2) T : Different from the two-objective case, in a threedimensional space, points are distributed in an ambient space where the neighboring points can be in various directions.This explains the significant fluctuations of R-IGD and R-HV curves shown in Fig. 9(c).Nevertheless, the point set most relevant to the DM supplied preference information, i.e., S 17 , obtains the best R-IGD and R-HV values.Furthermore, we also notice that the point sets close to the reference point obtain similar R-metric values, e.g., S 9 , S 13 , S 14 , S 16 and S 18 in DTLZ1 get similar R-IGD and R-HV values as shown in Fig. 9(d).
• Attainable reference point z r2 = (0.3, 0.3, 0.2) T : From the results shown in Fig. 9(d), we find that R-metrics are able to provide a reliable assessment on different point sets.S 9 which is closest to this reference point, obtains the best R-IGD and R-HV values.

C. Investigations on Problems with Disconnected PFs
Although the investigations in Section V-B1 and Section V-B2 are based on problems with continuous PFs, our proposed R-metrics are also effective for problems with disconnected PFs.For proof of principle purpose, this section chooses ZDT3, whose PF consists of five disconnected segments, for investigation.Five point sets, S 1 to S 5 as shown in Fig. 10(a), are respectively sampled from each segment and z r = (0.5, 0.0) T .In addition, we also plot the transferred point sets, denoted as green circles, for illustration (the dashed arrow denotes the transfer direction for each point set).From the results shown in Fig. 10(b), we find that the R-metrics are able to provide a reasonable assessment for problems with disconnected PFs.In particular, S 3 , which is closest to z r , obtains the best R-metric values.Note that the transferred points of S 3 still lie on S 3 and is indeed closest to z r .From Fig. 10(a), we find that S 4 is also very close to z r , the same for its transferred points.This explains its similar R-metric values with respect to S 3 .On the other hand, S 1 , farthest away from z r , obtains the worst R-metric values.

D. Investigations on Multiple Reference Points
In practice, the DM might not be sure about his/her exact preference beforehand.The DM would like to simultaneously explore several ROIs by supplying T (1 < T ≤ |S|) reference points at the same time.Accordingly, a small modification is required to adapt the R-metrics to this circumstance.Generally speaking, when there are more than one reference point, the pre-processing procedure should take each reference point into consideration separately.At first, we use the prescreening procedure introduced in Section IV-A1 to remove those dominated solutions from further R-metric calculation.Afterwards, we apply the k-means [48] algorithm to divide the remaining solutions into T clusters.Then, each cluster is associated with a reference point closest to its centroid.For each cluster, we use Step 2 to Step 4 introduced in Section IV-A to pre-process the points and transfer them to a virtual position.Finally, we combine all pre-processed points together and evaluate their R-metric values as a whole.In particular, the worst point for R-HV calculation is chosen as the nadir point of the worst point for each reference point; for R-IGD calculation, P * needs to be pre-processed for each reference point separately according to the method introduced in Section IV-A.
To validate the effectiveness of our strategy, we take the example in Fig. 11(a) for investigation.Here we set two reference points z r1 = (0.2, 0.5) T and z r2 = (0.6, 0.3) T , simultaneously, in the objective space.Five point set combinations, i.e., (S 3 , S 6 ), (S 1 , S 2 ), (S 4 , S 5 ), (S 3 , S 4 ), (S 6 , S 7 ), are chosen as the candidates for performance assessment.From the results shown in Fig. 11(b), we find that (S 3 , S 6 ) obtains the best R-IGD and R-HV values.From Fig. 11(a), we can see that S 3 and S 6 are in the corresponding ROI of z r1 and z r2 , respectively.In contrast, (S 1 , S 2 ) obtains the worst Rmetric values.From Fig. 11(a), we find that both S 1 and S 2 are close to z r1 , but are far away from z r2 .Therefore, the R-metric values with respect to z r1 can be acceptable, whereas the R-metric values with respect to z r2 should be significantly bad.This makes its final R-metric values become the worst.Notice that sometimes DMs tend to use multiple reference points to discretely approximate the ROI.Therefore, the regions between these supplied reference points are also very important.From Fig. 11(b), we found that the R-IGD and R-HV values obtained by (S 4 , S 5 ) and (S 3 , S 4 ) are similar and are only inferior to (S 3 , S 6 ).In contrast, although (S 6 , S 7 ) has some part locating in the ROI of z r2 , S 7 is far away from z r1 .This makes its R-metric values not as good as (S 4 , S 5 ) and (S 3 , S 4 ).In summary, due to the existence of multiple reference points, a good point set should have a promising satisfaction for every reference point.In Section III of the supplementary file, we provide some further discussions on the "scalability" issue of R-metric.

VI. EMPIRICAL STUDIES ON PREFERENCE-BASED EMO ALGORITHMS
In this section, we apply our proposed R-metrics to evaluate the performance of the following four preference-based EMO algorithms.Notice that all multi-objective optimizers use reference points to articulate the DM's preference information, and all of them, except g-NSGA-II, are capable of handling more than one reference point.Here we choose the classic ZDT and DTLZ test suites as benchmark problems.For R-IGD calculation, similar to Section V, we at first sample 10,000 points from the corresponding PF.Then, points located in the ROI (∆ is set as 0.2) are used to form P * .1) r-MOEA/D-STM [49]- [51]: It is an extension of our recently proposed MOEA/D variant based on stable matching model [52].Different from the original MOEA/D, where the selection of next parents is merely determined by the ASF value of a solution, MOEA/D-STM treats subproblems and solutions as two sets of agents and considers their mutual-preferences simultaneously.In particular, the preference of a subproblem over a solution measures the convergence issue, while the preference of a solution over a subproblem measures the diversity issue.Since the stable matching achieves an equilibrium between the mutual-preferences between subproblems and solutions, MOEA/D-STM strikes a balance between convergence and diversity of the search process.In order to incorporate the DM's preference information into the search process, we need to specify a population of weight vectors spread around reference points.2) R-NSGA-II [22]: It is a variant of NSGA-II which modifies the crowding operator based on the idea of classic reference point based method.More specifically, solutions close to reference points have a larger chance to survive in the selection procedure.In addition, R-NSGA-II employs an -clearing idea to control the spread of the final obtained solutions in the ROI. 3) g-NSGA-II [53]: It modifies NSGA-II by replacing the Pareto dominance with a new dominance relation, called g-dominance.More specifically, g-dominance uses a reference vector to represent DM's desired value for each objective, i.e., aspiration levels.Solutions either satisfying all aspiration levels or fulfilling none of the aspiration levels are preferable over those merely satisfying some aspiration levels.4) r-NSGA-II [23]: It uses a new dominance relation, called r-dominance, to replace the Pareto dominance in NSGA-II.When two solutions are non-dominated in terms of Pareto dominance, the one closer to the reference point is preferable.Moreover, its search behavior is adjusted by two parameters: one is the non-r-dominance threshold δ that controls the spread of the obtained solutions; the other is the ASF weight vector that controls the relative importance of different objectives.All multi-objective optimizers use simulated binary crossover (SBX) [54] and polynomial mutation [55] as the reproduction operators.For proof of principle purpose, all four algorithms assume all objectives are equally important in our empirical studies.Note that although r-MOEA/D-STM, and r-NSGA-II are able to control the spread of obtained solutions, there is no specific guideline to set the corresponding parameter.Due to the page limit, the parameter settings and the specifications of reference points for different test instances are given in Section V of the supplementary file.

A. Empirical Studies on Benchmark Problems
Each algorithm is performed 31 independent runs, and the R-metric values for two different reference point settings are respectively given in Table I and Table II.In particular, the best mean metric values are highlighted in bold face with gray background, and the Wilcoxon's rank sum test at a 0.05 significance level is used to compare the statistical significance of the difference between the best mean metric value and the others.To have a visual comparison, we also plot the final solutions obtained by different algorithms having the best R-IGD value.Due to the page limit, they are presented in Section VI of the supplementary file.In the following paragraphs, we will separately discuss the effectiveness of the R-metrics for evaluating the performance of different algorithms on problems with continuous and disconnected PFs.
1) Problems with continuous PFs: ZDT1 and ZDT2 are two relatively simple test instances, where all four algorithms do not have too much difficulty in finding solutions around the DM supplied reference points.However, as shown in Fig. 6(d) and Fig. 7(d) of the supplementary file, the convergence of solutions found by r-NSGA-II is not satisfactory.This makes most of its obtained solutions be trimmed during the prescreening step of the R-metric calculation.Accordingly, its R-IGD and R-HV values are the worst among all four algorithms.As for r-MOEA/D-STM and g-NSGA-II, their performance is visually similar on finding preferred solutions for ZDT1 and ZDT2.However, the R-IGD and R-HV values obtained by g-NSGA-II are better than r-MOEA/D-STM in 6 out of 8 comparisons.Let us look at Fig. 6(a) and Fig.
of the supplementary file, for the unattainable reference point, all solutions found by r-MOEA/D-STM well converge  to the PF whereas some solutions found by g-NSGA-II are not fully converged.In this case, the R-metric values obtained by r-MOEA/D-STM are better than g-NSGA-II.For the other three cases (i.e., ZDT1 with an attainable reference point and ZDT2 with both unattainable and attainable reference points), although solutions found by r-MOEA/D-STM well converge to the PF, their overall distributions deviate from the ROI a bit.In contrast, solutions found by g-NSGA-II not only converge to the PF, but also have a well concentration on the ROI.Therefore, it should be preferable and our proposed R-metrics also make a reasonable assessment.ZDT4 has the same PF shape as ZDT1, but it is more difficult due to the presence of many local optima.All algorithms, except r-MOEA/D-STM, cannot find solutions that fully converge to the PF.Accordingly, r-MOEA/D-STM obtains the best Rmetric values among all four algorithms.r-NSGA-II obtains the worst R-metric values since it only finds solutions close to the unattainable reference point.ZDT6 has a concave PF shape and a biased distribution in the search space.It is interesting to note that although the solutions found by r-MOEA/D-STM not only well converge to the PF but also have a uniform distribution, the R-metric values obtained by r-MOEA/D-STM are not as good as R-NSGA-II and g-NSGA-II.This might be explained as the representative point found by r-MOEA/D-STM is inferior to that found by R-NSGA-II and g-NGSA-II.In this case, the solution transfer step of the R-metric calculation transfer the solutions found by r-MOEA/D-STM to a farther position.Due to the poor convergence property, the R-metric values obtained by r-NSGA-II are still the worst.
The PF of DTLZ1 is a simplex having an intercept of 0.5 at each coordinate.Due to the presence of 11 5 − 1 local PFs, DTLZ1 causes difficulties for an EMO algorithm in reaching the global PF.From Fig. 10 of the supplementary file, only r-MOEA/D-STM well approximates the ROIs.Accordingly, it obtains the best R-metric values.It is interesting to note that solutions found by R-NSGA-II seem to have a nice concentration on the DM supplied reference points, but their spreads are too narrow.In contrast, neither g-NSGA-II nor r-NSGA-II finds any reasonable solution.DTLZ2 is a relatively simple test instance, where all four algorithms do not have much difficulty in finding solutions close to the ROIs.As shown in Fig. 11 of the supplementary file, it is clear that solutions found by r-MOEA/D-STM are preferable over the other candidates.Accordingly, its obtained R-metric values are also the best.Although solutions found by R-NSGA-II well converge to the PF, their spreads are too narrow.This explains its unsatisfied R-metric values.DTLZ3 has the same PF shape as DTLZ2, but is with 3 10 − 1 local PFs.From Fig. 12 of the supplementary file, it is obvious that g-NSGA-II and r-NSGA-II have some difficulties in converging to the PF.Thus their Rmetric values are extremely poor.In contrast, solutions found by r-MOEA/D-STM and R-NSGA-II are similar to those in DTLZ2.DTLZ4 also has the same PF shape as DTLZ2, but it has a strong bias towards f 3 − f 1 plane.From Fig. 13 of the supplementary file, we can see that solutions found by r-MOEA/D-STM are significantly better than the other three algorithms.Accordingly, its R-IGD and R-HV values are the best.DTLZ5 and DTLZ6 are two degenerate problems, where the latter one has a strong bias away from the PF.From Fig. 14 of the supplementary file, we find that solutions obtained by r-MOEA/D-STM are preferable since they converge well to the PF and have a good approximation to the ROIs.Accordingly, its obtained R-metric values are also the best among four algorithms.Although solutions found by R-NSGA-II converge well to the PF, their spreads are too narrow.For DTLZ6, all four algorithms have difficulties in converging to the PF.Solutions found by r-MOEA/D-STM seem to be closer to the PF and have a wide spread.Accordingly, it obtains the best R-metric values.
2) Problems with disconnected PFs: After the empirical studies on problems with continuous PFs, this subsection investigates the effectiveness of our proposed R-metrics on two problems with disconnected PFs.For ZDT3, as shown in Fig. 16 of the supplementary file, all four algorithms are able to find solutions close to the reference points, but those found by g-NSGA-II are visually better where most solutions converge to the PF and the spread is satisfactory.Accordingly, the R-metric values obtained by g-NSGA-II are better than the other three algorithms.For DTLZ7, as shown in Fig. 17 of the supplementary file, although solutions found by r-NSGA-II have a well focus on the ROIs, they are away from the PF.This explains its poorest R-metric values.All the other three algorithms find some solutions on the PF segments outside the ROIs.In particular, solutions found by r-MOEA/D-STM spread over all four PF segments, while those found by g-NSGA-II are the visual best as shown in Fig. 17 of the supplementary file.Accordingly, g-NSGA-II obtains the best R-IGD and R-HV values on both unattainable and attainable reference points.

B. Empirical Studies on Many-objective Problems
Recently, problems with more than three objectives have become one of the hottest topics in EMO.Due to the expansion of the objective space in size, many-objective problems cause several challenges to the traditional EMO algorithm design.For example, the mushrooming of non-dominated solutions in a population significantly weaken the selection pressure of Pareto-based EMO methods, and the sparse distribution of a number of solutions in a high-dimensional space makes the density estimation and diversity management become even more difficult than the two-and three-objective cases.As discussed in [56], instead of searching for the whole PF, finding a preferred subregion satisfying the DM's preference information is more practical in many-objective optimization.
In this section, we investigate the scalability of R-metrics for quantitatively assessing the quality of a preferred solution set in a high-dimensional space.DTLZ2 with 5 and 10 objectives are chosen as the test instances.The reference point is set as z r = (0.1, 0.3, 0.2, 0.4, 0.2) T for the 5-objective test instance, and z r = (0.3, 0.3, 0.3, 0.1, 0.3, 0.55, 0.35, 0.35, 0.25, 0.45) T for the 10-objective case.For the R-IGD calculation, we employ the method suggested in [56] to sample 101,270 points from DTLZ2's PF in the 5-objective case, and 3,124,550 points in the 10-objective case.Moreover, ∆ is increased to 0.5 in the R-metric calculation due to the sparse distribution of solutions in a high-dimensional space.For preferencebased EMO algorithms, all parameters are kept the same as Section VI, except the number of function evaluations.Specifically, it is set as 80,000 and 150,000 for the 5-and 10objective case respectively.Table III shows the comparisons of R-metric values and the parallel coordinates of the populations with the medium R-IGD value are plotted in Fig. 18 and Fig. 19 of the supplementary file.In particular, the red dotted line represents the reference point.From the results shown in these two figures, it is clear that g-NSGA-II and r-NSGA-II are the worst optimizers.This observation is also confirmed by their worst R-IGD and R-HV values.Both r-MOEA/D-STM and R-NSGA-II are able to find solutions around the ROIs.However, solutions found by R-NSGA-II almost concentrate on the reference point, while those found by r-MOEA/D-STM have a wider spread.This explains the better R-HV values obtained by r-MOEA/D-STM in these two test instances.

C. Further Investigations
In this section, we further investigate some other interesting properties of R-metrics.ZDT1 and DTLZ2 are chosen as the test instances, since all four EMO algorithms have no difficulty on solving them.For each test instance, we keep a record of R-IGD and R-HV values of an intermediate population every 10 consecutive generations.Fig. 20 of the supplementary file plots the variations of R-IGD and R-HV values versus the number of generations on ZDT1 with z r = (0.3, 0.4) T .From this figure, we find that all algorithms, except r-NSGA-II, converge to their optimal R-IGD and R-HV values within a few generations.In contrast, the R-metric trajectories of r-NSGA-II grow slowly with the number of generations.Fig. 21 of the supplementary file plots some intermediate populations for different algorithms.From these four subfigures, we find that r-NSGA-II approximates the preferred region in a layerwise manner while the other three algorithms converge to the preferred region rapidly (with around 100 generations).These observations are in accord with the corresponding Rmetric trajectories.For DTLZ2, we have a similar observation.As shown in Fig. 22 of the supplementary file, the R-metric trajectories of r-MOEA/D-STM and R-NSGA-II converge to a stable value within a few generations.These observations are also validated by the plots of intermediate populations in Fig. 23(a) and Fig. 23(b) of the supplementary file.As for g-NSGA-II, we also notice some fluctuations in its Rmetric trajectories.This observation is also in line with the fluctuations of the evolutionary population as shown in Fig. 23(c) of the supplementary file.The R-metric trajectories of r-NSGA-II are rather rugged.From Fig. 23(d) of the supplementary file, we find that the intermediate populations of r-NSGA-II vibrate significantly during the search process.
From the above experiments, we have another interesting observation that some algorithms do not need the predefined number of generations to find preferred solutions.For example, as shown in Fig. 21(a) and Fig. 23(a) of the supplementary file, r-MOEA/D-STM only uses around 100 generations to converge to the preferred region for ZDT1 and around 80 generations for DTLZ2.Furthermore, an algorithm almost converges to the preferred region when the R-metric trajectories become stable.Based on these observations, we keep a record of the standard deviations of the R-IGD and R-HV values obtained by different algorithms for every 10 and 25 consecutive generations.In addition, we set two thresholds τ = 0.1 and τ = 0.01 and to see how many generations   an algorithm needs to have a R-metric's standard deviation less than τ .From the empirical results shown in Table V of the supplementary file, we observe that the standard deviation of R-IGD can be reduced to the given thresholds when the time window is set to 10 generations.More interestingly, the number of generations that makes the standard deviation of R-IGD reduce to 0.01 is similar to the required budgets of the corresponding algorithm converges to the preferred region.Moreover, we also notice that the standard deviation of R-HV cannot always be reduced to the given thresholds on DTLZ2.However, if we extend the time window to 25 generations, even the standard deviation of R-IGD cannot drop down to the expected thresholds in many cases.From this experiment, we find that our proposed R-metrics are not only reliable metrics to evaluate the performance of a preference-based EMO algorithm, more interestingly, the variation of a certain R-metric (e.g., R-IGD with a time window of 10 generations and standard deviation's threshold τ = 0.1) can be used as a stopping criterion in searching for a preferred solution set.

D. Influence of ∆
As described in Section IV, ∆ represents the DM's expectation of the ROI's relative extent comparing to the whole PF.It is also used to trim the irrelevant solutions for the R-metric calculation.A large ∆ means that the DM prefers solutions having a wide spread, while a small ∆ indicates that the the DM prefers solutions having a good concentration on the ROI.This section takes ZDT1 as an example to investigate the influence of ∆ on R-metric values, where ∆ varies from 0.1 to 1.0 with an increment of 0.1.Fig. 12 shows the variations of R-metric values obtained by four preference-based EMO algorithms for different ∆.
Let us start from z r = (0.3, 0.4) T .As shown in Fig. 12(a), the R-IGD value obtained by R-NSGA-II is worse than r-MOEA/D-STM and g-NSGA-II when ∆ is small.But it becomes the best in case ∆ is larger than 0.2.Moreover, the R-IGD value obtained by r-NSGA-II is the worst when ∆ is small.However, when ∆ is larger than 0.4, the R-IGD values obtained by r-MOEA/D-STM, g-NSGA-II and r-NSGA-II are almost the same.Let us refer to Fig. 6 of the supplementary file, as for z r = (0.3, 0.4) T , solutions found by R-NSGA-II and r-NSGA-II have a wider spread than those of r-MOEA/D-STM and g-NSGA-II.If the DM expects the ROI to be concentrated on his/her provided reference point, i.e., ∆ is set to be small, solutions found by r-MOEA/D-STM and g-NSGA-II are preferable.Accordingly, the R-metric values obtained by the previous two algorithms should be better.On the flip side, if the DM expects the ROI to be widely spread, i.e., ∆ is set to be large, solutions found by r-MOEA/D-STM and g-NSGA-II are not satisfactory any longer.Even though the solutions found by R-NSGA-II and r-NSGA-II are not well converged, their wide spread meet the DM's expectation and provide him/her more choices.This explains their better Rmetric values when ∆ becomes large.As for z r = (0.65, 0.3), since solutions found by g-NSGA-II not only well converge to the PF, but also have a wide spread around the DM supplied reference point, its R-metric values are constantly better than the other competitors.In contrast, although solutions obtained by R-NSGA-II still have a wide spread along the PF, their convergence is poor.Therefore, the R-metric values of R-NSGA-II are worse than g-NSGA-II.
It is worth noting that although some preference-based EMO algorithms (e.g., [22], [23] and [49]) claim to be able to control the ROI's extent by setting an appropriate parameter, to the best of our knowledge, there is no rule-of-thumb to set the corresponding parameter.We believe that ∆, used in the Rmetric calculation, is able to provide a general guideline to tune the corresponding parameters in a posterior manner.

VII. CONCLUSIONS AND FUTURE WORKS
Given the DM's preference information, approximating a partial and preferred PF, rather than the whole PF, has been one of the most important topics in the modern EMO research.Besides developing effective algorithms that drive solutions towards the ROI, how to evaluate the quality of a set of preferred trade-off solutions is of the same importance but has rarely been studied in this literature.In this paper, we presented a systematic way to evaluate the quality of a set of preferred solutions obtained by a preference-based EMO using reference points.More specifically, we pre-process the preferred solutions, according to a MCDM approach, before using a regular metrics for performance assessment.In particular, according to the DM's expectation of the ROI's extent, our proposed R-metric has a trimming procedure that penalizes the population diversity of a preferred solution set having an excessive extent.Furthermore, inspired by the ASF-based ranking from the MCDM literature, our proposed R-metric has a transferring procedure that transfers the preferred trade-off solutions to a virtual position according to their satisfaction degree to the DM supplied preference information.Extensive experiments on several artificial scenarios and benchmark problems fully demonstrate the efficacy of our proposed Rmetrics for evaluating the quality of a preferred solution set according to the DM supplied preference information.
This work is a very first attempt to systematically and quantitatively evaluate the quality of a preferred solution set.Much more attention and effort should be required on this topic.In future, we want to explore the following issues: 1) Note that there is no single way of expressing the DM's preference information, so we may not be able to expect a universal way for performance assessment.This paper assumes that the DM's preference information is expressed in terms of a reference point.However, there exist other types of preferences to which our proposed R-metric may not be directly useful.To solve this drawback, one may consider the method presented in [57] to adapt the R-metric to other types of preferences.2) As described in Section IV-B, the setting of ∆ assumes the objective space is normalized to [0, 1].However, in practice, this assumption might not always hold.It is interesting to investigate other method to specify the DM's expectation of the relative extent of ROI with respect to objectives in different scales.3) Other than the weighted Chebyshev function used in our proposed R-metric, there have been some other ASF formats [58], [59] proposed in the multi-objective optimization field.Different ASFs have distinct contour lines and characteristics.It is interesting to study the influences and effects of using different ASFs.4) To avoid wasting computational resources and to examine the formal convergence and optimality achieved by an EMO algorithm, the research on the online stopping criteria (OSC) has obtained increasing popularity.Empirical studies in Section VI-C shows a simple application of our proposed R-metrics for designing OSC.Nevertheless, more sophisticated and advanced techniques [60] are worth being studied in future.5) In addition to the empirical studies, it is of importance to have a rigorous analysis of the optimal archive with respect to the R-metric.This is not only useful for better understanding the behaviour of R-metric itself, but also for providing foundations in the case of using the Rmetric to design an indicator-based algorithm in future.
IGD and HV values for point sets in Fig. 1(b).

Fig. 1 :
Fig.1: Variations of IGD and HV values with respect to a DM specified reference point z r = (0.16, 0.9).

Fig. 3 :
Fig. 3: Illustration of the ASF ranking method.Considering the DM supplied reference point z r , c is preferable over a and b; while a and b are equally important.
(b), the R-IGD and R-HV values obtained by S 3 are indeed the best.For the other point sets, the farther away from the ROI, the worse the R-metric values are.•Attainable reference point z r2 = (0.6, 0.3) T : Similar to the observations in the above scenario, the point set closest to the reference point, i.e., S 6 , obtain the best R-IGD and R-HV values.And the R-metric values also depend on the distance to the ROI.• Extreme reference point z r3 = (1.1,−0.1) T : Since this reference point lies on one extreme, it is expected that the point set at the respective extreme boundary, i.e., S 10 , is desirable.From the results shown in Fig.8(d), we find that our proposed R-metrics are able to capture this fact and their trajectories show a monotone property.
R-IGD and R-HV values for each set.

Fig. 7 :
Fig. 7: Illustration of R-IGD and R-HV values with respect to a DM supplied reference point z r = (0.16, 0.9).

Fig. 10 :
Fig. 10: Variations of R-IGD and R-HV with respect to a reference point z r = (0.5, 0.0) T on ZDT3 problem.

TABLE I :
Comparison results of R-IGD and R-HV values on unattainable reference point.
† † denotes the best mean metric value is significantly better than the others according to the Wilcoxon's rank sum test at a 0.05 significance level.indicatesall obtained solutions are dominated by the other counterparts, and thus no useful solution can be used for R-metric computation.

TABLE II :
Comparison results of R-IGD values on attainable reference point.
† denotes the best mean metric value is significantly better than the others according to the Wilcoxon's rank sum test at a 0.05 significance level.indicatesall obtained solutions are dominated by the other counterparts, and thus no useful solution can be used for R-metric computation.

TABLE III :
Comparisons of R-IGD and R-HV values on DTLZ2 with 5 and 10 objectives.
† † denotes the best mean metric value is significantly better than the others according to the Wilcoxon's rank sum test at a 0.05 significance level.