Mitigating Unfairness via Evolutionary Multi-objective Ensemble Learning

In the literature of mitigating unfairness in machine learning, many fairness measures are designed to evaluate predictions of learning models and also utilised to guide the training of fair models. It has been theoretically and empirically shown that there exist conflicts and inconsistencies among accuracy and multiple fairness measures. Optimising one or several fairness measures may sacrifice or deteriorate other measures. Two key questions should be considered, how to simultaneously optimise accuracy and multiple fairness measures, and how to optimise all the considered fairness measures more effectively. In this paper, we view the mitigating unfairness problem as a multi-objective learning problem considering the conflicts among fairness measures. A multi-objective evolutionary learning framework is used to simultaneously optimise several metrics (including accuracy and multiple fairness measures) of machine learning models. Then, ensembles are constructed based on the learning models in order to automatically balance different metrics. Empirical results on eight well-known datasets demonstrate that compared with the state-of-the-art approaches for mitigating unfairness, our proposed algorithm can provide decision-makers with better tradeoffs among accuracy and multiple fairness metrics. Furthermore, the high-quality models generated by the framework can be used to construct an ensemble to automatically achieve a better tradeoff among all the considered fairness metrics than other ensemble methods. Our code is publicly available at https://github.com/qingquan63/FairEMOL


I. INTRODUCTION
A RTIFICIAL intelligence ethics, including fairness, has been a very important topic [1], [2], [3]. Fairness is viewed as a significant element in artificial intelligent ethics, which refers to the "absence of any prejudice or favoritism toward an individual or a group based on their inherent or acquired characteristics" [4] in the context of decision making. Due to the potential bias and discrimination of training data and algorithms [2], unfair data-driven models and unfair decisions may be made.
For over 50 years, myriad types of fairness quantification have emerged from many disciplines [1], e.g., employment, education, and finance, aiming to determine and evaluate (un)fairness. Since fairness in different contexts can be interpreted into different quantitative definitions to emphasize different perspectives, there is a lack of consensus among different measures and, no single measure has been accepted as a universal notion of fairness quantification [1], [2], [5]. Research has shown that different fairness measures often have conflicts [1], [2], [5]. In other words, if the performance of prediction results on a certain fairness measure is improved, the predictions may perform worse on at least one other fairness measure.
Many approaches have been proposed to mitigate the unfairness in machine learning (ML) [1], [2], [5]. Since data-driven models affected by unfair data will cause the unfair prediction results, mitigating bias could sacrifice the accuracy. Therefore, the main challenge of mitigating unfairness is how to make better tradeoffs between accuracy and fairness of learning models. Various mechanisms try to focus on one or more fairness metrics to mitigate bias. Two dilemmas exist for mitigating unfairness in the context of ML [2]: 1) conflict among accuracy and fairness and 2) conflict among multiple fairness measures, which have been theoretically and empirically shown [2], [6].
Existing work [2], [7], [8] often performs a weighted average of metrics (including accuracy and one or more fairness metrics) to deal with the two dilemmas. For the first dilemma, due to the bias and unfair training data, improving fairness of models may degrade accuracy. Some approaches for mitigating unfairness consider a single fairness measure as a This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. regularization term [9] or a constraint [10] to get a tradeoff between the model accuracy and fairness. Regarding the latter dilemma, due to the incompatibility and complementarity among different fairness measures, such as individual fairness and group fairness [11], the weighted sum approach was also used to combine two metrics into one [2].
However, two challenges should be considered for the methods using the weighted sum approach. First, weights of different objectives (accuracy or fairness metrics) are difficult to determine. The slight difference among the weights may also lead to a large difference in the performance of the ML model. Second, the weighted sum approach can provide only one ML model with one specific tradeoff among conflicting metrics.
A group of diverse models with different tradeoffs among accuracy and multiple fairness are needed for decision makers. Such a diverse set of fair models not only help decision makers to make a more informed choice but also facilitate the formation of ensemble of fair ML models.
A few studies [12], [13], [14], [15] view the unfairness mitigation problem as a multiobjective optimization problem. For example, the studies [12] and [14] proposed to convert gradients of multiple objectives (fairness measures) into one loss to train a model, where weights among objectives are adaptively determined. In the study [13], we proposed a framework based on multiobjective optimization evolutionary learning and revealed that the proposed framework can simultaneously optimize accuracy and multiple fairness measures and obtain a group of diverse models. In this article, we study this framework in depth. First, we evaluate this framework on a broader range of different datasets. Second, we investigate a more appropriate and representative set of metrics to use as objectives. Third, we develop ensembles of fair models to improve the accuracy and fairness. Our study will be organized around the following four research questions. (Q1) Can multiobjective learning simultaneously optimize several fairness measures without sacrificing accuracy? (Q2) Can we obtain a group of diverse models by applying multiobjective learning? (Q3) Can multiobjective learning improve all fairness measures including those not used in model training? (Q4) Can multiobjective learning generate an ensemble model combined from base models to balance accuracy and multiple fairness measures?
The novel contributions beyond [13] are as follows. 1) We apply a multiobjective evolutionary learning framework [13] to train fairer ML models. Multiobjective learning is applied to consider model accuracy and multiple fairness measures simultaneously during training. We have implemented our framework in two scenarios. Empirical results and comprehensive analyses on eight well-known benchmark datasets reveal that our framework can simultaneously optimize the accuracy and multiple fairness objectives (up to eight different fairness metrics).
2) The obtained models can act as good candidates for human decision makers' use with different preferences and can be fully utilized to create an ensemble with a good tradeoff between accuracy and fairness.
3) Our framework can improve fairness according to a broad range of fairness metrics, including those not used in our multiobjective learning algorithms. The robustness of our proposed multiobjective learning approach has been shown in the following sense: our learned model performed well not only on the accuracy and the eight fairness metrics used in the training, it also performed well according to other eight fairness metrics that were never used in training. In other words, our model is very robust against different fairness metrics used to assess it. The remainder of this article is organized as follows. Section II presents the background. The framework and the designed algorithm based on this framework are presented in Section III. Section IV shows and discusses the empirical study. Section V concludes this article.

II. BACKGROUND
In this section, we first introduce the definitions of different fairness metrics and the relationship among the metrics. Then, existing approaches in mitigating unfairness, including multiobjective learning and ensemble methods, are presented.

A. Measuring Fairness in Machine Learning
Many fairness measures have been defined to measure (un)fairness [2] from the perspectives of ethics in the context of fairness. While some measures are positively correlated [16], others are in conflict with each others [2]. So far, no one fairness measure is accepted as a universal notion of fairness quantification since different perspectives of fairness can be interpreted into different quantitative definitions [1], [2], [4], [5], [17]. Generally speaking, existing fairness measures can be divided into two conflicting but complementary categories [2], [11]: 1) individual fairness and 2) group fairness. Individual fairness means that similar individuals should be treated similarly, while the group fairness considers different groups relying on insensitive attributes. Typically, sensitive (also called protected) attributes are traits considered to be discriminative by law, such as gender, race, age, and so on.
From the perspective of economic and social welfare, work [11] proposed to quantify individual unfairness (f I ) and group unfairness (f G ) using inequality indices, namely, generalized entropy indices. Specifically, they use each prediction result of an algorithm to calculate the corresponding benefit and measure the degree of inequality of all the benefits at the individual and group levels, respectively, formulated as [11] (1) where |G| is the number of groups, n g refers to the size of group g (e.g., male, female), n is the number of observations (i.e., n = g n g ), and α is a positive constant. In (1) and (2), μ is the mean value of all b i , whereas μ g is the mean value of b i in group g. Speicher et al. [11] quantified the notion of benefit vector as b i = θ x i − y i + 1, where y i , θ , and x i denote the true labels, parameters of models and input data, respectively. The aim of introducing b i is to map the algorithmic outcomes θ x i to a scalar value based on its true label y i , which can capture the desirability [11] of the predictive outcomes for input data x i . In other words, b i indicates how much benefit the data x i receives according to the algorithmic outcomes. In [11], θ x i − y i + 1 is applied as the benefit assignment rule. Then, f I (1) and f G (2) can quantify inequality (unfairness) of the benefits b i based on all the considered data. f I aims to capture the inequality degree of each x i , whereas f G captures the potential inequality among subgroups of each x i . The study [11] has empirically and theoretically shown that f I and f G are conflicting but complementary in real-world problems. Many good properties of f I and f G in quantifying unfairness, such as anonymity, population invariance, transfer principle/Pigou-Dalton principle, zero-normalization, and subgroup decomposability were also investigated [11].
There are a variety of group fairness metrics, including parity-based metrics [2], calibration-based metrics, scorebased metrics, and confusion matrix-based metrics. Paritybased metrics are usually concerned about the predicted positive rates over each group, such as statistical parity. Both calibration-based and score-based metrics consider a predictive probability or score rather than predictive values. Confusion matrix-based metrics have attracted much attention recently. Table I summarizes 16 metrics belonging to this category, where G, y, andŷ denote sensitive attributes, true labels, and predicted labels obtained by learning models, respectively. Work [16] analyzed the correlations among Fair1-Fair16 metrics based on the prediction results of learning models on four datasets. They conclude from the obtained correlations that Fair1-Fair8 metrics are the representative fairness metrics among Fair1-Fair16 and can represent Fair1-Fair16 [16]. More specifically, Fair4 can represent Fair10, Fair13, Fair14, and Fair15; Fair9, Fair11, and Fair12 can be represented by Fair2; and Fair3 represents Fair16.

B. Mitigating Unfairness in Machine Learning
In the literature of mitigating unfairness, many methods were proposed to mitigate unfairness in the model training process [2], [4], [18]. When the optimized fairness metrics are nondifferentiable, many algorithms aim to make a proxy to these fairness metrics [10], [19], [20], [21]. For example, regarding equalized odds fairness metric, studies in [10] used a proxy as a constraint to the objective function. Study [20] applied the ramp loss to constrain nonconvex optimization to optimize disparate impact fairness metric. Work [21], focusing on statistical parity, equalized odds or equality of opportunity, constructed an adversarial model to detect and mitigate unfairness of the predictor model through an adversarial learning strategy. There are also other types of algorithms, such as bandits [22] and causal inference [23].
The fairness metrics can also be directly treated as objectives. Then, the problem of mitigating multiple unfairness metrics is considered as multiobjective optimization problems [12], [13], [24]. For example, Geden and Andrews [24] focused on three hiring problems and investigated the performance of different many-objective evolutionary optimization methods for fair, interpretable, and legally compliant hiring. Wu et al. [12] used a weighted sum approach to combine several objectives into one. Only one model with a predefined set of weights was obtained. Study [13] proposed a framework based on multiobjective evolutionary learning to balance accuracy and multiple fairness metrics and then verified that the obtained model set had a good performance in terms of diversity and convergence.
Ensemble methods have also been used in the context of mitigating unfairness [25], [26], [27]. In dealing with classimbalance tasks, Iosifidis et al. [25] proposed an ensemble framework at both pre-and post-processing interventions to tackle discrimination class-imbalance tasks in ML. Study [26] claimed that an ensemble consisting of randomly selected classifiers is able to behave more fairly than a single classifier in many cases. The recent work [27] trained different base classifiers to maximize accuracy and determined the weights of base classifiers based on their performance of accuracy and fairness metrics and manually set weights among accuracy and fairness metrics. Then, predicted outcomes are produced through the weighted majority voting method [27]. Empirical results showed its weight assignment method can have better performance than [26]. However, using either the random selection [26] or weight assignment [27] has a limited ability to enhance ensemble diversity. During base model training, neither of [26] and [27] considers fairness metrics, resulting in the lower diversity of fairness metrics among ensemble individuals. Multiobjective evolutionary learning can provide the potential strengths to overcome the challenges that the works [26] and [27]  P ← Select μ promising models from M 1 , . . . , M λ with "best" 1 , . . . , μ according to π 7: M ← Generate φ new models M 1 , . . . , M φ from P according to π 8: for i ∈ {1, . . . , φ} do 9: M i ← Partially train [28], [29] M i on D train 10:

A. Multiobjective Learning Framework for Fairer ML
Our general framework is presented in Algorithm 1, aiming to evolve a population of learning models. Every individual of the population is a fair learning model, e.g., an artificial neural net (ANN). During the evolution, we expect that the population can gradually achieve better tradeoffs among accuracy and multiple fairness measures.
The inputs to our framework include a number of initial models M as a population, a set of model evaluation criteria E, a set of training data D train , a set of validation data D validation and a multiobjective optimizer π . More specifically, criteria E are used to calculate the optimized objective values (e.g., accuracy, fairness metrics in Table I) according to the predictions of models M on validation data D validation . Training data D train is used for local search strategies (e.g., partial training [28], [29]) to update parameters of models M.
A multiobjective optimizer π mainly contains three strategies, reproduction, mating selection, and survival selection. In our framework, every time a new model is initialized or generated (lines 1 and 9 in Algorithm 1), partial training [28], [29] is always adopted on D train . The objective values of each model are obtained through criteria E. In the main loop, first, the mating selection strategy of π is applied to select μ promising models as parent models P (line 6 in Algorithm 1). Next, λ new models as M are created with the aim of inheriting information from P (line 7 in Algorithm 1) through the reproduction strategy of π . Specifically, the reproduction strategy is applied to generate new models as offspring by modifying the parameters of parent models, where crossover and mutation are two widely used operators. After partial training and model evaluation (lines 9 and 10 in Algorithm 1), λ candidate models are selected from the combination of M and M by the survival selection of π as new M for the next generation (line 12 in Algorithm 1). The above steps repeat until a termination criterion is reached.
The core steps of our framework are the model evaluation based on multiple criteria and generation of new models. Multiobjective evolutionary algorithms (MOEAs) [30] as π are ideal to generate a learning model set with better convergence and diversity. The output models can be further selected by decision makers or used as an ensemble [31], [32], [33].

B. Multiobjective Ensemble Learning Framework for Fairer ML
When adopting ensemble learning, the final model set obtained by Algorithm 1 can be used to construct an ensemble. Algorithm 2 shows our proposed multiobjective ensemble learning framework. The inputs of our ensemble framework include trained models M 1 , . . . , M λ obtained by Algorithm 1, a set of model evaluation criteria E, an ensemble training dataset D ensemble , and a multiobjective ensemble selection strategy π ens . First, the objective values of M 1 , . . . , M λ are computed through model evaluation criteria E on the ensemble training dataset D ensemble . Then, various model selection strategies can be applied to select a subset of models from M 1 , . . . , M λ , denoted as M 1 , . . . , M e , according to π ens and the obtained objective values.

C. Proposed Algorithms Based on Our Framework
The choices of the model set, evaluation criteria, multiobjective optimization algorithm, and ensemble selection strategy in our proposed framework can vary according to the prediction tasks and actual preferences. We designed algorithms based on our framework using the following core ingredients.
1) Model Set: Various ML models can be used. In this work, a set of ANNs with an identical architecture are used as individuals. The weights and biases of each ANN are encoded as a real-value vector and represented as an individual [28].
2) Evaluation Criteria: In this work, we consider 11 measures in total including accuracy, individual unfairness f I (1), group unfairness f G (2), and Fair1-Fair8 (Table I). Cross entropy (CE) is widely used to measure the accuracy of classifiers and is minimized. f I and f G are also minimized [11]. The measures of Fair1-Fair8 based on absolute differences are minimized. For Fair1-Fair8 using ratios, taking Fair3 as an example, the objective is calculated as which is to be minimized. In this article, when observing the values of Fair9-Fair16, the transformation introduced above is applied to these fairness measures. So, the optimal values of Fair1-Fair16 are all zeros.
3) Multiobjective Optimizer: Concerning that some criteria from E can be directly used as loss functions to update ANNs, denoted as E loss , such as f I and f G , we design an effective strategy including mating selection and reproduction strategies, shown in Algorithm 3. Survival selection of π can be adopted from any MOEAs. In our algorithm, we choose the survival selection of stochastic ranking algorithm (SRA) [34], a wellknown multi-indicator-based MOEA. SRA uses the stochastic ranking [35] to balance different search biases of different indicators and achieved the superior performance in dealing with many-objective optimization problems. The proposed strategy aims to better balance exploration (lines 4-14 in Algorithm 3) and exploitation (lines 15-22 in Algorithm 3).
For the exploration improvement part, m best models according to each criteria E loss are selected from M 1 , . . . , M λ and denoted as S best . After applying the reproduction strategy to S best , partial training [28], [29] is performed to each of S best for K times, where the loss is the same as the corresponding criterion E i loss (line 9 in Algorithm 3). All the generated models are stored in the set S. For the exploitation improvement part, κ promising models are selected from M 1 , . . . , M λ based on SRA's mating selection strategy and denoted as S π . Next, λ new models are generated after the reproduction strategy is applied to S π , where κ is equal to λ − m * K. Then, partial training is performed to each of S π , where the loss is randomly selected from criteria E loss .
Both crossover and mutation are applied in the reproduction strategy of our algorithms. When mutation, isotropic Gaussian perturbation [36] is performed, formulated as r i = r i + δ i , where r i is the ith weight of an ANN, isotropic Gaussian perturbation δ i ∼ N (0, σ 2 ), and σ is the mutation strength. Given parents p and q, the variant of weight crossover [37] is applied and defined as i are ith weight of parent p, parent q, offspring o 1 and offspring o 2 , respectively.

Algorithm 3 Selection and Reproduction Strategy
Input: Current population M 1 , . . . , M λ , set of model evaluation criteria E, training dataset D train , validation dataset D validation , multi-objective optimiser π , parameter K for extreme models in partial training [28], [29] Output: An offspring model set S 1: E loss ← Select from E the criteria that can be directly used as loss functions to update the parameters of models 2: m = |E loss | M ← Partially train [28], [29] the i-th model of S on D train , where the loss is calculated based on the corresponding criterion E i loss 10: "best" 1 , . . . , κ according to the mating selection of π 17: S ← Generate λ new models M 1 , . . . , M λ from S π according to reproduction strategy of π 18: for i ∈ {1, . . . , λ} do 19: M ← Partially train [28], [29] the i-th model of -S on D train , where the loss is randomly selected from criteria E loss 20: EnsAll, all the nondominated models are selected [39], while in EnsBest, only the best models from the nondominated models are selected according to their performance [39]. In EnsKnee, a knee point (model) subset from the nondominated models is selected according to the strategy of finding a knee point subset in [40]. In EnsDiv, a diverse model subset from the nondominated models is selected according to the selection method in an overflowed diversity archive in the work of Wang et al. [41].

IV. EXPERIMENTAL STUDIES
In this section, two studies are used to answer Q1-Q4 through extensive experiments. First, the overview of the two studies is introduced in Section IV-A, including the motivation and details of the two studies. Then, Q1 and Q2, and Q3 and Q4 will be answered in Sections IV-B and IV-C, respectively.

A. Overview of the Two Studies
To adequately answer Q1-Q4, two studies are adopted, formulated as tri-and 9-objective optimization problems.
To achieve a comprehensive investigation of Q1 and Q2, we will compare the performance of the methods that are based on our framework but considering different measures. For convenience, we only consider two unfairness metrics, but the conclusion can be generalized to any case with more than two unfairness measures. To answer Q1 and Q2, the triobjective case is considered, where the evaluation criteria involve the cross entropy, individual unfairness [11], and group unfairness [11], introduced in Section II-A.
To answer Q3 and Q4, the 9-objective case that simultaneously optimizes the cross entropy and Fair1-Fair8 (Table I) is considered. The conclusion of [16] can be directly utilized in answering Q3, indicating that the metrics Fair1-Fair8 can well represent Fair1-Fair16. Then, we make full use of the final population to construct an ensemble to answer Q4.

B. Answering Q1 and Q2
In this section, the experimental results of the triobjective case are used to answer both Q1 and Q2.
1) Compared Methods: We denote our framework as F * , where * means the optimized objectives. So, the triobjective case is F EIG , where E, I, G refer to the cross entropy CE, individual unfairness f I and group unfairness f G , respectively. Three ablation studies are performed, which are all based on our framework F * but consider one or two measures * . What is more, we compare F EIG with the state-of-the-art algorithm Multi-FR [12] that does not use an MOEA. More specifically, four variants of Multi-FR according to the method of gradient normalization are considered. Although the work of [12] pointed out that the normalization of the gradient is optional, the gradient normalization method can affect the performance of Multi-FR significantly. Three widely used normalization methods [42] are applied in our experiments.
The seven compared methods are summarized as follows. F EI is the biobjective case that considers both CE and individual unfairness f I based on our framework. F EG is another biobjective case that considers both CE and group unfairness f G based on our framework. F E is a single-objective case that only considers CE. Multi-FR-l 2 , Multi-FR-loss, and Multi-FR-loss+ are variants of Multi-FR approach [12] using l 2 normalization, loss normalization, and loss+ normalization, respectively. Multi-FR-no-norm refers to the Multi-FR approach without gradient normalization.
2) Datasets: Eight well-known benchmark datasets widely used in the literature of algorithmic fairness [43], namely, Student [44], [45], German [46], COMPAS [47], LSAT [48], Default [49], Adult [50], Bank [51], and Dutch [52], are used in our experimental study. Table II summarizes these datasets. The preprocessing on German, COMPAS, and Adult datasets is the same as in [6]. Each dataset is randomly split into three partitions, with a ratio of 6:2:2, as the training, validation, and test sets. The sensitive features of each dataset in Table II are all considered in calculating f G . The difficulty of f G being optimized increases as the value of |G| increases. 3) Parameter Setting: All ANN models are fully connected with one hidden layer of 64 nodes. The weights are initialized as in [53], which is commonly used. The learning rate is set as 0.004 for all experiments. The gradient descent optimizer is based on the SGD [38]. For the experiments with algorithms F * s, μ, and λ in Algorithm 1 are 100. K in Algorithm 3 is set as 10. The E loss in Algorithm 3 contains CE, f I , and f G since all the three objectives are differentiable and are directly used as losses. The settings of mutation strength and batch size on different datasets are shown in Table III, where the parameter values are determined through the grid search. Since F E considers one single objective, the top λ models considering CE are directly selected as the new population for the next generation. The probabilities of crossover and mutation are all set as to 1. The termination condition is set as a maximum number of 200 generations. The four variants of Multi-FR use the same batch size as in Table III. Five-fold cross-validation is applied. For each compared method, 30 independent trials are performed.

4) Performance Measures:
Considering that a set of models will be generated by F * s but only one model will be created by Multi-FR, for fair and comprehensive comparisons, two groups of performance measures are introduced.
When comparing population-based algorithms including F E , F EI , F EG , and F EIG , two popular indicators [54], hypervolume (HV) [55] and coverage over Pareto front (CPF) [56], are used to evaluate the solution set. A larger HV value indicates that the set has better performance. CPF emphasizes more the diversity and a larger value indicates better diversity [54], [56]. In this work, since the true Pareto front is unknown, when calculating HV and CPF, all the nondominated solutions found in all the experimental trials considering the same objectives on the same dataset are collected as a pseudo Pareto front. After normalizing the solution set into the closed intervals [0, 1] × · · · × [0, 1] based on the pseudo Pareto front, {1.1, . . . , 1.1} is set as the nadir point in HV.
To compare F EIG with Multi-FR, three metrics are used to consider the domination relationship [54]. Given two solutions p and q, we denote that [54] p ≺ q and p ⊀ q mean p dominates q, and p does not dominate q, respectively.
Given a solution s generated by Multi-FR in one trial and a set of model sets P obtained by F EIG in all 30 trials, the metric Dominate records if F EIG generates solutions that dominate the solution obtained by Multi-FR as follows: where P i is the model set in the i-trial, P i j is the jth model in P i . sign[·] is equal to 1 if [·] is true, otherwise 0. The larger value means there are more trials where s is dominated by some solutions of the model set obtained by F EIG .
The metric Incomparable calculates the average proportion of solutions that are obtained by F EIG and incomparable with s over all 30 trials of F EIG A larger Incomparable value means that more models obtained by F EIG are incomparable with s. For the third metric Dominated, we calculate the average proportion of solutions that are obtained by F EIG and are dominated by s over all 30 trials of F EIG Dominated(P, s) = 1 30 A larger Dominated value means that more models obtained by F EIG are dominated by the model s.

(Q1) Can Multiobjective Learning Simultaneously Optimize Several Fairness Measures?
We answer Q1 from four perspectives on the test set: 1) visualization of optimization process; 2) convergence curves of HV values; 3) HV performance of the final generation; and 4) comparison with state-of-the-art algorithm Multi-FR [12] according to Dominate, Incomparable, and Dominated. Fig. 1 illustrates the optimization process of arbitrarily selected trials of F EI , F EG , and F EIG on the test sets, where the nondominated solutions of each generation are drawn with color darken as the evolution progresses. It is clearly shown that model error and one or two unfairness measures converge simultaneously toward Pareto fronts (green stars).
The convergence curves of HV values that quantify the optimization process over all the independently repeated trials are shown in Fig. 2. Specifically, for each dataset, the pseudo Pareto front based on the three objectives is determined considering all the solutions from all the generations of 30 trials of F E , F EI , F EG , and F EIG . Then, based on the pseudo Pareto front, we record the average HV values of every ten generations in each experiment over all the 30 trials. It is worth noting that CE, f I , and f G are all involved in the calculation of HV, which makes the HVs of different algorithms comparable with each other, although their optimized objectives are different. Therefore, the HV results can represent the performance of CE, f I , and f G in terms of convergence and diversity.
As shown in Fig. 2, in all the three experiments considering two or three objectives, the HV values increase along with evolution, implying that the diversity and convergence become significantly better and that the model error, individual and group unfairness decrease along with evolution. During the evolution, since F E only optimizes CE, the improvement of CE may lead to worse f I and f G , which makes little improvement   Table IV, where "+/ ≈ /−" indicates the averaged HV values of corresponding algorithms are statistically better/similar/worse than that of F EIG according to the Wilcoxon rank sum test with a 0.05 significance level. Table IV indicates that except that F EIG and F EI have the same performance on the German dataset, F EIG is statistically superior to others in terms of HV values, which shows that F EIG can optimize accuracy, individual and group unfairness measures simultaneously without sacrificing any of them.
Finally, we compare the state-of-the-art Multi-FR with F EIG . Dominate, Incomparable, and Dominated values of models obtained by Multi-FR with l 2 or loss or loss+ or no normalization on the test data in 30 trials are recorded in Table V. The Dominate values of Multi-FR without normalization is always 1 on all datasets, which implies that there are always some models generated by F EIG having better performance in terms of all the three objectives, CE, f I , and f G , than the models obtained by Multi-FR without normalization in each trial. On Student, German, COMPAS, Default, and Bank datasets, the Dominate values of Multi-FR with either loss or loss+ are larger than 0.65, which indicates that F EIG has a high probability of generating models that are better than Multi-FR in terms of CE, f I , and f G . On LSAT, Adult, and Dutch datasets, although the Dominate values of Multi-FR with l 2 or loss or loss+ is not high, Dominated values of them is high and Incomparable values of them is low, which means To answer Q2, three perspectives are considered on the test set: 1) visualization of Pareto fronts generated by F EIG ; 2) evaluation based on diversity indicator CPF; and 3) comparison with the state-of-the-art algorithm Multi-FR [12].
As shown in Fig. 3, we select nondominated solutions of the models during the whole evolution process of F EIG in all trials as the Pareto fronts (black points). We plot nondominated solutions of the models in the last generation of F EIG in one arbitrary trial (green triangles). Fig. 3 clearly shows the tradeoffs among CE, f I , and f G through the obtained Pareto fronts, which helps decision makers to understand the different behaviors among the three objectives and how much sacrifice of one certain metric can improve others to what extent. The diverse set of models can be observed from Fig. 3 clearly except for the Student dataset, whose diverse models might be better observed from Fig. 2.
We quantify the diversity of models obtained by F E , F EI , F EG , and F EIG , respectively. Since HV indicator can evaluate both convergence and diversity of a mode set, the results of Table IV indicate that F EIG can provide a more diverse model Illustrative examples: 1) nondominated solutions of the models obtained in the whole evolution process of F EIG in 30 trials (black points) and 2) nondominated solutions of the models in the last generation of F EIG in one arbitrary trial (green triangles). The "extreme" models in the final generation generated by F EIG are selected for analysis. More specifically, the best models according to individual metrics (accuracy, f I , and f G ) are selected for further analysis. The accuracy, f I and f G values of those selected models are averaged, respectively, over 30 trials and reported in Table VII, where the values mean the differences between the metric values and the best values in the corresponding metric. Table VII clearly shows the extreme tradeoffs among these metrics. Take Default as an example, the best performance of accuracy, f I and f G for F EIG are 0.82332, 0.02724, and 0.00001, respectively. Then, it is possible for the model with the best accuracy to optimize f I from 0.06753 to 0.2724 but the model must sacrifice the high accuracy performance for about 6.022e-1, which helps decision makers to clearly understand the tradeoffs. Thanks to the diverse tradeoff set, decision makers can make an appropriate decision depending on the demands in real life. The average values of accuracy, f I , and f G of models obtained by Multi-FRs using different normalization methods are also recorded in Table VII. For f I and f G , F EIG can statistically achieve better performance than all four types of Multi-FRs. Fig. 4 visualizes the model set of F EIG in one arbitrary trial and models of four Multi-FRs in all trials. Although the weights of the gradients of accuracy, f I , and f G are adaptively determined by Frank-Wolf solver method during the training process, the models obtained by Multi-FRs only distribute in a subregion of those obtained by F EIG , which implies that F EIG is able to explore more diverse decision solutions in both f I and f G metrics.
F EIG using multiobjective learning can obtain a group of diverse models and outperform the state of the art.

C. Answering Q3 and Q4
In this section, we implement the 9-objective case to answer Q3 and Q4. Three metric sets are considered according to whether the fairness measures belong to the representative measure subset in Table I Fair1-Fair16). The Metric Set I is the 9 directly optimized objectives. Our ensemble methods are denoted as F Repre . The ensemble models obtained by F Repre are denoted as "Ens."

1) Compared Methods:
To verify the effectiveness of our ensemble methods in balancing accuracy and multiple fairness metrics, our ensemble learning framework (shown in Algorithm 2) according to four different multiobjective ensemble selection strategies π ens are implemented, called EnsBest, EnsAll, EnsKnee, and EnsDiv (described in Section III-C4), to optimize the measures in the Metric Set I.
The ensemble method of [27] is used as a baseline for its outstanding performance compared to [26] according to [27]. In [27], different classifiers are considered as base models including linear regression (LR), linear discriminant analysis Thus, a total of nine algorithms are compared in this work. Fair1-Fair8 are directly optimized based on the formulations in Table I. Cross entropy is used to measure the accuracy.
2) Datasets: Besides the eight datasets in Table II, seven new datasets are also considered to answer Q3 and Q4 and to verify the effectiveness of our ensemble method, including Academics [57], Heart [58], Diabetes [59], Performance [60], IBM [61], Drug [62], and Patient [63]. These additional datasets are added to facilitate comparisons with existing work [27]. In order to construct ensemble models, each dataset is randomly split into four partitions, with a ratio of 5:1.25:1.25:2.5, as training, validation, ensemble, and test sets, where the setting is the same as the work [64]. The use of validation set is the same as that of D validation in Algorithm 1, whilst the ensemble sets are only used in the process of ensemble strategies on the 15 datasets. The selection of sensitive attributes on German, Adult, COMPAS, and Bank is, gender and age, gender, race, and age, respectively, which is the same as the work [16]. The remainder is all set as gender.
3) Parameter Settings: According to the size of datasets, we use two settings for the number of hidden nodes and learning rate. For Student, German, COMPAS, LSAT, Default, Adult, Bank, Dutch, and Patient, the individuals of each Ens* are ANNs that are fully connected with one hidden layer of 64 nodes. The learning rate is set as 0.004. For the remaining datasets, the number of hidden nodes is set as 32 and learning rate is 0.001. The initialization method is the same as [53]. The μ and λ in Algorithm 1 are 300. The E loss in Algorithm 3 only contains CE since Fair1-Fair16 cannot be directly used as losses. Larger mutation strength is applied for the 9-objective case to improve the performance of the objectives of Fair1-Fair8. The mutation strength is 0.1 and the batch size is set as 1000. K in Algorithm 3 is set as 10. The probabilities of crossover and mutation are all set as to 1. The termination condition is set as a maximum number of 100 generations. When dealing with the test dataset D test , the four ensemble methods output the arithmetic averaged prediction value over the selected model subset.
For EnsDiv, the 50 base classifiers are selected from the nondominated solutions in the final population of F Repre using the diversity update strategy of Two_Arch2 [41] according to the objectives on the ensemble data. For EnsKnee, the 50 base classifiers are chosen based on the knee point selection of the work [40]. For Ens*, the ensemble prediction is the arithmetic average of the selected base ANNs.
For KCR, KCS, LrKSCR, LrKLSCR, and KCSRN, the setting in the original study [27] is used. All the base models are implemented by scikit-learn [65]. The weights among base classifiers are determined by two types of weights, manual-based weights e and metric-based weights t on the ensemble data. For manual-based weights, all the objectives are treated equally (e = 1/9), which is the same as the original setting [27]. Soft majority voting is applied to determine metric-based weights among the base classifiers since the experiment results of [27] show that soft majority voting performed better than hard majority voting to balance different metrics. Four-fold cross-validation is applied. For each compared method, 30 independent trials are performed. 4) Performance Measures: HV is used to evaluate the overall performance of a model set in terms of convergence and diversity, and the detail is introduced in Section IV-B4. Geometric mean (G-mean) [66] is used to measure the performance of an ensemble in terms of accuracy and Fair1-Fair8 since G-mean can measure a solution considering multiple objectives with different units and is widely used in many applications [66], [67]. In the calculation of G-mean, the "Accuracy" metric is equal to (1-accuracy) in order to make all the measures in accuracy and Fair1-Fair8 be minimized. The smaller G-mean value means the better performance. (

Q3) Can Multiobjective Learning Improve All Fairness Measures Including Those Not Used in Model Training?
To answer Q3, we will plot and analyze the convergence curves of HV values of the 9-objective optimization algorithm F Repre on the test data of the 15 datasets.
We use F Repre to directly optimize Metric Set I, and then calculate the average HV values of every ten generations in the Metric Set I, the Metric Set II, and the Metric Set III, respectively, on the test data over 30 trials. The way of determining pseudo Pareto front described in Section IV-B4 is used except that the considered objectives are the Metric Set I, the Metric Set II and the Metric Set III, respectively.   To answer Q4, three perspectives are considered: evaluation of base models, evaluation based on G-mean, and comparison according to the accuracy and multiple fairness measures.  The quality of base models in terms of accuracy and Fair1-Fair8 on the test set are analyzed from two perspectives: 1) the visualization of base models and 2) HV values of base models. We record the best performance of the whole base model set in terms of accuracy and Fair1-Fair8 for every trial. Then, each averaged measure value on 30 trials is plotted in Fig. 6 for EnsBest, KCR, KCS, LrKSCR, LrKLSCR, and KCSRN. Since the base models of EnsBest have the lowest number of base models than other Ens*, we only plot the performance of the base model sets of EnsBest to clearly demonstrate the advantages of using multiobjective learning. It is observed that EnsBest is better than KCR, KCS, LrKSCR, LrKLSCR, and KCSRN in terms of fairness. Regarding EnsBest that trains base models considering accuracy and the fairness measures through multiobjective learning, base models with better performance or even the optimal performance in terms of Fair1-Fair8 can be found since most values of fairness measures are close to 0, as shown in Fig. 6.
Furthermore, HV is used to evaluate the overall performance of the base model sets obtained by different algorithms in terms of Accuracy and Fair1-Fair8, where the value 1-Accuracy is used as the accuracy objective value to make all nine objectives be minimized in the calculation process of HV. Specifically, for each dataset, the pseudo Pareto front based on Accuracy and Fair1-Fair8 objectives is obtained from all the compared algorithms on 30  This means that the base models obtained by multiobjective learning can have better quality than other approaches in terms of fairness measures, which can contribute to generating an ensemble with better performance of Accuracy and Fair1-Fair8. Table VIII also shows that EnsBest is a baseline for our ensemble because EnsDiv, EnsKnee, and EnsAll are all better than EnsBest in all cases based on HV values. They are all better than KCR, KCS, LrKSCR, LrKLSCR, and KCSRN.
As to the overall performance of the ensemble models considering Accuracy and Fair1-Fair8, Table IX gives the G-mean values on the 15 datasets averaged over 30 trials. The last row of Table IX also shows the overall rankings of the nine algorithms on the 15 datasets. +/ ≈ /− indicates that the average G-mean values of the corresponding algorithm (specified by column header) is statistically better/similar/worse than the one of EnsBest according to the Wilcoxon rank sum test with a 0.05 significance level. It is observed that EnsBest has the best averaged ranking 3.47 among all the compared algorithms. EnsBest outperforms KCR, KCS, LrKSCR, LrKLSCR, and KCSRN on 11 out of 15 datasets except Adult, Bank, Heart, and Diabetes. A closer examination of the datasets and the fairness measures reveal that the poor performance of EnsBest on Adult, Bank is partially caused by imbalanced data distribution [68], [69], in the datasets.
We take a closer analysis of the performance of the nine algorithms in terms of each measure in Accuracy and Fair1-Fair16. Fig. 7 ranks the nine algorithms according to the averaged values of each measure in terms of Accuracy and Fair1-Fair16 on 15 datasets, where the smaller ranking value means the better performance. According to Fig. 7, several observations can be made. First, no method is best across all metrics, which is expected because of inherent conflicts among metrics. If a method excels at one metric, it is very likely to achieve suboptimal values for other conflicting metrics. Second, if we consider the overall performance among all 17 metrics, our EnsBest is the best, which achieves the best ranking according to the most number of metrics and also achieves good rankings on other metrics. In fact, our four ensemble methods, Ens*, all achieve better overall performance than others. Third, if we examine individual objectives, including the accuracy and 16 fairness metrics, separately, Fig. 7 shows that our methods Ens* achieve the best performances according to fairness metrics 1, 4, 6-9, and 11-15. Ens* do not perform as well as others on the accuracy and fairness metrics 2, 3, 5, 10, and 16. In short, our methods performed the best on 11 out of 17 metrics, keeping in mind that our methods also have the best overall performance according to G-means. Note that these 8 fairness metrics, Fair1-Fair8, were used to answer Q3, not specifically used to address the balance between group and individual fairness. By considering group fairness metrics only (Fair1-Fair8), we will compromise our performance on individual fairness metrics, as shown by other papers [4], [11], [18]. There is an inherent conflict between group and individual fairness, which is also evident from our previous work [13] and from our experimental results related to Q1 and Q2. It is important to maintain the balance among different metrics. If we are to consider all possible fairness metrics, we should select representative metrics from different categories, e.g., group and individual fairness, as objectives in our multiobjective ensemble learning framework.

V. CONCLUSION
To deal with the conflict among accuracy and different fairness measures, this article applied a novel multiobjective evolutionary learning framework to mitigate unfairness. Two studies of our proposed framework: 1) the tri-and 2) 9-objective optimization algorithms were carried out focusing on the four research questions raised in Section I. In particular, we have demonstrated through extensive experimental studies that our multiobjective learning framework is able to learn fair models with high accuracy and outperforms the state of the art. We have shown that we are able to find a more diverse set of fair models than the state of the art. Such a diverse set has enabled us to develop an ensemble of learning models with good performance, outperforming existing ensemble approaches. We have analyzed our experimental results from different perspectives in order to understand the results in-depth. We have also shown that our multiobjective learning framework is able to consider a broad range of fairness measures, even those not used in model training. This is an indication that the fair models we have found are robust against different fairness metrics, rather than "overfiting" to a specific fairness measure.
In the future, we plan to improve upon several aspects of our work. First, we would like to carry out a deeper analysis of various fairness measures in order to reduce the number of objectives used in multiobjective learning. Second, we plan to investigate the impact of training data distribution on our multiobjective learning framework. Third, new ensemble formulation methods will be investigated. Fourth, since our framework is a population-based approach and each individual in the population is an ML model, if the model size becomes very large, the computation time will increase. Our framework is a good option when multiple objectives are considered or when loss functions are nondifferentiable or nonconvex. Our framework is better at providing a model set to balance multiple objectives. Our future work will study how to improve the efficiency of the proposed framework on very large models through parallel processing [70], [71], [72].