Travel Demand Forecasting: A Fair AI Approach

Artificial Intelligence (AI) and machine learning have been increasingly adopted for travel demand forecasting. The AI-based travel demand forecasting models, though generate accurate predictions, may produce prediction biases and raise fairness issues. Using such biased models for decision-making may lead to transportation policies that exacerbate social inequalities. However, limited studies have been focused on addressing the fairness issues of these models. Therefore, in this study, we propose a novel methodology to develop fairness-aware, highly-accurate travel demand forecasting models. Particularly, the proposed methodology can enhance the fairness of AI models for multiple protected attributes (such as race and income) simultaneously. Specifically, we introduce a new fairness regularization term, which is explicitly designed to measure the correlation between prediction accuracy and multiple protected attributes, into the loss function of the travel demand forecasting model. We conduct two case studies to evaluate the performance of the proposed methodology using real-world ridesourcing-trip data in Chicago, IL and Austin, TX, respectively. Results highlight that our proposed methodology can effectively enhance fairness for multiple protected attributes while preserving prediction accuracy. Additionally, we have compared our methodology with three state-of-the-art methods that adopt the regularization term approach, and the results demonstrate that our approach significantly outperforms them in both preserving prediction accuracy and enhancing fairness. This study can provide transportation professionals with a new tool to achieve fair and accurate travel demand forecasting.


Introduction
In recent years, Artificial Intelligence (AI) has been increasingly used in travel demand forecasting, due to its powerful prediction capability (Chu et al., 2019;Xu et al., 2022).However, a growing number of studies reported that AI has evident fairness issues (Angwin et al., 2016;Baker and Hawn, 2021;Barabas et al., 2018;Beutel et al., 2019;Buolamwini and Gebru, 2018;Obermeyer et al., 2019;Prates et al., 2020)-making worse predictions for disadvantaged population groups (e.g., racial and ethnic minorities, low-income individuals, and women) than the advantaged groups.For example, facial recognition systems have higher error rates on classifying darker-skinned individuals and females (Buolamwini and Gebru, 2018).Studies in the transportation domain also have similar findings.For example, recent research has shown that AI algorithms could underestimate the actual travel demand for the disadvantaged groups (Yan and Howe, 2020) and deliver much lower prediction accuracy for the disadvantaged groups than the advantaged groups (Zheng et al., 2021).The unfair predictions may negatively impact transportation policies and decision-making (e.g., vehicle rebalancing and traffic control), leading to unintended consequences for transportation equity.Therefore, AI-based travel demand forecasting models should account for both prediction accuracy and fairness (Yan, 2021).
Recently, some researchers have started to develop fairness-aware AI methods in travel behavior modeling, e.g., travel mode choice modeling (Zheng et al., 2021) and travel demand forecasting (Yan and Howe, 2020).However, research on this important topic, especially for travel demand forecasting, is still lacking.For instance, although various methods have been developed to mitigate the unfairness issues, very few can be flexibly adopted by different types of models (e.g., linear models, deep learning models with different architectures, etc.).In other words, there still lacks a systematic framework to address the model's fairness issue in a model-agnostic (i.e., the method should be independent of models) manner.Also, it remains largely unsolved how to prioritize model fairness while preserving its prediction accuracy, both of which are critical to ensure the trustworthiness of AI (Kaur et al., 2022;Li et al., 2023).Additionally, previous studies have primarily focused on correcting the unfairness of a single protected attribute.In real-world dataset, however, the debiased model and results could vary across different protected attributes, potentially causing confusion and hindering adoption by end-users.For example, one study has found that mitigating unfairness of one protected attribute (i.e., race) could increase the prediction disparities of another protected attribute (i.e., income) (Zheng et al., 2021).This suggests that a model that is fair for one protected attribute could still be unfair for other attributes (Wan et al., 2023).However, few prior studies have been devoted to simultaneously tackling fairness issues from multiple protected attributes (Bose and Hamilton, 2019;Wan et al., 2023).
To address these research gaps, we aim to develop a new methodology to enhance fairness in AI-based travel demand forecasting models.More specifically, first, we define Fairness as the Equality of Prediction Accuracy, i.e., the prediction accuracy is equal for advantaged and disadvantaged population groups.Next, we examine the potential unfairness (i.e., prediction accuracy disparity) existing among several state-of-the-art deep learning and statistical models for travel demand forecasting, using real-world ridesourcing-trip data in Chicago, IL and Austin, TX.We propose a novel absolute correlation regularization method to simultaneously correct the detected unfairness across multiple protected attributes (e.g., race, education, etc).We further compare the proposed methodology with other existing state-of-the-art regularization terms to show its effectiveness in both preserving accuracy and correcting unfairness.The unique contributions of this study are presented as follows: • This study is one of the first studies to examine the fairness issues of travel demand forecasting models from the algorithmic view.We extend the literature on this topic by detecting the unfairness issues of several commonly-used deep learning and statistical models and proposing a methodology to correct the unfairness.
• We introduce a novel absolute correlation regularization term to address the model's unfairness arising from multiple protected attributes.This regularization term is explicitly designed to penalize models that produce unfair predictions, which holds notable transparency.Moreover, the proposed regularization term is model-agnostic and can be flexibly incorporated into the loss function of any type of model architecture.
• We propose to use an interactive weight coefficient for both the accuracy loss and fairness regularization terms.This weight coefficient is tuned simultaneously with other key hyperparameters of an AI model (e.g., number of hidden layers, number of hidden neurons, and learning rate of a multiple-layer perception model).Therefore, the fairness-aware travel demand forecasting models can optimally improve fairness while preserving prediction accuracy.
The remaining paper is structured as follows: Section 2 reviews the related studies.Section 3 introduces the fairness definitions, metrics and unfairness correction method.We introduce the empirical case studies in Section 4. The modeling results are presented in Section 5. Section 6 discusses the merits of the proposed methodology, echoes the critical findings, proposes some policy implications and lists several future research directions.Finally, Section 7 concludes our study.

AI fairness issues
In recent years, AI methods have been deployed in a broad array of real-world applications due to their outstanding strength in producing highly-accurate predictions.However, there has been a growing recognition that, despite predictive superiority, AI and machine learning techniques have also been accompanied by increasing concerns of fairness (Angwin et al., 2016).Studies from multiple fields have reported that AI algorithms could be discriminatory to the disadvantaged population groups under various applications, including healthcare, criminal justice, credit assessment, translation, among many others (Angwin et al., 2016;Baker and Hawn, 2021;Barabas et al., 2018;Dressel and Farid, 2018;Obermeyer et al., 2019;Prates et al., 2020).For example, healthcare systems could underestimate the health condition of black patients than white patients, even if they have the same health risk score (Obermeyer et al., 2019).If these inherent biases are not addressed, using these AI systems to assist decision-making will worsen the existing social disparities (Mehrabi et al., 2021).

Taxonomy of fairness notions
Numerous fairness notions and corresponding mathematical formulations have been proposed for different downstream learning tasks (Mehrabi et al., 2021).These fairness notions span various dimensions, including classification vs. regression, group vs. individual and disparate treatment (Berk et al., 2017).In classification, multiple fairness notions are created to mitigate "disparate impact", i.e., if practices or policies have disproportionately adverse effects on different groups (Barocas and Selbst, 2016).For example, statistical parity (Dwork et al., 2012), equality of odds and and equality of opportunity (Hardt et al., 2016).In regression, notions like individual/region-based fairness gap (Yan and Howe, 2020), cross-pair loss (Berk et al., 2017) and equal means (Calders et al., 2013) are introduced to address realworld regression applications that require fairness concerns.Fairness notions also branch into the axis of individual and group.Individual fairness requires similar individuals to be treated similarly, while group fairness equalizes the outcome among all groups (Dwork et al., 2012).Another branch to classify fairness notions is determining whether the disparate treatment is allowed.Disparate treatment measures fairness through treatment rather than the outcomes.It addresses both formal classification and intentional discrimination (Barocas and Selbst, 2016), and includes notions like counterfactual fairness (Kusner et al., 2017) and fairness through unawareness (Dwork et al., 2012).These fairness notions have laid a solid foundation for defining and measuring fairness in real-world problems.

Correcting unfairness for multiple protected attributes
There are three possible ways to achieve the aforementioned fairness, i.e., correcting the unfairness.First, pre-processing the data (e.g., resampling or reweighting) and remove bias before training the models (e.g., (Calmon et al., 2017;Kamiran and Calders, 2012)).Second, in-processing: modifying the algorithms such as including fairness penalty in the loss function (Berk et al., 2017;Yan and Howe, 2020) or incorporating constraints (Agarwal et al., 2019).Third, post-processing: correcting unfairness by adjusting the learned algorithms (Hardt et al., 2016;Johnson et al., 2016).In this study, we selected the in-processing techniques due to their transparency (i.e., directly taking fairness into model optimization) and strong capabilities in achieving fairness even when confronted with biased data (Caton and Haas, 2020) and the effectiveness in mitigating bias amplification problems (i.e., the trained models amplify the biases in the training data) (Wang and Russakovsky, 2021).
In-processing methods involve two categories: implicit method and explicit method (Wan et al., 2023).Implicit methods debias the models by implicitly removing bias from the latent representations.They usually hypothesize that if the latent representations are less biased, the predictions produced from the representations could also be less biased.The implicit methods are commonly used in adversarial learning (Xu et al., 2019;Yan and Howe, 2021;Yang et al., 2023), contrastive learning (Cheng et al., 2021), etc.However, these methods (1) are usually less transparent since we can hardly interpret how the produced latent representations mitigate (or even remove) the unfairness (Du et al., 2020;Quadrianto et al., 2019) and (2) usually come with specific model architectures (Yan and Howe, 2021).Explicit methods focus on explicitly modifying the objective function while keeping the model structure intact, for example, adding fairness-related regularization terms or constraints.
Therefore, the explicit methods usually afford greater flexibility and can be applied to a wide range of models.Existing explicit methods include absolute correlation regularization term (Beutel et al., 2019), pairwise fairness loss (Berk et al., 2017), equal means (Calders et al., 2013), etc.This study adopts the explicit method by integrating a fairness-related regularization term into the loss function to jointly account for accuracy and fairness.
Achieving multi-attribute fairness has long been an enduring challenge in using inprocessing techniques to mitigate unfairness (Wan et al., 2023).To date, most of the existing literature purely focused on correcting the unfairness of a single protected attribute (Agarwal et al., 2019;Berk et al., 2017;Kamishima et al., 2011;Yang et al., 2023).However, mitigating the unfairness of one attribute may increase the unfairness of another attribute (Zheng et al., 2021).This unexpected outcome may confuse the end-users (e.g., travel demand modelers) and thus hinder the adoption of the fairness-aware models.To tackle this issue, Yan and Howe (2020) proposed to explicitly correct the unfairness of multiple attributes by simply adding multiple regularization terms (one for each attribute with a corresponding weight) into the loss function.However, when the protected attributes are correlated with each other (which is the case for most travel demand forecasting problems), it could be challenging to determine the appropriate weight for each protected attribute in order to achieve the optimal solution that minimizes the unfairness for the combination of the selected protected attributes.Other related methods include learning fair graph embeddings via adversarial learning (Bose and Hamilton, 2019), disentangled representation learning (Kim et al., 2021), adding fairness constraints for each protected attribute and achieve fairness via constrained optimization (Kearns et al., 2018(Kearns et al., , 2019)).However, as we discussed, these methods are often less transparent and come with specific model architectures, which hinder their adaptability.As of now, there is a pressing need to develop transparent, effective and flexible methods that can simultaneously account for fairness for multiple protected attributes and can be applied to any model class.

Addressing AI fairness issues in travel demand forecasting
Recently, transportation researchers have also started to examine and address the fairness concerns of travel demand forecasting models, e.g., Yan and Howe (2020) and Yan and Howe (2021).Specifically, Yan and Howe (2020) treated fairness as equal mean per capita travel demand across groups over a period of time and evaluated the fairness issues of several AI methods on demand prediction for ridesourcing services and bike-share systems.Results showed that machine learning spontaneously underestimated the travel demand of disadvantaged people.They also proposed two fairness regularization terms and a corresponding fairness-aware demand prediction model to correct the unfairness.Yan and Howe (2021) proposed to use an implicit method, which contains fair representations (i.e., EquiTensors) learned by adversarial learning, to forecast the bike-share demand.These fairness-aware models offer transportation professionals new insights on transportation resource allocations and a novel instrument for designing a fairer transportation ecosystem.
However, there are still two critical knowledge gaps that have yet to be addressed.Firstly, prior research has primarily concentrated on equalizing per capita travel demand among different population groups, but we should note that travel demand disparities may have already been introduced during the data creation process, which is often beyond our control (Chouldechova and Roth, 2018;Zheng et al., 2021).For example, multiple studies found that rich people are more likely to use ridesourcing services than the poor (Yan et al., 2020;Zhang and Zhao, 2022).That means this behavioral bias among different population groups may naturally exist (Olteanu et al., 2019).However, to date, no study has investigated how to appropriately account for this type of bias, especially for travel demand forecasting models.Second, the existing fairness-aware travel demand forecasting methods necessitate particular model structures, which has very limited adaptability.Thus, developing a model-agnostic (i.e., independent of the model structure) method that can be flexibly adopted by different types of AI models is promising.To date, however, a systematic method in model-agnostic manner to address fairness issues, especially for travel demand forecasting problems, is still lacking.

Methodology
The methodological framework is outlined as follows.The travel demand forecasting problem will be mathematically defined in Section 3.1.In Section 3.2, we will introduce the fairness metrics used in the proposed methodology, followed by the unfairness correction approach for multiple attributes (in Section 3.3).The notations are summarized in Table 1.

Travel demand forecasting problem
The goal of travel demand forecasting is to predict the future travel demand for each area (or other spatial unit such as traffic segments) given previously observed time-series data.We consider the transportation network as a weighted directed graph G = (V, E, W ), where V is a set of nodes (i.e., areas or traffic segments) with |V | = N ; E is a set of edges representing the connectivity between two nodes; and W ∈ R N ×N is a weighted adjacency matrix representing the node's proximity (e.g., distance or functional similarity).Given weighted directed graph G with N nodes, we assume time t ∈ T is a discrete variable where T is a set containing all possible timestamps, let x t = (x i t , i ∈ I) represent travel demand at time t, where I is the index set of nodes, x i t is the travel demand corresponding to node i ∈ I at time t, and let The travel demand forecasting problem could be formulated as learning a function h(•) : R N ×K → R N ×M which maps the historical K travel demand to travel demand at next M time interval for all nodes in a given graph G.
denote the predicted travel demand for next M time interval starting from timestamp t, where x t = ( x i t , i ∈ I) refers to the predicted travel demand at timestamp t for all nodes, then we can mathematically write: (1)

Fairness in travel demand forecasting models
This study defines Fairness as the equality of prediction accuracy.Intuitively, we assume that the travel demand prediction accuracy should be independent of the protected attributes.Taking racial composition as an example, equality of prediction accuracy suggests that the prediction accuracy for any racial group should be equal.In this study, we use the Absolute Percentage Error (APE) to measure the predictive accuracy instead of the Mean Absolute Error (MAE) or Root Mean Square Error (RMSE).We believe the magnitude of the travel demand (especially for the emerging mobility) for an advantaged community (e.g., high-income community) should be naturally greater than a disadvantaged community (Brown, 2019).This type of behavioral bias may largely be introduced during the data creation process instead of applying the algorithm (Chouldechova and Roth, 2018;Olteanu et al., 2019).If we quantify the equality of prediction accuracy with MAE and RMSE, which are sensitive to the magnitude of the forecasting outcome, machine learning may replicate, or even reinforce and potentially exacerbate existing biases.Instead, APE scales the magnitude and cancels out the behavioral bias that has already been embedded in the data.
Recall from the previous section, a travel demand forecasting model is to learn a function as input and predict travel demands from next M time interval starting from time t, i.e., Y t .We define e t = (e i t , i ∈ I) to indicate the prediction accuracy (i.e., APE) at time t, and e i t is the prediction accuracy of node i at time t.Specifically, where x i t , x i t are the ground truth and predicted value of node i at time t, respectively, e i t is the absolute percentage error for node i at time t.The lower the value of e i t , the better the predictive performance.
Suppose Z = [z j , j ∈ J ] is the matrix of protected attributes of interest, where J = {1, 2, . . ., Q} is the index set of attributes, where Q is the total number of protected attributes; z j = (z i j , i ∈ I) represents the protected attribute j, and z i j denotes the protected attribute j at node i, I is the set of index for nodes.Denote p i j as a binary indicator indicating if node i is belonging to advantaged (i.e., p i j = 1) or disadvantaged (i.e., p i j = 0) groups for protected attribute j, and accordingly let I + j = {i : p i j = 1} and I − j = {i : p i j = 0} represent the set of advantaged and disadvantaged node index for demographic attribute j with size , respectively.We note that assigning value for p i j , i.e., determining whether each node should be labeled as advantaged or disadvantaged, is context-specific.This determination could be guided by the criteria or statistics defined by local government (Yan and Howe, 2020).Subsequently, Equality of Prediction Accuracy is defined as: Where E e t |p i j = 1 and E e t |p i j = 0 are the conditional expectation of prediction accuracy e t given p i j = 1 and p i j = 0, and represents the mean APE for advantaged group and disadvantaged group respectively.That is, for any protected attribute j, a fair model should have equal prediction accuracy for different groups.Moreover, when a forecasting model is conducted, we could measure the model fairness by quantifying prediction accuracy disparities, especially between nodes with different labels, for instance, low-income communities and high-income communities.
In this study, we introduce Prediction Accuracy Gap (PAG) as a fairness metric to measure prediction accuracy disparity and if fairness/unfairness achieves/occurs.Define: Intuitively speaking, PAG directly measures the prediction accuracy disparity between these two types of nodes.A high value of PAG indicates that the machine learning model delivers inconsistent predictive performance among nodes; in most cases, the performance is worse in disadvantaged nodes.
In this study, we also use Correlation Coefficient as another fairness metric.The correlation coefficient can naturally measure the extent to which the predictions are biased on specific protected groups.Intuitively, if fairness is achieved, correlation between prediction accuracy and any protected attribute should be zero.By using correlation coefficient as a measure of fairness, we assume that the target variable (i.e., prediction accuracy) is linearly correlated with the independent variable (i.e., protected attribute).
Recall from the discussions above, e t is the prediction accuracy (APE) at time t, and z t refers to the protected attribute j for all nodes.Then, the correlation between prediction accuracy e t and the protected attribute z j across all nodes is denoted by r(e t , z j ).Define: where ēt = E(e t ) and zj = E(z j ).In our experiment, we add small ϵ = e −20 to denominator to keep it always positive.Although correlation coefficient does not require a label for each region, we cannot directly read the prediction accuracy disparity from it.

Unfairness correction method for travel demand forecasting models
In this study, we introduce an absolute correlation regularization approach, which adapts the efforts from Beutel et al. (2019), to mitigate the prediction accuracy disparities existing among groups.In Beutel et al. (2019), the authors applied this approach to a classification problem by minimizing the false positive rate (FPR) gap between groups.We generalize this approach to a regression setting (i.e., travel demand forecasting problem) by minimizing the prediction accuracy disparities among different communities.
More importantly, including Beutel et al. (2019), most previous studies have primarily focused on correcting the unfairness of one single attribute.In real-world dataset, however, the debiased model and results could differ among various protected attributes.Also, a model that is fair for one protected attribute could still be unfair for other attributes (Wan et al., 2023;Zheng et al., 2021).One feasible solution to solve this issue is to consider multiple attributes at the same time when correcting the unfairness of the models.We expected that a fair model should produce fair predictions for all types of attributes instead of focusing solely on one.
Therefore, we propose a methodology that can correct the unfairness for multiple protected attributes.More specifically, we propose to use the Multiple Correlation Coefficient (Bai and Krishnaiah, 2003), denoted as R, to measure the correlation between the target variable, i.e., prediction accuracy, and a set of protected attributes (including race, education, age and income).A larger R suggests that a stronger dependence may exist between the target variable and the explanatory variables.We expect that a fair prediction should lead to R = 0, or at least, a small value.Accordingly, we will use R as the regularization term in the loss function to account for fairness loss.We should note that the linear model may encounter potential multicollinearity concerns.However, there is no need to address them since the goal of the linear model is forecasting rather than estimating the coefficients (Shmueli, 2010).
Recall from previous subsections, we will use the prediction accuracy e t as the target variable and Z = [z j , j ∈ J ] to represent the matrix of multiple protected attributes of interest.And, we use r(e t , z j ) to indicate the correlation between prediction accuracy e t and the protected attribute z j across all nodes.Given these notations, we will naturally write the vector of correlations between each protected attribute z j and prediction accuracy e j , i.e., c = (r(e t , z 1 ), r(e t , z 2 ), . . ., r(e t , z Q )) ⊤ , and the correlation matrix calculated by the correlation coefficient among each pair of protected attributes, denoted as Ω, i.e., Consequently, the multiple correlation coefficient between e t and Z), i.e., R(e t , Z), which is the square root of the coefficient of determination (i.e., R 2 ) of the linear model (Allison, 1999), can be written as: where c ⊤ is the transpose of c and Ω −1 is the inverse matrix of Ω.
Accordingly, given graph G and a forecasting model Y t = h(X t |G), we add the multiple correlation coefficient, R, into the loss function, denoted as L(X t , Z|G) as shown in Eq. 5.In this way, the model will simultaneously account for the unfairness issues sourcing from multiple protected attributes.Let Y t = [x t , • • • , x t+M −1 ] denote the ground truth travel demand of next M time intervals starting from t, mathematically, the loss function of the forecasting model to be minimized, i.e., L(X t , Z|G), is written as: and, In the above equations, x i t , x i t refer to the ground truth and predicted travel demand for node i at time t, respectively; l is the primary loss function for forecasting model, and in this study, we use mean squared error (MSE) for l; λ is the interactive weight coefficient that controls the weight between the prediction loss and the fairness regularization term.When λ = 0, the model will be unaware of the fairness; and when λ = 1, the model will completely focus on correcting the unfairness.We can directly treat λ as a hyperparameter to find the optimal model that effectively addresses fairness while preserving accuracy.The prediction accuracy disparity is captured and mitigated by the correlation regularization term, in Eq. ( 4).The regularization term is dedicated to shrinking the potential prediction accuracy disparity that existed among groups toward zero.Incorporating it into the loss function enables the machine learning model to automatically keep track of the fairness during training.
Note that when there is only one single protected attribute of interest, the multiple correlation coefficient, i.e., Eq. 4 reduces to Eq. 3.

Case Study
In this section, we will describe two real-world ridesourcing-trip datasets and seven commonly-used travel demand forecasting models used for case studies.Section 4.1 and Section 4.2 present the data collection and processing process.Table 2 presents the descriptive statistics of all input variables.In Appendix.A, Fig. A.1 displays the spatial distribution of the average ridesourcing demand per hour.We will briefly introduce the selected deep learning and statistical models for unfairness detection and correction in Section 4.3.

Chicago ridesourcing-trip data
In this study, we collected the publicly available ridesourcing-trip data from Chicago Data Portal1 for case study.The data are from November 1, 2018 to March 31, 2019, containing 45,338,599 trips.There are plenty of attributes included in this dataset, but only pick-up locations and timestamps are considered for this research.Since we focused on trip generation (i.e., origin demand) forecasting, all trips are aggregated at the census-tract level and hourly counted.We prepared the data for modeling in the same way as previous studies (Zhang and Zhao, 2022), to account for the missing-data issues and outliers.The data preparation process produced the trip generation data for 711 census tracts.We split the first 70% data for training, the following 10% for validation and the remaining for testing.The census-tract-level demographic data (i.e., protected attributes) were collected from the American Community Survey 2013-2017 5-year estimates data, including the percentage of white, the percentage of low-income households, the percentage of population with a bachelor's degree or above and the percentage of young populations (with age in 18-44).

Austin ridesourcing-trip dataset
This study also collected ridesourcing-trip data from RideAustin 2 for case study.The data ranges from October 1, 2016 to April 13, 2017, including 1,259,574 trips in total.Similar to the case study in Chicago, we only retained pick-up locations and the corresponding timestamps from the dataset for empirical analysis.All ridesourcing trips were aggregated at the census-tract level on an hourly basis.Finally, the prepared dataset includes 191 census tracts.The first 70% of the whole dataset was split for model training, followed by the following 10% for validation and 20% for testing.Four protected attributes, including the percentage of white, the percentage of low-income households, the percentage of population with a bachelor's degree or above and the percentage of young populations (aged 18-44) were also collected from American Community Survey 2013-2017 5-year estimates data.

Models
In this study, we applied seven models as the major baseline models to measure the fairness metrics and perform the bias mitigation.We also compared their performance with historical average method.All used models are detailed as follows: • Historical Average (HA): We calculate the historical average travel demand using the mean values of all observations from the inputted sequence.
• Multivariate Linear Regression (MLR): MLR is frequently used in machine learning studies as the benchmark model.This study treats observations at every timestamp t as a covariate.
• Autoregressive Integrated Moving Average Model (ARIMA): ARIMA is one of the most fundamental statistical models for forecasting time-series data Makridakis and Hibon (1997).ARIMA consists of three basic parts: auto-regressive, firstdifferencing and moving-average part.The order of the auto-regressive (p) and movingaverage (q) and the degree of first-differencing (d) included should be prespecified before building the model.In this study, we established ARIMA model to predict the travel demand for all areas at once.
• Multiple Layer Perception (MLP): MLP is a commonly-used deep neural net model.In this study, the model architecture is set as 1 hidden layer with 300 hidden linear neurons.A drop-out layer rate 0.01 is set after the hidden layer to avoid overfitting.
• Gated Recurrent Unit (GRU): GRU is a widely-adopted Recurrent Neural Network (RNN) model with gated hidden neurons Cho et al. (2014).GRU can generate the predicted travel demand x i,t+1 by inputting the hidden status at timesampe t − 1 and the travel demand at timestamp x i,t .In this way, GRU can dynamically capture the travel demand information at the current timestamp while maintaining the historical demand trend.We use GRU model for forecasting the travel demand for all nodes at once.
• Temporal Graph Convolution Network (T-GCN): T-GCN can capture the spatial dependency and temporal information at the same time Zhao et al. (2019).Specifically, the spatial dependency is calibrated by the spatial adjacency graph G adj , where 1 indicates two nodes are spatially adjacent and 0 otherwise.T-GCN takes the hidden status at timestamp t − 1 and the graph-convolution-processed travel demand information at timestamp t as the input.Therefore, T-GCN can effectively deal with data that have strong spatial dependency such traffic speed data.
• Convolutional Long-short Term Memory (ConvLSTM): ConvLSTM is one of the most novel approaches for spatio-temporal forecasting problem Shi et al. (2015).
ConvLSTM has a convolution structure in both the input-to-state and state-to-state transitions; it determines a certain cell's future states by considering the inputs and past states from its local neighbors.This characteristic allows it a more powerful strength in handling spatio-temporal correlations.In this study, the convolutional kernel size of the ConvLSTM is set to 5.
• Spatio-Temporal Graph Convolution Network (STGCN): STGCN is an effective approach for spatio-temporal traffic flow forecasting Yu et al. (2017).STGCN consists of several spatio-temporal convolution (ST-Conv) blocks.Each block has a "sandwich"-like structure: two gated sequential convolution layers and one spatial graph convolution layer in between.This allows STGCN to distill the most useful spatial features and capture the most essential temporal features collectively.In this study, we set the number of ST-Conv blocks as 2. Let d i,j denote the distance between node i and node j, the element in the weighted adjacency matrix, i.e., w i,j ∈ W , is given by: where σ 2 and α, assigned as 10 4 and 0.5, are thresholds that control the sparsity of W .

Results
This section sequentially reports the modeling results of all benchmark models, the evaluations of their underlying fairness issues and the results after applying our proposed unfairness correction approach.We conducted empirical experiments using the real-world ridesourcing-trip data in Chicago, IL and Austin, TX.The analytical spatial unit is census tract.We incorporate the regularization term into the loss function for all models.All experiments were completed in a Pytorch environment using an Ampere A-100 GPU.We tuned the hyperparameters such as batch size and sequence length under each fairness weight λ using grid search.We built our models with Adam optimizer Kingma and Ba (2014).Early stopping method is also taken to avoid overfitting problems.In this study, we use 60 and 40 percentile statistics for the protected attributes as the threshold to determine the label (i.e., p i j ) of each node (e.g., census tract).For instance, the 60 percentile of white population percentage attribute is 62.35%, for nodes with white population percentage over 62.35% are labeled as advantaged.

Unfairness detection
The predictive performance and two fairness metrics (i.e., correlation [Corr] and prediction accuracy gap [PAG]) of all models with respect to four protected attributes are presented in Table 3 and Table 4.We show the results of the predictive performance for each benchmark in Chicago ridesourcing-trip data (Table 3) and Austin ridesourcing-trip data (Table 4).
Regarding prediction accuracy, all benchmark models show a similar trend across two case studies.The performance ranking is ConvLSTM ≈ STGCN > GRU > T-GCN > MLP > ARIMA > HA.It indicates that the prediction accuracy gradually increases as the model becomes more complex.Two convolution models, i.e., STGCN and CovLSTM, are best-performing among all models.Both STGCN and ConvLSTM can incorporate spatial and temporal information through the convolution blocks, which enhance their prediction power.Among two RNN-based models, GRU outperformed T-GCN for both MAE and RMSE.MLP, due to its simple model architecture, underperformed all neural networkbased models.Compared with deep neural networks, traditional statistical models, i.e., MLR and ARIMA, have relatively low prediction accuracy.However, their performance still significantly outperformed HA.MLR and ARIMA both have a prespecified (linear) model structure and cannot capture the nonlinearity between the inputs and target variables, which restricts the predictive capability.
Regarding fairness issues, for Chicago ridesourcing-trip data, Table 3 shows that HA exhibits completely inverse relationships in correlation and gap compared with other models.Since HA has the worst predictive performance, the corresponding fairness metrics could be unreliable.The results illustrate that both statistical and deep learning models have evident fairness issues.protected attributes, including race, education and age, are negatively correlated with the prediction accuracy which means that communities with high proportion of white population, high education-attainment rate and more young people have high prediction accuracy.Income level is positively related to predictive performance, indicating that communities with more low-income households may have higher perdition error.In terms of magnitude, we found that education and age have the largest value of correlation with prediction accuracy, followed by income and race.Although there are variations in the magnitude of correlations, the signs for all protected attributes among all models except for HA are consistent.In addition to correlations, we also explored the PAG between the advantaged groups and disadvantaged groups.Table 3 presents that all gaps have a positive value (except for HA), indicating that the prediction error for minority groups is higher than for advantaged groups.Additionally, the prediction accuracy disparity is more pronounced for education and age than for race and income.
For Austin ridesourcing-trip data, all benchmark models demonstrate a similar performance (both trend and direction of associations) compared with using Chicago dataset.However, results showed that the fairness issues are relatively subdued in Austin dataset.In other words, the extent of unfairness (as shown by correlation coefficient and PAG) is notably diminished in comparison to the Chicago dataset.Notably, Table 4 shows that prediction accuracy is less biased regarding race and income.Two best-performing models (i.e., STGCN and ConvLSTM) may produce satisfying fair predictions.For example, the correlation between prediction accuracy and race delivered by ConvLSTM is 0.000 and the PAG regarding race is only -0.391%.This evidence indicates that the unfairness in prediction accuracy should be of little concern for this protected attribute.

Unfairness correction
We tuned a set of values of λ (i.e., the weight for fairness loss) by grid search to validate the effectiveness of the proposed unfairness correction method.Table 5 and Table 6 present the results of simultaneously mitigating the unfairness issues for multiple protected attributes across two case studies.We only present the best λ (i.e., the one that can significantly improve fairness while largely preserving prediction accuracy) from the empirical experiments.We also add experimental results of correcting unfairness of only one single attribute at the bottom of each table for comparison.For the sensitivity analysis of λ, please refer to Section 5.3.As discussed in previous section, only very limited prediction accuracy disparities are detected on race (percentage of white population) and income (percentage of low-income households) in the case study of Austin (as shown in Table 4).Thus, we decided to only correct the unfairness of prediction accuracy manifested in education (percentage of bachelor holders) and age (percentage of young population) in this case.There are several key findings to highlight.First, results of the multi-attribute scenario show great consistency across two datasets.Table 5 and Table 6 show that in almost all trails, incorporating a small fairness weight can significantly reduce the absolute value of the correlation and PAG across all protected attributes.For example, in Chicago dataset,  incorporating only 0.050 fairness weight for GRU can lead to 85.142%, 86.386%, 92.004%, 94.927% reduction of the absolute values of the PAG for race, education, age and income, respectively.In the meantime, the correlation between prediction accuracy and protected attributes also improved more than 60%, but RMSE only increased by 2.062%.In Austin dataset, setting λ as 0.025 for ConvLSTM yields 68.973% and 88.496% PAG shrinkage on education and age by sacrificing only 5.661% increase on RMSE (from 3.162 to 3.341).
Second, the effects of the proposed unfariness correction method varies across models and protected attributes.For example, Table 5 shows that when mitigating the income bias, setting λ as 0.025 only reduces 54.923% of the PAG in absolute value for STGCN; while for ConvLSTM, the same setting can lead to a 95.081% reduction.In addition, the case study on Chicago ridesourcing-trip data reveals that compared with education and age, the absolute value of PAG for race and income are more likely to be largely reduced (i.e., to less than 1%).
Third, by choosing an appropriate λ, both fairness and accuracy can be improved at the same time.Taking Chicago dataset as an example, adding 0.025 fairness weight on STGCN can simultaneously reduce the absolute value of PAG and correlations for all protected attributes while even reducing RMSE by 1.347%.
Moreover, we found that MLR and ARIMA showed limited capabilities in mitigating unfairness.In Chicago dataset, the prediction accuracy disparities of education and age (as shown in the change of PAG) for MLR and ARIMA even increased after debiasing multiple protected attributes.Also, our examination of the Austin dataset indicated that after incorporating the proposed fairness regularization term, although the PAG for MLR and ARIMA decreased, the magnitude of this reduction was comparatively modest in comparison to other models.In fact, these two models are less flexible compared with other deep learning models since they have a pre-specified model structure.We believe that this inherent limitation could hinder their effectiveness in addressing fairness concerns.
Lastly, in most cases, our proposed multi-attribute unfairness correction method shows better performance in reducing disparities of prediction and preserving fairness compared with only debiasing a single attribute, especially for more complex deep learning models (e.g., GRU, T-GCN, STGCN and ConvLSTM).For example, Table 5 shows that when considering multiple attributes together, GRU and ConvLSTM can close more than 94% of PAG of income in absolute value; while for the single-attribute scenario, the PAG is only reduced by around 60%.However, we also observed in certain cases, single-attribute unfairness correction could produce fairer performance.For example, GRU is found to be more effective in reducing PAG when only debiasing age for Austin dataset.To provide a more comprehensive demonstration of the efficacy of the proposed multiattribute unfairness correction approach and to pinpoint potential shortcomings in the singleattribute bias correction method, we conduct a comparative analysis of unfairness correction outcomes achieved through debiasing the age variable alone versus debiasing multiple attributes simultaneously.We have chosen the top-performing model, i.e., ConvLSTM, for demonstration.The resulting findings can be found in Table 7.
We found that correcting unfairness regarding one attribute might even create more biases for other protected attributes, which aligns with one previous study (Zheng et al., 2021).This finding highlights the importance of considering multiple protected attributes at once.Specifically, results showed that compared with the original model that purely focused on prediction accuracy, solely correcting unfairness of age variable could indeed help drop the absolute value of PAG.However, by only considering age, the PAG for other variables, especially for race and income variables, even increases.For example, in Austin dataset, debiasing only Age shrank the PAG from 4.642% to 0.771% by significantly sacrificing the PAG of income from 0.888% to 8.238%.This unexpected outcome may further shed light on the fact that the transportation resource allocations intended to be fair for distinct age groups could nonetheless still be unfair regarding communities with different income levels.Notably, the results showed that the proposed multi-attribute unfairness correction method can effectively debias multiple protected attributes and in almost all cases the absolute value of PAG is significantly dropped compared with the original model without sacrificing too much prediction accuracy.

Sensitivity analysis of fairness weight
We also explored the influence of the fairness weight, i.e., λ, in shaping the interaction between accuracy and fairness based on the predictive performance of seven models with four protected attributes.Fig. C.1 presented in Appendix C illustrates the sensitivity analysis of λ in determining accuracy and fairness.The x-axis is the value of λ while the y-axis is the performance metrics (RMSE, correlation coefficient and PAG).Generally, the accuracy for all models decreases when λ gradually increases.In terms of RMSE, the marginal effect of λ on more complex models is relatively small.Figures show that as λ grows, the correlation will first drastically increase/decrease, and then remain flat.Notably, setting a small weight (λ ≤ 0.1) can lead the correlation drop to around 0. The PAG shows a decreasing trend as λ gradually increases.But in most cases, the gap may get over-corrected when λ is greater than 0.1.According to the tables shown in Section 5.2, a suitable fairness weight possibly exists in the range between 0 to 0.1.This finding further reinforces the effectiveness of our proposed unfairness correction approach: incorporating only a small amount of weight for fairness can lead to a significant improvement in producing fair predictions.We also found that increasing fairness weight may not monotonically reduce the PAG.This finding echoes the results in Zheng et al. (2021), where they showed that increasing fairness weight might even extend the PAG.Our computational experiments show that this scenario frequently occurs for traditional statistical models.This finding also suggests the need for more fine-grained searching ranges of λ when conducting hyperparameter tuning.Overall, the sensitivity of the effects of λ shows great consistency across two case studies.Finally, we noticed that in Austin case, setting fairness weight as 0.4 for GRU led a substantial increase in RMSE and PAG.One possible reason could be that this combination of hyperparameters might explode the gradients and thus lead to this numerical instability.

Comparison with benchmark fairness regularization methods
This study compares the performance of the proposed unfairness correction approach (i.e., the absolute correlation regularization term) with three state-of-the-art benchmark regularizers, including Equal Mean (EM) (Calders et al., 2013), Region-based Fairness Gap (RFG) and Individual-based Fairness Gap (IFG) (Yan and Howe, 2020).For experiments, we only consider single-attribute scenario as these three benchmark regularizers are explicitly designed for addressing unfairness of a single protected attribute.For Chicago Ridesourcing dataset, we select race (percentage of white population) for model debiasing; while for RideAustin dataset, education variable (percentage of bachelor holders) is chosen for comparison.All benchmark regularizers are set with the best-performing λ yielded by our proposed method for comparison.Table 8 presents the comparative analysis between our proposed method (i.e., absolute correlation regularizer) and three state-of-the-art benchmark regularizers.Results unequivocally show that the proposed method evidently outperforms other methods in terms of preserving prediction accuracy as well as improving fairness.Among all regularizers, EM delivers the worst performance.This is expected since EM focuses on balancing the target variable (i.e., ridesourcing demand) of disadvantaged and advantaged groups instead of the prediction accuracy.However, this method could be questionable since the variations in ridesourcing usage between different population groups may naturally exist due to socioe-conomic and demographic disparities (Brown, 2019).RFG and IFG tend to yield improved outcomes in terms of both accuracy and fairness when compared to EM.Moreover, in certain scenarios, their performance (especially for correlation and RMSE) surpasses that of the proposed method.We attribute this to their capabilities to effectively reduce variations in per capita travel demand for each individual population group, as indicated in Yan and Howe (2020).However, these two metrics may still not be able to fully account for the inherent disparities of different population groups in generating travel demand (Zhang and Zhao, 2022).In most cases, especially for deep learning models with more complex model architectures, the proposed method can significantly help reduce the PAG between disadvantaged and advantaged groups while keeping the prediction error lowest.Although in some cases the proposed method may not be the best-performing one regarding accuracy, the produced RMSE still remains within an acceptable range.

Discussion
The above sections demonstrate the modeling results of our proposed unfairness correction method.In this section, we will discuss the merits of the unfairness correction method, policy implications, and the limitations of the work and future research directions.

Merit of the unfairness correction method
The merit of the proposed unfairness correction method is threefold.First, a new regularizer to simultaneously debias multiple protected attributes.The current literature rarely discusses how to effectively address fairness issues for multiple protected attributes.However, designing a method that can accommodate various fairness needs is necessary for real-world applications (Wan et al., 2023).This study addresses this issue by proposing to use Multiple Correlation Coefficient (i.e., R of a linear model) as a regularization term and incorporating it into the loss function.The multiple correlation coefficient can directly measure the correlation between the target variable (i.e., prediction accuracy) and a set of protected demographic variables (i.e., race, age, education and income).By minimizing the coefficient of multiple correlation, AI models can simultaneously debias multiple sensitive attributes.Unlike adding multiple regularization terms (one for each attribute) (Yan and Howe, 2020), this approach is straightforward and easy to implement, and there is no need to fine-tune the fairness weight for different attributes (only one is enough).Also, this approach has little concern about the multicolinearity (as shown in Appendix.Appendix B) issues among different protected attributes, since the goal of the linear model is to use the set of protected attributes to forecast the prediction errors instead of estimating and interpreting the beta coefficients (Shmueli, 2010).Overall, our proposed unfairness correction method enables future studies to flexibly debias both single or multiple protected attributes of interests.
Second, flexibility and transparency.The proposed unfairness correction method is model-agnostic and may be generalizable for different applications and different data modalities.We implemented the unfairness correction method on both statistical and deep learning models.Results jointly demonstrated that, generally, this approach could mitigate the unfairness while only slightly reducing the overall accuracy.Specifically, we correct the unfairness by incorporating an explicitly designed absolute correlation regularization term into the loss function without modifying the model structure.It allows the unfairness correction method great flexibility to be independent of the underlying model.Scholars can thus flexibly adopt any model they want in addressing fairness issues.Also, the proposed method enjoys great transparency since end-users (e.g., stakeholders) can easily understand how fairness is being taken into account and improved (from the fairness regularization term).Moreover, this method is transferable for other forecasting applications.Besides travel demand forecasting, other important issues including traffic count forecasting, pedestrian activity forecasting or crash frequency forecasting may also have silent fairness problems.Researchers can apply our proposed method to address the fairness issues and provide fair decisionmaking.This study only examined the proposed method using time-series (panel) data.However, we believe it can be easily generalized to other applications with different data modality.For example, transportation-planning models, which usually use cross-sectional data, should also be examined with fairness analysis.Our unfairness correction method can be flexibly adopted by planning models (e.g.Zhang and Zhao, 2022) to inform fair design of transportation ecosystems.Flexibility is also reflected in that, once the models are trained, access to protected attributes is no longer required.Unlike the post-processing technique that always requires access to the protected attribute (Agarwal et al., 2019;Hardt et al., 2016), our approach lifts this restriction and can be flexibly adapted for future forecasting tasks.
Third, effectiveness in achieving fairness while preserving prediction accuracy.Multiple studies reported that machine learning has a trade-off between accuracy and fairness (e.g., (Agarwal et al., 2019;Berk et al., 2017)), i.e., the reduction of unfairness will inevitably trigger an accuracy drop.Our scheme addresses this trade-off by incorporating an interactive weight coefficient (i.e., λ) into the loss function.We treat λ as a hyperparameter of the learning task and tune it together with other hyperparameters.In this way, the model automatically finds the optimal hyperparameter combination that has the best performance in improving fairness while maintaining prediction accuracy.Most of our experiments revealed that this approach could significantly reduce unfairness only at little expense of accuracy decline.While in some cases, our proposed method can even significantly improve fairness and slightly improve prediction accuracy.

Policy implications
Dynamically balancing the supply and demand for transportation systems is important to improve cost-benefit effects and efficiency.And this balance relies heavily on accurate predictions (Chu et al., 2019).Although machine learning intensively promotes predictions, it may simultaneously introduce bias.The overall satisfactory predictions may hide a huge prediction accuracy gap across areas of the city or underrepresented groups of residents (Yan and Howe, 2020;Zheng et al., 2021).Our study also confirms this finding.Specifically, Table 3 shows that both machine learning and statistical models can produce lower prediction accuracy for the disadvantaged communities (i.e., the non-white-majority, the lower-education-attainment, the elderly and the low-income) than that of the advantaged communities.The predictive disparity implies that if transportation planners naively use such travel demand forecasting models without accounting for the fairness issues, the modeling results will lead to ineffective transportation resource allocations, impede the mobility of the disadvantaged communities, and even possibly further exacerbate the existing operational biases of ridesourcing services, e.g., higher trip-cancellation rate, longer waiting times and higher per-mile fees for disadvantaged communities (Brown, 2022;Brown et al., 2019;Brown and Williams, 2021;Yang et al., 2021).
Our proposed method can help mitigate the unfairness issues of the current ridesourcing operations to better serve the disadvantaged communities.We believe that ridesourcing policymakers should consider incorporating our proposed method into the travel demand modeling framework to inform fairer ridesourcing resource allocations and operations.Additionally, two fairness metrics can be used by city governments to evaluate and regulate ridesourcing operations.Moreover, the fairness measurements and unfairness correction method can be adopted to facilitate the effective operations of other travel modes such as public transit and shared micromobility.For example, an accurate and fair demand forecasting model will enable transit authorities to provide more personalized transit services to balance operation efficiency and effectiveness (Ermagun and Tilahun, 2020).Also, a fairness-aware travel demand forecasting model will help micromobility (e.g., bikeshare and e-scooter) operators better rebalance the vehicles and ensure fair distribution of service availability throughout the day (Yan et al., 2021).

Limitations and future research directions
This study has some limitations that warrant follow-up investigations.For example, we only evaluated the proposed methodology using two fairness metrics (i.e., prediction accuracy gap and correlation coefficient) in this paper.Future works may consider using a wider range of fairness metrics to conduct a comprehensive evaluation.Moreover, by using correlation techniques, we assume the prediction accuracy is linearly correlated with the protected attributes.Future studies may consider exploring whether this association is nonlinear and developing corresponding methods.Another widely debated research topic is the connection between accuracy and fairness.Several previous studies have shown that the accuracy-fairness trade-off exists across datasets and applications (Berk et al., 2017;Chouldechova and Roth, 2018) while others have shown that improvements in accuracy and fairness can co-occur (Yan, 2021).Hence, forthcoming investigations may shed further light on this relationship, such as identifying scenarios in which fairness and accuracy can both be enhanced or where the accuracy-fairness trade-off is prominent.Finally, this study only examined one travel mode (i.e., ridesourcing).A more comprehensive analysis that includes various travel modes (e.g., transit, car-sharing, and shared micromobility) and diverse contexts (e.g., different locations) should be conducted to test the generalizability and robustness of the unfairness correction method.

Conclusion
This study examines the fairness issues in travel demand forecasting models and develops a new methodology to enhance their fairness while preserving the prediction accuracy.
By leveraging two real-world ridesourcing-trip data from Chicago, IL and Austin, TX, the unfairness issues of seven state-of-the-art AI-based models on forecasting travel demand are evaluated.A novel and transparent in-processing method, which is based on an absolute correlation regularization term, is proposed to simultaneously address the unfairness arising from multiple protected attributes.We also compare the performance (including both fairness and accuracy) of our proposed unfairness correction method with three state-of-the-art unfairness correction method to show its effectiveness.
The results highlight that both statistical and machine learning models have pronounced fairness issues, i.e., the prediction accuracy for advantaged groups are notably higher than disadvantaged groups.Our proposed unfairness correction method can effectively enhance fairness for multiple protected attributes while preserving prediction accuracy.The comparative study reveals that our proposed method significantly outperforms other methods in both fairness and accuracy.Besides performance, our proposed method has remarkable flexibility-it is model-agnostic and can be adapted to different applications and different data modality.In summary, this study advances our understanding of fairness issues in travel demand forecasting and equips transportation researchers with a powerful tool to foster fairness within the transportation ecosystem.

Notes:
Corr represents correlation.PAG refers to prediction accuracy gap.The value inside each bracket refers to the percentage change of metric in absolute value.It is computed as: (|o| − |m|) * 100%/|o|, with o denoting the initial value obtained from the fairness-unaware model and m representing the final value from the fairness-aware model.A positive value indicates the improvement while a negative value indicates the reduction.

Figure C. 1 :
Figure C.1: Sensitivity analysis of λ across two case studies

Table 1 :
A list of symbols and notations

Table 3 :
Modeling results of the benchmarks (Chicgao)

Table 4 :
Modeling results of the benchmarks (Austin) Notes: Corr represents correlation.All correlations are statistically significant at 1% confidence level.

Table 5 :
Multi-attribute unfairness correction in Chicago.Notes: Corr represents correlation.PAG refers to prediction accuracy gap.The value inside each bracket refers to the percentage change of metric in absolute value.It is computed as: (|o| − |m|) * 100%/|o|, with o denoting the initial value obtained from the fairness-unaware model and m representing the final value from the fairness-aware model.A positive value indicates the improvement while a negative value indicates the reduction.

Table 6 :
Multi-attribute unfairness correction in Austin.

Table 7 :
Performance comparison between only debiasing Age and simultaneously debiasing multiple attributes using ConvLSTM.

Table 8 :
Comparison with state-of-the-art benchmark regularizers