Travel Mode Choice Prediction Using Imbalanced Machine Learning

Travel mode choice prediction is critical for travel demand prediction, which influences transport resource allocation and transport policies. Travel modes are often characterised by severe class imbalance and inequality, which leads to the inferior predictive performance of minority modes and bias in travel demand prediction. In existing studies, the class imbalance in travel mode prediction has not been addressed with a general approach. Basic resampling methods were adopted without much investigation, and the performance was assessed by commonly used metrics (e.g., accuracy), which is not suitable for predicting highly imbalanced modes. To this end, this paper proposes an evaluation framework to systematically investigate the combination of six over/undersampling techniques and three prediction methods. In a case study using the London Passenger Mode Choice dataset, results show that applying over/undersampling techniques on travel mode substantially improves the F1 score (i.e., the harmonic mean of precision and recall) of minority classes, without considerably downgrading the overall prediction performance or model interpretation. These findings suggest that combining over/undersampling techniques and statistical/machine-learning methods is appropriate for predicting travel mode, which effectively mitigates the influence of class imbalance while achieving high predictive accuracy and model interpretation. In addition, the combination of over/undersampling techniques and prediction methods enriches the model options for predicting mode choice, which would better support transport planning.

including the multinomial logit model and its variants [1], [2]. Recently, there is a growing interest in using machine learning methods for modelling travel mode choice, including support vector machine (SVM), deep neural network (DNN), and extreme gradient boosting (XGB). It is reported that XGB and DNN methods have higher predictive power than discrete choice models in predicting travel mode [3], [4], [5], [6].
In travel mode prediction, the class imbalance between modes has become a common and prominent issue, which leads to the underestimation of the minority class. Due to different levels of transport service provision and transport policies, travel mode choice data are often highly imbalanced [7], i.e., some modes are used much more frequently than others. Table I shows the class imbalance of travel mode choice in literature. The degree of class imbalance is measured by the ratio of the number of trips of the minority mode to the majority mode [8]. In most datasets, this degree is less than 0.1, which indicates a high level of imbalance. The class imbalance would severely compromise the model estimation and predictive performance, as the model tends to focus on the majority class whilst ignoring the minority class [9]. Nevertheless, to the best of our knowledge, class imbalance in travel mode choice prediction has not received adequate attention and has not been well tackled. This study aims to deepen the understanding of whether and how mode choice imbalance can be tackled.
The rest of the paper is structured as follows. Section II starts by surveying the methods used in previous studies to tackle class imbalance in travel mode prediction and the evaluation metrics for assessing the performance of travel mode prediction; Section III firstly specifies the workflow of this paper, then briefly introduces the selected travel mode prediction models and over/undersampling (OUS) techniques to be combined, followed with a comprehensive evaluation framework proposed to assess the prediction using highly imbalanced dataset; Section IV describes the London Passenger Mode Choice dataset and variables used for prediction, as well as the setup of experiments; Section V evaluates the model performance using the framework proposed in Section II and discuss the suitability of various combinations. Finally, Section VI concludes this paper and proposes future research directions.

II. LITERATURE REVIEW A. Methods Used to Tackle Class Imbalance in Travel Mode Prediction
Although the issue of class imbalance is one common challenge of predicting travel mode choices, which may cause This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ inferior performance for predicting modes with smaller shares [6], [7], [10], [11], [12], [13], limited efforts have been made to solve this problem. Hagenauer and Helbich [10] tried to deal with the class imbalance in travel modes by randomly oversampling the minority class and undersampling the majority class when pre-processing data. However, whether data over/undersampling improves the prediction is poorly understood because the prediction performance on the original and processed datasets was not compared. Pirra and Diana [11] adopted a modified SVM method that assigns different weights to different classes in the decision function of SVM, and found out this method outperformed the plain SVM. Qian et al. [14] introduced adjusting kernel scaling in developing an SVM model and found it improved the accuracy of the minority class classification in some cases. However, both methods are specifically designed for SVM and do not generalise to other machine learning models. Kim [12] used a class-specific weighting scheme in which each instance is assigned weights that are inversely proportional to the frequency distribution of classes. However, this approach treats all instances of a class as equally important to the classification. None of them proposed a general approach for addressing mode class imbalance, and there is a lack of systematic investigation into how and to what extent class imbalance can be tackled.

B. Evaluation Metrics to Assess the Performance of Travel Mode Prediction
Most studies on travel mode prediction adopt only one or multiple overall performance metrics, such as accuracy [15], [16], [17], recall (or sensitivity) [10], and log-loss [18]. These metrics are insufficient for highly imbalanced mode distribution because they ignore class-specific performance. Specifically, when the data is highly imbalanced, these overall metrics can be achieved by a trivial classifier that always predicts the most likely class. Furthermore, most studies only use metrics based on discretising the classification by assigning each prediction to the class with the highest probability. This is inadequate for imbalanced data because it is highly likely to result in non-representative mode shares. Therefore, an evaluation framework that includes metrics representing overall and mode-specific, aggregate and disaggregate performance of travel mode prediction is imperative. Rezaei et al. [19] tried to evaluate the impact of resampling techniques on the performance of logit models. However, machine learning models were not considered, and the research only investigated the sign and magnitude of the coefficients when carrying out behavioural analysis.
In summary, the tackling of the class imbalance in travel mode prediction remains unexplored. In the machine learning community, several techniques have been proposed to tackle class imbalance: over/undersampling the original dataset [20], [21], [22], [23], cost-sensitive learning [24], [25], [26], [27], [28], active learning [29], [30], [31], [32], and kernel-based methods [33], [34]. Among them, over/undersampling (OUS) is a straightforward and effective method for the imbalance problem and can be applied to a wide range of classifiers. This study investigates whether OUS techniques can enhance travel mode choice prediction by testing and comparing various combinations of OUS techniques and statistical or machine-learning methods with a comprehensive evaluation framework.
This study contributes to the literature on travel mode prediction as follows. Firstly, it proposes a comprehensive and multifaceted evaluation framework for travel mode prediction, which entails overall model performance, modespecific performance, and model interpretation. Secondly, it presents a systematic investigation of over/undersampling techniques for tackling class imbalance in travel mode prediction. Thirdly, it verifies that it is viable and efficient to combine over/undersampling techniques and statistical/ machine-learning models for predicting travel mode, which mitigates the influence of mode imbalance while achieving high predictive accuracy and model interpretation. This approach can inform transport planning and effectively avoid bias in travel demand prediction.
III. METHODOLOGY This paper aims to investigate the impact of over/undersampling techniques on travel mode prediction. We firstly introduced three prediction methods and six  TABLE II LIST OF ABBREVIATIONS AND ACRONYMS over/undersampling (OUS) techniques to be investigated. The prediction methods include one traditional discrete choice model and two advanced machine learning models. Then we proposed a comprehensive evaluation framework for assessing the model performance of travel mode prediction on highly imbalanced travel datasets. Different combinations of prediction models and OUS techniques were evaluated and the best-performing combinations were selected and discussed, as shown in Fig. 1. Table II presents the list of abbreviations and acronyms used in this paper.
A. Travel Mode Choice Models 1) Logit Models: Logit models assume that passengers would choose a mode from a set of alternatives to maximise their utility. Under the random utility theory [35], logit models assume that each mode has a certain level of utility that consists of two components: a component representing the effects of observed explanatory variables (e.g., travel time, cost) and a random error reflecting the effects of unobserved variables. The utility of choosing mode i is: where M n is the set of available modes for trip n; N is the total number of trips; U ni is the utility of alternative travel mode i for trip n; V ni is the representative utility of alternative travel mode i for trip n; x ni is a 1× K vector of explanatory variables of alternative mode i for trip n; β is a K×1 vector of coefficients of variables representing the weights attached to explanatory variables for trip i; and ε ni is the random error of travel mode i for trip n. Different types of logit models are developed by specifying different types of random errors and choices of coefficients of explanatory variables. Most notably, the MNL model is formed when the error term is independently, identically Gumbel-distributed. In the MNL, the probability of trip n to choose travel mode i is given by (2). The coefficients of the MNL can be estimated using the maximum likelihood method.
2) Machine Learning Models: Machine learning models consider mode choice prediction as a classification problem, i.e., given input variables, predicting the most likely mode and/or the probability of all alternatives. The objective is to learn a target function that maps input variables to the output target. A range of machine learning models have been used to predict travel mode choice, which include tree-based models, Naïve Bayes, support vector machine, and neural network [36]. Notably, the tree-based ensemble model (represented by extreme gradient boosting) and DNN have been attracting interest because of their high predictive power and capability of estimating choice probability [3], [4], [5].
Extreme gradient boosting (XGBoost) [37] is an efficient and scalable ensemble approach that uses decision trees as base predictors. The XGBoost is trained in an additive manner by starting from a low-accuracy decision tree and iteratively building trees to minimise a loss function. In each iteration, the instances that are misclassified by existing trees are given more weight. The final prediction of XGBoost is based on the weighted votes of base predictors, where the weight of a predictor is proportional to its predictive accuracy. XGBoost has proved suitable for mode prediction, due to the high predictive accuracy, robustness, interpretability, and ability to derive well-calibrated choice probabilities [4], [38].
Deep neural network (DNN) is an Artificial Neural Network (ANN) with multiple layers between the input and output layers. The DNN can model complex non-linear relationships between variables as the data goes through the weighted connections between DNN layers. The output of the DNN consists of k units corresponding to k classes of mode choice. Moreover, DNNs can reveal utility functions and behavioural patterns when applied to mode choice analysis [39]. Because of their extraordinary predictive power and satisfactory interpretability, DNNs have been adopted in transportation studies, including predicting travel mode, route choice, and automobile ownership.

B. Oversampling and Undersampling Techniques
Oversampling and undersampling techniques adjust the class distribution by replicating or synthesising samples in minority classes or by removing samples in majority classes. These techniques can be combined with various prediction methods and are likely to tackle imbalance in travel mode prediction. Note that in prediction tasks involving a training and testing set, a good practice is to apply oversampling and undersampling to only the training set, not the testing set. This guarantees fair and unbiased model evaluation on the testing set. It is noteworthy that oversampling and undersampling fundamentally differ from sampling or resampling in statistics. Statistical sampling refers to extracting a subset of individuals from the population to infer characteristics of the whole population, and the extracted sample is expected to follow the distribution of the population.
In this study, six OUS techniques were selected and compared, as these methods represent the state-of-the-art samplingbased solutions for imbalanced data [40], [41]. These methods include two basic methods (RUS and ROS) and advanced methods because of their good performance in existing studies.
1) Undersampling Methods: Random undersampling method (RUS) works by randomly removing instances in major classes until the predefined class balance is achieved. This method is straightforward and efficient, with no assumptions about the data distribution. However, its major drawback is that potentially useful instances can be removed. In order to tackle this problem, new undersampling techniques have been proposed that identify and remove redundant, noisy and/or borderline instances from majority classes. Specifically, redundant instances are points that add little information about the majority classes, while noisy instances represent randomness in the data. Borderline instances are close to the boundary between classes and are unreliable as small changes to borderline instances' attributes would lead to considerable shifting of the decision boundary [42].
As one of the advanced undersampling approaches, One-Sided Selection (OSS) [20] combines Condensed Nearest Neighbour (CNN) (for removing redundant instances) and Tomek Links (for removing borderline/noisy instances). In Step 1, let S be the original data, a subset C is generated that contains all instances of the minor classes and a randomly selected majority instance. Then, for each instance in S, it is classified using its nearest neighbour in C. The misclassified instances are added into C. In this way, C does not contain redundant instances that are correctly classified by its nearest neighbours. In Step 2, minority class instances that belong to Tomek Links are removed from C. Tomek Links [43] can be briefly explained as follows: a pair of instances a and b is a Tomek Links if three criteria are met: (i) a and b belong to different classes, (ii) a's nearest neighbour is b, and (iii) b's nearest neighbour is a. By definition, instances that belong to Tomek Links are either noisy or boundary instances. The resulting set C is the output of OSS.
Another undersampling approach is the Neighbourhood Cleaning Rule [44], which adopts the rule of Wilson's Edited Nearest Neighbours (ENN) [45] to eliminate noisy/borderline major class instances. In NCR, the three nearest neighbours of each instance a are computed and used to classify a. If a is from the majority classes and is misclassified by the three nearest neighbours, a is removed as it is considered as a borderline/noisy instance. If a is a minor class instance and is misclassified by its nearest neighbours, then the majority class instances within a's nearest neighbours are removed.
2) Oversampling Methods: In random oversampling, minority class instances are randomly selected and repeated in the data until a balanced class distribution is obtained. It is subject to overfitting on the training data and thus fails to generalise to the unseen dataset. To avoid overfitting, more advanced oversampling approaches have been proposed that smartly create synthetic instances of minority classes. Among these approaches are Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling approach (ADASYN).
SMOTE [46] works by firstly selecting a minority class instance a at random and finding its k-nearest minority class instances called S. An instance b is randomly picked up from S. Then, a synthetic instance is generated as a weighted combination of a and b. This approach is plausible as new synthetic instances are generated from pairs of minority class instances that are sufficiently close.
Although SMOTE proves effective for continuous data, it is not applicable for data consisting of continuous and nominal variables. One such example is the travel survey data which include continuous variables of travel duration and cost, as well as nominal variables of gender, trip purpose, and trip mode. To deal with a mixture of continuous and nominal attributes, we use a variant of SMOTE, SMOTE-Nominal Continuous (SMOTE-NC for short). It differs from the SMOTE in two aspects. Firstly, the distance between two instances a and b consists of two components, namely the difference of continuous variables and the penalty term of differing nominal variables. The penalty term is defined as the median of standard deviations of all continuous variables across the minority class. Secondly, in the generation of synthetic instances, while the continuous variables are interpolated using the same procedure as SMOTE, each nominal variable is given the most frequent value in the k-nearest neighbours.
Alternatively, ADASYN [8] adaptively generates minority class instances based on the difficulty level of classifying original minority class instances. Specifically, it tends to generate synthetic instances close to the original instances that are incorrectly classified by a k-nearest neighbours classifier. It uses the same procedure as SMOTE-NC to generate synthetic instances by interpolation. Through this strategy, ADASYN increases the density of hard-to-classify minority class instances close to the borderline and then improves the classification performance.

C. Evaluation Framework
We proposed a comprehensive evaluation framework to systematically assess the performance of mode choice prediction from three aspects: overall model performance, mode-specific performance, and model economic interpretation, as shown in Part 2 of Fig. 1. The metrics of this framework not only assess how well the market share is predicted, but also how accurately the modes of individual trips are predicted. Furthermore, the prediction performance of each mode was discussed to explicitly show the impact of imbalance and how OUS techniques tackle this problem. Using economic interpretation metrics, we validated that applying OUS techniques will not distort travellers' behaviour patterns in mode choice prediction. This framework is applicable to both logit models and machine learning models.
1) Overall Model Performance: The overall performance refers to the model's predictive power for the entire dataset that consists of multiple travel modes. Specifically, the overall performance includes three aspects, namely aggregate prediction performance, disaggregate prediction performance, and weighted performance.
Aggregate prediction performance of a model concerns the model's capability to reproduce and predict the aggregate choice distribution of each mode, i.e., the market mode share. This performance can be assessed using the mean absolute deviations of market share (MADMS), which is defined as: where M is the set of travel modes; N i is the set of trips that choose travel mode i; | · | is the cardinality function that outputs the number of elements in a set; P ni is the predicted choice probability of travel mode i for trip n; P i is the actual market share of travel mode i. The MADMS metric is similar to the L1-norm error for mode share prediction [36], which is defined as the sum of the absolute differences between the predicted and actual market share predictions.
On the other hand, disaggregate prediction performance concerns the model's ability to accurately predict the mode of each trip record. In literature, this performance has been evaluated by a range of metrics, including accuracy, precision, recall, and F1 score. Herein, we mainly use accuracy and macro-average F1 score to evaluate the disaggregate prediction performance.
The metrics of accuracy, F1 score, and others are based on a summary of individual predictions. Given the predicted and actual labels, the prediction of each class can be summarised by a confusion matrix (see Fig. 2 The accuracy of travel mode prediction is defined as the proportion of accurately predicted trip records to the total number of records, as below: where N is the set of all trips; M is the set of travel modes; TP i and TN i are the frequency of True Positive and True Negative instances of travel mode i, respectively. Precision and recall are two common metrics to measure the predictive performance of each class, and both range from 0 to 1. Specifically, precision is the proportion of true positive predictions in the total positive predictions, while recall is the proportion of positive predictions that are correctly identified. They are defined as: where FP i and FN i are the frequency of False Positive and False Negative instances of travel mode i, respectively. There is often a trade-off between precision and recall, meaning that improving one metric would lead to the reduction of the other. For this reason, F1 score has been proposed to reconcile both metrics, which is defined as the harmonic mean of precision and recall. The F1 score of travel mode i is expressed as: In this study, F1 score is used in two ways. First, we use the F1 score of each mode to describe the mode-specific predictive performance. A higher F1 score indicates a better predictive performance of the mode. Second, we use the macro-average F1 score to describe the overall disaggregate predictive performance, which is the average F1 score across all modes, as shown in (8). Likewise, the higher the macro F1 score, the better the overall predictive performance.
A main challenge in model evaluation and comparison is the conflict between different metrics. In other words, it is often impossible to simultaneously achieve the best performance in all metrics. Therefore, we propose a weighted method to combine the metrics into an overall score. All three metrics are firstly standardised by min-max scalar to keep the same scale, then each metric is multiplied with a weight. The sum of three weights equals one, with each of them ranging in value between zero and one. The overall score is defined as the sum of weighted standardised metrics in Equation (9): where S (·) represents the standardisation procedure.
There are different approaches to determining the weights. First, the weights can be selected based on expert knowledge or the need for real-world applications. Second, if there is limited prior knowledge of the weights, it is recommended to try a range of weight values in a sensitivity test.
2) Mode-Specific Performance: The mode-specific performance refers to the model's predictive power for each travel mode. The mode-specific performance may vary significantly across modes, especially when the mode frequency is highly imbalanced. Many studies paid much attention to the overall model performance while ignoring the impact of imbalance on each travel mode. Here, we measured the mode-specific performance by the F1 score of each mode (as discussed above), which provides a detailed understanding of the model's capability and would reveal whether OUS improves the prediction of each mode.
3) Model Economic Interpretation: A well-performed prediction model for travel mode should have not only high predictive power but also accurate and reliable economic information regarding travel behaviours. In this context, the economic information includes the marginal effect and elasticity of travel modes regarding input variables, the value-of-time in different modes, and the substitution pattern of alternatives [39]. Interpreting the economic information of logit models is straightforward. Recent studies show that machine learning models can readily provide as reliable economic information as logit models [39].
In this study, the focus of the model interpretation is whether using OUS on the data would alter the economic information in the prediction models. To achieve this, we calculate the average elasticities of four travel modes with regard to travel duration or cost. Elasticity (also known as the standardised derivative) measures the per cent changes in the choice probability of a mode as a result of one per cent change in an input variable. Mathematically, it is defined as: where E ik is the average elasticity of travel mode i with regard to the k th variable; N is the total number of trips; P ni is the predicted choice probability of trip n of choosing travel mode i; x nik is the k th variable of travel mode i for trip n. A positive elasticity means that an increase in the input variable leads to an increase in the choice probability of the given mode, while a negative value means an increase in the variable causes a decrease in the choice probability. We note that there are other metrics for economic information in travel behaviours. A comprehensive discussion of behaviour analysis in mode choice is available in Wang et al. [39].

IV. DATA AND SETUP OF EXPERIMENTS A. Data
The dataset of London Passenger Mode Choice (LPMC) from April 2012 to March 2015 [4] was used in this study. This dataset was derived from the London Travel Demand Survey (LTDS), an annual survey that captures a detailed snapshot of journeys made by every over-five-year-old member of the selected household on a selected day. The key steps that generate the LPMC dataset from LTDS include: (1) removing the trips that had the same postcode in origin and destination; (2) assigning each trip to one of the four travel modes; (3) simplifying the trip purposes to five main purposes; (4) adding travel time and cost information of four modes to LTDS by utilising Google Map API and Oyster cards. The resultant LPMC dataset contains 81,086 trips generated by 31,954 individuals across 17,616 households. The four main travel modes accounting for 99.5% of trips are walking, cycling, public transport, and driving (including car passenger, taxi, van and motorbike). Table III indicates that the mode shares are considerably stable between 2012 and 2015. Driving accounts for more than 40% of total trips, followed by public transport accounting for 35% of trips. In contrast, less than 3% of trips use cycling. The large difference between the major and minor modes reveals the severe class imbalance in mode choice, which is consistent with the mode choice data mentioned above.
LPMC dataset contains a wide range of variables about the household (e.g., household members, car ownership), individual (e.g., gender, age, ticket types) and trip (e.g., trip purpose, departure time, travel mode, trip duration and cost of alternatives). Compared with LTDS, which provides only the trip information of the chosen mode, one big improvement of LPMC is that it provides the trip cost and duration of all four alternative modes, which is estimated using an online directions service. We selected 14 variables for this study (see Table IV), which are in line with Wang et al. [39]. As the cost of walking and cycling is zero for all trips, they were not included in the list.
The duration and cost of travel modes are used differently in discrete choice and machine learning models. In discrete choice models, the duration and cost of a mode are only used in the corresponding utility function. In contrast, in machine learning, the duration and cost of all modes are fed into the algorithm, and then the algorithm automatically determines the variables for building models.

B. Setup of Experiments
To gain insight into the impact of class imbalance and different OUS techniques, we tested 18 combinations of three mode prediction models and six OUS techniques, as mentioned in Section 2. These combinations were compared with the models using the original dataset. The computation was conducted on a Windows 10 desktop (Intel i7 CPU, 3.1 GHz with 15 GBytes memory). The logit and machine-learning models were constructed and trained in Python using the packages listed in Table V. There are two sources of randomness in the mode prediction: first, applying OUS techniques introduces randomness; second, the model training of DNN and XGB involves randomness, as the model training may identify local minima rather than global minima, which is called model non-identification challenge of machine learning. Therefore, each combination of models and OUS techniques is assessed 100 times and  the average metric is used. Specifically, the OUS is applied ten times, which generates ten datasets; for each dataset, the prediction model is repeated ten times.
We use holdout sample testing in order to emulate the real-world application of predicting future trips and to avoid data leakage. The dataset is split into a training set (April 2012-March 2014, totalling 54,766 instances) and a hold-out testing set (April 2014-March 2015, totalling 26,230 instances), which is consistent with the data splitting in Hillel et al. [4]. While the training set is used for model optimisation and final model training, the testing set provides an unbiased performance evaluation of final models.
Regarding model optimisation, we used the optimum hyperparameters of the Opt-DNNs in Wang et al. [34] without further tuning. This is reasonable as both studies use the London dataset provided by Hillel et al. [4]. On the other hand, we tuned the hyperparameters of XGB using the sequential model-based optimisation algorithm (also known as the Bayesian optimisation) via the hyperopt library (using 100 iterations). The XGB hyperparameters were optimised on the original dataset without OUS techniques. The optimal hyperparameters of DNN and XGB are shown in Table VI.

A. Overall Model Performance
Figs. 3-5 show how the three metrics (MADMS, accuracy and Macro F1 score) varied for each combination of travel mode prediction models and datasets. The details of the three metrics are available in Appendices A and B. Overall, machine learning models outperform the MNL models, with the DNN models showing the best aggregate predictive performance while XGB models have the best disaggregate predictive performance.
When aggregate predictive performance is considered, all models achieved better performance on the original dataset than on the resampled data. The DNN model had the lowest MADMS (0.0040), followed by the MNL and XGB models with comparable performance. The advanced undersampling techniques (i.e., OSS and NCR) could keep the MADMS at a low level. On the contrary, RUS and all the oversam-     techniques should not be used if MADMS is the only criterion of mode prediction.
In terms of the metrics indicating disaggregate prediction performance, the accuracy showed a similar pattern as MADMS. All the models achieved their highest accuracy when the original dataset was used. In contrast, the Macro F1 score demonstrates a different trend. Although the MNL models still performed best when the original dataset was used, both machine learning methods achieved their best Macro F1 score when SMOTENC was used, followed by ADASYN. The highest macro F1 score was 0.5703 when the XGB method and SMOTENC-oversampled dataset were used. Thus, the oversampling techniques showed the capability of improving the Macro F1 score for XGB and DNN models. The trade-off between the three metrics is illustrated in Fig. 6.
The optimal combination of the prediction model and OUS technique depends on the relative importance of these metrics. To this end, we designed ten scenarios with differing weights and relative importance in the three metrics. Table VII presents the optimal and sub-optimal combinations in each scenario. Obviously, the machine learning methods achieve a more balanced performance with the three metrics as 19 out of 20 optimal or sub-optimal combinations were XGB or DNN models. The MNL model was the sub-optimal model only when accuracy and Macro F1-score were neglected in the evaluation (i.e., W 1 =1.0). In addition, the original dataset was among the optimal or sub-optimal combinations when the weights were similar or when the dominant metric is MADMS or Macro F1-score. XGB models combined with SMOTENC or ROS datasets were the optimal or suboptimal combinations when Macro F1 score was the major metric (i.e., W 3 ≥0.8). These results indicate that XGB models with oversampling techniques achieved better disaggregate prediction performance at the cost of inferior performance in the MADMS. Meanwhile, the models with the original dataset had high accuracy and lower MADMS but do not perform well on the Macro F1 score.
The OUS techniques add more flexibility to model selection for predicting travel mode. While machine learning models combined with the original dataset had the best overall performance in most scenarios, the combinations of XGB and oversampling techniques are the best choices if Macro F1 score is the focus of the prediction task.

B. Mode-Specific Prediction
The mode-specific F1 scores provide a better understanding of how different OUS techniques improve mode prediction. Fig. 7 shows that the mode-specific F1 scores of all the travel modes ranged from 0.50 to 0.78, except for cycling. Both machine learning models had higher F1 scores compared with the MNL model, especially for public transport and driving. XGB models performed best not only in Macro F1 score, but also in mode-specific F1 scores.
Notably, the F1 score of cycling is (nearly) zero for the models using the original dataset and the OSS and NCR datasets. This is because very few or no cycling records are correctly predicted. Given that cycling accounts for 3% of the total trips, the severe underprediction of cycling is problematic and unacceptable. This implies that we should be cautious about the overall performance (e.g., Macro F1 score), which might hide the underprediction of minority modes. Therefore, evaluating the prediction performance only at the overall level may be misleading. To avoid this misleading, it is essential to add mode-specific performance into our evaluation framework to enable a deep look into the impact of imbalance on each mode and how OUS techniques tackle this issue.
Another thing to note is that RUS exhibited good prediction performance for cycling, which is different from the other undersampling methods (OSS and OCR). This is because RUS reduces the modes with a higher share and leads to a dataset with equal share of different modes (or with no class imbalance). Similarly, the oversampling techniques (i.e., ROS, SMOTENC, and ADASYN) could mitigate the imbalance in the original dataset. Thus, the issue of underprediction for the minority class was substantially alleviated by using RUS and oversampling techniques, in which the F1 score of cycling is markedly improved in comparison with the original dataset. The implication is that using appropriate OUS techniques could lead to better predictive performance of the minority class (i.e., cycling in this study) without degrading the predictive performance of the other classes.

C. Model Interpretation
This section interprets the behavioural pattern and economic information in the constructed models by computing the elasticities of four travel modes regarding input variables. Specifically, for the seven models that achieved a high overall score (as recommended in Table VII), we calculated the elasticities for trip-related variables, including mode-specific duration, cost, and the number of interchanges in transit, as shown in Table VIII. Figures presenting the elasticities of each mode for recommended combinations are available in Appendix C.
In each panel, each entry represents the average elasticities of all respondents in the testing set, which indicates how much per cent changes in the choice probability of a mode would happen as a result of one per cent change of the corresponding variable. The elasticities of mode choices regarding their mode-specific variables are highlighted in Table VIII. It can be found the average elasticities in the models selected were largely reasonable in terms of signs. The highlighted entries in Table VIII were mostly negative, which is aligned with common sense as the higher travel cost and duration will reduce the probability of selecting the corresponding mode. However, a few exceptions of highlighted positive values did exist. For example, the elasticities of the duration of cycling were positive for the mode of cycling in Panels 2, 5 and 6. This can be attributed to the local irregularity of DNN or model non-identification of XGB and DNN models [39]. Local irregularity refers to that DNN models have locally irregular patterns (i.e., exploding gradients, the lack of monotonicity) such that certain choice behaviours revealed by DNNs are not realistic. On the other hand, the model non-identification of machine-learning models refers to that the objective function of XGB or DNN is not globally convex and that the optimisation of XGB or DNN models may identify local minima or saddle points rather than global minima. In addition, it is worth noting that all highlighted entries in Panel 1 were negative except the number of interchanges in public transit (represented by pt_n_interchanges), which indicates the MNL model had a good performance in travel behaviour analysis.
The magnitudes of average elasticities in the models were mostly valid and consistent with existing studies. Wang et al. [39] reported the elasticities of travel modes in the same dataset using DNN and MNL models, which are very similar to Panels 1 and 5 in Table VIII. Notably, the elasticities of XGBs were much smaller in magnitude than MNLs and DNNs, although the relative magnitudes of the elasticity coefficients in XGBs were similar to those of MNLs and DNNs. For example, Panel 5 indicates that a 1% increase in accessing time, invehicle travel time and interchange time of public transit leads to a decrease of 0.55%, 0.34%, and 0.13% probabilities in using public transit, while in Panel 2, the corresponding probability decreases are 0.18%, 0.08%, and 0.05%. Although it is challenging to assess the validity of these results due to a lack of the ground truth of elasticity coefficients, it is indicated that these results were reasonable in relative magnitudes. This implies a need for further machine-learning-based mode choice studies that focus on validating the behavioural outputs [16].
The average elasticities in the models with OUS techniques were consistent with those using original datasets. If we compare the results of XGB models in Panels 2, 3, and 4, the corresponding average elasticities were quite close. Moreover, the elasticities of DNN models in Panels 5 and 6 were largely aligned with those reported by Wang et al. [39]. We can conclude that a combination of OUS techniques and machine learning models leads to models with valid and intuitive travel behaviours and economic information.

D. Limitation
In the LPMC dataset, each trip is labelled as one of the four modes: walking, cycling, public transport, and driving (which includes car passenger, taxi, van, and motorbike). For journeys consisting of multiple modes, the assigned mode is the one that covers the longest distance. This leads to the bias of the travel modes. The mixed mode prediction can be formulated as a multi-label classification [47], which predicts one or multiple labels (from a given label set) for unseen journeys. Another approach is to create more label classes by combining the current four modes (e.g., 'walking-public-transport'); however, the combination would result in 16 classes of travel modes, which is challenging for classification. We expect that the class imbalance issue will exist for both approaches, and therefore the OUS techniques are likely valid for the mixed mode prediction. VI. CONCLUSION Class imbalance is a common and prominent problem in travel mode data, which leads to the underprediction of the minority class in travel mode prediction and causes biases in transport planning and policy-making. Although machine learning methods have obtained a high predictive accuracy in predicting travel modes, the problem of class imbalance has not been adequately discussed and addressed. This paper fills this research gap by proposing an evaluation framework for assessing the performance of travel mode prediction methods and OUS techniques. The contribution of the framework consists of at least two aspects: first, it examines not only the overall performance of prediction with both aggregate and disaggregate metrics, but also the mode-specific performance that highlights the potential underprediction of minority modes. This framework also incorporates economic interpretation that examines whether the prediction provides valid travel behaviours. Second, because of the conflict between the aggregate and disaggregate metrics, we propose the overall score (i.e., the weighted sum of these metrics) that enables the performance comparison of travel mode prediction in different scenarios.
Using this framework, we conducted a systematic investigation of the combinations of statistical/machine-learning methods (i.e., MNL, DNN, and XGB) and six OUS techniques. It is found that although prediction models with the original dataset had better aggregate prediction performance, most OUS techniques could help improve the disaggregate prediction performance of machine learning models. RUS and oversampling techniques substantially improve the prediction of minority modes whilst keeping the overall prediction performance and model interpretation. On the other hand, the undersampling techniques of OSS and NCR fail to accurately predict the minority mode. Researchers should be careful about the selection of OUS techniques based on the purpose of travel mode choice prediction.
This research suggests that combining OUS techniques and statistical/machine-learning methods is appropriate for predicting travel mode, because it can effectively mitigate the influence of class imbalance while achieving high predictive accuracy and model interpretation. This methodology can effectively avoid bias in travel demand prediction and inform transport policy. For example, cycling is a healthy travel mode and has been advocated by many countries to improve micro-mobility and reduce carbon emissions [48], [49], [50], [51]. Since the outbreak of COVID-19, cycling has become more popular in many countries by substituting public transport in short and medium-distance journeys while keeping social distancing. However, as cycling is much less popular than driving or buses, the travel demand for cycling is often underestimated, which causes further problems in transport resource allocation and policy. While the minority class differs from area to area, a general principle is that no mode should be disadvantaged in prediction because each transport mode benefits some population groups while excluding others [52]. The methodology proposed in this research makes it possible to mitigate class imbalance in travel mode prediction. Moreover, this methodology enriches the model options for predicting mode choice, thereby providing greater flexibility of models for decision-making in transport planning.
The proposed methodology is generalisable to other classification-based transport studies that are subject to class imbalance. Some examples are driving safety risk prediction and driver sleepiness detection [53], [54], [55], where the frequency of incidents and sleepiness is very low and the data distribution is highly imbalanced. The proposed combination of OUS techniques and prediction methods is likely to mitigate class imbalance and improve the prediction for the minority classes. The evaluation framework proposed in this paper can serve to assess whether the class imbalance is addressed. This research sheds light on several topics that are worth further investigation to improve mode choice prediction. One is the prediction of mixed-mode journeys, as this application is realistic and relevant. Another topic is preference heterogeneity in machine-learning mode choice prediction. While the DNN and XGBoost models in this paper are based on the average effects of mode choice, it would be interesting to look beyond the average effects in order to create models with better performance. Another topic is combining machine learning and causal inference in travel mode choice. While most machine learning models are based on associational relations between variables (e.g., Random Forest and XGBoost), they are subject to spurious correlation and might have limitations in model generalisation. Emerging methods that integrate machine learning with causal inference (e.g., causal forest) [56] might lead to an accurate and robust model for travel mode prediction, which is yet to be developed.

A. Further Details of Mean Absolute Deviations of Market Share
See

B. Details of Accuracy and Macro F1 Score
See Tables X and XI shown in the Supplementary Material.

C. Elasticities of Each Mode for Recommended Combinations
See Fig. 8 shown in the Supplementary Material.

ACKNOWLEDGMENT
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.