A Bayesian Optimization Approach for Calibrating Large-Scale Activity-Based Transport Models

Addressing complexity in transportation in cases such as disruptive trends or disaggregated management strategies has become increasingly important. This in turn is resulting in the rising adoption of Agent-Based and Activity-Based modeling. Still, a broad adoption is hindered by the high complexity and computational needs. For example, hundreds of parameters are involved in the calibration of Activity-Based models focused on behavioral theory, to properly frame the required detailed socio-economical characteristics. To address this challenge, this paper presents a novel Bayesian Optimization approach that incorporates a surrogate model defined as an improved Random Forest to automate the calibration process of the behavioral parameters. The presented solution calibrates the largest set of parameters yet, according to the literature, by combining state-of-the-art methods. To the best of the authors’ knowledge, this is the first work in which such a high dimensionality is tackled in sequential model-based algorithm configuration theory. The proposed method is tested in the city of Tallinn, Estonia, for which the calibration of 477 behavioral parameters is carried out. The calibration process results in a satisfactory performance for all the major indicators, the OD matrix average mismatch is equal to 15.92 vehicles per day while the error for the overall number of trips is equal to 4%.


I. INTRODUCTION
L ARGE-SCALE transportation problems have always been prone to high complexity, approximations, and a lack of a univocal mathematical formulation [1].This is especially the case for models involving human behavior and the numerous factors ruling over mobility choices, which have been applied to narrow scopes (e.g., modal choices) or have been designed as aggregated (e.g., four-step models).Still, current and future transportation challenges -e.g.urbanization, population growth, and congestion [2], [3] but also disruptive events such as pandemics [4] or the climate crisis -require solutions able to frame transport demand through the lenses of individual choices on a large scale.Future innovations on the transport supply spectrum [5] are also foreseen to have disruptive effects on mobility demand.To evaluate said innovations, tools able to frame changed mobility choices and travel habits are needed.Currently, agentbased modeling (ABM) is the most promising solution due to the ability to frame both demand and supply at the agent level (the agent being either the individual or the single vehicle) in a disaggregate fashion, allowing the investigation of emergent behaviors [6].Specifically, activity-based models are a particular kind of ABM that considers each individual in the population as an agent, while modeling their behavioral choices in a disaggregated fashion.Activity-based and ABM models have already successfully been exploited to carry out policy analyses [7], accessibility studies [8], and forecasting experiments [9]- [11] involving automated mobility.Automated vehicles are a perfect example of the potentiality of a disaggregate approach, in which aspects such as the fleet size and the routing algorithm need to be assessed based on a realistic demand, rooted in behavioral models rather than in historical patterns.Another ABM application addressing the reported long-term challenges and strongly benefitting from a properly calibrated large-scale activity-based model is the modeling of remote and hybrid working patterns during public health crises [12].
Still, despite these needs, the state-of-the-art concerning activity-based models behind most ABM is limited, especially when it comes to the underlying calibration.The different structures of the available activity-based tools and the different magnitude of parameters to be calibrated each time are still a barrier against a wider adoption.A unified calibration approach, not dependent on specific software and able to include hundreds of parameters, is still needed to fill this research gap and foster the usage of behavioral activity-based models [13], [14].This paper tries to accomplish the above, enabling wider adoption of activity-based models by proposing a method for calibration via efficient global optimization -the Bayesian optimization (BO).BO [15], [16] seeks the global optimum through a sequential sampling design and approximation of an underlying likelihood function using a surrogate model and surrogate (response) surface.The evaluation of the surrogate model requires significantly less computational time, making it suitable for extensive utilization in guiding the search process and balancing the exploration-exploitation trade-off.As such, the method is designed for optimization problems that feature "expensive" functions in terms of computational time, which are approximated through the surrogate surface.This study focuses on the development of a high-dimensional BO method that converges within a given computational budget and avoids the necessity to master the implementation of the underlying large-scale ABM.The proposed algorithm and the modifications to the BO approach are designed to be transferable to other large-scale problems involving dozens of parameters and fit to be analyzed through a surrogate model.
In Section II, a review of the current literature is carried out, concerning the calibration of large-scale transport models and BO; Section III elaborates on the proposed methodology; Section IV introduces the case study, while Section V reports the results obtained by applying the proposed methodology to the case study; Section VI discusses the results and highlights the main conclusions for this work.

A. Calibration approaches in transport modeling
The calibration of activity-based models in transport has received far less attention than the calibration of other ABMs, while existing approaches mainly rely on heuristics, which, in turn, is hindering the potentialities of these tools.
The work presented in [13] describes a gradient-and simulation-based optimization procedure designed to calibrate 28 parameters in a utility-based nested logit system.Similarly, [28] exploits the WSPSA algorithm to calibrate 94 behavioral parameters on the demand side, still, it does not exploit a surrogate model and thus needs multiple computationally expensive runs.Besides, the WSPSA requires the definition of a weight matrix, which becomes difficult to define as the number of parameters increases.In [29], [30], an iterative black box approach is adopted to calibrate an activity-based transport model, with the former applied to a small network (24 zones) and the latter calibrating only 9 behavioral parameters.Paper [31] succeeds in calibrating 25 behavioral parameters and, similarly to [32]- [34], jointly calibrates an activity-based model with a traffic assignment tool.Papers [32], [33] calibrate both the MATSim software and CEMDAP [35], with MATSim receiving the final calibration based on traffic counts.Still, in these works a calibrated traffic assignment model is needed, which increases the overall calibration effort by increasing the number of factors but also by somehow putting the two modules (activity-based and traffic assignment) "against" each other, in an iterative fashion, whereas a univocal optimization criterion for the two modules is not defined, with convergence being decided by supply-side metrics.This issue is tackled in [36], where MATSim is integrated with a multinomial discrete choice model considering 12 behavioral parameters.Results show that the choice of behavioral parameters becomes a key element in the simulation pipeline, without which it is not possible to reach a good integration with an ABM while avoiding convergence issues.To address this limitation, [37] decouples the traffic assignment from the behavioral components; however, the resulting agents' features and the related behavioral constants in a logit model remain not calibrated, which limits the transferability or even the usability of the calibrated model.Overall, MATSim applications, while generally more widespread, are more limited in their ability to predict new technologies and disruptive scenarios while, also, relying on a smaller number of parameters to be calibrated [38].
Based on the above, it is possible to state the following limitations concerning the state-of-the-art: 1) Existing literature tackling the calibration of behavioral parameters for activity-based models on large-scale scenarios is scarce and only a handful of works try to solve the problem without recurring to heuristics.2) Most of the calibration methods aim to reproduce outputs of the activity-based models matching the desired supply-side measurements, rather than to calibrate the underlying behavioral parameters.Thus, the calibration of the supply overrules the calibration of the demand.3) Even the works trying to formalize a rigorous methodology do not consider more than a few dozen of parameters (the maximum number being 98 in [28]).

B. Bayesian optimization
BO finds applications in various scientific and industrial domains, e.g., machine learning for hyperparameter optimization [17], [18], modeling of population genetics [19], spreading of pathogens [20], atomic structure of materials [21], [22], as well as cosmology [23], and establishes as a state-of-the-art method in lower-dimensional problems [15], [24].However, the BO performances and its computational efficiency decline as the dimensionality of a problem increases [17], [25]- [27], which is the case with the calibration of large-scale ABM that features a large number of behavioral parameters to be tuned.
BO exhibits state-of-the-art performances using Gaussian processes (GP) as a prior distribution to model the surrogate surface and approximate the posterior distribution of the parameters [15], [16].The probabilistic nature of the GP enables the quantification of the prediction uncertainty based on the spatial vicinity of new samples to the known regions in mathematical spaces.Such quantified uncertainty allows for an efficient trade-off that guides the search for better samples to be sampled, which helps the BO to achieve state-of-theart performances.However, the GP comes with a computation bottleneck when applied to high-dimensional problems, constituting a setback for their broader adoption in settings of complex parameter spaces [17], [25]- [27].Therefore, the straightforward adoption of the BO method in the calibration of activity-based models is hampered by the large number of parameters to be tuned [13], [39].
For that purpose, various methods for dimensionality reduction are adopted [29], [30], [40], or transformations (including partitioning) of the parameter space are applied [41]- [47], but neither circumvent the obstacle.Namely, the former requires an increased number of runs, while the latter relies on strong assumptions of low intrinsic dimensionality and compounding effects.[29], [30] invest a significant amount of effort to introduce BO in the field of transportation modeling, but fails to do so without an expensive dimensionality reduction using a deep learning methodology, i.e., auto-encoders.The deep learning methodology has been shown in numerous applications to be a valuable approach in high-dimensional spaces, but it is known to be data-intensive [48], which in the context of transportation activity-based models means a greater number of executions of the models.
Approaches based on transforming the parameter space and decomposing the optimization problem into sub-problems -each mapped to a lower-dimensional space -depend on space properties, among which the intrinsic dimensionality is the most important.In that context, previous works explore a latent space where the function is decomposable, either by latent structures [42] or additive structures [49], [50].Another approach to dimensionality reduction involves random projections into a latent space [41], [44]- [46] or lowrank matrix approximation [51].Alternatively, a cylindrical transformation of the parameter space [43] and sequential optimization along with a subset of dimensions [52] are adopted in recent studies.However, the authors of these studies report the performance on benchmark optimization functions, showing that the methods perform well on problems with low intrinsic dimensionality, but fail to depart significantly from the initial points in optimization problems with high intrinsic dimensionality.
Without prior knowledge of the intrinsic dimensionality of the large-scale activity-based models and the corresponding calibration process, we consider a variation of the BO method that uses a dimensionality-wise more robust method to approximate the posterior of the parameters, i.e., Random forests [53].Previously, the method of Random forests has been used in a study with low dimensional optimization problems, featuring discrete mathematical spaces [54].

A. Activity-based transport models
An activity-based ABM aims at describing the behavior of potentially millions of agents, each representing a traveler (and/or a vehicle), each one capable of multiple choices through the simulation horizon.The decision process for each agent is typically modeled via a nested tree, where each node represents a choice, which is in turn defined via utility maximization, solved via evaluating multiple (utility) functions, each described by several parameters and, crucially, the corresponding weights.Utility functions are of the form where V is a set of (known) parameters characterizing the agent, which could include, e.g., the socioeconomic features of each agent (defined while creating the synthetic population), and β choice are the weights defining how much these parameters concur to the utility function.The hundreds of possible combinations among weights result in a problem for which it is impossible to calculate an analytical solution, while the stochastic nature of the utility-maximizing theory makes employing a numerical solution a necessity.In the following, all weights β choice present in all utility functions defined in a decision tree are grouped in θ, while the observations are measurements resulting from the emergent behavior of the agents.

B. The conceptual design
BO [15], [16] is a method that seeks the global optima through sequential sampling design and approximation of an underlying likelihood function using a surrogate model over a surrogate (response) surface.The surrogate surface is defined by a parameter space and discrepancy between observed and simulated outputs.The sequential sampling design is an iterative approach, through which new samples (parameters' values) are acquired that maximize the expected (acquisition) utility.The utility is based on an acquisition function, which balances the trade-off between exploration and exploitation of the search space and efficiently guides the search towards the global optima (Fig. 1).
Formally, a simulation model is a generative stochastic process and its calibration corresponds to a statistical inference of a finite number of parameters θ ∈ R d based on a set of observations Y o : where p(θ) is our prior belief on the distribution of parameter values and p(Y o |θ) is the likelihood of the observations, given the parameters, derived from a known function L(θ).Since the analytical form of L(θ) is unknown in the underlying challenge, we use the notation L(θ) that needs to be approximated over a set of N samples -LN (θ).The notation is simplified if the marginal distribution p(Y o ) is omitted because it does not depend on θ: where the L(θ) is approximated over a finite sample set ( LN (θ)) and reconstructed as the number of samples increases:

C. The surrogate function and the sampling design
The approximation ( LN (θ)) of the likelihood function (L(θ)) is the formal task we address with the BO methodology, using a surrogate function and the sequential sampling design [16].To model the surrogate function, we use a regressor such as Random forests (RF) [53], which is used to estimate the acquisition utility of newly sampled parameter values through the Expected Improvement (EI) as an acquisition function [64]: where σ(θ) and µ(θ) are statistics of the inferred posterior distribution, f * is the active optima discovered in the previous iterations, and Φ and φ are probability density and cumulative distribution function in terms of the standard normal distribution, respectively.The expected improvement EI(θ) = 0 if σ(θ) = 0.The analogy behind (5) reveals the explorationexploitation trade-off that favors larger uncertainties that are close to the latest discovered optimal region(s).
The RF [53] is an ensemble method composed of C regression trees.Regression trees follow the decision tree concept, with a structure of decision binary nodes built iteratively in a top-down fashion.Each regression tree is built over a subspace of the parameter space, designed by random subsets of both the features (dimensions) and bootstrap samples.Therefore, given a dataset, each regression tree predicts the target for a specific region in the defined space.The prediction of the ensemble, on the other hand, is an aggregation (average) of the outcomes of all C tree base predictors: where C is the number of tree predictors, Θ i and Y i are training datasets of i-th regression tree T i that provides a prediction τ i , while Θ and Y are global training dataset and the corresponding label set, respectively.
The RF method has a small number of hyper-parameters that can significantly influence the outcome, among which most commonly tuned are the number of tree components C, the minimum number of samples in a terminating node that controls the structure growth and over-fitting settings of the individual tree components, and the number of features to design a sub-space or partition.Additionally, the RF method has shown excellent robustness over high-dimensional data, which limits the bias of the overall predictions by maximizing the variance between base predictors [55].
However, in the context of BO, RF models lack: (i) uncertainties in the quantification of predictions (non-probabilistic output); and (ii) predicting a value outside of the observed range.Thus, as a standalone surrogate model, the RF greatly affects the efficiency of a probabilistic acquisition function (e.g., EI) in acquiring new promising samples [15], [54].

D. The improved Random Forest
In order to comply with the expected probabilistic output, the RF method is adapted so that it approximates a parametric (normal) probability distribution of the evaluated input parameter values (θ), through empirically derived mean (µ θ ) and standard deviation (σ θ ): where τ i corresponds to a prediction of a single decision tree model in the RF, as described in Eq. 7.
As we adapt the RF into a compatible method for modeling the surrogate function that enables estimation of the acquisition utility for each new sampled parameter value, a component of BO, yet to be formalized in our calibration framework, is the optimization of the acquisition utility at each iteration of the iterative optimization process.The optimization of the acquisition utility corresponds to finding a set of parameters values (θ * ) that maximizes the utility: where θ * prev is the optimal set of parameter values obtained in previous iterations of the process.
Finding the set of parameters' values that maximizes the acquisition utility (estimated to perform best) can be performed in various ways, including Random search, Thompson sampling, gradient-based, or population-based (evolutionary) optimization methods [15], [16], [18].In this study, we examined the performances of the Random search, gradient-based, and population-based methods, observing that the gradient-based outperforms the rest.Therefore, for our case study, we adopt the gradient-based Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints (L-BFGS-B) [56], [57].

E. The selection of the best-performing simulation
The stochastic nature of the proposed method requires that the optimization is performed multiple times, by which the possibility to found a locally optimal solution is eliminated.However, depending on the complexity of the highdimensional problem at glance and definition of the optimisation objectives (cost), it may be not feasible to completely avoid locally optimal regions of the problem space, as there may be more of them.Therefore, ex post Pareto analyses of simulated candidates according to a broader set of criteria is recommended.These shall represent some important benchmarks or margins of the considered case study and have to be defined accordingly.They differ from the optimisation objectives (performance measure) that is exploited to guide the algorithm through the learning process, and they are used to filter out non-robust simulations in accordance to the Pareto objectives.

IV. CASE STUDY
The proposed methodology is tested for modeling the city of Tallinn (Estonia).The reference year is 2015 as the mobility survey used both for the generation of the database population and for the calibration was carried out in the said year (Fig. 2).SimMobility MT -Preday [58] is employed as the activity-based model, which takes as input a Postgres database containing the following features: 1) A population of ∼400.000synthetic individuals, matching the city's whole population, while each individual is characterized by socioeconomic features such as age, gender, income, workplace, etc. 2) A spatial map of 616 zones, each 500x500 meters wide (Fig. 2).3) Skim tables detailing the costs of different travel modes, including waiting time, time on board, etc., among the different zones.The spatial resolution has been set to 500 m to realistically capture the Walking mode of transport, while the skim matrix for public transport has been extracted through Open Trip Planner [59].More details about the synthetic population, its generation, and assignment, may be found in [60].
A run of Preday results in a set of Daily Activity Schedules, one for each leg of a tour.An extract is provided in Table I for a randomly chosen individual (ID 107).
The main strength of Preday lies in the behavioral models used, i.e., a series of nested logit functions allowing to simulate the travel demand based on an established methodology [8], [58], [61].This feature of the model allows simulating future scenarios for which the ground data (e.g., traffic counts) are not yet available, which is done via computing utilities at different levels (binary choice to leave the residence, type and number of tours, modes and destinations, time of the day, and stops).These choices are interrelated, an aspect that is accounted for through the computation of logsums (defined in [8] as the log of the denominator of a logit choice probability).
In the following, the formulation of utility for the binary choice to leave or not the residence (top of the nested logit tree) and the mode choice (bottom of the nested logit tree) are reported (11), (12) to highlight the high number of behavioral parameters involved (and, crucially, to be calibrated).
Note that each variable in ( 11) is a vector of βs, including as many behavioral variables as categories considered.For example, β for age category is a vector with 5 βs, since 5 are the age categories considered.Overall, binary alone includes 25 βs to be calibrated.For further information about the nested logit tree, please refer to [58].
On the other side of the tree, the utility related to the bus mode is computed as: Also in this case, the above is a simplified version for presentation purposes and the number of βs required to compute U bus is 18.When four modes and four levels of the nested logit tree are considered, the resulting total number of behavioral parameters (βs) to be calibrated is 477.The whole set of βs represents the θ parameters described in Section III, they constitute the search space explored by the algorithm.The full list of βs is publicly available 1 .The calibration is carried out against the following baseline data: 1) OD matrix at subdistrict level.Tallinn has 82 subdistricts and the movements across them at each time of the day are extracted and upscaled from a mobility survey obtained from Taltech University [62].2) Statistical margins concerning workplace distributions and totals at the cell level (500x500m).The methodology behind these is detailed in [60].A similar method, albeit simplified, was applied for the margins concerning school institutions.3) Modal share, as detailed in [63].
These represent the observations Y o introduced in (2) and the achieved match are detailed in the calibration and results section.

A. The objective function
The design of the calibration process with the proposed methodology features a) a custom objective function, b) an iteration process for the calibration runs, and c) the definition of hyperparameters for both optimization methods, i.e., the global objective and the inner acquisition function.
Formally, the calibration of a simulation model is an optimization task that aims to minimize the discrepancy between a set of simulated and observed outputs.The discrepancy is measured via an objective function and, within this study, we compile a custom function that comprises three components: (i) Origin-Destination (OD) matrix; (ii) share of transportation modes; and (iii) coverage of the employed individuals with scheduled work tours.All three components are adjusted to result in similar ranges and as such share equal contributions to the final objective function: where n is the number of districts covered within the OD matrix, od ij and õd ij are numbers of observed and simulated tour legs for a given element of the OD matrix, respectively, where i is the origin and j is the destination.M i and Mi are shares of observed and simulated transportation modes, respectively, whereby, for both, holds i∈M M i = 1 and i∈M Mi = 1.Regarding the workers component, w assign corresponds to the number of employed individuals with scheduled at least one work-based tour, and w total is the total number of employed individuals in the population.

B. The iterative algorithm
In order to account for the stochasticity of the Bayesian optimization and produce stable results, the calibration process is repeated multiple times (five in our experiments), with Step sizes for approximation to gradient 0.5  different random seeds and the initial dataset.The starting values of the behavioral parameters have been set within a realistic range but this preadjustment does not amount to a pre-calibration, as the results from the first simulation reported in Figs. 3 and 4 show.The reported results are summarized across all runs, with the selected optimal solution (run no. 2) outperforming all other runs.Each independent run is performed with the same hyperparameters for the optimization methods.A summary of all hyperparameters used in this study is presented in Table II.In Fig. 3, the progression of the performance measure for 5 sets of iterations is reported.
In order to investigate broader aspects of the solutions and confirm their robustness, we examine the Pareto front of all potential solutions through six additional criteria: i) the modal share (M i − Mi < 10%), the spatial distribution of ii) work and iii) education trips, iv) the absolute number of workers, the v) total legs, and vi) the spatially distributed legs (note that the last two items are directly derived from the OD matrix).Hence, the defined cost function guides the algorithm in its optimization process, but the final results are analyzed and a selection is made against a broader set of criteria.The post analysis confirms that run number 2 outperforms the others after slightly more than 150 iterations and reaches a performance value global ≈ 1.06.

C. Calibration against the baseline
The results reported in the following are the ones arising from run 2. The overall improvement is evident in Figs. 3  and 4. Fig. 4 shows the best-performing simulation in run 2, namely 182.It is compared with the initial simulation which results instead in 0 satisfactory benchmarks and high errors across all the measurements.The second and third benchmarks (legs by rType and educational spatial, not perfectly met in 182) reflect instead absolute quantities and, while the latter is quite small and can be considered a match, the former will be further commented on in the following.We would like to stress that these six criteria differ from the metrics selected for the performance measure.In fact, while the latter guides the algorithm and the learning process, the former ones are exploited to filter the best simulations after each iteration.While Fig. 4 and Table III report only the error in the total number of legs, their distribution across the 82 subdistricts of Tallinn has also been compared against the baseline (i.e., the mobility survey).The algorithm is remarkably able to reproduce correct spatial distributions as in Fig. 5.
It should be stressed that these matrices cover a period of 24 hours.This means that, as in Fig. 5, the absolute error of private vehicle trips outside the diagonal averages at 15.92 vehicles.If one considers that the magnitude of simulated trips by private vehicles settles around 50.000 trips (each simulation runs over 10% of the population), the match between the two matrices is impressively close.The diagonal does instead show a somewhat higher variance (albeit still within a reasonable margin, an absolute average of 267.82 trips) and all the cells appear to be underestimated in the number of intrazonal trips.Still, intrazonal trips are commonly more difficult to frame, so a higher error was expected.The algorithm succeeds also in framing the tour types and spatial distribution of work and education trips, somehow trickier because mandatory, thus subject to stricter correspondence against the benchmark.Table III reports a comparison of the totals while Fig. 6 reports the spatial distributions of school and work trips.
As can be seen, the algorithm faithfully frames both education and totals, while slightly overestimating work tours.The overestimation has been considered less of a problem than an underestimation since it was considered mandatory that all the work trips would be framed (every employed person goes to the workplace).An overestimation of work trips does instead mistake only the aim of a tour, this is probably caused by the "emphasis" that the performance measure puts on simulating all the employees going to their workplace.
The distribution of work and education destinations modeled has been checked against the workplaces and education anchor points previously assigned to each eligible individual in the synthetic population [60].Fig. 6 shows a good match between the two, which implies that the calibrated utility parameters do indeed result in the right number of trips and spatial distribution for these two categories.Modal share also reaches a reasonable level of precision against the share recorded in [63], as shown in Fig. 7.
Another important result of the calibration is the time distribution of the stops in each tour throughout the day.This is reported in Fig. 8.As mentioned, the case study has 2015 as the reference year (ante COVID-19 pandemic).This means that the usual travel pattern showing two peaks (one in the morning and one in the afternoon) was to be expected and is coherent with the case study.Besides, the model clearly captures the different dynamics such as an education peak in the early afternoon or a spike in leisure trips in the evening (after work).
While most of the tours do fall in a realistic pattern, the calibrated behavioral parameters result in a small percentage of tours (4% percent) allocated in the last available time slot.This is bound to how SimMobility Preday models the time of  the day, allocating at the very end all the tours that could not be fitted through the day.

D. Analysis of the calibrated parameters
The set of results provided in this section and their comparison with baseline values should clarify how the algorithm reaches an acceptable solution in a completely automated way.It is important to stress how this calibration process differs from the traditional one for Preday, which is carried out manually by tuning the various parameters 2 , improving its results.Besides, in the literature review reported above, it was highlighted how other, more complex methods, do not encompass as many βs as the proposed algorithm.Fig. 9 provides a snapshot of the behavioral parameters and their original and final values for one level of the nested logit tree.For the complete list of βs, please refer to the publicly available repository 1 .One fundamental aspect should be highlighted: the algorithm allows to expand the set of parameters that are calibrated, as it appears in the tme branch of the logit tree (the branch addressing modal choice for education) plot in Fig. 9, where many parameters have non-null values only for the calibrated results.This is because the starting values (manually defined) could not encompass such a large set of behavioral parameters that were then set to 0.

VI. DISCUSSION AND CONCLUSIONS
The paper presented a new algorithm to calibrate a large number of parameters by exploiting a surrogate model and BO techniques, applied to a case study to prove the effectiveness of the proposed method in calibrating hundreds of behavioral parameters for an activity-based model.The result shows a satisfactory match between the modeled outputs and the baseline, built from an available mobility survey and 2 https://github.com/smart-fm/simmobility-prod/wiki/Mid-term-calibrationaggregate data.By calibrating the model through the presented algorithm, it was possible to tune a wider set of behavioral parameters than it would have been manually or through heuristics.As shown in the literature review, no other work succeeded in calibrating as many as 477 parameters, although avoiding doing so would strongly reduce the effectiveness of a nested logit model detailed enough to consider different socio-demographic features.The algorithm searches for the best-performing solution by perturbing all 477 behavioral parameters (βs through the paper).This is a task that could hardly be performed by hand.
By automating the process and exploiting a surrogate model, the algorithm bypass the need to set up and run multiple runs of the activity-based model.To frame the computational effort that would be required, a single run of SimMobility Preday for a population of ∼400.000individuals takes around 6-8 hours on a laptop (Intel core i5 -1.60 GHz x 8).The proposed algorithm may be used in other scenarios and case studies, improving the feasibility of large-scale activity-based modeling rooted in behavioral science, thus fostering the number of similar studies/tools.The database, software, and the resulting behavioral parameters are available as open-source 1 .Furthermore, the applicability of the proposed algorithm is not limited to activity-based modeling and transportation problems, greatly increasing its applicability.
Finally, it is still worth noting that the study has some limitations that may be addressed in future research directions.Since the calibration has been carried out against aggregate benchmarks (e.g., the modal share of the whole population), future developments may strive to apply the calibration algorithm to a more disaggregate set of data (calibrating, for example, modal share for the type of tour or type of individual) reducing local discrepancies in the modal share.Moreover, the empirically quantified uncertainty, i.e., standard deviation, of the Random forests base predictions tends to collapse to 0 in the regions of the space that are distant from the observed points.This implies similar predictions by all base models and hence inaccurately estimated uncertainty.Such phenomena may exhibit greater visibility at the initial stage of the BO, when the space is very scarcely sampled, thus future works may allow limiting the exploration to the distant regions of the parameter space.

Fig. 1 .
Fig. 1.Conceptual design of the iterative Bayesian optimization method with Random Forest as a surrogate model and Expected Improvement (EI) as an acquisition function.

Fig. 4 .
Fig. 4. Comparison between the starting point (1) and the best simulation in run 2 (182) -benchmarks against the baseline and residual error; the y-axis represents the percentage for worker coverage and mode distribution, and the number of legs for all the other benchmarks

Fig. 5 .
Fig.5.Difference between the number of trips for each OD pair, calculated between the simulated ones and the total obtained instead by upscaling the mobility survey to the whole population.The axis labels include every third district.

Fig. 6 .Fig. 7 .
Fig. 6.Comparison of attendances at anchor points for education-related and work-related reasons; the x-axis represents the number of trips to the anchor point in the calibrated simulation, the y-axis represents the baseline number of trips to the anchor point from the baseline

Fig. 8 .
Fig. 8. Temporal distribution of tours throughout the day for the bestperforming simulation 182 in run 2

Fig. 9 .
Fig. 9. Comparison of the parameters -starting (red) and final (green) values -in the logit branch modeling the mode choice for trips to school or other educational institutions study , β female travel , β age category , β children in household , β income , β missing income , β work at home , β number of cars in household , β dptour logsum , β employment status )

TABLE II HYPERPARAMETERS
USED FOR THE ALGORITHMS UTILIZED IN THIS STUDY.FOR THE HYPERPARAMETERS NOT INCLUDED IN THISTABLE, DEFAULT VALUES ARE USED.

TABLE III NUMERICAL
COMPARISON BETWEEN THE BEST-PERFORMING SIMULATION 182 IN RUN 2 AND THE BASELINE