Optimizing Transportation Dynamics at a City-Scale Using a Reinforcement Learning Framework

Urban planners, authorities, and numerous additional players have to deal with challenges related to the rapid urbanization process and its effect on human mobility and transport dynamics. Hence, optimize transportation systems represents a unique occasion for municipalities. Indeed, the quality of transport is linked to economic growth, and by decreasing traffic congestion, the life quality of the inhabitants is drastically enhanced. Most state-of-the-art solutions optimize traffic in specific and small zones of cities (e.g., single intersections) and cannot be used to gather insights for an entire city. Moreover, evaluating such optimized policies in a realistic way that is convincing for policy-makers can be extremely expensive. In our work, we propose a reinforcement learning frameworks to overtake these two limitations. In particular, we use human mobility data to optimize the transport dynamics of three real-world cities (i.e., Berlin, Santiago de Chile, Dakar) and a synthesized one (i.e., SynthTown). To this end, we transform the transportation dynamics’ simulator MATSim into a realistic reinforcement learning environment able to optimize and evaluate transportation policies using agents that perform realistic daily activities and trips. In this way, we can assess transportation policies in a manner that is convincing for policy-makers. Finally, we develop a model-based reinforcement learning algorithm that approximates MATSim dynamics with a Partially Observable Discrete Event Decision Process (PODEDP) and, with respect to other state-of-art policy optimization techniques, can scale on big transportation data and find optimal policies also on a city-scale.


I. INTRODUCTION
Organizing and managing cities that are fastly becoming bigger will be one of the most critical challenges of the next decade. The urbanization process is a global challenge that involves all the continents, and it poses questions on how to make the future of mega-cities more livable and sustainable from many perspectives such as ecology, water, and energy management [1]. Also, transportation systems play an essential role in attaining this purpose, and thus redesigning and modernizing urban mobility remains a pivotal factor for our metropolitan landscapes.
The associate editor coordinating the review of this manuscript and approving it for publication was Nikhil Padhi .
Moreover, several studies have provided evidence that the quality of the transportation systems is related to the economic well-being of both small [2] and large areas [3]. Factors such as traffic jams have a consequence on the quality of life, the vitality, and the health of citizens [4]- [6]. Hence, transportation engineers are doing their best to overtake the before-mentioned concerns by using innovative solutions identified as Intelligent Transportation Systems (ITS). The primary purpose of ITS is to enhance decision-making processes for transportation-related tasks, and data can provide a considerable supplement. Nowadays, data are more accessible than ever before, thanks to the rise of ubiquitous computing, including the usage of sensors by citizens and in our cities. For instance, GPS devices installed on cars and mobile phones are extensively used to gain an up-to-date overview of the mobility patterns in a city with a remarkably high spatial granularity [7]- [9].
The availability of such precious knowledge provides insights of paramount importance for the management of transportation dynamics and the related decision-making processes. For example, it is possible to determine the traffic state in a city [10] while other techniques allow us to forecast the volume of traffic within a specific period [11]. Another example is represented by the possibility to forecast the speed travel on a specific link [12], [13]. Notwithstanding the goodness of many state-of-the-art approaches, ITS still poses several challenges for researchers. For example, the most recent algorithms and models are often focused on solving transportation network dynamics only on specific parts of a city (see Section II).
Significant progress in transportation research has been due to simulators' role, namely, environments in which engineers can experiment with their models and gain enlightening results. Two widely used examples of simulators are (i) SUMO (Simulation of Urban MObility) [14] and (ii) MATSim [15]. The former is an open-source traffic simulator that provides APIs and a graphical interface to model road networks. It was used, for example, for defining an environment in which a deep reinforcement-learning adaptive traffic signal control agent can operate [16] and, in [17] in which the traffic optimization is extended to a multi-agent problem where each agent controls a traffic light. Instead, MATSim, recently proposed by Horni et al. [15], is an open-source framework for simulating transportation dynamics in a multiagent environment that offers a wide variety of modules for demand-modeling, traffic flow simulation, and re-planning. Interestingly, this simulation framework can be extended to cover specific needs by integrating models and source codes.
In this work, we adopt the state-of-the-art high-fidelity multi-agent transportation simulator MATSim. We transform it into a reinforcement learning environment to have a scenario in which agents act realistically. In this way, we define a framework in which we can test optimized transportation policies in convincing ways for policy-makers serving as an alternative to real-world tests that are extremely expensive and, in some cases, dangerous. Moreover, thanks to agents' realistic behaviors offered by MATSim, we can test the performances of a recently proposed Partially Observable Discrete Event Decision Process (PODEDP) algorithm [18], a policy optimization technique that assumes the environment is only partially observable, and of other state-of-the-art algorithms in a context in which state-transitions are not given a priori. Finally, we show that PODEDP can scale and optimize policies for large areas such as an entire city.
The paper is organized as follows. Section II discusses previous works on modeling transportation dynamics using reinforcement and deep learning approaches, also highlighting their limitations. In Section III, we provide an overview of reinforcement learning concepts and the Discrete Event Decision Process (DEDP) [19] algorithm as the basis of PODEDP.
In Section IV, we first discuss how we turn MATSim into a reinforcement learning-based environment and then we explore in detail the PODEDP technique. Finally, we briefly introduce the other policy optimization techniques that are benchmarked with PODEDP. In Section V, we outline the experimental setup as well as describe the synthetic and realworld datasets used for the evaluation. The results of the experiments are presented and discussed in Section VI and, finally, in Section VII we propose some conclusions and we suggest possible future directions.

II. LITERATURE REVIEW
Previous work on optimizing transportation dynamics includes controlling the schedule of traffic lights to reduce traffic jams through dynamic programming and reinforcement learning at the scale of several road intersections and identifying transportation-related policies to optimize transportation network operations through simulation of a city. To the best of our knowledge, the work on optimizing city-scale transportation dynamics based on reinforcement learning and human mobility data is rare, because both the algorithms to cope with complex interaction dynamics and the data are becoming available only in recent years.
In the domain of controlling traffic lights' schedule, the most traditional methods are based on dynamic programming to solve an optimization problem according to precise manually-described assumptions, with actuated and adaptive methods. The actuated methods take into account the provided distribution of the traffic on the roads and set the traffic lights to reduce traffic jams. A significant weakness of models such as [20], [21] is that the optimal time threshold rigidly depends on the volume of the traffic and this parameter has to be set manually. Thus, it is nearly impossible to use such models in modern mega-cities where transportation patterns constantly change, and traffic volume depends on an astonishingly large amount of parameters [22]. The adaptive methods, such as Split Cycle Offset Optimisation Technique (SCOOT) and Sydney Coordinated Adaptive Traffic System (SCATS) [23], [24] choose among a set of signalcycle plans to optimize the assumed traffic distributions. These two models handle traffic congestion inefficiently due to limited adaptability in handling non-recurrent traffic jams in real-time scenarios. Other examples of adaptive methods are Real-Time Hierarchical Optimized Distributed Effective System (RHODES) [25], Optimized Policies for Adaptive Control (OPAC) [26], and Adaptive Control Software Lite (ACS-Lite) [27]. Such models are exponentially complex to deploy them at scale.
Recurrent and non-recurrent big traffic jams reveal the limits of adaptive traffic control methods. In recent years, several works have used a Reinforcement Learning (RL) framework to optimize the management of traffic signals and traffic lights [28], [29] in a data-driven fashion. In RL approaches, one or more agents control the status of the signals according to certain conditions and a reward function. The agent's behavior is usually learned through a substantial number of VOLUME 8, 2020 simulations to maximize the future expected reward. The key components of an RL algorithm are the state features, the actions, and the reward function. The state features include the length of the queue to represent the state of the traffic congestion [30]- [32] and the position of the vehicles with respect to the stop line and its relative speed [29], [33]. The reward function for the agent to optimize could be formed from the length of the queue [31], [34], the average delay [28], [35], the cumulative delay [33], [36], and the vehicle staying time [29], or a weighted sum of the length of the queue, the vehicle delays, the average waiting time, the light switcher state (0 if it does not change, 1 otherwise), the number of vehicles and, finally, the total travel time of each car [34].
Using RL algorithms, researchers have recently obtained significant results in adaptive control of traffic signals. Works such as [29], [33], [34], [36] focused on the optimization of single intersections and others took into account multiple crossroads [16], [17], [35], [37]. In [30], a single intersection scenario is used to train an agent and the obtained insights are used to extend this scenario to the management of multiple intersections. Only a couple of these works are extended to a district level: [32] in downtown Toronto and [37] in a restricted area of Barcelona. Regarding the adopted algorithms, most of the mentioned works use either Q-Learning [38], [39] and Deep Q-Learning [40] algorithms to learn to assess the expected reward function of state features, or Policy Gradient [41], [42] algorithms to learn to take actions directly according to the state features. A different method proposed by El-Tantawy et al. [32] is based on a Multi-Agent Reinforcement Learning (MARL) framework to coordinate the potentially conflicting goals of the agents.
In the field of transportation policy research [15] and urban planning [43], simulation is a widely-used approach to evaluate transportation policies and identify optimal solutions at the city scale. The utilities [44] to optimize generally involve less traveling time, less uncertainty in arrival time, and less impact of traffic fluctuation on planned activity duration, and the control variables generally involve when to end the current activity and what route to select in the current trip [45], [46]. Experiments in the real world are often costly, dangerous, and infeasible. Simulators provide a solution for evaluating hypotheses and methodologies in silico where certain aspects of the behavior faithfully mirror the real world. Transportation simulators [14], [15], [47] generally identify the optimal policy as open-loop control, where the current state of a complex system does not affect the control variables. An open-loop system is simple to implement. Yet it is inaccurate and unreliable when the system is noisy.
Despite all the proposed solutions representing a valuable improvement in traffic optimization, to the best of our knowledge, there is not a solution focused on the optimization of complex systems such as transportation dynamics at a city scale based on high-fidelity simulation [15], big mobility data [48], and reinforcement learning [49]. In this work, we fill the gap by introducing a framework based on the state-of-art transportation simulator MATSim and a reinforcement learning approach.

III. BACKGROUND
In this section, we provide the background knowledge for MATSim, Reinforcement Learning, Discrete Event Decision Processes (DEDPs), and Partially Observable Discrete Event Decision Processes (PODEDPs).

A. SIMULATION AND MATSIM
Simulations are extensively used in transportation engineering and policy research as well as in other fields such as urban planning [43], robotics [50], epidemiology [51], gaming [52], and so on. Experiments in the real world are often costly, dangerous, and infeasible. Thus, simulators provide a solution for evaluating hypotheses and methodologies in silico where certain aspects of the behavior faithfully mirror the real world.
MATSim [15] is a state-of-the-art large-scale multi-agent transportation simulator: it reproduces a realistic behavior of how people travel and perform activities such as spending time at work, attending schools, shopping, and visiting friends, and how roads with different parameters respond to travel demands. Specifically, the simulator's workflow to identify the typical travel behavior is organized as follows. First of all, the road network is specified in the simulator to match the real world, and initial travel plans (i.e., activities and trips connecting the activities on an individual basis) are created from where people live and work according to census and surveys. Second, the plans for the individuals are executed in the road network through high-fidelity simulations and scored according to their economic values. Next, the simulator re-plans, re-executes, and re-scores the plans for the individuals using a co-evolutionary algorithm until nobody can unilaterally improve their trips. At that point, the system has an equilibrium, and we can inspect the typical behaviors of the individuals.

B. REINFORCEMENT LEARNING
A reinforcement learning system is delineated by four essential parts: (i) a policy, (ii) a reward function, (iii) a value function, and optionally (iv) a model of the environment [49]. A policy is a mapping between the state perceived by the agent and the actions to take. The reward function determines the goal of the agent and thus the action to take in the immediate. Similarly, the value function of a system determines what is valuable in the long term. The presence of a formal model of the environment discriminates among model-based and model-free RL approaches [49]. More precisely, a model, given a specific state and action, might be used to foretell the next state and the next reward. In other terms, model-based RL approaches can investigate prospective scenarios and situations before they are encountered. On the contrary, model-free methods learn directly from their experiences [49].
Usually, models are employed to mimic the conditions in which the RL algorithm will act and provide the simulated experience. The latter can then be utilized for planning the subsequent actions. In particular, the two ways in which planning can be developed are state-space planning and planspace planning. In the former, the purpose is to examine all the possible states to determine an optimal policy. In contrast, in the latter, the exploration for an optimal pattern is scrutinized over the space of plans. In some model-based RL settings, an optimal policy is difficult to achieve due to the barriers in modeling large-dimensional state-spaces. This is the case, for example, of transportation systems. Additionally, representing real-world problems as fully observed models is prohibitive in the vast majority of the cases.

C. DISCRETE EVENT DECISION PROCESS AND PARTIALLY OBSERVABLE DISCRETE EVENT DECISION PROCESS
A Discrete Event Decision Process (DEDP) [19] and a Partially Observable Discrete Event Decision Process (PODEDP) [18] have been proposed for optimal control in transportation systems.
The most significant limitation of the Discrete Event Decision Process (DEDP) [19] is that by considering the full observability of the environment, the algorithm for the policy optimization never performs an information gain check. In particular, DEDP can be seen as a tuple DEDP S, A, V, C, P, R, γ where S is the state space, A is the action space, V the set of possible events, C is a mapping between actions and event coefficients, P is the transition kernel, R represents the reward function and, finally, γ ∈ [0, 1] is a discount factor. Mathematically, the transition kernel is represented as the probability of an event to happen given a state and an action P(s t+1 ) = p(v t |s t , a t )δ s t+1 =s t + v and the immediate reward function R is a function of a state at a given time t and corresponds to the sum of the rewards of each component: R(s t ) = M m=1 R (m ) t (sm t ). In this sense, a policy π can be defined as a mapping between a state s t and a distribution of actions a t parametrized with θ: πp(a t |s t ; θ). The goal of DEDP is to optimize the policy φ or, in other terms, maximize the expected future reward as θ ← The algorithm proposed in [19] (Algorithm 1) optimizes DEDP policy through performing policy evaluation and policy improvement steps repeatedly until convergence: DEDP policy evaluation through dynamic programming is intractable, due to the exponential size of the state space in the number of state and control variables. In [19], the Bethe entropy approximation [53] is applied to identify a variational lower bound L(θ) of the log expected future reward. In particular, given a deterministic policy a t = µ(s t ; θ) and the generative model of the (intractable) DEDP process P(ξ T ; π) = p(s 0 ) T t=1 p(v t | s t ; θ)δ s t+1 =s t + v t , they introduced a tractable auxiliary process q(T , m, to approximate the assignment of future expected reward among the 2) Policy improvement: update θ with Eq. 3 to maximize the expected future reward function future states and actions, and solved the variational lower bound with the method of Lagrange multipliers. The result is a forward-backward algorithm to estimate the state statistics α   PODEDP is an approach that aims to optimize a policy starting from a partially observable environment. In general, we can consider it as an extension of DEDP, where the decision-making process is not fully observable. Therefore, we do not need to assume perfect a priori knowledge. Moreover, by removing such an assumption, the optimizer is forced to perform an information gain step during the policy optimization process. It leads to the ability to model complex systems such as transportation in a more precise and realistic way. In general, it is known that the majority of the real-world systems are partially observable and, despite the excellent results achieved in [19] using DEDP, we expect to gain better and more realistic results using PODED. In this paper, we integrate PODEDP in an RL environment as a policy optimizer and we connect its dynamics to the simulator MATSim described in Section IV-A to evaluate the performances in a realistic city-scale scenario.

IV. METHODOLOGY
Here, we describe how we integrate a reinforcement learning approach into the MATSim simulator (see Section IV-A). Then, in Section IV-B, we present the Partially Observable Discrete Event Decision Process (PODEDP) algorithm as well as we discuss, in Section IV-C, some additional policy optimization algorithms that can be adopted in the context of transportation systems.

A. TRANSFORMING MATSIM INTO A REALISTIC REINFORCEMENT LEARNING-BASED ENVIRONMENT
In this work, we transform MATSim, a high-fidelity multi-agent transportation simulator, into a realistic reinforcement learning environment to optimize and evaluate transportation policies. This reinforcement learning environment enables us to assess transportation policies and convince policy-makers by simulating how agents make daily activities and trips with high-fidelity. With the reinforcement learning environment, we develop a model-based reinforcement learning algorithm to approximate MATSim dynamics with a Partially Observed Discrete Event Decision Process and identify the optimal policy through variational inference. The model-based reinforcement learning algorithm is benchmarked against three state-of-the-art algorithms, Policy Gradient, Actor-Critic, and Guided Policy Search, in the proposed reinforcement learning environment on four different scenarios: a fully synthetic one provided by MATSim (Synth-Town) and other three real-world datasets involving the cities of Berlin, Santiago de Chile and Dakar. Of the four algorithms, only the model-based reinforcement learning algorithm can converge with a reasonable computational effort. An implementation of the entire framework is available at github.com/LuckysonKhaidem/matsim-code-examples.
One of the challenges we face while translating MATSim into a reinforcement learning environment is keeping track of immediate rewards. This is particularly difficult because the MATSim scoring function generates scores of plans at the end of every iteration. Instead, in our reinforcement learning paradigm, we define immediate rewards as the difference between rewards accumulated between two consecutive actions that an agent takes. To calculate this difference, we need to keep track of rewards that an agent accumulates throughout the course of simulation every minute of the day.
In sum, turning MATSim into a reinforcement learning environment with realistic agents' and roads' behavior, we offer a way to evaluate a transportation dynamic policy in a realistic scenario. In the following, we describe in detail the solution adopted (i.e., Partially Observable Discrete Event Decision Process) to characterize the decision-making process of a transportation system.

B. REINFORCEMENT LEARNING IN MATSIM WITH A PODEDP
Modeling transportation dynamics on a city-scale represents a formidable hurdle due to the massive dimensions of the state-space and the complexity of the transitions. Furthermore, transportation systems are only partially perceptible and, consequently, it is hard to derive a model that can accurately trace the related dynamics. Another challenge stems from the complex interactions in the system, with the consequence that short-term choices may have massive implications in long-term scenarios. To address those challenges, we built on the PODEDP optimal control algorithm recently proposed by Yang et al. [18] and developed a modelbased reinforcement learning algorithm. The mathematical notations used in this Section are described in Table 1   TABLE 1. Summary of the mathematical notation used in this Section with the associated meaning.
As described by Yang et al. [18], a PODEDP is a tuple PODEDP S, A, , V, C, P, O, R, γ where S is the state space, A the action space, the observation space and O is the related observation function. V is the events set, C is a mapping between actions and event rate coefficients, P corresponds to the transition kernel, R is the reward function and, finally, γ is a discount factor that varies between 0 and 1.
Given this definition, the two model parameters are φ c governing state transition dynamics and φ R governing the immediate reward to be received. The probability for an event v t = 1, . . . , V to happen conditioned on state s t and action a t is In other words, action a t controls discrete event decision process dynamics through changing the event rate coefficients c = (c 1 , . . . , c V ) = C(a t ; φ c ). The probability of no event is thus p Similarly, the policy parameters are θ. Finally, given an action a t = (a (1) t , . . . , a (D) t ), a policy is defined deterministically as a t = µ(s t ; θ) or stochastically as π = p(a t |s t ; θ).
The idea behind the integration of PODEDP in an RL environment is to start from the model parameters (φ c , φ R ) and the policy parameters θ and iterates through the following steps until convergence: Regarding point (1), we optimize city-scale road traffic streams through dispatching carriers to downstream road links in the correct proportion and recommending people regulate the place and duration of their future actions, according to our estimation of traffic jams from the isolated measurements of vehicle trips.
Specifically, the action variables a t are the chances of choosing a downstream road link after completing the current road link, and the event rate coefficient of leaving or entering buildings. We implement the action variables within MAT-Sim through selecting from the alternative plans identified by the MATSim re-planning module with rejection sampling and using the within-day re-planning interface.
We use PODEDP S, A, , V, C, P, O, R, γ to approximate the complex dynamics of MATSim (and the real world) analytically through defining a set of microscopic events. In the PODEDP model, the states s t = (s (1) t , · · · , s → p · m 2 , where the event rate coefficient c m 1 m 2 is the probability for the vehicle to make the movement. The action variables a t will change the event rate coefficients to make road usage more efficient. We implement the state transition p(s t+1 , v t | s t , a t ) following the traffic flow diagram to best match MAT-Sim traffic dynamics, including traffic congestion. Then, the reward function R(s t ) is implemented to best approximate the Charypa-Nagel scoring function. Finally, we implement a policy µ(s t ; θ) as a neural network with weight θ. t,perf are the score coefficients. We implement the deterministic policy as a function of states through a neural network a t = µ(s t ) = N N (s t ; θ) parameterized by policy parameter θ.
Concerning point (2), we used a simplified version of the concepts introduced in [54], [55] to train the two parameters of the model (φ c , φ R ) for each dataset D introduced in Section V-A. In particular, they propose a way of capturing interactions in social dynamics and of copying exponential growth of state-space with respect to the number of elements in the system. In this sense, it would be unlikely to assess the scenarios utilizing exact inference solutions, and so Xu et al. [54] have proposed an approximate inference method. We get inspired by their solution as the complexity of transport dynamics and transportation systems perfectly fit with the obstacle Xu et al. [54] solved. Notably, as the state-space is remarkably large and the environment is partially observable, it is better to estimate the parameters of the model using a technique that copes with approximate inference. To succeed, Xu et al. [54] suggest to consider i (1) . . . i (M ) as a set of individuals, v (1) . . . v (T ) as a set of events, y (1) . . . y (T ) as a set of observations and, finally, x (1) . . . x (T ) as a sequence of hidden states. In this sense, the likelihood of the sequence is estimated as P(x (1) , . . . , x (T ) ), y (1) , . . . , y (T ) , v (1) In particular, P(x t , v t |x t−1 ) corresponds to the transition kernel. In the policy evaluation step, we use the method introduced by Xu et al. [54] to handle the exponential growth of the state-space and to reduce it to a tractable one with a mean-field approximation of the state In this way, we can define a projected marginal kernel that can be further developed into a backward-forward algorithm. The forward and backward steps are defined as: To learn the aforementioned parameters, we maximize the expected log-likelihood In our experiments, we use a simplified version of the method proposed to estimate the parameters φ c and φ R .
Finally, for point (3), we need to optimize the policy parameter θ using the estimated values of φ c and φ R . In other terms, θ ← argmax θ E θ,φ c ,φ R t γ t R(s t ). To accomplish this purpose, we execute the algorithm stated in [18]. The policy improvement step performs θ ← θ + γ ∂logV π (r) ∂θ where γ is the learning rate and the gradient is mathematically described in Equation 3. Given the difficulty of solving a PODEDP problem due to the critical dimension of belief state-space, as recommended in the literature [56], we train VOLUME 8, 2020 the model in a fully observable environment that can be constructed using DEDP. Adopting such an approach leads to a computationally-simpler solution of the partially observable environment. In different words, we are reducing the optimization of a PODEDP problem to a policy optimization in a Specially designed DEDP (SDEDP). Thanks to the full observability assumed in a DEDP model, we are assured that the optimized value function of a DEDP system is the upper bound of the related PODEDP optimizer. Maximizing the upper bound of the DEDP problem is not enough as we cannot solve the PODEDP model because by computing this bound we are not guaranteeing the performances of the target function. This is where SDEDP plays a crucial role. SDEDP is a tuple SDEDP S, A, V, C, P, R, γ . Given a PODEDP model, SDEDP is the corresponding fully observable model if the two share the same S, A, V, C, P, R, γ and the initial state distribution is the same.
Algorithm 2 summarizes how the local policy associated with the corresponding SDEDP instance can be optimized. Once done, we are able to estimate the belief state b t (s t ) = p(s t |o 0:t ) by inferring the state distribution in the original PODEDP environment from the past observations until the current observation. To obtain an optimized policy in the PODEDP environment, we first apply Algorithm 1 on µ * (s t ), we estimate b t (s t ) as showed in Equation 6 and we generate the optimal policy as indicated in Equation 7.
The steps to follow are further highlighted in Algorithm 3. As mentioned, the chance of working in a model-based reinforcement learning setting is a unique opportunity to investigate scenarios and learn from simulated experience. Simultaneously, shaping a complex real-world system is challenging, and the risk of training agents on an inadequately designed model is significant. Moreover, model-free RL brings substantial limitations. In such systems, agents cannot carry an explicit plan of how environmental dynamics affect the system, especially in response to an action earlier practiced. Therefore, it would be complicated to collect

Algorithm 3 Optimization of a PODEDP
Input: PODEDP S, A, , V, C, P, O, R, γ Output: The optimized policy: π * (a t | b t ) Procedure: 1) Run Algorithm 2 on µ * (s t ) 2) Estimate b t (s t ) with Equation 6 3) Generate the optimal policy as in Equation 7 insights that can be applied to real-world situations. The solution specified in this Section is a model that embodies a good compromise between the ascertainment that in a modelbased environment we can gather valuable insights and the admission that a world may be partially observable. In this sense, we affirm that by using such a model, we can capture complex real-world dynamics more accurately and succinctly with respect to other models. Despite the good results obtained in model-based scenarios that assume the full observability of the environment ( [19]), by removing such strong assumptions ( [18]) we can overtake the results of methods such as DEDP, Guided Policy Search (GPS) [57], Policy Gradient (PG) [58] and Actor-Critic (AC) [59]. The outcomes are discussed and investigated in Section VI.

C. POLICY OPTIMIZATION TECHNIQUES
In this Section, we focus on three policy optimization algorithms. In particular, we analyze two simulation-based algorithms, namely Policy Gradient (PG) [58] and Actor-Critic (AC) [59], and an analytical method, Guided Policy Search (GPS) [57]. The main difference between simulation and analytical methods is that the former category sample the states and the actions to reproduce population flows while the latter approximate transition dynamics using differential equations and solve local policies. All the mentioned techniques implement the policy as a neural network considering the historical inputs as observations. The performances of the aforementioned state-of-art algorithms are compared with our proposed approach based on PODEDP [18] in Section VI.
Regarding PG [58], the authors have highlighted that in many works, the loss function, that should be optimized to solve a problem successfully, consists of computing an expectation over a set of variables that may be part of a probabilistic environment and thus gradient-based algorithms cannot be used. To overtake this issue, Schulman et al. [58] show how an unbiased estimator of the loss function's gradient can be derived starting from a stochastic computation graph: a directed acyclic graph that encodes the dependency structure of the computation to be performed. Four types of nodes can be found in the graph: (i) input nodes containing the parameters we want to compute, the derivative of (ii) deterministic nodes that correspond to a deterministic function we want to calculate with respect to the parents, (iii) stochastic nodes that specify a random variable through a conditional distribution on their parents, and finally (iv) constant nodes, namely a subset of deterministic nodes corresponding to real numbers. Schulman et al. [58] derive a gradient estimator for an expected sum of costs in a stochastic computation graph and they show that this gradient estimator can be computed efficiently. In their paper, the authors propose two ways to calculate the gradient. One way is to use the backpropagation algorithm with one of the surrogate loss functions. As an alternative, the algorithm proposed in [58] can be used. Another algorithm for policy optimization that we compare with PODEDP is proposed by Lillicrap et al. [59]. In this work, they introduce an Actor-Critic model-free algorithm that also works in continuous action spaces. Q-learning cannot be applied to continuous action space due to the computational complexity of finding a greedy optimized policy. Thus, the authors have proposed an Actor-Critic algorithm based on a deterministic policy gradient [60].
An additional solution to policy optimization is described in [57]. There are many guided policies that can be used to solve non-linear problems and Montogomery and Levine [57] propose a new Guided Policy Search (GPS) algorithm providing two main contributions. First of all, their algorithm guarantees convergence in simplified convex and linear settings. Moreover, they show that, in the more general nonlinear setting, the error in the projection step can be bounded.

V. EXPERIMENTAL SETUP AND DATA
In this Section, we present the structure of the experiments we perform as well as the synthetic and real-world datasets used for evaluating our approach. Overall, our purpose is to compute an optimal policy to manage transportation dynamics in order to (i) minimize the time cars spend on the roads, (ii) arrive at a specific Point of Interest (POI) on time, and (iii) staying at a given POI for a sufficient amount of time. Some of the features needed to run the PODEDP optimizer are taken from the datasets outlined in Section V-A, while others are measured from MATSim. As previously said, a PODEDP can be seen as a tuple PODEDP S, A, , V, C, P, O, R, γ and the mapping done to run our experiments is as follows. First of all, the set of states S is composed by elements such as s t = (s  (s t+1 , v t |s t , a t ). In other terms, c m 1 ,m 2 is the probability of p of moving from m 1 to m 2 at t. Similarly, each action a t can be represented as the event rate coefficient of (i) choosing the downstream link, (ii) leaving a POI, and (iii) entering a POI and staying there for a sufficient time period. Concerning the traffic dynamics, we decide to follow the traffic flow diagram proposed by MATSim while the reward function emulates the Charypa-Nagel scoring function illustrated in [15]. Finally, we implement the policy as p(a t |b t ; θ) = s t b t (s t )δ a t =µ(s t ;θ) where µ(s t ; θ) is a neural network with θ as weights.
A model-based reinforcement learning algorithm is dataefficient. To demonstrate that, we compare our proposed approach against other analytical and simulation methods. For the simulation method, we focus on a Policy Gradient (PG) [58] and an Actor-Critic (AC) [59] that sample the actions and next states to reproduce the population flow. For the analytical method, we focus on a Guided Policy Search (GPS) [57] that approximates the transition dynamics with differential equations and solves the local policies analytically. All these algorithms implement the policy as a neural network with the historical inputs as observations. We run each algorithm on each scenario for 10 times and draw the average and standard deviation of the achieved utilities across training epochs.

A. DATA
We evaluate the performance of our proposed framework on four datasets of human mobility: (i) SynthTown, (ii) Berlin, (iii) Santiago de Chile, and (iv) Dakar. A summary of the relevant characteristic of each dataset are exposed in Table 2 The SynthTown dataset contains a synthesized network of one home location, one work location, and 23 single-direction road links. Hence, it represents the synthesized travel demand of 50 agents going to work in the morning and returning home in the afternoon [15]. We uniformly sample 10% of these agents as probe ones who are observable on link 1 and link 20. The optimization problem has the goal of maximizing the total utility over all agents through setting the rate for an agent to start the home-to-work and the work-to-home trips, and distributing agents to the available downstream links according to the time of day and the numbers of probe agents on link 1 and link 20. If each agent maximizes his/her utility greedily, traffic congestion will happen and the overall utility will be sub-optimal. The numbers of probe agents on link 1 and link 20, on the other hand, provide information for the controller to optimize agent behavior. This centralized control is what happens when a transportation agency and a web-mapping service-provider control traffic signals, provide traffic information, and diversify the routes to optimize road network operation.
The Berlin dataset is comprised of a network of approximately 2400 single-direction road links, derived from Open-StreetMap, in the metropolitan area of Berlin bounded by the Berlin beltway. Moreover, this dataset contains the trips of 17 thousand synthesized vehicles representing the travel behaviors of three million inhabitants [61]. The daily trips in the Berlin dataset were synthesized from (i) the commuter data provided by the German Federal Employment Agency containing the home and workplace municipalities of the working population, (ii) an activity-based demand model [64]  to sample a sequence of activities (staying at home, working, attending schools, shopping, going to the restaurant, etc.) and the corresponding travels throughout a day, (iii) a physical simulation [15] to repeatedly modify the sampled activitytravel sequences to match the capability of the transportation network, and finally (iv) a Bayesian sampling [65] to match the daily activity-travel sequences from the previous step with hourly traffic count values from over 300 count stations. The synthesized daily trips have been validated based on extensive, regularly-conducted travel surveys and constitute a quality representation of road transport demand.
The third dataset, the Santiago de Chile dataset, comprises a network of 6000 single-direction links derived from OpenStreetMap, and of the trips of 65 thousand synthesized vehicles representing the travel behaviors of six million inhabitants in cars, walking and public transportation modals [62]. The daily trips in the Santiago de Chile dataset were initialized from cloning the sequences of activities (starting time and duration of time spent at home, work, school, doing shopping, doing leisure activities, doing visits, and at health facilities) and travel mode of 60 thousand individuals (from 18 thousand households) from publicly-accessible travel diaries. These sequences of activities are modified through physical simulation and a co-evolutionary algorithm (using MATSim) to maximize the overall utility of the system. The resulting daily trips are compatible with travel modals' distributions and observed traffic counts at count stations.
Finally, the Dakar dataset contains a network of 8000 single-direction road links derived from OpenStreetMap and 12 thousand real-world vehicle trips derived from the ''Data for Development (D4D)'' challenge datasets based on the Call Detail Records (CDR) of over 9 million Sonatel customers in Senegal through the year 2013 [63]. A Call Detail Record is a data record that details a telecommunication transaction to generate phone bills and has been used widely in academic research for modeling human mobility patterns [48]. More precisely, the D4D-Senegal datasets contain hourly site-to-site voice/SMS traffic among 1666 sites (dataset 1), mobility of 300 thousand users randomly sampled every 2 weeks at site level (dataset 2), and the mobility of 150 thousand randomly-sampled users for one year at the level of 123 arrondissements (dataset 3). From dataset 2, we identify the home and work/school locations of each user as randomly picked locations from the most appeared sites in the periods 7am -7pm and 7pm -7am, respectively. Then, we sample an activity-trip sequence for each user to match her/his sequence of mobility records from a Markov chain model describing how s/he performed various activities (staying at home, staying at work, attending school, shopping, etc).
In MATSim, immediate rewards and penalties are assigned to an agent according to its current location. For example, performing an activity results in decreasing immediate reward over time (diminishing returns), while traveling results in a constant immediate penalty over time depending on the transportation mode and an upfront penalty at the beginning of a trip. Early arrival and departure result in no reward or penalty. Additionally, traffic dynamics in MATSim are modeled mesoscopically as a queuing network. A road link is characterized with maximum speed, maximum flow and maximum capacity. When traffic flow (i.e., number of vehicles moving out of the road link per unit time) is less than the maximum traffic flow, vehicles move out of the link at maximum speed. Otherwise, vehicles are queued up in the road link until all cars in front of them move out the link at maximum traffic flow. When the number of cars in the link reaches maximum capacity, no other vehicles are allowed to move into the link. Both the utility function and the dynamics reflect the complex behavior in real world road traffic.

VI. RESULTS
In this Section, we compare the performance of Guided Policy Search, Actor-Critic, Policy Gradient, and modelbased reinforcement learning with the Partially Observed Discrete Event Decision Process algorithm on the four scenarios described in Section V-A. Since the performance loss of a model-based reinforcement learning algorithm comes from both the inaccuracy of the model representing the real world and the approximation to the optimal solution within the model, we also compare the PODEDP dynamics and the MATSim dynamics it represents.

A. COMPARING GPS, AC, PG, AND PODEDP MODEL-BASED REINFORCEMENT LEARNING
In a reinforcement learning environment, the final aim is to optimize a policy by maximizing the rewards. In Figure 1, we show the learning curves of PODEDP (in grey), GPS (in red), AC (in green), and PG (in blue) when we optimize transportation dynamics in the SynthTown, Berlin, Santiago, and Dakar scenarios respectively. Moreover, in Figure 1 we highlight the variance of the algorithms, which is represented by the colored area around the curves.
From the panels, it emerges that the PODEDP-based reinforcement learning algorithm quickly achieves higher utilities and with less variance. This is because the algorithm uses the data gathered from executing the learned PODEDP policy to refine the PODEDP model with supervised learning. Then, it uses the refined PODEDP model to improve the policy with variational inference, thus achieving data efficiency. At the same time, we have not been able to bring GPS, AC and PG to convergence with a reasonable computational effort. The analytical solution (GPS algorithm) obtains better results with respect to the simulation-based solution. Only in the Berlin scenario, the performances of GPS are close to the one of AC. In general, in a road traffic network, high rewards require the individuals to perform the correct activities (staying at home, working, and so on) at the right time, and to spend less time on roads. Modeling the complex system transition dynamics based on a Markov Decision Process analytically, using Taylor approximation, introduces modeling errors, which is the case of the GPS policy optimization techniques. The PG algorithm uses a similar technique based on a Monte Carlo integration. In this sense when this algorithm deal with highdimensional state spaces, its variance drastically increase and a small perturbation of policy may result in significant changes to the immediate reward and value in later steps, which makes it difficult to estimate the correct value from sampled trajectories, and as a result difficult to compute the correct value gradient. The AC algorithm also faces the same problem.
To illustrate how the learned behavior of individual agents causes the overall performance differences, we present the average number of vehicles of ten runs at each location of SynthTown for each algorithm using the learned policy ( Figure 2). As shown in this Figure, the PODEDP leads to the smallest amount of vehicles on roads, the largest amount of vehicles at work during working hours (9am -5pm), and the largest amount of vehicles at home during rest hours (other hours), which indicates that the learned policy of PODEDP algorithm best satisfies the needs. By combining the accurate modeling of PODEDP with a tractable solution using variational inference, our method achieves the best performance. For other analytical techniques, GPS introduces modeling error when approximate the state transitions with differential equations. PG and AC instead introduce a high variance in sampling.
To further highlight the performances of the algorithms under analysis, we identify some key performance metrics (Table 3). In particular, we measure the average time vehicles spend on the roads, the number of cars at workplace locations in the working hours (9am -5pm) and vehicles at home locations during non-working hours. These values are collected for each algorithm in ten runs and using the policy that is learned at that point. Such metrics can be used as indicators of how well a learned policy performs. Given the higher rewards obtained by the PODEDP techniques, it is the algorithm that better minimizes the average times of vehicles on the roads and maximizes the number of vehicles both at the workplaces in working hours and at home locations in the other periods.

B. COMPARISON OF MATSim AND PODEDP DYNAMICS
To emphasize the PODEDP solution's correctness, we compare its traffic dynamics with the ones of MATSim. Specifically, we run a 24-hour MATSim simulation of the Berlin environment by executing a predefined set of plans. We collect the location and reward received for each agent in the simulation at every minute of the simulation time. We also record the same metrics for the PODEDP solution that has been run using the learned policy for the Berlin environment.   Using the data obtained from both the experimental runs, we may draw various insights to shed light on how similar the traffic and reward dynamics of the two simulations are. Since MATSim provides a realistic simulation of the actual traffic dynamics, we need to ensure that our solution is effective in an environment whose dynamics are close to the ones of MATSim's. Figure 3 compares the average time spent on each link between the two simulations. For each link (i.e., road segment/building), we calculate the average time spent in minutes and then sample 10,000 links and plot their corresponding average duration. Each data point represents a link where its x-coordinate represents the average time spent in MATSim and its y-coordinate represents the average time spent in the PODEDP run. The strong diagonal implies strong similarities in the temporal dynamics of the two simulations.
Similarly, Figure 4 compares the average reward received on each link for both the simulations. The average reward  values are calculated in the same manner as we have done for the previous plots. Even in this case, we can observe a substantial similarity in the reward dynamics, as implied by the plot's diagonality. We can also observe from the plot that the corresponding rewards for PODEDP tend to be slightly higher than the ones of MATSim's on some links. Figure 5 compares the average time spent in minutes on different modes of transport (leg) and activity. This plot digs deeper into how average time spent is distributed across different types of legs and activities. We can observe that for the two simulations, the corresponding time spent is almost the same. Note that for PODEDP, the overall time spent on trips is slightly less than that one of its MATSim counterpart. Similarly, Figure 6 compares the average reward received by the agents of the simulations across various activities/legs. As we can see for the legs, where MATSim awards negative rewards, the PODEDP model performs significantly better than the MATSim model. However, for the activities, rewards FIGURE 7. A quantile-quantile plot to compare the average trip duration for both working and non-working hours (in minutes). On the x-axis, we have the minutes taken for a trip in MATSim and on the y-axis the performance of time spent with PODEDP. In general, PODEDP performs slightly better than MATSim, meaning that, after a PODEDP policy optimization, travelers spend less time on the roads. earned by the PODEDP based model do not surpass but are very close to the MATSim based model. Based on rewards, we can say that PODEDP model, if not better, is at par with the MATSim based model. Figure 7 shows the quantile-quantile plot of the average trip duration, for every hour and for both models. As we can see the average time taken by the PODEDP based model is slightly less than the one taken by the MATSim model. This reinforces what we have seen in Figure 3, where the average time spent by the PODEDP model over activities is slightly less than the one by the MATSim model. Finally, Figure 8 shows the quantile-quantile plot of the average activity duration, for each hour, and for both models. Here we observe that for all the activities, with the exception of staying at home, the two models spend almost the same time. This again confirms our observations from Figure 3. Both models spend the same average amount of time on all activities except home, where the PODEDP based model spends slightly less time. This again shows that our POD-EDP based model is comparable to MATSim in terms of performance.

VII. DISCUSSION AND CONCLUSION
Redesigning and modernizing urban mobility has a pivotal role in our metropolitan landscapes. Yet there is not a city-scale solution to optimize transportation dynamics based on big mobility data and reinforcement learning to the best of our knowledge. In this work, we transformed MATSim, a high-fidelity multi-agent transportation simulator, into a realistic reinforcement learning environment for optimizing and evaluating transportation policies. This reinforcement learning environment enables us to assess transportation policies and convince policy-makers by simulating how agents make daily activities and trips with high-fidelity. With the reinforcement learning environment, we developed a model-based reinforcement learning algorithm to approximate MATSim dynamics with a partially observed discrete-event decision process, and to identify the optimal policy through variational inference. The model-based reinforcement learning algorithm is benchmarked against three state-of-the-art algorithms, Policy Gradient, Actor-Critic, and Guided Policy Search in the proposed reinforcement learning environment on four different scenarios: a fully synthetic one provided by MATSim (SynthTown) and other three real-world datasets involving the cities of Berlin, Santiago de Chile, and Dakar. Of the four algorithms, only the model-based reinforcement learning algorithm can converge with a reasonable computational effort. A detailed comparison between PODEDP dynamics and the corresponding MATSim dynamics using the Berlin scenario demonstrates that PODEDP indeed captures complex system dynamics well. Thus model-based reinforcement learning is sample efficient with the use of variational inference to search the solution space efficiently.
While we have evaluated the proposed model-based reinforcement learning approach in a state-of-the-art multi-agent transportation simulator that emulates real-world road network dynamics highly realistically, we did not evaluate it in the real world. This reflects a limitation of the evaluation. But such restriction is typical in complex systems and policy research because experiments in the real world are often costly, dangerous, and infeasible. Transforming state-of-theart simulators into a realistic reinforcement learning environment for policy optimization and evaluation, and applying modern reinforcement learning techniques to learn complex systems control from big data represent a new direction in introducing human mobility data and reinforcement learning into policy research. Through the integration of state-of-theart simulation models and big data with machine learning, we can effectively turn the real world into a living lab, where theories are evaluated on the data commons, results are quantifiable and approaches are repeatable. In the future, we expect to see applications of semantically richer models such as multi-agent reinforcement learning and mean-field game, additional data sources of heterogeneous nature, and eventually real-world evaluations.

(Luckyson Khaidem, Massimiliano Luca, Fan Yang, Ankit
Anand, Bruno Lepri, and Wen Dong contributed equally to this work.) VOLUME 8, 2020 LUCKYSON KHAIDEM received the bachelor's degree in computer science and engineering from the PES Institute of Technology. He is currently pursuing the master's degree in computer science from the State University of New York at Buffalo. His research interests include application of machine learning in financial domain, multiobjective optimization, and game theory in astrophysics (exoplanet habitability estimation).
MASSIMILIANO LUCA is currently pursuing the Ph.D. degree with the Mobile and Social Computing Laboratory, Fondazione Bruno Kessler, Italy, and the Faculty of Computer Science, University of Bozen-Bolzano. His main research interests include analysis of human-mobility dynamics and design and development of machine learning methods to tackle mobility challenges.
FAN YANG (Member, IEEE) is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, State University of New York at Buffalo, USA. His research interests include reinforcement learning, imitation learning, modeling, generative models, and probabilistic graphical models.
ANKIT ANAND received the bachelor's degree in electronics and communication engineering from the Netaji Subhas Institute of Technology. He is currently pursuing the master's degree in computer science with the State University of New York at Buffalo. His research interests include application of machine learning, deep learning, and reinforcement learning. He is also the Head of the Research with Data-Pop Alliance, the first think-tank on big data and development co-created by Harvard Humanitarian Initiative, the MIT Media Laboratory, and the Overseas Development Institute. His research has received attention from several international press outlets. His research interests include computational social science, urban computing, network science, machine learning, new models for personal data management, and monetization. He received three years Marie Curie Postdoctoral Fellowship in 2010, the James Chen Annual Award for the Best UMUAI Paper, and the Best Paper Award from ACM Ubicomp in 2014.
WEN DONG (Member, IEEE) received the Ph.D. degree from the MIT Media Laboratory. He is currently an Assistant Professor of computer science and engineering (Joint Appointment) with the Institute for Sustainable Transportation and Logistics and the State University of New York at Buffalo. His research interests include developing machine learning and signal processing tools to study the dynamics of large-social systems in situ. VOLUME 8, 2020