A Distributed Assignment Method for Dynamic Traffic Assignment Using Heterogeneous-Adviser Based Multi-Agent Reinforcement Learning

The Dynamic Traffic Assignment (DTA) is one of the important measures to alleviate urban network traffic congestion. The congestions are usually caused by stochastic traffic demands, which are generally unassignable from time dimension in the real-world but are assumed to be assignable in existing DTA methods (i.e. real-time travel demands). In this paper, a distributed DTA method for preventing urban network traffic congestion caused by stochastic real-time travel demands by improving Multi-Agent Reinforcement Learning (MARL). A team structure, which consists of decision-makers and advisers, is designed to learn parallelly in realistic DTA tasks. To reduce the size of the solution space adaptively, the dynamic critical values advised by adviser agents are adopted as constraints for the strategy space of decision-makers (i.e. main agents). A collaborative heterogeneous-adviser mechanism is designed to avoid deviation of guidance. To enhance the adaptability of DTA to the changeable external environment, the mixed strategy concept is introduced to improve the decision-making process of main agents. The respective mapping mechanisms are designed to define adaptive learning rates to improve the sensitivity of MARL. The Sioux Falls (SF) network is established as a test platform via a Dynamic Network Loading (DNL). The effectiveness of the suggested DTA method is assessed through numerical simulations SF network. Under the influence of the scenario with stochastic real-time travel demands, the results show that the proposed method outperforms in terms of the throughput of the network and the individual average travel time among the overall network. Additionally, the ability of the proposed method in response to the external environment rapidly has also been demonstrated. Adopting the suggested method can improve the state of the art to assign stochastic real-time travel demands dynamically and to avoid potential traffic congestion fundamentally.


I. INTRODUCTION
Traffic congestion of urban networks and its derivative effects are the problems faced by many cities [1]. The cause of traffic congestion is that existing traffic resources cannot be assigned to meet the rapidly increasing travel demand. Among existing solutions, Traffic Assignment (TA) provides administrations with a macro-perspective to eliminate or mitigate traffic congestion.
Modeling and solution techniques for TA have been extensively studied since Wardrop's Principles (W-Ps) [2]. Existing traffic assignment problems studied by researchers can be The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . classified into two categories: (a) Static Traffic Assignment (STA) [3]; (b) Dynamic Traffic Assignment (DTA) [4].
STA is so macroscopic that it is difficult to cope with the real polytropic situation in Traffic Network (TN). Even if STA had been improved by some scholars considering travel time reliability [5], it still has limitations. Therefore, DTA has gradually gained the favor of many scholars.
The current DTA methods can be divided into several types depending on different perspectives.
However, the above existing DTA methods both assign traffic demand generated in the network to achieve the expected objective (e.g. DUE, DSO, etc.) among the spatialtemporal scale. It is an ideal state that is hardly reached in the real world. In the realistic world, traffic demands are usually stochastic and instant. It means that this kind of demand needs to be dealt with immediately. In other words, these traffic demands are unassignable in the time dimension. Therefore, the DTA system should have the ability to assign traffic demands as they occur. Furthermore, the above DTA methods solve problems of TN from a top-level and centrical perspective, in which a lot of time resources are required for information transmission and solution calculation.
All of the above are challenging for the real-time performance of the DTA system. Hence, the real-time performance must be considered by engineers and scholars for applying DTA in a realistic TN. Therefore, it is necessary to improve the real-time performance of existing DTA methods.
Recently, Multi-Agent Systems (MAS) and Reinforcement Learning (RL) have been integrated and applied to the field of traffic management, such as traffic control [29], [30] and route planning [31], [32]. With the advantages of both MAS and RL, Multi-Agent Reinforcement Learning (MARL) was introduced for TA [33]. The problem can be solved by MARL in a distributed perspective due to that travelers decide which route to adopt and learn from their trips. MARL can provide a distributed structure to deal with the DTA problem. Hence, the portability and universality are guaranteed for applying MARL-based DTA methods to the complex and changeable traffic network. However, TA which was investigated and discussed in [33] is STA. Furthermore, to the authors' knowledge, it is a lack of research on the application of MARL in DTA. Therefore, it is necessary to investigate and modified MARL for DTA with a complex and dynamic environment.
Since the dimension and complexity faced by DTA exceed those of STA, it is necessary to suitably define the space of state and action in MARL for DTA. Furthermore, the equilibrium solution in DTA is time-varying, namely that the convergence point in DTA is time-varying. It requires MARL with capacity that keeps sensitivity to a dynamic environment and prepares to explore unfamiliar knowledge at any time. This capacity can't be established by employing an attenuated learning rate or attenuated greedy search strategy which is generally used in MARL. Therefore, there are significant components of MARL for DTA requiring further study: (a) the space of state and action, which affects the arithmetic speed of MARL; (b) the decision-making strategy, which balances exploration and exploitation; (c) the learning rate, which affects the sensitivity on the aspect of time-varying convergence point.
In this article, we suggest an SB-DTA framework based on concepts and mechanisms inspired by MARL, namely Heterogeneous-Adviser based Multi-Agent Reinforcement Learning (HAB-MARL). This framework is established on a multi-agent architecture with multiple teams. In this architecture, the agents can be divided into two categories: (a) main agents: learning strategy how to select the route for TN and (b) adviser agents (including two sub-types): updating recommended value (i.e. critical time and critical size) of the action space for main agents. For main agents, each agent independently updates its experience without considering the effects caused by other agents. For adviser agents, each agent provides recommendations on the space of action to the corresponding main agent and updates its experience from the external environment. Additionally, agents increase their adaptive capacity for DTA by employing different MARL algorithms. We particularly focus on considering flexible and directional guidance to expedite the convergence of MARL, and on enhancing the MARL's ability to quickly adapt to fluctuations in demand and supply for DTA.
The main contributions of this study to the advancement of the state of the art are summarized as follows: (1) Aiming to improve MARL's capacity for DTA, multiple team architecture is established. In this architecture, two agent types are separated according to the learning task. The main agents are responsible for realizing DTA, which receives variable decision support provided by adviser agents.
(2) The variegated critical value from adviser agents is integrated to modify the space restriction which is used to reduce the computational complexity of MARL. The improved MARL framework is capable of adjusting the decision space adaptively, which can capture the direction of the optimal assignment for DTA. The convergence process of MARL can be accelerated.
(3) The mixed strategy concept is introduced to improve the decision-making process of MARL for main agents to update experience parallelly. It makes MARL explore potential solutions efficiently.
(4) Three independent mapping mechanisms are defined to model the learning rate of MARL in corresponding agents. These mechanisms promote MARL's self-adaptive ability facing with a new state, in which occurs quantitative transitions of demand and supply in TN. In other words, the sensitivity of MARL on the aspect of a dynamic external environment is enhanced.
The rest of this article is organized as follows. The related works on the aspect of DTA and MARL are introduced in Section II. The technical background and definition of the HAB-MARL method are given in Sections III and IV, respectively. Numerical simulation experiments and analyzation of results are exhibited in Section V. Finally, conclusions and future expectations of this work are given in Section VI.

II. RELATED WORKS
The fundamental rules of DTA are established on W-Ps [2], which consists of two optimization principles: (a) Wardrop's first principle, namely User Equilibrium (UE); and (b) Wardrop's second principle, namely System Optimal (SO). Additionally, there are also some important components in DTA, which include travel cost function, route choice, etc.
On the aspect of algorithms, early studies on DTA were developed on the basic technology of STA, such as the application of the Frank-Wolfe (FW) [34]- [37]. To avoid the defect of FW in terms of convergence rate, the Column Generation (CG) [38] and Simplicial Decomposition (SD) [39] had been widely employed as alternative approaches. With the generalization of DTA, the Sub-gradient (SG) was adopted [40]. The above studies can be regarded as AB-DTA.
Although the uniqueness and existence of solutions are guaranteed and determined in advance, the limitation that traffic dynamics are difficult to be captured still restricts the application of the analytic approach. With the capacity of modeling dynamic characteristics of traffic flow, SB-DTA was presented to model DTA in a realistic environment.
In SB-DTA, scholars focus on what kind of traffic flow models can be select to reflect the real-word. This model is the basis for Dynamic Network Loading (DNL), which is a fundamental module structuring the architecture of DTA. Kuwahara et al [41] extended the point queue model to deal with physical queues and then proposed a corresponding solution algorithm for UO-DTA. The form widely used in SB-DTA is CTM. Lo et al solved UO-DTA by employing alternating direction method [42] or transforming NCP to an equivalent MP [43] on the CTM. Szeto and Lo [22] developed a CTM framework to solve route and departure time choice simultaneously under flexible demands scenarios for UO-DTA. Szeto and Lo [24] modeled SUO-DTA as a Fixed Point (FP) problem and employed CTM to capture the effect of the random evolution of traffic states.
In addition to being classified as AB-DTA and SB-DTA, DTA can also be divided according to the inflow form in which decision variables are assigned, namely PB-DTA, LB-DTA, and IB-DTA. In PB-DTA, the divergent or merged flow can be modeled due to its abundant information on path flow. Therefore, PB-DTA was studied in various studies, such as Chen and Feng [12], Lim and Heydecker [25], and Meng et al. [44]. Acquiring abundant information of PB-DTA relies on the enumeration of feasible paths, in which enumerative complexity increases exponentially with the growth of OD pairs and TN scale. Two approaches were usually to avoid or mitigate this disadvantage. One is to reduce the enumerative complexity of paths and accelerate computational speed via embedding the path generation algorithm in PB-DTA. It expands a sub-research in the field of DTA [45]. The other completely abnegates the requirement of path information to avoid enumeration fundamentally, which also known as LB-DTA. The studies in [14] and [15] previously mentioned can be regarded as researches on LB-DTA. To integrate the advantages of PB-DTA and LB-DTA, Long et al. proposed [28] and further generalized [1] IB-DTA. It transforms IB-DTA to FP that seeks the stabilized flow proportion at diverges and merges in TN.
However, as discussed in Section I, all traffic demands are counted and assigned among space-time scale in assumptions of the above DTA methods. This kind of assignment process is top-level and centralized actually. The corresponding algorithm finding a stable equilibrium point repeatedly. It is difficult and unrealistic to searching an evolutive equilibrium point caused by stochastic and real-time travel demands. Therefore, it is necessary to develop an effective approach that is self-adaptive to time-varying equilibrium immediately.

B. MULTI-AGENT REINFORCEMENT LEARNING (MARL)
In recent years, Artificial Intelligence (AI) technology has been greatly developed [46]- [48]. As a sub-field of AI, MAS has excellent portability. Therefore, MAS has been widely adopted to model systems with multi-participators, which are widespread in numerous fields including communication protocol [49], cooperative control [50], [51], fault-tolerant control [52], electrical power system [53], sensor deployment [54], transportation [55]- [57], etc. MAS researchers focus on structural mechanisms of complex systems. These mechanisms include internal learning mechanisms of each agent and externally interactive mechanisms among agents to other agents and the environment. For instance, MAS was employed to establish a real-world traffic simulation [58], [59]. With the development of Machine Learning (ML), RL with the advantage of model-free and self-learning has been proposed and exploited. MARL combines RL with MAS and promotes the development of theory and technology in transportation.
In transportation, MARL is extensively applied for many tasks, such as traffic control and route choice [33]. However, only the route choice in these tasks is associated with DTA weakly.
Bazzan and Grunitzki [60] considered DTA with individual drivers as independent and autonomous agents, which employ QL to stepwise select suitable routes. According to the objective function of DTA, Grunitzki et al. [61] proposed two improved algorithms based on QL to maximize the utility VOLUME 8, 2020 of agent and system respectively. Bazzan, and Klugl [59] believed that the improvement and deterioration of the whole system are affected by the combinations of actions among agents. Zolfpour-Arokhlo et al. [32] combined MARL with Q-value based Dynamic Programming (QVDP) to provide a priority route plan for vehicles. MARL in the above literature focuses on the definition of the reward function. By defining rational and precise reward functions, authors can achieve their objectives, such as minimizing individual travel time, maximizing system utility, etc.
Grunitzki et al. [33] suggested two MARL methods for TA and compared them with three classic TA methods. The concept of restricted search space was integrated into MARL aiming to expedite convergence. It is proven that the convergent process can be accelerated by reducing the action space felicitously. Nevertheless, the size of restricted action space in [33] is constant, which was obtained by the experiment. For DTA, in the actual situation, different drivers decide to travel from their origin to destination with distinguished spatial awareness. Moreover, the set of potential optimal routes is variational along with time due to that the situation in TN is time-dependent. Restricting the set of recommended paths adaptively can model the above situation approximatively.
As mentioned earlier, the equilibrium in DTA is evolutionary. Solving DTA is essentially an online closed-loop optimizing process. It requires that MARL has a high convergence rate and the ability to adjust the learning process adaptively. With respect to the state of the art, the motivated mechanism [62] and Equilibrium Transfer (ET) [63] are proposed to accelerate convergence. The motivated mechanism is introduced to supplement the reward function to accelerate MARL convergence. This method was defined as Motivated Reinforcement Learning (MRL). Nevertheless, MRL may lead to a deviation from the target due to that the reward function connects with the objective of DTA closely. Therefore, defining the reward function reasonably is more suitable for DTA. It has been shown in the above literature [32], [60], [61]. As for ET, its effectiveness is based on similar environments or historical experiences suffered by agents. This basic scenario is unfamiliar in DTA leading to poor applicability of ET. Additionally, the traditional assignment-loading reciprocating pattern of DTA impedes the further increase of computing speed. According to the framework of MARL, the feasible approach to accelerate MARL convergence is the assignment-loading parallel pattern, in which the assignment is divided into multi-unit as time goes on and executed synchronously along with loading. Moreover, the decisionmaking mechanism of MARL can be adjusted appropriately to improve the efficiency of exploration to accelerate MARL convergence.
In addition to the above discussion, the sensitivity of MARL is important to DTA. The time-dependent equilibrium in DTA requires MARL with the capacity of recognizing dynamic state and updating knowledge as soon as possible. Hence, frequently-used the attenuation learning rate, which lacks re-learning ability after convergence, is inapposite for MARL in DTA. It is necessary to explore the mapping mechanism which can self-adaptively regulate the learning rate according to the state of MAS and TN.
In conclusion, it is difficult to apply and popularize MARL in real-world DTA under the existing technological environment. To address the above difficulties singly, the present work proposes a Heterogeneous-Adviser based Multi-Agent Reinforcement Learning (HAB-MARL) architecture to deal with the DTA.

III. BASIC BACKGROUND TECHNOLOGY OF HETEROGENEOUS-ADVISER BASED MULTI-AGENT REINFORCEMENT LEARNING (HAB-MARL)
In this section, we introduce basic background technologies of Heterogeneous-Adviser based Multi-Agent Reinforcement Learning (HAB-MARL) architecture for DTA. In Section III.A, DTA will first be introduced as the application background of HAB-MARL. Then, a CTM-based DNL will subsequently be elucidated to establish numerical simulation in Section III.B. Finally, the common form and related concepts of MARL will be given in Section III.C. Additionally, all parameters relating to DTA, DNL, and MARL are listed in TABLE 1-3 respectively.

A. DYNAMIC TRAFFIC ASSIGNMENT (DTA) PROBLEM
In this section, we will first introduce some basic components of TA and briefly describe a Variational Inequality (VI), which is the familiar generalized solution form for DTA.
Definition 1: A traffic network G is a tuple V , E . The elements of the tuple are described in TABLE 1.
In G (V , E), flow is the control subject of DTA aiming to minimize travel costs. It can be defined in the form of VI.
Definition 2: For specific flow f t b , the corresponding travel cost function is c b t, f t b . For DTA, the following generalized discrete VI exists. t∈T b∈B In formula (1) For travel cost function c b , there are two frequently-used modalities: travel time and marginal travel time, which severally express implicit objectives of UE-DTA and SO-DTA. Additionally, it has more components in the travel cost function, such as pricing tactics, pollutant emission, capacity constraints, congestion delay, etc. This moment, c b is usually referred to as the generalized cost function.
Moreover, the existence and uniqueness of the solution in VI for DTA have been proven in some cases.

Theorem 3: Formula (1) has a unique solution if and only if the travel cost function satisfies some specific mathematical conditions: continuity and monotonicity.
In Theorem 3, continuity and monotonicity be emphasized to guarantee that the equivalent mathematical programming function of VI holds strictly convex. The existence of the optimal solution in the finite solution space is further guaranteed. A similar process of proof has been fully discussed in [65].

B. DYNAMIC NETWORK LOADING (DNL) FOR DTA
Among numerous methods of DNL, a method based on CTM with the capacity of capturing traffic dynamics superiorly has been widely studied, which has various forms, such as [20], [66], [67]. For more details please refer to [20], [64] and [68]. In this section, a CTM-based approach is introduced to structure DNL for DTA.
In CTM-based DNL, the renewal process of traffic flow on a link of TN subjects to the following generalized regulations: (Ordinary edge l ∈ E non ) (Demand and supply of flow transmission) The parameters in formulas (2)-(9) are described in TABLE 2. The formulas (2)-(4) demonstrate the discrete flow variation in link transmission. The potential demand and supply of links are represented by formulas (5) and (6) respectively. In formulas (5) and (6), three components on the right side represent density flow, current flow/allowance, and link capacity in turn. There is a unity of multiple possible link scenarios, such as the entrance of the network, and the exit of the network. The inequation (7) shows two states faced by connectors: ordinary state and branch state. In the ordinary state, inequation can be transformed into equality. For branch state, formula (7) expresses the possible scheme in connectors. The further strict constraints are sequentially given in formulas (8) and (9) for connectors at diverging and merging scenes.
With the advantages of simple structure and linearization, the above CTM-based DNL can be easily established for DTA in numerical simulation.

C. MULTI-AGENT REINFORCEMENT LEARNING (MARL)
A classical model-free MARL method widely applied is Q-Learning (QL). In QL, the value functions are learned and stored in the form of Q factor.
Definition 4: For Q-learning, exiting a tuple , S, A, R , the function updating process of Q-values subjects to the following equation: The parameters in Definition 4 have been described in For more specific detail about proof of Lemma 5 see [69].
However, the complexity of applying the MARL method represented by QL in TA increases with the scale of TN. To deal with this limitation, the concept of critical time was introduced in [33] to divide the strategy space into two subsets: valid strategy space and invalid strategy space. The relationship of subsets is as follows: The parameters in formulas (11)- (12) have been given in TABLE 3. The (11) indicates that elements in the two subsets are totally different. The (11) illustrates that the combination of two subsets can represent the whole strategy space including all possible actions for TA. This improved method of QL has been named as Edge-based QL (EB-QL).
Additionally, another improved method named Routebased QL (RB-QL) was also presented in [33]. In RB-QL, agents receive precomputed recommendatory strategy space before the learning process begins, namely A i s − ← precomputed . In contrast with the feasible strategy space, the recommendatory strategy space appears the enumerable feature due to that precomputed is small.
Both EB-QL and RB-QL essentially reduce MARL's complexity by cutting down the size of the strategy space. The above t critical and precomputed were configured as constant via mensurable experiment. The selecting of a critical time in EB-QL for agents is important. Smaller values lead to the loss of potential optimal solution, in which traffic flow only can be assigned into minority paths. On this account, exorbitant travel cost occurs on overloaded links. On the contrary, there are too many low-efficient strategies merged into a valid strategy set leading to the performance loss of learning when higher values are adopted. For RB-QL, the similar discussion about the size of precomputed had been presented in [33]. It is unsuitable for applying either EB-QL or RB-QL in DTA due to that t critical and precomputed are time-dependent with DTA evolution in TN. For the application of MARL in DTA, dynamic characters of DTA, such as time-vary demands, equilibrium transformation, the variation of travel cost, etc., must be considered.

IV. FRAMEWORK OF HETEROGENEOUS-ADVISER BASED MULTI-AGENT REINFORCEMENT LEARNING (HAB-MARL)
In this section, we establish Heterogeneous-Adviser based Multi-Agent Reinforcement Learning (HAB-MARL) architecture for DTA. The complete formulation of HAB-MARL will be described in detail in Section IV.A. The corresponding algorithm will be subsequently summarized in Section IV.B. All parameters relating to HAB-MARL are listed in TABLE 4.

A. HETEROGENEOUS-ADVISER BASED MULTI-AGENT REINFORCEMENT LEARNING (HAB-MARL)
In this section, MARL (described in Section III.C) theory is improved to design a distributed assignment method for DTA. We suggest a decentralized framework, where two categories of agents are associated with Origin-Destination (OD) pairs and strategy space support severally. In this architecture, there is a main agent group in which members decide how to assign corresponding flow into TN independently. The members of the other agent group are responsible for assisting the corresponding main agents. Communication occurs only among the auxiliary agents (advisers) and the main agents (deciders) for the transfer of support information. FIGURE 1 illustrates the suggested heterogeneous-adviser based multi-agent reinforcement learning (HAB-MARL) distributed assignment framework for DTA. In FIGURE 1, it has four modules (i.e. A-D). Module A is the kernel of HAB-MARL including the external interactions and internal learning mechanisms among agents playing different roles in HAB-MARL. Due to that architecture of HAB-MARL is multi-group and decentralized, only one agent team for one OD pair in DTA has been provided as a basic standard team structure in module A. For DTA, this basic standard team structure of HAB-MARL can be replicated according to the size of the network. The details of module A will be described in the rest of this section. Module B, which has been introduced in Section III.B, can be employed to model traffic flow in the real world. In this paper, the exchange between modules A and B can be simply regarded as interaction with the external environment for HAB-MARL. It has two layers in module C, which represent the realistic and mathematical structure of road networks respectively. Module C is the indispensable foundation to establish module B. Module D contains the information about demand of network. In DTA, these demands require the services of module A, which are usually varied and can be defined according to scenarios (such as defined in Section V.A.3). In addition, the flow of information of HAB-MARL in DTA tasks has also been abstracted among modules A-D.
In the following sub-sections, the various components of the HAB-MARL architecture shown in FIGURE 1 (A) are defined. To simplify pertinent expressions, the label of iteration series is omitted. Moreover, the differentiated agent groups in MAS have their specific Q-functions severally. It is necessary to define the components of the Q function for each agent in different groups.

1) ADVISER AGENT
For adviser agents, their interaction process, in which actions affect the external environment indirectly, is slightly different from that of main agents. Furthermore, the reward is perceived by the main agents from the external environment. However, to keep cognitive consistency about the outside world, the state of adviser agents must be consistent with that of the main agents.

a: TASK
The task of adviser agents is that critical-value is learned to restrict relevant action set provided for main agents. In this paper, the critical-value contains two forms: expectant critical time and expectant size of strategy set.

b: STATE SPACE
As mentioned, for both adviser agents and main agents, the state is perceived from the external environment. In DTA, the state is employed to describe the scene in TN. Additionally, the evolution of traffic in DTA is modeled in this paper as a network of density (see Section III.B). The state of each agent is the vector of states at all links (edges) in TN. Besides, it is worth noting that rationalize the size of state space to reduce computational complexity [30]. Fortunately, the state of TN only needs to be detected once to be accessible by all agents. Inspired by the concept of distributed expression in Deep Learning (DL) [70], we formally define the state of TN as follows.
Definition 6: For a traffic network G (N , E), the state for each agent can be expressed as a vector (distributed expression): s = s 1 , · · · , s l , · · · , s n e . The element of s represents the state of edge in TN, which can be defined by integrating formula (13) and formula (14). The related parameters are described in TABLE 4.
In formula (13), ψ l is evaluation index composed of vehicle proportions on link l. It is further defined by formula (14). In this paper, 0.5 and 0.8 are adopted as the values of ϕ free and ϕ jam respectively.

c: ACTION SPACE AND SELECTION
To reduce the complexity, we construct the action of each adviser agent using a similar form adopted in modeling state space.  1.(a)).
To balance exploration (gaining of knowledge) and exploitation (usage of knowledge), adviser agents select their action according to ε greedy widely adopted in QL.

d: REWARD FUNCTION
The reward function is used to represent the effects of agent action, which is also used to establish a closed feedback loop. It needs to be defined according to the objective of agents. For adviser agents, the reward function is employed to assess the effect of recommendation.
Definition 8: The reward function of adviser agent i, Formula (15) is defined to ensure the range of the reward function among [−1, 1] to avoid the increase of convergence time caused by sharp fluctuation of Q value. The formula (15) can be further defined by formulas (16)-(18) as follows: Formula (16) represents the change of average travel cost on link/path b. The average travel cost of b is further defined in formula (17). η improve i defined in formula (18) is a coefficient to ensure the plus or minus characteristic of formula (15). The parameters in formulas (15)- (18) have been described in TABLE 4.

e: OTHER RELEVANT PARAMETERS
The discount rate in QL reflects the preference of agents on long-and-short term reward. For adviser agents, they hope that the advice received by matching the main agent is perfect every time. Therefore, discount rates of adviser agents are set as 0.9 uniformly.
The learning rate affects the cumulative effect of knowledge. It determines the degree of retention and replacement of historical experience. Additionally, considering the application in DTA and role in HAB-MARL, the learning rate of adviser agents needs to be adaptive to dynamic goals. Therefore, self-adaptive forms of learning rate should be defined for two adviser agent types according to differentiated task contents.
For critical time adviser agents, the best critical time should be the travel time corresponding to the optimal solution of DTA. However, in the process of moving towards the DTA optimal solution, the specific value of the optimal solution is unknown and uncertain. Hence, the critical value should be determined according to the current assignment effect, namely the corresponding actual travel time.
Definition 9: The learning rate of critical time adviser agent i, α i can be expressed as a function of the difference between critical time and the actual average travel time: Formula (19) can be further expanded by introducing formulas (20) and (21).
For formula (19), the nonnegativity of its input value is guaranteed by formula (20). Furthermore, the value of formula (20) is far less than infinity. The above are the key elements to ensure that the learning rate can controlled within a reasonable range [0, 1).
Formula (20) represents the difference between the recommended value t critical i,k and the desired valuet k i . The recommended value t critical i,k is further defined in formula (21). For strategy set size adviser agents, the homologous optimum expectant size of strategy set should be the number of paths adopted in DTA at the equilibrium point. Nevertheless, the traversal of paths is dynamic with tracking the timevarying equilibrium point. Under this circumstance, it is a feasible method to estimate the optimum expectant size by assessing the current path traversal situation. From the above, the learning rate of strategy set size adviser agents can be defined as follows.
Definition 10: The learning rate of strategy set size adviser agent i, α i can be expressed as a function of the difference between expectant size and the estimated optimal size: VOLUME 8, 2020 Formula (22) can be further expanded by introducing formulas (23) and (24).
Formula (24) returns logic value of necessity judgment about paths traversal. The logic value can be set to 1 when the current traversal path p is inefficient (i.e. p ∈ ). The set is a specific space that sub-set i meets specified requirements: For adviser agent i, the corresponding set of inefficient paths is given in i . Moreover, it is similar as formula (19) that formula (22) is constructed to hold the non-negative input value for formula (23). The range of α i derived from formula (22) can be restrained in [0, 1).
The difference between the recommended value | | critical,k i and the ideal value | | k i − In these ways, the sensitivity of adviser agents to dynamic differentiated targets is also preserved while the convergence is guaranteed. The parameters in this sub-section (i.e. Definition 9, 10, and their relevant complementary formulas) can be seen in TABLE 4.

2) MAIN AGENT a: TASK
For application in DTA, main agents are responsible for assign traffic flow on TN synchronously to avoid the emergence of traffic congestion in the local region. The process of assignment obeys W-Ps (see Section II.A).

b: STATE SPACE
The state space of the main agents is the same as that of adviser agents. The reason has been discussed in Section III.D.1.(b).

c: ACTION SPACE AND SELECTION
The action space of main agent i, A † i (s i ) is a sub-set of unabridged strategy space A i (s i ). It is acquired taking suggestions provided by corresponding adviser agents into account synthetically.

Definition 11: For main agent i, its action space A † i (s i ) is the intersection of two recommended strategy spaces:
In Definition 11, the recommended strategy spaces A † time i (s i ) and A † size i (s i ) are extracted from A i (s i ) on the basis of t critical and | | critical derived from advise agent respectively.
To chase the objective of DTA, it is necessary to select a proper way to decide how to carry out actions for main agents.
Definition 12: The action of main agent i is executed in a mixed strategy manner, namely σ i , which is a probability distribution on A † i (s i ) computed based on partial history In Definition 12, the mixed strategy manner σ i can also be treated as a convex combination of all feasible a † j

d: REWARD FUNCTION
In QL, the update of the value function is usually based on evaluating the effect of independent action. However, the main agent's action is in the form of a mixed strategy. The update process of Q-value for each sub-action of mixed strategy needs to be updated independently. It is necessary to define a special reward function modality for main agents.
Definition 13: For main agent i, the reward for its subaction a † j i , r i s i , a † j i can be defined as follows.
Similar to formula (15), the range of formula (26) is restricted in [−1, 1]. Formula (27) shows that the travel cost input variable of a sub-action has two components: subaction effect estimation c base † j defined by formula (28) and main agent passive impact supplement estimation † j i c siack † j defined by integrating equations (29) and (30). Formula (28) models the utility among b † j selected by sub-action a † j i . Formula (29) models the utility among b † unselected by σ i . It is the external utility provided by σ i . The various mean values in (28) and (29) can be defined in a similar way to (17). The weight of external utility assigned to sub-action a † j i is defined in Formula (30). Similar to formula (18), formula (31) describes a coefficient η † j to ensure the plus or minus characteristic of formula (31).
Based on the above reward function, among main agents, their Q functions can be calculated with a series of rewards synchronously.

e: OTHER RELEVANT PARAMETERS
For the discount factor, a similar reason has been discussed in Section IV.A.1.(e). It is suitable to set the discount factor of the main agents as 0.9 uniformly.
Nevertheless, because of different mission objectives, the definition of the main agents' learning rate differs from that of adviser agents. Moreover, the sensitivity of MARL also needs to be considered among the main agents. In other words, each main agent should be equipped with a self-adaptive ability for dynamic and polytropic external factors and with learning ability for new sub-goal that derived from its task with new surroundings. Therefore, the learning rate can be defined as follows.
Definition 14: The learning rate of main agent i, α i is a function of the variance of individual travel costs. It can be expressed as follows.
In formula (32), the parameter in the right term can be further described via equation (33).
The individual average travel costc i in formula (33) can be calculated in a manner similar to formula (17). Since that the variance of individual travel costs is non-negative and finiteness, the interval of the corresponding function (i.e. formula (33)) is also non-negative, [0, 1). In formula (33), f b (c b −c i ) 2 represents the total deviation at b. The global deviation input is constructed by formula (33). Under this condition, it is feasible to adopt this definition as the learning rate of the main agent. Employing the above mapping mechanism to improve the learning rate can enhance the sensitivity of MARL among main agents. Additionally, the parameters used in this sub-section (i.e. the section of the main agent) have been listed in TABLE 4.

B. ALGORITHM OF HAB-MARL
According to the definitions in Section IV.A, it summarizes HAB-MARL algorithm performed by each agent in Algorithm 1.

V. NUMERICAL SIMULATION EXPERIMENT
This section presents the results of the numerical simulation to evaluate the performance of the proposed distributed assignment method based on HAB-MARL. The traffic network topology, the contrast methods, and the corresponding experimental scenarios are described in Section V.A. Then, in Section V.B, the framework of the numerical simulation is given in pseudo-code. Finally, details of the results are discussed in Section V.C. The following simulation and evaluation are implemented employing an applicable evaluation testbed based on MATLAB. Moreover, the parameters, which were not mentioned above but will be adopted in this section, are listed in TABLE 5.

Algorithm 1 Pseudo-Code of HAB-MARL
Input: the discount rate γ HAB ; the set of state S; the space of action A; the set of agents I i ; simulation time T ; greedy coefficient ε; Initialize: the simulation time t ← 0; the step of simulation k ← 0 for i ∈ I m /I at /I as do (main agents/critical time adviser agents/ critical size adviser agents)   , [33] and [68]. Compared to commonly adopted conceptual grid networks, which only has few OD pairs and one-way roads or paths, SF network describes a situation which is closer to the real world. In addition, compared with Anaheim network and Chicago network in [68], SF network has a relatively simple structure which is convenient for numerical calculation. The basic information of SF network is listed in Tables 8 and 9. Following the description in [71], the well-known Bureau of Public Roads (BPR) function (i.e. equation (34)) is employed to model the travel cost on each edge in SF network.
The parameters in formula (34) have been described in TABLE 5. The values of coefficient τ and β are 0.15 and 4 respectively. The above calibration values are usually used in the BPR-type function that describes the travel cost of a general road, for instance, the travel cost in [71], [60], and [72].

2) CONTRAST METHODS
Considering that the traffic assignment process is impossible to be performed repeatedly in real traffic situations, many iterative methods requiring reassignment are impractical. Therefore, to evaluate and compare the proposed assignment approach based on HAB-MARL, we select two agent-based methods named EB-MARL and RB-MARL, which have been suggested in [33] and briefly mentioned in Section III.C. A more detailed description of EB-MARL and RB-MARL can be referred to [33].
For EB-MARL and RB-MARL, the recommended parameters have been verified optimal for TA in [33]. Considering that will be tested in SF network, the best combination of parameters for EB-MARL is α EB = 0.9, γ EB = 0.99 and t EB critical = 9229. Under the same experimental network, the recommended optimal parameters of RB-MARL are α RB = 0.9, γ RB = 0.99, and RB precomputed = 10. We also construct two extra contrast methods via breaking the components of HAB-MARL. One, which is named HAB-MARL-NAT for short, has no critical time adviser. The other without critical size advisers can be called HAB-MARL-NAS. The settings of HAB-MARL-NAT and HAB-MARL-NAS refer to the corresponding components of HAB-MARL (see Section IV.A).

3) SCENARIO
To verify the performance of the proposed method and contrast methods, we designed a scenario with the timedependent flow in SF network. In this scenario, the traffic demand of each OD-pair varies in the time dimension. To describe the traffic demand for each OD pair in this scenario succinctly and efficiently, the average of Poisson distribution is adopted to stand for corresponding fluctuant traffic demand. The scenario is visualized in FIGURE 3, the  Table 7.
In FIGURE 3, sub-graph (a) describes the stochasticity of demand (the statistical average value) among different OD pairs. The volatility of demand (the actual value) in time for each OD pairs has been shown in sub-graph (b).

B. FRAMEWORK OF NUMERICAL SIMULATION
For the sake of understanding, the framework of numerical simulation constructed in this paper will be described briefly in the form of pseudo-code (i.e. FRAMEWORK 1).
The parameters in FRAMEWORK 1 have been described in TABLE 4 and 5. This framework can be regarded as an all-purpose platform based on the DNL model.

C. ANALYSIS AND EVALUATION OF RESULTS
In this sub-section, the effectiveness and superiority of the method suggested by us can be demonstrated from three aspects: (a) throughput of the network, (b) the average travel cost of each vehicle, and (c) the situation faced by agents and the result of corresponding assignment.

1) THROUGHPUT
The throughput of the network reflects the service level of the network. The higher the throughput, the smoother the road network. Travelers can reach their destinations easily. The effect of each method on the throughput of the network is shown in FIGURE 4. In FIGURE 4, it is obvious that the suggested method performs better than other comparative approaches (i.e. relational expression (35)- (38) provided below). The increasing throughput indicates that HAB-MARL can improve the smoothness of the network continuously. The throughput improvement of HAB-MARL over other methods is recorded in TABLE 6. To simplify the description and analysis process, we use OTF X to represent the throughput under using method X . In combination with Figure 4 and Table 6, The performance of each method follows the following relationship: Considering expressions (35)- (38), RB-MARL performs worst. It indicates that the fixed critical size (RB-MARL) will result in performance loss compared with the adaptive critical size (HAB-MARL and HAB-MARL-NAT). For RB-MARL, it can be thought of as the existence that flow assigned to impertinent paths or sectional appropriate paths with low utilization rate. These situations will lead to a decrease in the throughput of the network.
However, the opposite effect was observed in the similar comparison on the aspect of critical time (i.e. expression (36)). The situation can be attributed to the emergence of using inappropriate paths when the adaptive critical time exceeds the fixed critical time. In other words, low efficiency is derived from that the adaptive critical time deviates from the optimal travel time.
In addition to the above analyses, advice on critical time contributes more to performance improvement than that on critical size. The reason for the significant difference in performance is the extent difference on the exploitation of potential feasible path set with facing high load demand in some sections of the network. The fact that the potential path is not utilized effectively indicates that the constraint of critical size on the strategy space of MARL is tighter than that of critical time.
It is also verified to some extent from the fact that HAB-MARL is superior to HAB-MARL-NAS. The exploration of bad paths is reduced by supplementary constraints from critical size. The loss of performance is thus avoided.
About throughput, the performance of EB-MARL is second only but close to that of HAB-MARL. The fixed critical time of EB-MARL is the optimal travel time resulting in that the infeasible paths are excluded. Nevertheless, the fixed critical time is still not suitable for dynamic systems because of the temporal variability of the optimal travel time. Besides, the optimal travel time is only for the specified scenario, which needs to be re-optimized once the scenario changes. It is enough to explain the poor portability of EB-MARL.
In summary, HAB-MARL, which is the method proposed by us, is effective and excellent.

2) TRAVEL COST
In this paper, the cost of individual travel in the network can be expressed as average travel time (ATT). The lower ATT, the more efficient the trip, namely travelers can reach their destinations sooner. In FIGURE 5, it displays the effect of each approach on the travel cost of individual traveler in the network. The ATT improvement of HAB-MARL over other approaches has been listed in TABLE 6. Similar to throughput analysis, we also adopt ATT X to represent the individual travel cost under using method X . Moreover, the following similar relational expressions can be acquired: The expressions show that the performance of methods tested in this paper on the aspect of individual travel time is consistent with that on the aspect of throughput. The reasons for the difference in the degree of reducing individual average travel time can be divided into two categories: (a) the excessive utilization of non-optimal paths (such as RB-MARL, HAB-MARL-NAS), and (b) the idle spatiotemporal resources of feasible paths (such as EB-MARL, RB-MARL, HAB-MARL-NAT). The similar detailed analysis process has been stated in the throughput analysis (see Section V.C.1). Analyzing the influence of each method on the improvement degree of ATT can be executed by referring to the analysis process in the previous sub-section. Hence, we will not provide a detailed discussion here. To sum up, HAB-MARL performs well in terms of the individual average travel time in the whole network.
In addition to the above analysis, the influence of the suggested method on individual average travel time among each OD pair is further shown in FIGURE 6. In FIGURE 6, we chose 4 OD pairs (red circles) with inefficient improvement on the aspect of ATT seemingly and name them from IE1 to IE4 (red font). However, the actual improvement effects are as follows: Obviously, the worst improvement was over 10%. Moreover, IE1 has the worst improvement among all OD pairs statistically. It indicates that HAB-MARL improves the individual average travel time for each OD pair effectually.

3) SITUATION WITH THE CORRESPONDING ASSIGNMENT RESULT
To illustrate the properties of DTA based on HAB-MARL, we select five arbitrary OD pairs in SF network. For these OD pairs, the effective travel time evolutions of each path as well as the corresponding situations of the assignment are shown in FIGURE 7-11.  The 'a' (the fluctuation range of the effective travel time is controlled within 0.3194 h/veh to 2.2917 h/veh) and 'a'' (the flow assigned from path 12 (i.e. N4→N5→N9→N10→N15 →N22) to path 6 (i.e. N4→N3→N12→N13→N24→N23 →N22)) in FIGURE 8(a)-(b) verify that HAB-MARL implements flow migration between paths efficaciously once again. Additionally, the fluctuation, which blue curve on path 12 of N4-N22 (i.e. N4→N5→N9→N10→N15→N22) drops briefly in early-stage (FIGURE 8(b), from 17.67 veh/h to 3.9812 veh/h to 62.4286 veh/h), reflects the responsiveness of HAB-MARL for the complex and changeable external environment. For N6-N14, the effective travel times of each path begin with low value at the early stage (between 0.4167 h/veh and 0.7917 h/veh). However, they both increase rapidly in the following process (between 5.4167 h/veh and 6.0833 h/veh). This process is reflected via the areas marked by blue dotted lines in FIGURE 9(a). In this situation, the effective travel time at the later stage is inconspicuously low among paths (the deviation between maximum and minimum is 11.03%). In FIGURE 9(b), with the support of HAB-MARL, DTA shifts the distribution pattern of flow from early uniform distribution among all paths to later centralized distribution among a few paths (path 10 (i.e. N6→N8→N7→N18→N20→N19→N15→N14) with effective travel time 5.4861 h/veh and path 14 (i.e. N6→N8→N9→N10→N15→N14) with effective travel time 5.4444 h/veh). Combining with FIGURE 9(a) and 9(b), it can be found that HAB-MARL realizes the process of assigning flow to the feasible paths (i.e. paths with the lowest effective travel time) in simulation (i.e. the process 'b' and 'b'' with green in FIGURE 9(a) and 9(b)). This phenomenon conforms to the notion of DTA. It shows that HAB-MARL can implement DTA effectively.   The situation in FIGURE 11 (the fluctuation range of the effective travel time controlled within 0.2917 h/veh to 2.4861 h/veh; the flow assigned from path 1 (i.e. N20→N18→N7→N8→N6→N2→N1→N3→N12) and 20 (i.e. N20→N22→N21→N24→N13→N12) to path 13 (i.e. N20→N19→N15→N10→N11→N12)) can be categorized together with those in FIGURE 7 and 8. Additionally, the emergence of this phenomenon in FIGURE 11 indicates that HAB-MARL can recognize a small change in path effective travel time and adjust the assignment action immediately.
It is further evidence that HAB-MARL can accomplish DTA tasks dynamically and efficiently.
In this sub-section, all of the above five different situations confirmed that HAB-MARL strictly complied with W-Ps when completing DTA tasks.

VI. CONCLUSION
The dynamic traffic demand, which is unassignable on the time scale, is the general state of the modern urban network. This objective and actual state usually not be considered in the assumptions of existing ideal DTA methods. For DTA with this state, a decentralized method based on MARL has been presented in this article. A flat hierarchical architecture of agent-teams is employed, where each agent-team is associated with an OD pair and interacts with the environment. To restrict strategy space of MARL in response to dynamic equilibrium objective adaptively and rapidly, the basic structure of agent-team has been designed as a combination of one main agent (corresponding TA) and two adviser agents (corresponding critical time and critical size respectively). To achieve DTA, the decision mechanism of main agents is defined as a mixed strategy, which is a concept in Game Theory (GT). In this way, the experience of main agents can be updated in batches. To maintain the ability to re-learn at any time for tacking fickle equilibrium point, learning rates have been designed specially according to different tasks of different agents. The global search ability of MARL on DTA tasks has been enhanced. With this improvement, the DTA system can rapidly and effectively respond to dynamic traffic demands which are unassignable in temporal-dimension. Thus, DTA can achieve its desired objectives, namely minimizing the individual average travel time in an urban network. Meanwhile, the traffic congestion or potential congestion risk caused by stochastic traffic demand can be mitigated or eliminated from the origin.
To assess the proposed method (i.e. HAB-MARL), a comparative analysis of performance is designed. In SF network with fluctuant and unbalanced OD demands, HAB-MARL is benchmarked against two existing methods (i.e. EB-MARL and RB-MARL) and two ablative methods (i.e. HAB-MARL-NAT and HAB-MARL-NAS) to evaluate the performance of HAB-MARL in terms of the network overall throughput and the network individual average travel time. By visualizing the path travel distribution and effective individual travel time on the perspective of OD pairs, the ability of HAB-MARL to fleetly track variable equilibrium points in different OD pairs has been further analyzed. The results show that HAB-MARL outperforms the other contrast approaches in terms of the network throughput and the network individual average travel time. Although the performance of the EB-MARL method with a preset optimal value is close to that of HAB-MARL. Although the performance of EB-MARL with a preset optimal value is close to that of HAB-MARL. EB-MARL lacks HAB-MARL's ability to self-learn without optimal preconditions, which brings superior portability for HAB-MARL. Moreover, HAB-MARL is demonstrated that can be timely and excellent in adjusting flow distribution among paths in response to rise and fall of effective individual travel time.
The method proposed in this paper optimizes the algorithm framework for the scene under the dynamic traffic demand which is unassailable on time dimension. It enhances the adaptive learning ability of MARL for DTA tasks from the perspective of algorithmic theory. Hence, HAB-MARL is basic and can be a compatible framework. For example, existing path generation methods can be embedded in the process of searching HAB-MARL's strategy space to make DTA system more efficient. Exploiting possible extension patterns is a part of our future work.
Additionally, the framework of HAB-MARL can also be reorganized for other similar tasks, which need to be achieved via employing MARL with an action space restricted by cooperative constraints. It means that the architecture of HAB-MARL has universality and plasticity.

APPENDIX
In the appendix, the configuration of travel demands for each Origin-Destination (OD) and the basic attributes of the Siouxfalls (SF) network are provided in Tables 7-9, respectively. ZHAOTIAN PAN received the B.Sc. degree in traffic engineering from Jilin University, Changchun, China, in 2016, where he is currently pursuing the Ph.D. degree in traffic information engineering and control. His research interests include intelligent transportation systems, traffic flow theory, traffic simulation, dynamic traffic assignment, route planning, application of game theory, and multiagent systems.
ZHAOWEI QU received the Ph.D. degree in traffic information engineering and control from Jilin University, Changchun, China. He is currently a Professor and a Ph.D. Supervisor with the School of Transportation, Jilin University. His main current research interests include traffic control theory, traffic organization, traffic big data, and intelligent transportation systems.
YONGHENG CHEN received the B.S. degree in civil engineering and the Ph.D. degree in traffic information engineering and control from Jilin University, Changchun, China. He is currently an Associate Professor with the School of Transportation, Jilin University. His research interests include traffic control, traffic flow theory, and traffic organization.
HAITAO LI received the B.Sc. degree in traffic engineering from Jilin University, Changchun, China, in 2016, where he is currently pursuing the Ph.D. degree in traffic information engineering and control. His research interests include pattern recognition, traffic flow prediction, and traffic information forecasting.
XIN WANG received the B.Sc. degree in traffic engineering from Jilin University, Changchun, China, in 2016, where she is currently pursuing the Ph.D. degree in traffic information engineering and control. Her research interests include traffic organization and traffic data analysis and its applications related approach. VOLUME 8, 2020