Traffic Signal Control Under Mixed Traffic With Connected and Automated Vehicles: A Transfer-Based Deep Reinforcement Learning Approach

Backgrounds: The traffic signal control (TSC) system could be more intelligently controlled by deep reinforcement learning (DRL) and information provided by connected and automated vehicles (CAVs). However, the direct training procedure of the DRL is time-consuming and hard to converge. Methods: This study improves the training efficiency of the deep Q network (DQN) by transferring the well-trained action policy of a previous DQN model into a target model under similar traffic scenarios. Different reward parameters, exploration rates, and action step lengths are tested. The performance of the transfer-based DQN-TSC is analyzed by considering different traffic demands and market penetration rates (MPRs) of CAVs. The information level requirements of the DQN-TSC are also investigated. Results: Compared to directly trained DQN, transfer-based models could improve both the training efficiency and model performance. In high traffic scenarios with a 100% MPR of CAVs, the total waiting time, CO2 emission, and fuel consumption in the transfer-based TSC decrease about 38%, 34%, and 34% compared to pre-timed signal schemes. Also, the transfer-based TSC system requires more than 20% to 40% MPRs of CAVs under different traffic demands to perform better than pre-timed signal schemes. Conclusions: The proposed model could improve both the traffic performance of the TSC system and the training efficiency of the DQN model. The insights of this study should be helpful to planners and engineers in designing intelligent signal intersections and providing guidance for engineering applications of the DQN TSC systems.


I. INTRODUCTION
With the rapid development of learning-based artificial intelligence technologies, combining the management of transportation systems with reinforcement learning (RL) technologies provides a new potential solution to improve the efficiency, safety, and sustainability of intelligent transportation systems. Also, the emerging development of the vehicle to infrastructure (V2I) communication technology enables connected vehicles (CVs) or connected and autonomous The associate editor coordinating the review of this manuscript and approving it for publication was Michail Makridis . vehicles (CAVs) to transmit real-time information on vehicles to the traffic signal control (TSC) system. All these technologies make it feasible to control an intelligent TSC system by RL technologies.
Several studies optimized TSC strategies by assuming a 100% market penetration rate (MPR) of CAVs so that the TSC system could obtain information from all vehicles [1], [2]. MPRs of CAVs could determine the information level of vehicles that can be obtained by the TSC system. There is still a long transition time to achieve high MPRs of CAVs [3]. Hence, it remains an open question about how many MPRs of CAVs are sufficient to train a relatively good RL-controlled TSC system. To investigate the validity and information-level requirement of the RL-controlled TSC system, it is important to study the impacts of different MPRs of CAVs on the RL-based TSC system.
Moreover, recent studies combined RL with the deep learning methods to approximate highly nonlinear functions from complex datasets, and this deep reinforcement learning (DRL) framework could provide a better TSC performance compared to RL-only methods [4]. However, the DRL-controlled TSC system still has many shortcomings. First, the training procedure of the DRL-controlled TSC system takes a long time to converge [5]. Secondly, the DRL training procedure requires lots of samples. Thirdly, the traffic flow at the intersection rapidly changes across time and space. It is extremely hard to train a model that could accommodate several traffic scenarios for real-world applications. When implementing this method in city networks with many intersections, training a specific DRL model for each intersection is extremely cumbersome. Thus, reusing or adjusting previous models under similar traffic scenarios provides a potential and feasible solution for engineering applications.
Currently, transfer learning enables the reuse of previously trained action policy developed from a similar task to initialize the learning of a target task, and it is expected to improve the training efficiency, sample efficiency, and training performance [5], [6]. It is expected that a model trained with a higher information level could also obtain a better solution compared to models trained with only partial information [7]. Several studies found that the proposed intersection control system performs better than traditional signal schemes after certain MRPs of CAVs are reached [8]- [10]. When transferring a prior model with a higher information level into a scenario with a lower information level, the transfer-based DRL TSC is expected to outperform the directly trained model as it could take a better action, which is given by the prior policy of the pre-trained model. Also, the training efficiency is expected to be improved by following and transferring prior policy as a start point in the target task. As few studies and research efforts have been made on transfer-based DRL TSC systems, it is meaningful to reuse pre-trained models and test the model performances in scenarios with similar traffic demands and/or MPRs of CAVs.
This paper aims to explore the performance and the transferability of the transfer-based DRL technologies in TSC systems and to bridge the research gap in terms of training efficiency and validity of the transfer-based DRL model. The transfer-based DRL TSC is tested at an isolated intersection with different traffic demands and MPRs of CAVs. The rest of the paper is organized as follows: Section 2 summarizes the state-of-art DRL TSC and studies on the impact of mixed traffic in the intersection control system. Section 3 describes the methodology of the transfer-based DRL TSC system. Section 4 introduces the simulation scenarios and model settings. Section 5 presents the results and findings. Finally, the article is summarized in Section 6.

II. LITERATURE REVIEW A. REVIEW OF REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
Optimizing traffic signal control with a reinforcement learning method has received great attention in previous studies [2], [11], [12]. The TSC agent is trained to learn an optimum policy for developing the signal phase or timeplan based on the information gathered from the traffic environment. With regards to the number of RL agents, these studies could be classified into centralized TSC with a single agent RL (for an isolated intersection or the entire intersection network) or decentralized TSC with multi-agent RL (for a network of intersections). The state of vehicles (numbers, locations, speeds, or other traffic performance criteria) is usually presented by image-like representation format (i.e., discrete traffic state encoding) or feature-based state vectors [2]. The actions are commonly defined as binary action sets (whether or not to prolong the green time) or multi-phase sets (usually four or eight green phases). Due to the large scale of the state and action representation, many recent TSC studies employed deep learning (neural networks) to approximate Q-values, which are returns for taking an action A at a state S [7]. Based on the target estimated by the deep learning, the deep reinforcement learning (DRL) could be classified into value-based (estimating Q value), policy-based (estimating action policy probability), and state-value-based method (estimating both Q value and action policy probability, such as actor-critic (A2C) framework). Table 1 summarizes several deep reinforcement learning studies for traffic signal control systems. One of the earliest neural-network-based RL models for TSC was proposed in [13]. However, it was different from the typical deep Q network (DQN) algorithm as the lack of experience replay and the target network. After that, Genders and Razavi [14] implemented a convolutional neural network (CNN) to approximate the Q values for a single intersection with four green phases. The simulation in SUMO showed a better result compared to that using a single-layer neural network Q-learning approach. Wei et al. [15] introduced a DQN-based TSC, called IntelliLight, and utilized CNN to extract traffic features from real-world camera data collected in China. The IntelliLight is also selected as a benchmark in [5]. This research introduced a transfer learning framework with source task selection and batch learning. Results based on the real-world data from China indicated a quicker model convergence and better traffic performance compared to nontransfer models. Shi and Chen [4] also utilized transfer learning to speed up the training procedure of multi-agent DRL TSC with long short-term memory (LSTM) layers (a type of recurrent neural network, RNN) for Q-value approximation. The results on 2-by-2 grids of intersections indicated lower average delay compared to Q-learning and fixed-time signal under both low and high traffic demands. Moreover, Zhang et al. [7] trained a DQN for TSC with partial detection of vehicles. Results indicated that the DQN controlled TSC VOLUME 9, 2021 could efficiently reduce the average waiting time even with a low detection rate.

B. REVIEW OF STUDIES ON MIXED TRAFFIC AT THE INTERSECTION
As shown in Table 2, it is noted that most simulationbased studies indicated a positive effect of the mixed traffic flow of human driving vehicles (HDVs) co-existing with CVs/AVs/CAVs at the intersection. Shladover et al. [19] found that a 40% MPR of the Cooperative Adaptive Cruise Control (CACC) vehicle is a critical threshold to achieve a 10% improvement of the highway capacity based on field experiment data. Yang et al. [20] also found a 50% information level for CVs could significantly decrease the delay and stops by maximizing the speed entering the intersection. However, several studies also found that the intersection performance only improved after certain MPRs of CVs/CAVs are reached [10], [21], [22]. Moreover, several studies found that the models trained with only partial information and the interaction between CAVs and HDVs could result in a negative impact on the intersection system performance [8], [9]. For TSC with DRL technologies, the MPR of the CAVs determines the information levels of the training inputs for the DRL system. According to the above studies, it is important to study the impacts of different MRPs of CAVs as they have a high potential to impact the performance, validity, and transferability of the transfer-based DQN TSC system. In summary, the DRL-controlled TSC systems could have a better performance than traditional signal schemes. However, the training procedure of the DRL-controlled TSC system takes a long time and requires a lot of samples to accommodate different traffic conditions. Hence, reusing or modifying previous models under similar traffic scenarios provides a potential solution to improve the training efficiency and performance. Moreover, to bridge the research gap in terms of training efficiency and validity of the transferbased DRL model, the impacts of different MPRs of CAVs on the DRL-TSC system still need to be further investigated.

III. MATH
In this study, the traffic light at an isolated intersection is controlled by a single agent DRL that interacts with the simulation environment. With the V2I technology, the TSC agent could choose an action a t based on the state s t and reward r t transmitted by the CAVs in the timestep t. In this case, the information levels of vehicles within the simulation system are determined by the MPRs of CAVs. The Deep Q Network (DQN), which is a benchmark DRL method, is implemented in this research to train the TSC system. The action set A t includes green phases for traffic movements. The state s t is the traffic volume/state in each inlet segment of the intersection and is transmitted by the V2I technology of CAVs. The detailed definitions and settings for the action and state are given in the empirical settings part. The framework of the transfer-based DRL TSC system is shown in Fig. 1. The reward r t denotes the feedback after the agent chosen an action a t . Several traffic performance criteria are utilized as the reward in TSC systems, such as the queue length, throughput, and total waiting time [2], [12]. The total waiting time is the sum of the time for vehicles when the vehicle speed is less than 0.1 m/s. In comparison to the queue length and the throughput, the total waiting time considers both traffic volume and stopping time. Hence, the total waiting time is selected to describe the reward in this paper. Also, according to [12], a hyperparameter is added in the reward function to improve the training efficiency. The reward function is defined as: where twt t denotes the total waiting time at the time step t, and δ (δ ≤ 1) could increase the magnitude of the reward value and is supposed to improve the training efficiency. When δ = 1, the reward function changes to a commonly used reward function. The positive reward r t denotes a better performance as the current action decreases the twt t . For value-based RL, Q learning is a benchmark model-free reinforcement learning technology [2]. The Q value denotes all rewards the agent could obtain when taking an action a t in state s t , and it could be approximated by selecting the action a t+1 that obtains the maximum Q value Q : where Q (s t+1 , a t+1 ) is the Q value for taking an action a t+1 in the state s t+1 . γ is the discount rate that adds a penalization of the future reward compared to the immediate reward r t+1 . γ is set as 0.25 according to the test results in [12].
With the help of deep learning, a deep neural network is implemented to approximate the Q value. Experience replay is implemented to store and extract a batch of samples from the reply memory database. The random selection of the samples could mitigate correlations in samples and improve the utilization rate of the samples. As shown in Fig. 1, the deep Q network (DQN) contains two neural networks to improve the stability of the training results. The Q' value is the predicted value from the Neural Networks (NN) based on a given input sample. In the DQN framework, two NNs will be implemented for Q value prediction (with one from the base NN model and the other from the target NN model). Then, the difference between those two Q' values will compose a loss function which will be used to update the weights of the NN. After one episode updating of the base NN, the weights of the base NN are copied/updated to the target NN (i.e., synchronizing process).
The loss function L(w) is denoted as the simple square error between Q' predicted from the base NN and the target NN network.
To minimize the loss function L(w), the Adaptive Moment Estimation (Adam) (i.e., a stochastic gradient descent method) is implemented. The weights w in the neural network are updated with the learning rate α as follows, Moreover, the epsilon-greedy method is used to explore possible actions at the beginning of the training stages. The agent would randomly choose an action with a probability of h . Otherwise, the agent chooses the action a t+1 that obtains the maximum Q value predicted from the training neural network, where h is the current episode number. H is the total number of simulation episodes. For two similar traffic scenarios, the trained policy for the action selection in one model is supposed to be useful and could be treated as an initial policy for another model [5], [6]. The transfer learning enables the reuse of a previously trained model between similar tasks. As the training procedure of the DRL is cumbersome and time-consuming, it is expected that transfer learning could improve the training efficiency and performance (when transferring models with higher information levels of the vehicles). In this paper, the neural network weights w in a prior task are transferred into a target task in scenarios with similar traffic demands or traffic information levels (determined by the market penetration rates of CAVs). The detailed algorithms of the DQN with the experience replay and transferred procedure are shown in Table 3.
In this paper, the total waiting time, CO 2 emission, and fuel consumption are utilized to investigate the traffic performance of the intersection system. All these three criteria are retrieved from the SUMO software. The waiting time of a vehicle/lane is calculated by accumulating the time when the vehicle speed decreases to a value below 0.1m/s. Also, the waiting time would be reset to 0 after the vehicle moves. The emission and fuel consumption models of the gasoline-driven passenger car (Euro norm 4) are developed and calculated by the HBEFA3 (version 3.1.). The details of the calculation procedure and emission/fuel consumption factors could be referred to the HBEFA3 [23].

IV. EMPIRICAL SETTINGS A. SIMULATION SCENARIOS
A typical four-way intersection with four lanes per approach is selected for the simulation. As shown in Fig. 2, the vehicle-based state array (the number of vehicles in each grid/segment) is determined by the discrete traffic state encoding (DTSE) method and is set as the input for the DQN model. Eight green phases are set as possible actions for the intersection. A 4-s yellow and all red-time is added if the  traffic light changes its phase. The speed limit is set at 35 mph (i.e., 15.6 m/s). The peak-hour traffic, which is the main reason for the congestion at the intersection, is generated according to a Weibull distribution with a shape equal to 2. Meanwhile, the random seed, which equals the episode value, is utilized to generate heterogeneous traffic for each training episode. The saturated traffic demands of the intersection are determined by the simulated maximum throughput under a pre-timed signal scheme (100-s cycle length, two 30-s phases for the direct and right traffic, two14-s phases for the leftturn traffic). As shown in Fig. 3, with the increase of the traffic demand, the maximum throughput of the intersection increases to 4800 vehicles/hour in the simulation, and this saturated traffic is set as the high traffic demand scenario. The low, medium, and medium-high traffic demands are set at 20%, 40%, and 60% of the high traffic demand, respectively. The detailed traffic demand for each movement is presented in Table 4.
All simulation scenarios are processed in the Simulation of Urban MObility (SUMO) by the TraCI-Python interface. Each training episode of the simulation is set as 3600-s with a 0.1-s time step to accommodate the distribution of the peak hour traffic volume. The Intelligent Driving Model (IDM) is implemented for human driving vehicles (HDVs) according to [24]. The IDM has a simple model structure, accidentfree logic, and continuous acceleration control function that can be used to describe the longitudinal movements of the HDVs/AVs [25]. A current study by Adil [26] also indicated that the IDM could have a better speed and acceleration control accuracy compared to the Krauss model or Wiedmann 99 model (a default car-following model adopted in VISSIM software). The Cooperative Adaptive Cruise Control (CACC) system is utilized for CAV simulation according to previous research [27], [28]. The default lane change model ''LC2013'' in SUMO is employed for all vehicles. Both HDVs and CAVs are assumed to have the same ability for acceleration (2 m/s 2 ) and deceleration (−4 m/s 2 ). The desired headways for HDVs and CAVs are 1.6-s and 0.7-s, respectively. To model heterogeneous driving behaviors of the human drivers, the maximum speed for the HDV follows a normal distribution N (1.2, 0.1) with respect to the speed limits. Also, other parameters for CACC controlled CAVs are set according to previous research [28]- [31].
In this study, all vehicles are set as CAVs at first. A direct training procedure with 800 episodes is employed under the low traffic demand scenario. Then, the trained model is transferred to the next scenario with a higher traffic demand (from low to medium, medium to medium-high, and medium-high to high). After that, this paper tests the validity of the transferbased DQN signal system by considering different information levels of the vehicles. For scenarios with the same traffic demand, the MPR of CAVs decreases from 100% to 20% by 20% per step. The trained model with higher MPRs of CAVs will be transferred into the subsequent scenario with lower MPRs of CAVs.

B. MODEL SETTINGS
A medium-sized fully connected neural network (4 hidden layers with 400 neurons per layer) is implemented in this study. This size of the NN is recommended as it could obtain good training results and save a lot of training time according to [12]. The NN is built in TensorFlow 2.0. The Rectified Linear Unit (ReLU) activation function is implemented for all hidden layers and the Liner activation function is used for the output layer. The Adam optimization algorithm is implemented for training NN models. The discount factor of the Q-learning equation is set at 0.25. The training iterations of the neural network weights will execute 800 times with a 0.01 learning rate, and each iteration will retrieve 100 samples according to the memory replay [12]. The total number of the training times for neural networks in one simulation episode is determined by the action step length and total simulation steps. The test results will be output after the convergence of the cumulative reward values. The following parts test the reward function parameter and the action step length.
As introduced in [12], the revision of the parameter (γ = 0.9) in the reward function could increase the magnitude of the reward value and improve the training efficiency. As shown in Fig. 4, this paper compares the results between the general reward parameter (γ = 1) and the revised reward parameter (γ = 0.9) for the transfer-based DQN procedure under the scenario of medium traffic and a 100% MPR of CAVs. The reward curves indicate that the proposed reward parameter (γ = 0.9) could not always improve the training efficiency and could result in more variations in action choices. Hence, the general reward function (γ = 1) is utilized in this paper. The ε-greedy exploration rate introduced in (6) is utilized to strike a balance between the exploration and exploitation of the actions. In general, the DQN training procedure is expected to explore more possible actions at the beginning and then exploit more when the action policy is well trained. As the transfer-based learning procedure could obtain prior action policy from previous scenarios, the training procedure might obtain the converged value without exploring all possible actions. To confirm this assumption, different ε-greedy exploration rates are tested, and the results are illustrated in Fig. 5. It is found that without full exploration (ε changes from 1 to 0), the transfer-based models could also obtain a similar stable reward, which indicates the validity of transferring models from similar scenarios.
When the current action a t is different from the previous action a t−1 , a phase that includes 3-s yellow and 1-s all-red time is added. If the agent selects the same action (green phase) and that green phase exceeds a maximum cumulative green time (60 s in this paper), the agent would stop the current phase and change to the next green phase. As shown in Table 5, the model performances with different action step lengths (green time durations) are tested under a low traffic demand scenario. The start value of the action step length is set as the minimum green time (5s). It is noted that a VOLUME 9, 2021  significant increase in the total waiting time, CO 2 emission, and fuel consumption are observed after 10-s of the green time. Also, the frequent change of the green phase would add more red/yellow time to the total time, and this would result in more green time loss. Hence, this paper sets 10-s green time for each action and 60-s for the maximum green time duration.

V. RESULTS AND DISCUSSIONS A. COMPARISON BETWEEN DIRECT AND TRANSFER-BASED LEARNING
To test the efficiency of the transfer-based DQN approach, a comparison between direct training and transfer-based training with full exploration (ε-greedy from 1 to 0) is made under a scenario with medium traffic and a 100% MPR of CAVs. The cumulative negative reward curves in Fig. 6 demonstrate that the transfer-based method could get the stable maximum value with fewer training episodes compared to the direct training procedure. This result further proves that the prior action policy (neural network weights) provided by the previous model could be utilized in target models under similar traffic scenarios and promote the training efficiency with fewer adjustments of the pre-trained model. It is noted that the direct training procedure for an intersection with different traffic demands is very time-consuming. For example, in Fig. 6, the direct training and the transfer-based training take about 54.2 hours and 20.1 hours, respectively, in a computer with GTX-1050 GPU (for neural network training) and i5-7300 2.5GHz CPU. The significant decrease in the training time gives a possible engineering application of the transfer-based DQN TSC system at intersections with similar traffic demands.

B. IMPACTS UNDER DIFFERENT TRAFFIC DEMANDS AND MPRS OF CAVs
With the V2I communication technology, the TSC system could obtain state information (i.e., traffic volume, speed, waiting time, etc.) from the CAVs approaching the intersection. However, it is expected to have a long transition period during which human driving vehicles and intelligent vehicles will coexist [3]. This paper also tests the impacts of information levels of the mixed traffic on the transfer-based DQN TSC system. Fig. 7 presents cumulative negative reward curves for scenarios with different traffic demands and MPRs of CAVs. First, with a 100% MPR of CAVs, the prior-trained NN weights of the trained DQN model are transferred from scenarios with high traffic demands to scenarios with low traffic demands (i.e., from low to medium, medium to medium-high, and medium-high to high). After that, for scenarios with the same traffic demands, the impacts of information levels of the vehicles on the DQN TSC system are investigated by transferring models with high MPRs of CAVs into models with low MPRs of CAVs (decreasing from 100% to 20% by 20% per step). For example, the direct training model for the scenario with a 100% MPR of CAVs is transferred to the scenario with an 80% MPR of CAVs. It is also noted that direct training procedures are utilized in some low MPRs scenarios to obtain more stable reward values at the end of the training procedure.
An interesting finding is that the reward values of the transfer-based model overlap with the reward values in models with higher MPRs of CAVs (i.e., higher information level). For example, in Figure 7 (b), the transfer-based curve with 60% and 80% MPRs of CAVs overlaps with the directly trained curve with 20% and 40% MPRs of CAVs. This overlapping is scarcely observed in directly trained models as the TSC system is trained based on partial traffic information. For scenarios with lower MPRs of CAVs, the cumulative  negative rewards values are also lower due to the missing of some rewards from HDVs (TSC could not obtain state values from HDVs). In this case, the final stable reward value would not overlap with others in directly trained models. The overlapping indicates that transfer-based models could obtain larger reward values than the directly trained models. The traffic light controller could not select a better choice if the system only gets limited/biased information on the vehicle states and system rewards. Table 6 also indicates that, in the same scenarios, the transfer-based model could improve the traffic performance compared to the directly trained models. Table 7 to Table 9 present the test performance (total waiting time, total CO 2 emission, and total fuel consumption) of the proposed DQN TSC under different traffic demands and MPRs of the CAVs. Compared to the scenario with fixed signal schemes, a decrease in the total waiting time, CO 2 emission, and fuel consumption could be observed in scenarios with more than a 40% MPR of CAVs. Meanwhile, a decrease in indicator values (i.e., better system performance) could be observed with the increase of MPRs of CAVs. The DQN controlled signal system could get worse traffic performance with a 20% MPR of CAVs under low-, medium-, and high-traffic scenarios. Also, the performance indicator values decrease significantly when the MPRs of the CAVs increase from 20% to 40%. These results indicate that the proposed transfer-based DQN signal controller needs a certain information level of the vehicles, and the critical value   of the information level is between 20% to 40% according to different traffic demands. Moreover, with a 100% MPR of CAVs in a medium traffic demand scenario, the DQN TSC system indicates a decrease of 58% of the total waiting time, which is the best performance in total waiting time. For scenarios with high traffic demand, fixed signal schemes indicate significant congestion as all performance values almost doubled compared to medium-high traffic scenarios. However, for DQN TSC with a 100% MPR of CAVs, the total waiting time, CO 2 emission, and fuel consumption still decrease about 38%, 34%, and 34%, respectively.

VI. CONCLUSION
This paper presents a transfer-based DQN traffic light control system to improve the training efficiency of the deep reinforcement learning procedure. Different model settings (reward parameter, exploration rate, and action step length) are tested and discussed. Different traffic demands are determined according to the simulated maximum throughput of the intersection. The impacts of traffic demands and information levels of the vehicles on the transfer-based model are investigated. The trained DQN models are first transferred from scenarios with low traffic demands into scenarios with higher traffic demands (from low to medium, medium to medium-high, and medium-high to high). For scenarios with the same traffic demand, models are then transferred from high MPRs of CAVs scenario into low MPRs of CAVs scenario (decrease from 100% to 20% by 20% per step).
The result comparison between the transfer-based training procedure and direct training procedure indicates that the prior action policy of the DQN TSC model could be utilized in models with similar traffic demands or information levels of vehicles. The training efficiency is improved significantly in transfer-based models. Also, this paper tests the validity of the transfer-based DQN method by considering different information levels of vehicles. In this paper, the information level is determined by the MPR of CAVs and is transmitted to the TSC system. With the increase of MPRs of CAVs, a decrease in the total waiting time, CO 2 emission, and fuel consumption could be observed in transferred-based DQN TSC systems. Compared with pre-time signal schemes, the transferredbased DQN TSC systems perform better when the MPRs of CAVs are more than 20% under the medium-high traffic scenario and more than 40% under low, medium, and high traffic scenarios. Moreover, the transfer-based models could choose actions given by previous models with higher information level. Hence, the transfer-based model could choose actions with better performance than the model directly trained by the same information level.
The good performances in efficiency, validity, and transferability of the transfer-based DQN TSC method indicate a possible engineering application of this method in scenarios with similar traffic demands or information levels. With the rapid development of vehicles with V2I communication technologies, the information level requirement (between 20% and 40%) for this transfer-based DQN TSC system is expected to be met in the near future. These findings should be valuable to transportation researchers, decision-makers, and traffic engineers to improve intersection efficiency, design intelligent intersections, promote the technologies of V2I, and implement DQN-controlled traffic signals. Please note that adjacent intersections are more commonly seen in corridors or road networks. The coordination between adjacent intersection signals requires a multiagent DRL framework which is different from the single DQN framework used for the isolated intersection. Future research efforts could be focused on modeling multiagent signal controllers for adjacent intersections or intersection networks. Also, the size of the neural networks of the DQN model deserves further investigation.