Intelligent Resource Allocation for Train-to-Train Communication: A Multi-Agent Deep Reinforcement Learning Approach

The application of train-to-train (T2T) communication in urban rail transit is expected to simplify system structure, reduce maintenance costs, and improve operational efficiency. In particular, train-to-wayside (T2W) communication coexist with T2T communication in the train control system based on T2T communication. To make full use of limited spectrum resources, frequency reuse is adopted as an efficient technique, but it brings the co-channel interference unfortunately, which affects the quality of service (QoS) for T2T and T2W users. In this paper, we propose a multi-agent deep reinforcement learning (MADRL) based autonomous channel selection and transmission power selection algorithm for T2T communication to reduce the co-channel interference. Specifically, each agent interacts with the environment and selects actions to implement a distributed resource allocation mechanism independently, adopting asynchronous updates to avoid different agents choosing the same sub-band. Simulation results show the superiority of our proposed algorithm: compared with the existing resource allocation schemes for T2T communication, the system throughput and the successful transmission probability of T2T links are greatly improved.


I. INTRODUCTION
With the continuous expansion of urban scale and the pressure of rail transit increasing, efficient and safe rail transit is highly valued [1]. In the past decade, the communicationbased train control (CBTC) system has been widely used for its punctuality and higher operational efficiency [2], [3]. However, key functions such as train route and safety protection are based on bidirectional train-to-wayside (T2W) communication structure in CBTC systems, which bring many problems such as multiple configuration equipment and complicated system structure [4], [5]. Reliable direct train-to-train (T2T) communication can significantly improve efficiency T2T communication, and the authors studied the alignment of narrow beams between trains in turning scenes. The authors of [17] studied the switch control function of the CBTC system based on T2T communication.
Although the T2T communication based CBTC system has many advantages, the wayside equipment in the system is still necessary. While the two adjacent trains acquire each other's position and status information through the T2T link, the train also needs to communicate with the wayside equipment. By multiplexing the frequency resources of the T2W uplink in T2T link, spectrum utilization can be improved effectively. However, it also produces co-channel interference in the system. Therefore, an effective resource allocation scheme is required to manage the interference [18]- [20]. The authors of [21] proposed a bio-inspired algorithm to achieve distributed channel allocation, which could effectively increase system throughput and reduce communication delay. To improve channel utilization and system performance, the authors of [22], [23] proposed a novel distributed channel allocation algorithm and a evolutionary scheme (named E-MAC) to achieve collision-free transmissions. The authors of [24] proposed a power control algorithm based on statistical-feature, which could reduce the average D2D transmit power and increase the energy efficiency of D2D communications in the cellular. The authors of [25] designed a mean-field game (MFG) theoretic framework and achieved a novel distributed power control scheme within the MFG framework. Notice that, none of the above works involved machine learning algorithms.
The reinforcement learning (RL) based resource allocation schemes have been applied to device-to-device (D2D) communication widely. In [26], a Q-learning based power control algorithm was proposed which decorrelated the actions selected by users and expand the solution space, and it had higher quality of service (QoS) than the schemes based on correlated Q-learning. In [27], two RL based power control methods were proposed, i.e., centralized method and distributed method. The simulation results showed that the distributed method had better system performance. In [28], a distributed learning based spectrum allocation scheme was proposed, which could maximize system throughput and spectral efficiency. However, in the above schemes, power control and channel selection were realized separately. In [29] and [30], new methods were proposed to solve this defect. In [29], an actor-critic RL based on policy gradient was proposed to improve D2D throughput and system throughput. In [30], a novel Bayesian (RL) model was proposed, and Bayesian RL-based coalition formation algorithms were implemented in a long-term evolution advanced network.
Recently, multi-agent RL has been gradually applied to wireless networks for its excellent performance and efficient implementation of distributed mechanisms. In [31], a collaborative multi-agent RL anti-jamming algorithm was proposed to solve the problem of external malicious jamming and mutual interference among users. An autonomous channel selection scheme based on multi-agent RL was proposed in [32], which could accelerate the convergence speed of the algorithm as well as improve the throughput of the system. In [33], a multi-agent deep reinforcement learning (MADRL) based distributed dynamic power allocation scheme was proposed, which achieved near-optimal power allocation. In [34], a MADRL method was adopted to realize cooperative spectrum sensing. Compared with traditional RL methods, the proposed algorithm had advantages in both the convergence speed and the reward performance.
However, the machine learning based resource allocation scheme for T2T communication is still scarce. In [35], Stackelberg game was proposed for power control, and weight factors based on proportional fairness were introduced for channel selection, which realized the resource allocation in the T2T scenario. The scheme can improve the throughput of the system and ensure the stability of the T2T communication. However, in this scheme, the system model has some disadvantages, e.g., the resource of one T2W uplink can only be multiplexed by one T2T link.
In this paper, we design a novel CBTC system structure based on T2T communication with Long Term Evolution for Metro (LTE-M), since Beijing Yanfang urban railway has already adopted LTE-M to transmit CBTC traffic [9]. Then, MADRL is adopted to the T2T scenario for the first time, and a novel distributed resource management scheme is realized. Specifically, in the proposed scheme, each T2T transmitter is regarded as an agent. Through interaction between agents and environment, each agent obtains state information, including resource block (RB) reuse and channel state, etc. According to the policy, each agent chooses actions, including power selection and RB selection. Compared with random allocation scheme and existing resource allocation scheme for T2T communication, the proposed scheme can effectively improve the throughput of the system, and improve the successful transmission probability of T2T links within the specified time.
The remainder of this paper is organized as follows. Section II briefly introduces the system model and formulates the resource allocation problem in the T2T scenario. In Section III, we describe the MADRL based resource allocation algorithm for T2T communication. The performances of the proposed schemes are simulated and compared in Section IV. Finally, we conclude the paper in Section V.
Notation: a, a and A represent a vector, a scalar and a set, respectively; |A| denotes the size of set A; R n stands for the set of n-vector real numbers; E[·] denotes the expectation.  integrated into trains and trackside controllers. In this novel CBTC system, the train becomes more ''core'' and ''intelligent''. Automatic train supervision (ATS) system sends the routing plan to vehicle on board controller (VOBC) by T2W communication, then VOBC can straightforwardly control the rotation and opening of the turnout according to the routing plan [3]. By adopting D2D communication technology, adjacent trains can directly perform T2T communication and exchange the key information such as train position and speed with each other. Based on the key information, the train can timely generate updated movement authority (MA) without the assistance of ZC or other equipments. There is a transmitter with the environmental sensor at the front and rear of the train to better support the T2T communication. The environmental sensor can obtain ''environmental state information'' such as instantaneous channel state information (CSI) and interference power, etc. The main function of the environmental sensor will be introduced in Section III.
The novel design can not only effectively reduce the communication processes between trains and improve the performance of the entire system, but also simplify the system structure.

B. PROBLEM FORMULATION
The T2T communication scenario in a single cell is shown in Fig. 2. In this scenario, we assume that the total number of RBs in the system is M , where M is the maximum number of trains in the area covered by the cell. The RBs are orthogonal to each other. Moreover, N trains require to establish N T2W uplinks to communicate from train to wayside equipment, and each link denotes as n ∈ W = {1, 2, . . . , N }. Each T2W link uses one RB, and the RBs are different from each other. Furthermore, each train requires to interact with the two adjacent trains to attain the location and state information of the adjacent train. So there are K T2T links denoted by T = {1, 2, . . . , K }, and K is twice as much as the number of trains. In particular, the first and last trains, only one train adjacent to them, and they still establish two T2T links with the adjacent train for redundant transmission. The anti-interference ability at the BS is stronger compared with that at the train, and the available spectrum resources for wireless communication are limited. Therefore, each T2T link reuses the orthogonal spectrum resource of the T2W uplink, and the same RB can be reused by multiple T2T links at the same time slot. When different links in the system use the same RB, co-channel interference (i.e., collision transmissions) occurs. The co-channel interference will affect the throughput and performance of the system.
The signal to interference plus noise ratio (SINR) of the nth T2W user at the BS can be expressed as where P W n is the transmission power of the nth T2W user, h W n is the channel gain of the useful signal corresponding to the nth T2W user, and σ 2 is the noise power. ρ k [n] = 1 when the kth T2T user has reused the frequency resource of the nth T2W user and ρ k [n] = 0 otherwise. P T k , h W k is the transmission power of the kth T2T user and the channel gain of the interference to the BS, respectively. According to the Shannon theorem, the throughput of the nth T2W user can be formulated as where B is the bandwidth. The co-channel interference caused by reusing the same frequency resource between the T2T user and the T2W user is and the co-channel interference among all T2T users which use the same RB is hence, the SINR of the kth T2T user can be expressed as where, h T k is the channel gain of the useful signal corresponding to the kth T2T user, and h T n,k is the channel gain of the interference from the nth T2W or T2T user to the kth T2T user. The throughput of the kth T2T user can be expressed as In this system, each T2T transmitter is regarded as an agent. Each agent chooses transmission power and RB by interacting with the environment. By designing an appropriate reward function, our proposed scheme can maximize system throughput and improve the reliability of information transmission in each T2T link. In order to ensure the safe operation of trains, the position and status of trains need to be transmitted periodically between adjacent trains, so the reliability of the T2T link is particularly important. To evaluate the reliability of the T2T links, we define the successful transmission probability of T2T links. The information transmission is considered to be unsuccessful if the T2T link fails to transmit the required information within the specified time. More details will be discussed in Section III.

III. MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR T2T RESOURCE ALLOCATION
MADRL can effectively implement a distributed resource allocation mechanism. Deep RL is a combination of deep learning and RL [36]. Deep learning is used to solve modelling problems between value function and policy, and RL is used to define problems and optimize goals. This section will be divided into two parts. The first part introduces RL. For the resource management in the T2T scenario, the basic elements of the RL model are designed, including state space S, action space A, policy π and reward function R. The second part introduces the deep Q-network (DQN) and multi-agent deep Q-network (MADQN) algorithm, which solve the mapping relationship between observation and value functions, and finds the optimization policy.

A. REINFORCEMENT LEARNING
As shown in Fig. 3, the framework of RL is composed of two parts: agent and environment, which can interact with each other. In the process of interaction, the agent can continuously learn and ultimately complete the learning task. The key elements of the RL model are designed as follows: • States: For the T2T resource allocation management, the agent can sense the external environment and generate its states s t based on the onboard environment sensor [37], and s t consists of six parts: where, G t ∈ R M and H m ∈ R M are channel gains of the T2T links and the T2W links at current time slot t respectively, I t−1 ∈ R M and D t−1 ∈ R M are interference power and times of the RBs being reused by adjacent agents at the previous time slot, respectively. E t and F t denote transmission duration and load quantity. At different time slots, the states observed by the agent will be different, all possible states constitute the state space S.
• Actions: At time slot t, the agent observes the state s t from the environment and selects the action a t , a t ∈ A, according to the policy π . Policy π is a mapping function from state space S to action space A, which determines the action selection in state s t . a t includes the selection of the RBs to reuse and the transmit power level, which can be expressed as As mentioned in Section II, the total number of RBs in the system is M , hence, RB t ∈ {1, 2, . . . , M }. Considering the complexity of the DQN network and T2T user requirements comprehensively, three levels of transmission power is adopted, and P t ∈ {P 1 , P 2 , P 3 }. Therefore, the size of the action space |A| (i.e., number of different actions) can be formulated as After the agent takes action a t , it will act on the environment. The state of the environment becomes s t+1 from s t , and an instant reward r t+1 feeds back to the agent. Such interaction can go on like this: • Reward Function: To recognize the impact of the selected action on the system, we define the reward function as the weighted total throughput of the T2T and T2W links, rather than the throughput of the link related to the agent. The instant reward r t is expressed as where λ ∈ [0, 1] is the weight factor. In RL, besides instant reward, a total reward should be considered to ensure the stability of long-term performance of the system. In the T2T scenario, the environment has no termination state, and the total reward will be infinite.
To solve this problem, the discount rate γ is introduced to control the weight of the long-term reward, and the discount reward r t is defined as where γ ∈ [0, 1], the agent is more concerned about long-term rewards when γ approaches 1, and the current reward becomes more important when γ approaches 0. The target of RL is to learn a policy to maximize the expected discount reward, which can be defined as The performance of the system is controlled by designing the reward function.

B. MULTI-AGENT DEEP Q-LEARNING
Many effective algorithms have been proposed to achieve the target of RL, Q-learning is one of the commonly used algorithms. For a policy π, Q-learning optimizes policy π by Q-value. The Q-value is closely related to the state s t and the selected action a t , denoted as Q(s t , a t ). It can be approximated as the expected total reward of the agent selecting the action a t in the state s t . The action with the highest Q-value is selected to update the policy π , then the Q-value is updated with the new policy, and repeat this process until Q-value converges to the optimal Q-value, Q * . The optimal policy π * can be found, once the Q * is obtained. The iteration formula of Q-value is as follows , a) , (14) where α is the learning rate. In Q-learning, the Q-value is stored in the Q-table, and the size of the Q-table is |A| |S| . As the state-action space increases, the size of the Q-table will increase dramatically. In the resource allocation for the T2T communication, the state space |S| is large and uncertain, so the classic Q-learning cannot be applied. This problem can be solved well by using the neural network. As shown in Fig. 4, the observed state is regarded as the input of the neural network, and the neural network outputs the Q-value of each action. The Q-table is replaced by the neural network which can be called as Q-network. In the resource allocation problem that we proposed, action a t ∈ A is discrete and finite. The output of Q-network can be expressed as: s t , a 1 ) . . .
where φ denote the weights in the Q-network and is learned to ensure Q φ (s t ) close to the real Q-value. There are two problems in the process of learning: one is that the target is unstable, and the goal of parameter learning depends on the parameter itself; the other is that there is a strong correlation between the samples. To solve the two problems, DQN was proposed. The DQN takes two measures: one is the freezing target network, i.e., the parameters in the target network are fixed in a period to stabilize the learning target, and the second is experience replay, an experience pool is built to remove data dependencies [38]. To solve the T2T resource allocation problem proposed in this paper, we adopt the MADQN algorithm, i.e., there are multiple agents which

Algorithm 1 MADQN for T2T Resource Allocation
Input: State space S, action space A, discount rate γ , learning rate α Output: Multi-agent deep Q-network 1 Initialize replay memory D to capacity N ; 2 Initialize Q k (s, a) for each k ∈ T ; 3 Randomly initialize the weights φ of the Q-network; 4 Randomly initialize the weights of the target Q-network φ = φ; 5 for episode = 1 : j do 6 Initialize state s k for each k ∈ T ; 7 for step = 1 : i do 8 for k ∈ T do 9 In state s k , select action a k with policy π ; 10 Take action a k , observe the reward r k and a new state s k ;

11
Save s k , a k , r k , s k into D; 12 Sample ss, aa, rr, s s from D; 13 y = rr + γ max a Qφ(s s , a ); 14 Train the multi-agent deep Q-network with the loss function Loss(φ) = (y − Q φ (ss, aa)) 2 ; 15 s k ← s k ; 16 Every C stepsφ ← φ; 17 return: Multi-agent deep Q-network with weights φ.
select actions with DQN independently. The learning process of the MADQN is described in Algorithm 1.
In the process of MADQN training, to make the agent explore the environment sufficiently, -greedy method is adopted, i.e., the agent selects the action which has the largest Q value with probability 1− and randomly selects the action from A with probability .
With the completion of the training, the Q-value converges, and the learning effect of the MADQN will be tested. Different from the training process, the -greedy method is not adopted in the testing stage. The action with the largest Q-value is directly selected to maximize total reward and improve the performance of the system. In the distributed resource allocation scheme, each agent cannot know the actions selected by other agents at current time slot, and multiple agents may reuse the same RB, thereby generating large interference, reducing the reward and failing to obtain a higher system performance. To solve this problem, only a few agents update their actions in each time slot, synchronous update becomes asynchronous update. At a different time, the impact of the action selected by the agent on the environment can be observed by other agents. For higher rewards, the reuse of the same RB by adjacent or multiple agents at the same time will be reduced or even avoided.
In summary, the MADQN algorithm is proposed in this paper, which can solve the resource management problem in the T2T communication scenario. The specific performance analysis is given in Section IV.

IV. SIMULATION ANALYSIS
In this section, detailed simulation parameters are given, and the simulation results are conducted to evaluate the performance of our proposed scheme.

A. SIMULATION PARAMETERS
For the MADQN, considering the number of inputs and outputs, a three-layer fully connected neural network is adopted, which consists of an input layer, a hidden layer, and an output layer, where the number of neurons in the hidden layer is set as 90. MADQN input size n in is 4M + 2 according to Equation (7). The output size n out is equal to |A|, and it can be seen in Equation (9). The number of neurons in the input layer and the output layer can be set once n in and n out are determined. In the training stage, the -greedy method with variable is adopted. At the beginning, the agent randomly selects the action with a high probability to constantly explore the environment and accumulate experience. With the number of training steps increasing, gradually decreases, which can effectively balance exploration and exploitation. More specifically, can be expressed as where, x denotes the number of training steps, and b is a constant. With the value of b increasing, the agent will spend longer time to explore the environment. In this paper, we take the value of b as 5500. From Fig. 5, the effect of the training process can be seen. Fig. 5 shows the variation of the average reward in the system with the number of training steps increasing. At the beginning, the agent is still in the stage of environmental exploration, i.e., the agent chooses action randomly with a high probability, so that the action with low reward will also be selected. Therefore, the average reward is constantly fluctuating. When training steps reach about 7700, due to the probability of selecting action with the largest Q-value is already greater than the probability of randomly selecting the action, the average reward begins to rise continuously. Hence, the exploration of the environment by agents begins to decrease. The detail parameters are shown in Table 1. The path loss models of T2T links and T2W links are from the real information in Beijing Yanfang subway line [9].

B. PERFORMANCE ANALYSIS
To evaluate our proposed scheme, we compare it with the scheme proposed by [35] and the random allocation scheme. In the first scheme, the channel selection was based on the weighting factor of proportional fairness, and the power control was performed by Stackelberg game. For simplification, the scheme is called as scheme I. The other scheme is randomly choosing an action for each agent.      6 shows the relationship between the average throughput of the T2T links and the number of trains. It can be seen that the proposed scheme effectively reduces the interference of the T2T link, and its throughput is higher than the other two schemes in different train quantities. As the number of trains grows, more T2T links are established. Due to the limited quantity of available RBs, the interference from the T2W link to the T2T link, and among the different T2T links increases, which lead to a reduction in the average throughput of the T2T link. In detail, when the number of trains increases from 2 to 3, the average T2T link throughput decreases significantly. The specific reasons are as follows: Ideally, the co-channel interference could be eliminated when the number of train is 2, and the total number of links in the system is less than the number of available RBs. However, the co-channel interference is inevitable when we increase the number of trains to 3. Meanwhile, the total number of links is larger than the available RBs due to the increase number of trains. Fig. 7 illustrates the total throughput of the system with respect to the number of trains in the system. From the simulation results, we can see that the proposed scheme can effectively increase the total throughput of the system, and the advantage of our scheme becomes more obvious as the number of trains increasing compared with the scheme I. As the number of trains increases, the total throughput of the system also increases. However, the increased T2T link and T2W link quantities bring more co-channel interference, which results in the slowdown of the increase rate of the system throughput. Fig. 8 shows the successful transmission probability of T2T links as a function of the number of trains. It can be seen that our scheme has the highest transmission success rate, and with train quantities increasing, the transmission success rate decreases less, which can effectively guarantee the reliability of T2T links. It is because that the agent can attain the state of transmission during the process of interacting with the environment, which can increase the throughput of the T2T links while improving the successful transmission probability of T2T links. Fig. 9 illustrates the probability for the agents to choose power levels. After training, the maximum transmission 8038 VOLUME 8, 2020  power is selected with the highest probability. Combined with Fig. 6 and Fig. 7, it can be seen that in order to get more reward, the agent learns to select the maximum transmitting power to improve the throughput of the T2T link and learn to reduce the co-channel interference in the system effectively.

V. CONCLUSION
T2T communication is proposed in the next generation train control system, and the resource allocation problem is caused by the T2T links multiplexing the T2W uplinks spectrum resource. In this paper, we propose a distributed resource allocation scheme based on MADRL. Simulation results demonstrate that our scheme can effectively reduce the interference in the system. It can improve the throughput of T2T links and the system, and ensure the successful transmission probability of the T2T links within the specified time. Our scheme can play an important role in resource allocation for T2T communication.