Modular Reinforcement Learning for Playing the Game of Tron

Tron is a simultaneous move two-player game where a wall is created along the path where two agents move and the agent that crash with the wall first is defeated. Due to the fact that the same action may result in different outcomes (non-stationarity), it is difficult to utilize the basic approach of reinforcement learning. In this paper, we present a modular reinforcement learning approach to tackling the game of Tron by decomposing the game into two phases where the first phase is non-stationary and the second phase is stationary. We evaluate the performance of our algorithm by comparing with previous algorithms including the state-of-the-art algorithm for the game of Tron (called a1k0n) in different grid sizes.


I. INTRODUCTION
Since the advent of deep Q-networks (DQN) [1], the field of deep reinforcement learning (RL) has been developed rapidly through numerous approaches to improving the efficiency and stability of learning and the performance of the trained RL agents.
As the popularity of RL continues to grow, the target applications of RL also rapidly expanded from board games and classic video games with simple rules and actions to more complicated and actionable games [2], multi-agent games [3], [4], and NP-hard graph problems [5], [6].
Since RL researchers started to tackle more complex and difficult problems, there have been many attempts to breaking the complex and difficult tasks down into less complex and easier sub-tasks. This approach is called the modular reinforcement learning (MRL) [7], [8], based on a simple idea that it would be more efficient to decompose a task into several sub-tasks and train an agent for each sub-task.
Sun et al. [9] applied decomposition unit to the critic network for efficient learning of multi-agent games with multiple objects. In other words, sub-tasks were defined based on discrete objects. Simpkins et al. [7] proposed a method to improve learning performance with the task-specific reward scale for a single agent game with multiple goals defined. In this case, sub-tasks are defined based on the objectives that should be considered at the same time. Andreas et al. [10] applied a method to maximize the total reward by applying different policies for multiple sequential tasks. As such, there are various criteria and methods of presenting tasks in take composition and MRL-related studies.
In this paper, we consider the game of Tron, a two-player simultaneous move game played on a discrete square grid. Tron is a competitive game in which a wall is created along the path where two agents move simultaneously, and the agent that crash the wall first is defeated. It is a multiagent game when sharing the same space with an opponent. However, after the space is separated from the opponent, the goal is to survive alone as long as possible because it cannot interfere with each other's moves [11], [12].
There have been many strong heuristics proposed to play the game of Tron [11], [13]- [18] including the Monte-Carlo Tree Search (MCTS) algorithm and minimax algorithm. Recently, Knegt et al. [13], [15] utilized a deep neural network as Q function approximator. Perick et al. [16] applied the MCTS to the game of Tron and performed a comparison of different node selection strategies for the MCTS algorithm. In order to exploit the non-stationary property of the game of Tron, Knegt et al. [13], [15] proposed a concept of opponent modeling to predict the opponent's next move which is subsequently used in MCTS.
We tackle the game of Tron by decomposing the game into two sequential sub-tasks. The first part of the game is to play until two agents are separated on the grid so that there is no possibility of colliding with each other. Since the next state of an agent cannot be determined only by the agent's action, we consider these state to be non-stationary. The second part of the game is that the agent needs to move along the empty cells as long as possible. Since the agent at this point does not need to care about the opponent's move, we can consider these states as stationary. In summary, we decompose the game of Tron into two phases where the first phase is inherently non-stationary as the opponent's move can change the environment and reward regardless of the chosen action and the second phase can be considered stationary as we can simply ignore the opponent's region from the grid. In this FIGURE 1. Two Tron agents that started the game from a point-symmetric location on a 6×6 game board. A large circle and a large diamond are the current location of the agents. The longest path of the circle-shaped agent is 16, and the diamond-shaped agent's is 7.
paper, we demonstrate that the modular RL approach to the game of Tron is robust compared to the previous approaches including the a1k0n algorithm [12] who won the Google Tron AI Challenge in 2010. Especially, we show that our algorithm is considerably faster than the previous heuristics as we do not perform the exhaustive tree search. Considering that Tron is a real-time game, our algorithm can be employed in time-sensitive scenarios. We verify the performance of our algorithm through comparisons with other algorithms on different size of boards.

A. REINFORCEMENT LEARNING
There are two well-recognized criteria for classifying the research directions of the RL. First, the RL algorithms can be classified into either model-free and model-based algorithms. The model-free RL algorithms (such as DQN [1], DDPG [19], etc.) try to estimate the optimal policy without using or estimating the transitions and reward functions of the environment. On the other hand, the model-based RL algorithms (such as AlphaZero [20], MuZero [21], etc.) employ the transition function and the reward function to estimate the optimal policy.
Second, we can also classify the RL algorithms into two classes, namely policy-based and value-based. In the policybased RL, we directly learn a policy that maps states to actions (mapping π : s → a) without using a value function. On the contrary, in the value-based RL, we only use a value function that estimates the values of the states and derive the policy from the value function by picking the action with the best value based on the value function.
The policy-based and value-based RL implicitly mixed in the field of RL but explicitly combined in the advantage actor-critic algorithms (such as A3C [22], A2C, AC-TKR [23], etc.). The actor-critic algorithm consists of two networks called the actor and the critic. The actor network (approximating the policy function) chooses an action at each step and the critic network (approximating the value function) evaluates the quality of the given state. As the critic network learns which state is better or worse, the actor relies on the critic to choose the states leading to better future rewards.
Reinforcement learning is a machine learning method that trains neural networks for decision-making processes based on the Markov Decision Process (MDP) [24]. In MDP, state transitions t(s (t+1) |s t , a t ) occur when a particular action a t is selected according to policy π(a t |s t ) in a given state s t , and agents receive rewards R t . In order to know the exact sum of future rewards, we make choices at each time and add the total rewards. However, as the search space increases, it is practically impossible to find the correct sum for all possible situations. Therefore, we need an expression that can approximate the total future rewards with only the current information, which is called the Q-value. As the reinforcement learning is based on the Bellman equation [25], we aim to train an agent that selects an action that maximizes the Qvalve defined as follows: where γ is the decay factor. Unlike present rewards that have a clear value, future rewards have unclear value. Therefore, when evaluating future rewards at the present time, a penalty is given by multiplying γ.

B. ACTOR-CRITIC ALGORITHMS
The Actor-Critic [22], [23], [26] algorithm employs two neural networks: an actor network that determines an action from a state, and a critic network that estimates a value of a state. The objective function for training the actor network is defined using the advantage function A(s t , a t ) = Q(s t , a t )− V (s t ), which is the difference between the actual Q value and the estimated value V (s t ) by the critic network. Then, we optimize the actor network by maximizing the probability of actions with higher advantages. In other words, the objective function for the actor network is On the other hand, we train the critic network by minimizing A(s t , a t ) 2 , the squared difference between the Q value and the estimated value by the critic network. Finally, we add the following term called the entropy loss to the objective function to encourage exploration and discourage premature convergence to a sub-optimal policy: − a π(a|s) log π(a|s).
In order to increase the diversity of training data, the asynchronous advantage actor-critic (A3C) [22] algorithm executes a set of environments in parallel and the parallel agents update the global network asynchronously. The advantage actor-critic (A2C) is a synchronous version of A3C introduced by OpenAI in their published baselines. In A2C, all of the updates by the parallel agents are collected to update the global network. The actor-critic using Kronecker-factored trust region (ACKTR) [23] is an advanced method from A2C. The ACKTR improves sample efficiency by applying a Fisher information matrix which is mathematically identical to the derivative of Kullback-Leibler divergence in the optimization process of A2C. At this point, ACKTR uses approximations using the basic properties of Kronecker product instead of second-order derivation to reduce computational costs.

C. RL IN STATIONARY AND NON-STATIONARY ENVIRONMENTS
An environment is said to be stationary [9], [27] if the next state s t+1 is determined based on the agent's current action a t and the current state s t . This is also known as the Markov property [24]. In general, most reinforcement learning algorithms assume a stationary environment where there is no stochastic element involved in determining the next state. See Figure ?? for pictorial description of the stationary and nonstationary environments.
On the contrary, an environment is called non-stationary if the next state s t+1 cannot be determined solely based on the agent's current action a t and the current state s t . There exist additional factors ε t that determine the next state such as other agents' behavior, randomness, past behaviors and states, etc. It is widely known that basic approaches of reinforcement learning in non-stationary environments are very unstable.

D. MINIMAX ALGORITHM
Minimax [28] is a well-known recursive algorithm that selects the most favorable action through valuation within a game tree. Each node in the game tree has a selection value, which is computed by means of a position evaluation function. When an agent explores its action, the minimax algorithm selects a node with the highest selection value.
Similarly, the opponent is expected to make the most favorable choice from the opponent's point of view. Therefore, when exploring the opponent's action, the agent selects the node with the lowest selection value. As the computation cost of the minimax algorithm can be very expensive as the depth of search increases, we can improve the performance of the minimax algorithm by the use of alpha-beta pruning [29]. If an agent finds an action by the minimax algorithm from the game tree of depth n, we can ensure that the chosen action is the most favorable action for the agent after n steps.
The Voronoi diagram [30] is used to partition a plane into regions close to each of a given set of points. Figure 2 left shows an example of a Voronoi diagram computed on a grid of Tron. Considering that an agent can only move in four directions in the game of Tron-up, down, left, and rightthe Voronoi diagram of an agent represents the set of cells with the closest L1 distance for the agent. Cells with the same L1 distance from both agents [31] belong to neutral region.
We implement a minimax agent playing the game of Tron by relying on the computation of Voronoi diagram at each turn [11]. More specifically, the selection value of a node for agent 1 is defined as R 1 − R 2 , where R i is the area of the Voronoi region of agent i. In other words, an agent always selects the node in the manner that the area of its Voronoi region is larger than the area of the opponent's Voronoi region as much as possible. Similarly, the opponent always selects the node with the minimum selection value. If there are multiple nodes with the same maximum value, the agent randomly choose one of the nodes. In our implementation, we set the exploration depth of the minimax algorithm to two due to time constraints.

E. TREE OF CHAMBERS HEURISTIC: A1K0N
However, the Voronoi heuristic has a clear limitation as in the situation depicted in Figure 2 right. While the area of the Voronoi region of the diamond-shaped agent is larger than that of the circle-shaped agent, the region will be divided into two regions if the diamond-shaped agent moves toward the wall and chooses one of the directions. If the circle-shaped agent blocks the selected area of diamond-shaped agent, the resulting remaining area of the circle-shaped agent will be larger. Therefore, if an agent can take one of two paths and only reach one of two areas as a result of the choice of the path, the area estimated by the Voronoi heuristic can be very different to the area the agent actually can reach. In these reasons, the a1k0n [12], the Google Tron AI Challenge champion, uses the biconnected component algorithm using the articulation points computed from the Voronoi region of the agent.
If two agents are separated, then the a1k0n algorithm relies on the simple wall-following heuristic. Note that the original a1k0n algorithm utilizes the iterative deepening search with runtime constraint but we re-implemented the tree search algorithm as the simple depth-limited search with depth one. VOLUME 4, 2016

III. PROPOSED METHOD
Before explaining our methods more specifically, let us briefly recall how the game of Tron proceeds. The initial position of agent 1 is randomly determined and the initial point of agent 2 is automatically set to be a cell so that the initial grid possesses central symmetry. At each time, agents move one cell of grid up, down, left, and right direction simultaneously. The trajectories where agents have passed will be filled with walls, and the first agent that crash with any kind of wall will be defeated. If both agents crash at the same time (or crash with each other), the result of the game is a draw.

A. RL AGENT FOR STATIONARY ENVIRONMENTS
In Tron, a state that two agents are in an independent space and cannot influence each other is considered to be stationary environments [9], [27], as shown in Figure 1. After beginning the game from a non-stationary environment, the game may transition to a stationary environment if two agents are isolated on the grid. In the stationary environment, the agents only need to survive as long as possible. Note that the problem of finding the longest path in the grid is equivalent to the longest path problem in the grid graph [32], which is known as NP-complete. Therefore, we can assume that there is no polynomial time algorithm that finds the optimal path for agents in the stationary state if P = NP. Therefore, we aim at solving the NP-complete problem using a neural network as it is proven in many cases that the neural network can solve intractable problems effectively compared to previous successful heuristics [5], [6]. In order to train an agent playing in stationary states to find the longest survival path, we randomly generate stationary environments that the agent may encounter during real games. The algorithm for generating the random stationary environment is explained in Algorithm 1.

B. RL AGENT FOR NON-STATIONARY ENVIRONMENTS
We distinguish between the non-stationary and stationary environments and finish the game as soon as it is converted to a stationary environment from a non-stationary environment. Then, the win or lose is determined by the remaining longest path of each agent. It is a draw if two agents crash in a nonstationary environment at the same time, or if the remaining longest path is same in a stationary environment. The whole training procedure of the agent playing in non-stationary states is explained in Algorithm 2.
Therefore, the agent that plays non-stationary environments is trained with the goal of making the remaining longest path is longer than opponent agent's. This agent named P agent.

C. REWARD FUNCTION
The reward function should be designed to guide agents' behaviours in reinforcement learning. The most intuitive way to design a reward function in the grid world is to give a positive reward for each step so that the agent tries to survive Algorithm 1 Random Stationary Environment Generation Input: An empty game map Output: A random stationary environment (map) 1: Randomly select one of the four sides; 2: Place a cursor to one of the cells from the selected side; 3: while cursor is not on one of the three unchosen sides do 4: Make the list of directions except for the direction where the cursor started from; 5: if cursor moved in the same direction (grid width−3) times in a row then 6: Remove the direction from the list; 7: end if 8: Remove the opposite direction of the previous cursor move from the list; 9: Randomly select one direction from the list; 10: Move a cursor to selected direction and place a wall behind; 11: end while 12: Choose a side from two separated regions; 13: Fill walls to the unchosen side; 14: return A randomly generated map Algorithm 2 Training Agent for Non-stationary Environment 1: Pretrain a stationary agent on randomly generated stationary environments; 2: Initialize a non-stationary agent; 3: while training is not finished do 4: Start self-play of same neural network agents in the non-stationary environment; 5: if game is changed to the stationary environment then 6: Compute the approximate length of the longest path using the pretrained agent; 7: Determine the result of the game by comparing the approximate lengths of two agents; 8: else 9: Determine the result of the game in the nonstationary environment by detecting crash of agents; 10: end if 11: Give reward to the non-stationary agent based on the result; 12: end while as much as possible. However, we find it inappropriate for the game of Tron as the positive reward for each step results in an unexpected side effect [33]. Since we train agents through self-play training, the agent tries to avoid to get close to the opponent during the self-play games. As a result, the agent trained with positive reward for each step tends to survive more steps but exhibits poorer performance against the other agents compared to the agent trained with negative reward for each step. We notice that negative step reward results in a more aggressive agent playing better against the other agents on average while positive step reward results in a more defensive agent playing worse against the other agents. For this reason, we opt to use the negative reward for each step to training our neural-network agents in our experiments.

D. NEURAL NETWORK ARCHITECTURE
Our neural network consists of seven convolutional layers followed by two linear (affine transformation) layers with residual connections as shown in Figure 3. We use the same neural network for both non-stationary and stationary agents. Finally, a policy network and a value network consist of two and three linear layers, respectively, with Mish activation function [34] in between. When the grid size is 6 × 6, the linear layer right after an average pooling layer has a size of 256, and 576 when the grid size is 8 × 8 or 10 × 10. The numbers in convolutional layers from Figure 3 are kernel size, number of input channels, and the number of output channels. Note that a stationary environment consists of two channels (due to the absence of the opponent) instead of three channels for a non-stationary environment. The padding size of all convolutional layers and pooling layers is kernel size/2 .
Matrices provided in Figure 4 show how we encode states of the game as matrices to feed the neural network. An input matrix for a non-stationary environment consists of three channels where the first channel encodes the locations of side walls, the second encodes current and previous (already filled with walls) locations of the current agent, and the last encodes the current and previous locations of the opponent. On the other hand, an input matrix for a stationary environment consists of only two channels as there is no need to encode any information about the opponent. In input matrices, a wall is represented by one, an empty cell by zero, and location occupied by an agent is ten.

IV. EXPERIMENTS
In this section, we first describe the experimental setup and present the experimental results and discussion for various experiments conducted to verify the proposed idea.

A. IMPLEMENTATION DETAILS
The experiments are performed on a computer with a 6-core Intel Core i7-8700K 3.70GHz CPU, 48GB RAM, and an NVIDIA GeForce GTX 1080 Ti GPU. We use Python 3.7.6, PyTorch 1.7.1, CUDA 11.1 running on Ubuntu 18.04.5 LTS.

B. BASELINES
We measure the performance of the proposed algorithm (named P agent) against several baseline approaches including the state-of-the-art Tron algorithm.  Section II-D. The depth of minimax search is fixed to two. In stationary environments, a length of the longest path is approximated by depth-limited search, which is maximizing remaining area. The depth of search is limited to one. 5) a1k0n agent and wall-following heuristic (named A agent): How the a1k0n agent plays in non-stationary environments is described in Section II-E. The depth of minimax search is fixed to two. In stationary environments, the longest path is approximated by DLS, that wall-following bonus added. If the maximum remaining area is the same for several possible directions, an agent choose the direction with the most walls around it. The depth of search is limited to one. 6) Neural network agents based on the area of agent's region (named RP agent and RG agent): The nonstationary agent uses a neural network trained via remaining region measurement without any path searching process. Therefore, this agent is trained with values greater than or equal to the actual possible longest path. This has no problem with training, but it is unfair with other agents during evaluation. Therefore, we use pretrained approximation (RP agent) and greedy approximation (RG agent) in evaluation.

C. EXPERIMENTAL METHODS
We measure how well six methods except for region approximation approximate the ratio of longest path to area in the game board of stationary environments generated by the method mentioned in 3.2. Backtracking method is measured only in 6 × 6 games due to exponential time of computation. Also, the same experiment is carried out in stationary environments generated during real plays. In this case, all games are self-plays for agents with the same approximation scheme (e.g., the pretrained approximation's ratio of longest path to area is measured in P agents' self-plays, and the greedy approximation's is measured in G agents' self-plays.). Each match is played 10,000 games, and the average is obtained by dividing the total area from the total longest path. The performance of each agent in non-stationary environments is measured in round-robin tournament [11], [16], except for the same agent-to-agent match. Each agent will play 1000 games per match, including match against B agent in 6 × 6 games.  In randomly generated stationary environments, the ratio of approximated longest path to area by agent is shown in Figure 5. First of all, we can see that the algorithm closest to the backtracking search results is wall-following heuristic.
The pretrained approximation is also close to backtracking and we can see that neural networks approximates the longest path problem well enough, which is NP-hard problem. Interestingly, the greedy approximation, which has lower time complexity compared to other methods, showed an approximate ratio of 0.9 or higher on all board sizes. Other approximation methods shows relatively lower performances. However, the stationary environments generated during the actual game play resulted in different outcomes. Wallfollowing approximation had a higher rate of filling regions than backtracking agents. This refers to the distribution difference between the stationary environments generated from A agents and the stationary environments generated from B agents. In other words, even though the stationary environments generated by the B agent is lower in longest path to area ratio compared to the A agent, the actual remaining area is wider and the remaining longest path is longer. Additionally, the single approximation results were much higher than those of Fig 6. In 8 × 8 games, the approximate ratio was higher than even the pretrained approximation, as the S agent was well trained on stationary environments generated by real games, but never learned about randomly generated stationary environments. From this experiment, we can see that the pretrained neural network trained through the random stationary environments generation algorithm shows relatively good performance. However, some degradation from distribution differences from stationary environments generated from real games remains room for improvement.

b: Results of the round-robin tournament
The results of the round-robin tournament with the proposed algorithm and the baseline algorithms are provided in Tables 1, 2, 3 and 4. Each agent plays against the other agents 1,000 times. The number of wins of each agent during the tournament is presented. Note that the number of draws can be obtained by subtracting the numbers of two agents' wins from 1,000. The P agent recorded a higher number of wins than the number of losses for all agents in 6 × 6 and 8 × 8 games, and the A agent recorded a higher number of wins than the number of losses in 10×10 games. G agent achieved slightly lower performance than P and A agent. The RP and RG agent lost more times than winning in all matches. P agent was defeated by A agent in 10 × 10 games because they used almost the same model for the game board in all sizes. If the P agent have a deeper model structure and can be trained from the approximate results of the improved pretrained model, it is expected to be able to win against the A agent, which is a strong baseline. We also compare the performances of considered agents against the backtracking agent which is trained with the ground truth longest survival path only on 6 × 6 grid due to its intrinsic complexity. The experimental results are shown in Table 5. The winning percentage is calculated without VOLUME 4, 2016 taking draws into account. The proposed P agent showed the highest winning percentage, and A agent showed higher winning percentages compared to other agents except A agent. G agent achieved slightly lower winning percentage than A agent. On the other hand, the RP and the RG agents showed the lowest winning percentage. This means that the non-stationary agent certainly plays better when trained with more accurate longest path approximation heuristic.

d: Runtime complexity of algorithms
Since Tron is a real-time game, the runtime complexity of algorithms playing the game of Tron is a very important factor. In order to analyze and compare the runtime complexity of algorithms, we present the elapsed time of each agent determines an action in Figure 7. It is readily verifiable that the neural network agents and greedy approximation are more time-efficient compared to the other algorithms. In nonstationary states, agents using neural networks spent almost the same amount of time to determine actions regardless of grid size. In addition, the amount of elapsed time slightly increases even when grid size increases to 10 × 10 thanks to the parallel computation of GPU. On the other hand, the elapsed time to determine an action of A agent and M agent increases noticeably particularly in non-stationary environments. Even in stationary environments, the elapsed time of the wall-following heuristic increases much more than the pretrained approximation. Obviously, the fastest longest path approximation heuristic is greedy approximation whose time complexity is linearly proportional to the number of cells. Although the wall-following heuristic shows the best approximation performance especially on larger grid, the pretrained or greedy approximation can be competent methods considering the time complexity of algorithms. Considering the fact that Tron is normally implemented on much larger grid sizes, the scalability of our algorithm can be a huge advantage when deployed in real-time game environment.

V. CONCLUSIONS
In this work, we propose a modular RL approach to solve the game of Tron especially by decomposing a game into the first non-stationary phase and the second stationary phase. We train two separate agents playing in the two phases to reduce the training complexity of the game. We find that the agent trained to play in a stationary state exhibits a competent approximation performance for the longest path problem on the grid graph, which is known to NP-complete, compared to the other agents using various longest path approximation heuristics. Especially, our approach demonstrates a better performance against the a1k0n algorithm, the Google Tron AI Challenge winner, on 6×6 and 8×8 grids. On 10×10 grid, our approach does not reach the level of the a1k0n algorithm but shows potential in terms of computational cost. We expect that the modular RL approach can be effectively applied to many complicated real-life problems or games, which can be seen as combinations of multiple tasks to achieve the ultimate goals. We also believe that our algorithm can be improved by training the agent playing in stationary states by generating more realistic random stationary states as we suspect that the main reason for performance degradation is due to (observed) differences in randomly generated stationary states and stationary states encountered from real game plays. We leave the problem of generating more realistic stationary states using generative networks such as generative adversarial networks (GANs) or variational autoencoders (VAEs) to train the stationary agent better as a follow-up research idea.