A New Open-Source Off-road Environment for Benchmark Generalization of Autonomous Driving

Recently, deep neural networks have greatly improved autonomous driving. However, as a great deal of training data is required, most studies have employed simulators. Generalization of such driving is key in terms of safety. The simulated environments feature only small variations in favorable conditions and thus cannot be used for benchmarking. Therefore, we developed a new open-source (OpenAI Gym-like) off-road environment featuring differently structured forests, plateaus, deserts, and snowfields. The dynamic topographical structures make the off-road environment a very challenging generalization problem. Our offroad environment can precisely evaluate autonomous driving in terms of generalization. Additionally, we proposed an evaluation method based on the success rate of driving tasks, enabling effective driving ability measurement. Furthermore, we evaluate the performance of existing end-to-end driving methods in our off-road environment. The results show that the end-to-end driving methods lack generalization ability and fail to generalize to unseen environments. Our off-road environment can help autonomous driving researchers develop a better, generalizable driving system. Unreal engine-level assets and codes are available at github.coma. We briefly introduce our model in https://www.youtube.com/watch?v=YIKd5bcHcGA.


I. INTRODUCTION
Recently, deep neural networks (DNNs) [1] have greatly improved autonomous driving [2]. DNNs approximate complex functions by processing high-dimensional sensory data such as images. Autonomous driving research has developed in two directions. Modular pipelines sequentially process data [3], [4], and end-to-end methods directly produce motion controls [5]. The modular pipeline features an NN-based perception module, a rule-based planner, and a maneuver controller. The end-to-end methods are trained via imitation learning (IL) or reinforcement learning (RL).
However, autonomous driving systems must be generalized to adverse conditions and wide terrain/weather variations. It is important to distinguish valuable features in sensory inputs when dealing with various conditions because generalization directly affects safety. Additionally, adversar-ial attacks may slightly change the inputs, confusing the predictions [6], [7]. Such attacks exploit samples very similar to the original samples but change the (correct) predictive labels of the trained model. Several approaches are used to test the robustness of autonomous vehicles to such attacks. Adversarial test cases can be generated via simulation [8] or by modifying the samples [9]. In addition, many researchers have studied training autonomous driving on simulations and transferring to the real world [10], [11]. This task requires the ability to generalize because it is impossible to build simulators that are exactly the same as the real world.
Although simulators are widely used for training and evaluating autonomous driving, open-source simulators are limited to benchmark generalization. Several open-source simulators are available for car racing, such as TORCS [12] and high-fidelity autonomous driving [13], [14], [15]. These VOLUME , simulators provide many environments on racing tracks and urban cities, but they are similar to each other. Thus, the environments lack adverse conditions and terrain/weather varieties. Training and evaluating driving systems under limited conditions may cause overfitting, so the systems poorly generalize to unseen environments.
Benchmarks are available that measure the generalization afforded by RL. Recently, the poor generalization afforded by RL was highlighted in several studies [16]. RL does not even permit transfers among very similar environments. The authors of [17] quantified RL generalization through procedurally generating environments. The authors of [18] developed the ProcGen environment to benchmark RL generalization. However, ProcGen is unsuitable for autonomous driving benchmark generalization. The environment is very different from driving tasks; it consists of simple two-dimensional video game tasks. It is natural for evaluating the performance of driving systems with driving tasks.
We present an open-source integrated framework for autonomous driving benchmark generalization. Our framework supports the end-to-end process of driving experiments: training, evaluating, analyzing and visualizing. Fig. 1 represents the framework compositions. The contribution of our study is clearly described as follows.
First, to effectively benchmark generalization, we utilize off-road environments that provide diverse and adverse conditions. The off-roads feature forests, plateaus, deserts, snowfields, and combinations. Each off-road offers visual diversity and various structures. The graphical and structural diversity makes it difficult for agents to generalize in an unseen environment, making it an effective measure of generalization.
Second, for accurate evaluation and analysis, we define tasks according to the traveled distance and propose a method for evaluating autonomous driving performance based on the success rate of these tasks. In addition, the framework offers visualization tools and various information for detailed analysis of experiments. Activation map visualization and bird-eye views are available, and the user can analyze an experiment with various measurements and driving errors.
Third, we prove that current end-to-end driving methods poorly generalize to unseen environments and analyze the reason for the failure through driving errors and activation map visualization. For training driving systems, the framework provides various baselines to build end-to-end driving, such as IL, RL and LBC (learning by cheating). We evaluate the generalizations of the three end-to-end driving methods in our environments. These methods failed to generalize to unseen off-road environments.

A. END-TO-END DRIVING
Importantly, an end-to-end driving system does not require complex, sequential data processing when steering, thus affording fast and intuitive decision-making [19]. Conditional IL (CIL) has been employed [20]. This addresses the fundamental problem of IL; it is impossible to control the agent, rendering such learning impractical. An CIL method is based on high-level commands that cause agents to turn at certain intersections. The authors of [21] developed an autonomous, vision-based obstacle avoidance system for offroad environments. Human driver data was employed for IL; agents thus trained were exposed to various conditions to enhance generality. However, only a few training sets were used. Agents engaged in online IL perform well in real-world experiments. Off-road environments are also considered in this paper, but we focus on measuring generalization ability.
The authors of [22] proposed an end-to-end method for race driving. They trained and tested their method on various road structures, graphics and physics. They proved that their method generalizes well on unseen tracks. However, they used the World Rally Championship 6 (WRC6) environment for their experiments, which is not for research use and not open source; thus, it is unsuitable to use it for benchmark generalization. Using multimodal inputs can improve end-to-end driving [23]. They trained their end-to-end driving agent with RGB images, depth images and measurements. Therefore, multimodality is an important factor in driving simulators. In addition, uncertainty-aware driving through uncertainty prediction has been studied in terms of end-to-end driving [24]. It can measure the reliability of driving agent decisions with this method, and it is important for autonomous driving safety. Our environment provides various graphical features; thus, it also can be used as a test bed for uncertainty-aware driving algorithms.

B. GENERALIZATION IN RL
RL (a form of machine learning) features rewards obtained by interacting with the environment. RL is frequently used for end-to-end driving. RL training commonly employs simulated environments. However, RL affords only poor generalization. Generalization problems can be divided into three categories: visual changes and differences in dynamics and content. Visual changes include changes in texture, background color, and object colors. The authors of [16] showed that slight changes in input images triggered catastrophic collapse in RL agents. To measure the generalization of such agents, the authors of [18] proposed a benchmark that uses procedural generation to create content. Generalization ability depends on network architecture, stochasticity, and regularization. To improve visual generalization, many dataaugmentation-based methods have been proposed. The authors of [25] developed a random data-augmentation network to improve visual generalization [25]; the empirical results were good. Additionally, a convolutional network was used to produce both original and augmented images via featurematching loss. This addressed the high variance problem in domain randomization techniques. The authors of [26] applied various image augmentation techniques to DRL. An augmented data method (RAD) was optimal in terms of sample efficiency and generalization. Simple DRL regularization using augmented images was used to build general RL agents [27]. The value function was regularized by averaging the value functions of various augmented images. Importantly, such data-augmentation-based methods are not associated with auxiliary losses and do not require world model approximation.

A. OFF-ROAD ENVIRONMENT
We utilized CARLA, an open-source driving simulator for autonomous driving systems. CARLA provides various sensory inputs and a suite of functions for training autonomous vehicle systems. The overview of our off-roads is shown in Fig. 2. The image and rendered images are shown in Fig.  5. Additionally, various weather situations are available (Fig.  4). We created eight off-road environments using an Unreal engine. Quixel Megascan assets (three-dimensional trees and rocks) are employed to draw off-road maps (which are not city-like grids). The off-road maps are ported to a CARLAcompatible environment. No lanes or guardrails mark the road boundaries. The maps feature diverse landscapes and geographic features (Fig. 2). The details follow.
• Structures. The maps contain different lanes. For mountains 1-4, there are no forks; only lane-keeping can be measured. The desert, snowfield, plateau, and combined maps feature forks; it is possible to train and test conditional policy. Unlike cities, off-roads are not standardized. They vary, as do the surroundings. Most city roads are gridded, but off-roads are not. Therefore, the agent must adjust steering more precisely than when in a city. • Dynamic elevation. City roads may rise or fall slowly; off-road slopes may change very rapidly. The off-road elevation is visualized in Fig. 3. In cities, throttle adjustments are necessary only when there are obstacles in the road or when turning. Off-road, speed is constantly adjusted. Additionally, off-road slopes affect the viewing angle, thus, the areas of sky and ground, and peripheral vision. The agent must be robust to such changes. • Visual diversity. A real-world off-road environment is accompanied by many visual changes. The surrounding elements in deserts, snowfields, plateaus, and mountains are very different, and the colors are diverse. Additionally, the angles of the principal light sources and illumination intensities vary. Mountains 1-4 feature forests, grasses of different textures, and roads. In the desert, most colors are red and yellow. The snowfield is entirely white; the trees and cliffs are snow-covered. The highlands (a plateau) feature large and small roadside rocks of texture and color similar to those of bare ground. The combined field features all forests, deserts, snowfields, and plateaus. The agent must recognize roads or obstacles that differ visually. Weather changes add further diversity. • Real-world similarity. The off-road benchmark is very similar to real off-road environments. Most autonomous driving research is focused on cities; however, off-road autonomous driving is required when rallying. Real offroad driving is dynamic and dangerous and is associated with high risks of injury or damage. Therefore, any simulator learning must be transferred to the real world rallying.
The measurements are listed in Table 1; these include CARLA server data and the waypoints. Agents were trained using these data. These are included in the birds-eye views, which show the location of the agent and the waypoints, and other current data. A sample birds-eye image is shown     Fig. 6. Additionally, Table 2 represents the road lengths in the environment, and Table 3 shows the environmental parameters used in this paper.   Birds-eye view of off-road driving in aspects of waypoints. The blue points are whole waypoints (left/middle/right), the orange points are previous waypoints, and the green points are the next waypoints that the agent should pass by. Additionally, the brown line represents the current heading of the car, and the pink line represents waypoint heading, which is the angle between the next waypoint and a previous waypoint. Further details are in Table 1.

B. DRIVING BENCHMARK
We propose a performance evaluation benchmark to accurately measure the driving agent performance. Generalization ability measured using various weather conditions and environments. Of the eight off-road environments and six weather conditions, the agents were trained using only "mountain 1" and "clear noon" weather. For simplicity, some available CARLA weather conditions were removed. The new (unseen) weather conditions were "clear sunset", "midrainy sunset", "midrainy noon", "wet cloudy noon" and "wet cloudy sunset". For each condition, 10 trials were performed; thus, 3,360 trials used either algorithm. In addition, we defined lane-keeping tasks, which require driving a certain distance while keeping the lane. The tasks are designed to increase the difficulty, and the target mileage ranges from low to high. The 7 tasks are defined as follows: 100 m, 200 m, 300 m, 400 m, 500 m, 600 m and whole lap. The whole lap task requires driving the entire map to be successful. The algorithms are evaluated through the success rate for these tasks. "Success" is defined as travel for a certain distance from a starting point while avoiding  Naturally, exiting a lane and moving too slowly fail. However, collisions can be one of the causes of failure in off-road environments. Because there are many obstacles near the lane, the agent can collide with the obstacle before removing the lane. The start point is randomly selected for every episode. The evaluation algorithm is clearly represented in Algorithm 1.

IV. AUTONOMOUS DRIVING EXPERIMENTS
IL, RL and LBC were tested on the off-road benchmark. They were used to build end-to-end models in a recent autonomous driving study. Previously, the models were tested in terms of navigation or obstacle avoidance. As we focus on generalization, the agents were evaluated with a simple lane-keeping task. The agent received RGB (red green blue) images from a front-facing camera and speed information; the agent could steer, throttle, and brake. To learn rapidly, the throttle range was set to -1 to 1; if the value was negative, braking was excessive, and the throttle was then set to zero. The IL and RL training details follow. The network structures, which was the IMPALA [28] convolutional architecture, were identical; this affords both good generalization and computational effi- Store result tuple (m, w, t, failure) in D end if end for end for end for end for ciency [18]. The Adam optimizer was employed to optimize weights. The input images were normalized to certain means (0.485, 0.456, 0.406 for each RGB channel) and standard deviations (0.229, 0.224, 0.225 for each RGB channel).

A. IMITATION LEARNING
IL builds an agent that imitates the inputs and actions of an expert [29], [30]; this is rapid if the data is sufficient. Many studies have applied imitation learning to driving tasks, and researchers have developed conditional imitation learning, which can control direction high-level commands [31], [20]. A hard-coded expert is used to collect data. Two types of ILs are trained: vanilla ILs and randomized ILs (RILs). The RIL was trained with various image augmentation methods for generalization. A total of 515,382 data tuples (RGB images, speeds, and actions) were collected and used as training data. Given the lane structures (generally straight), most steer values were near zero. To resolve this imbalance, the steer label distributions were uniformly adjusted by resampling. The IL model and RIL model were trained with a learning rate of 10 −4 and batch size of 60. The IL model and RIL model were trained for only one epoch. They were trained with the same loss function, the MAE (mean absolute error). The loss function is as follows: θ is the weight of the neural network, y i is a label steer and throttle, and x i is an input image. n is the number of samples.

B. REINFORCEMENT LEARNING
RL approximates an optimal policy using a DNN that is rewarded [32]. Recently, RL has achieved dramatic improvements in various real-world domains, such as Go [33], StartCraft [34] and DOTA [35]. DRLs are often used to build autonomous driving systems [36], [37], [22]. The Asynchronous Advantage Actor-Critic (A3C) algorithm [38], which is a policy-based algorithm frequently employed to handle continuous control problems, was used to build an RL agent. The RL agent trained for 250,000 steps. Various environmental measurements were utilized to build the reward function. The reward function consists of a weighted sum of five terms: distance toward next waypoint d in m, distance toward the center of lane c in m, angle of the car a in radians, angle between next waypoint and previous waypoint w in radians, and speed v in m/s. The reward was clipped from -1 to 1.
The hyperparameters of the RL model are shown in Table  4. Because RLs agent shows poor performance with color image inputs and do not converge, the RL agent was trained with grayscale images. The RL loss function was as follows: L value = (R t − V (s t )) L entropy = − a π(a|s t ) log π(a|s t ) L RL = L policy + c · L value + d · L entropy c and d are hyperparameters, which are the value loss coefficient and entropy coefficient, respectively.

C. LEARNING BY CHEATING
Learning by cheating (LBC) [39] is a state-of-the-art method for building an end-to-end driving agent through two steps.
First, a privileged agent trained through cheating learns with privileged information provided by the environment. Second, sensorimotor agents imitate privileged agents. Since we have hard-coded experts, it is used instead of training the privileged agent. Similar to IL and RIL, two types of LBC were trained: vanilla LBC and randomized LBC (RLBC). The RLBC was trained with various image augmentation methods for generalization. The LBC and RLBC trained for 50 epochs, with a learning rate of 0.0001.

V. RESULTS AND DISCUSSION
The results were analyzed in four categories: learning performance, generalization, Grad-CAM visualization and driving errors. First, we focus on the learning performance of the four algorithms and the evaluation results on the training conditions. Second, the results were analyzed in terms of generalization. Three perspectives were analyzed: new weather, new off-roads and new off-roads and weather. Finally, we analyze the reason for generalization failure in more detail through Grad-CAM visualization and driving errors.
Learning performance. Fig. 7 represents the learning performance of the algorithms. The four IL-based algorithms (IL, RIL, LBC, RLBC) performed almost perfectly in training conditions. However, RL performed poorly. The cause of the poor RL performance is expected to be a sharp change in the steering value. The IL-based method maintained a stable steering value, but the RL changed the steering value very sharply. RL agents changed the steering value significantly to avoid negative rewards and approach the center of the lane, but this caused greater acceleration, causing the car to move to the other side of the lane, again resulting in negative rewards. Interestingly, LBC and RLBC learn very rapidly. They achieved a very high reward (0.6) in the early training phase. Additionally, RIL learns slower than IL because of data augmentation. One interesting thing is that RLBC generally failed faster than LBC. Both LBC and RLBC were trained for the same number of episodes, but RLBC experienced nearly 70,000 steps, while LBC experienced nearly 90,000 steps. An episode consists of sequence steps, and the episode ends when the driving errors are met. If an agent trained for the same number of episodes and experienced fewer steps, the agent failed faster and made more mistakes. This can also be explained by the effect of data augmentation.
Generalization. Table 5   conditions. The same phenomenon was observed in LBC and RLBC. Thus, input image randomization improved generalization. The success percentages by the off-road environment are shown in Table 6. All agents performed well on the mountains but usually failed in the desert, snowfield, highlands, and combined field. This is because mountains 2-4 are more similar to the training environment than are the other four environments. Although mountains 3-4 are visually very similar to mountain 1, most agents failed to generalize to mountains 3-4 in the whole lap task. This is because, unlike mountain 1, mountains 3 and 4 have large hills, which makes it difficult for the agent to control speed.
Grad-CAM visualization. We prepared CNN activation maps with the aid of Grad-CAM [40], a gradient-based attention visualization method. The Grad-CAM results are shown in Fig. 8. During training, all algorithms usually focused on the road. RL focused more strongly than the others. In the new environment, all algorithms seemed to focus on random points and failed to distinguish important visual features such as the lane route. Surprisingly, all agents focused not only on roads but also on the sky. This may be explained by the structure of mountain 1, which is the training condition. On mountain 1, most roads surrounded by trees could infer the lane routes from the shape of the sky. However, in terms of generalization, making decisions based on the shape of the sky is inefficient. The shape of the sky cannot change if there are no trees surrounding the road or other obstacles. Actually, in deserts, snowy fields, highlands and combined fields, it cannot use the shape of the sky to estimate the lane route. Therefore, generalization failure in these environments can be explained by inference based on the shape of the sky.
Driving errors. Driving errors are shown in Table 7 and Table 8. Under training conditions, IL, LBC and RLBC perfectly succeeded in all episodes without any failure. However, in new off-roads and new off-roads-with-new weather, failure was usually an out-of-lane episode (steering control failure). The failures in mountains 3-4 were mostly lowspeed episodes. Speed has nothing to do with steering control, it is a matter of throttle control. In practice, simply increasing the throttle value can solve the speed problem. However, too high of a speed can cause unstable driving. Additionally, in the desert, snowy field, and highlands combined field, most episodes were out of lane for all algorithms. This means that generalization failure caused incorrect steering control.

VI. CONCLUSIONS AND FUTURE WORKS
Generalization is essential to ensure safe autonomous driving. We developed an off-road benchmark to assess generalization. We used different off-roads, which provide visually diverse graphics and various structures. The results of autonomous driving experiments indicate that our off-road environment precisely measures the generalization ability of autonomous driving, and the current end-to-end driving methods lack generalization ability. The agents successfully generalize to unseen weather but completely fail on some offroads. Further improvements in autonomous driving can be evaluated on the environment in terms of generalization. In addition, the environment can be used to build autonomous driving systems for off-road driving. To the best of our knowledge, this paper is the first to present a simulated offroad environment for autonomous driving. The environment provides various realistic off-road environments, such as mountains, deserts, snowy fields, and highlands. Previous simulated environments were usually urban cities or racing tracks. In the future, as RL is promising in terms of end-toend driving, various generalization methods featuring DRL will be employed. Generalization of different dynamics is also a future work. A benchmark is required because dynamics can change in the real world. For example, a car may skid in the rain and move in an unintended direction. Autonomous driving systems must deal with such situations.