Deep Reinforcement Learning for Autonomous Map-less Navigation of a Flying Robot

Flying robots are expected to be used in many tasks, like aerial delivery, inspection inside dangerous areas, and rescue. However, their deployment in unstructured and highly dynamic environments has been limited. This paper proposes a novel approach for enabling a Micro Aerial Vehicle (MAV) system equipped with a laser rangefinder and depth sensor to autonomously navigate and explore an unknown indoor or outdoor environment. We built a modular Deep-Q-Network architecture to fuse information from multiple sensors mounted onboard the vehicle. The developed algorithm can perform collision-free flight in the real world while trained entirely on a 3D simulator. Our method does not require prior expert demonstration or 3D mapping and path planning. It transforms the fused sensory data to a velocity control input for the robot through an end-to-end Convolutional Neural Network (CNN). The obtained policy was compared in simulation with the conventional potential field method. Our approach attains zero-shot transfer from simulation to real-world environments that were never experienced during training by simulating realistic sensor data. Several intensive experiments were conducted to show our system’s effectiveness in flying safely in dynamic outdoor and indoor environments. The supplementary videos for the actual flight tests can be accessed at the following link : http://bit.ly/ 2SEw8dQ.


I. INTRODUCTION
M Icro Aerial Vehicles (MAVs) were used widely in different applications in both military and civilian domains. Due to their agility and maneuverability, multirotor can hover and move freely in 3D space, making them suitable for many applications such as surveillance and rescue, aerial photography, and precision agriculture [1].
Autonomous navigation and collision avoidance are fundamental requirements for robotic aerial systems that must operate in unstructured and unknown open-world environments. Designing an autonomous navigation system with the ability to avoid obstacles for MAVs is a long-established research problem [2]. It is difficult to autonomously navigate an under-actuated aerial robot performing specific missions in an unstructured and unknown environment without colliding with any obstacle. Classical approaches are mainly based on 3D mapping, relative state estimation [3], [4], trajectory optimization, and path-planning [5], [6]. Nevertheless, conventional methods have significant limitations and are prone to failures, especially in unstructured, dynamic environments where reliable state feedback is unavailable for aerial robots. In addition, building an accurate 3D map of the environment takes considerable computational power. To address these problems and challenges, researchers have proposed Deep Learning (DL) based solutions that can be realized effectively and efficiently using current computer systems and Graphical Processing Units (GPUs) [7]. Deep learning's success in solving the problems in artificial intelligence [8]-FIGURE 1: The developed MAV platform [10] motivates researchers in control and robotics to apply the recent algorithms in common aerial robotics problems like flight control [11] and obstacle avoidance. End-to-End Convolutional Neural Network (CNN) approaches were proposed to directly generate control inputs from raw sensory data [12], reducing the complexity of classical methods. However, most applied DL algorithms are supervised and use structured data sets to train the models. These approaches require a large amount of data and manual labeling, which is time-consuming. To address these limitations, Reinforcement Learning (RL) has been merged with DL in the past few years, which led to a new research area called Deep Reinforcement Learning (DRL). Examples are the recent Deep-Mind algorithms, Deep Q-Network (DQN), and its generic versions like Double DQN and Dueling DQN. More recently, the policy gradient algorithms such as the Proximal Policy Optimization (PPO) and the Asynchronous Advantage Actor-Critic (A3C), for a continuous action space [13]- [15].
These algorithms were successfully tested in gaming environments and have shown better performance than humans. DRL algorithms are considered a powerful and promising tool for automatically mapping high-dimensional sensor information to robot motion commands without referencing the ground truth. They require only a scalar reward function to motivate the learning agent through trial-and-error experiences of interactions with the environment to find the best action for each given state. This paper presents a complete learning system for a MAV for autonomous navigation and obstacle avoidance in real indoor and outdoor environments. The system uses a DQN with a new CNN architecture for controlling the MAV's heading and forward motion. The collision-free policy was learned in a simulation environment designed in the Gazebo simulator. It was then deployed directly on a real MAV without any tuning. Our system uses Pixhawk autopilot for flight control and two-dimensional Lidar data and a depth camera as input for the algorithm. Our algorithm allows the MAV to follow predefined target points in the outdoor environment and avoid obstacles by switching between mission flight and DQN modes. All algorithms run in real-time onboard the MAV using an NVIDIA Jetson TX2 GPU. Testing was performed in indoor and outdoor environments in simulation and the real world.
The remainder of this paper is organized as follows. Section II introduces related work. Section III describes the aerial robot platform and software architecture. Section IV describes the developed algorithms and modular architecture. Section V presents the simulation results. Section VI presents the real-time experiments, and we finally conclude in section VII.

II. RELATED WORK
Techniques for autonomous navigation methods can be classified into map-based and map-less methods. Map-based methods require a global or local map of the environment during the flight to make decisions for navigation (e.g., [16]- [19]). Map-less navigation primarily concentrates on autonomous navigation without building a map of the environment. It uses computer vision techniques such as optical flow detection, feature matching using an input image, or an image fused with other sensors like Lidar (e.g., [20]- [22]). A survey of visual navigation methods can be found in [23]. Our methodology is considered to be map-less. Unlike other methods, it does not need a prior map of the environment and does not require supervised training or human demonstration [24]- [26].
Deep Neural Networks (DNNs) showed some results for solving complex autonomous navigation problems with obstacle avoidance ability using collected data-sets for training the neural networks. DNNs have been successfully applied for autonomously following trails in an unstructured forest environment using a monocular camera [27], [28]. A stereo vision algorithm was presented for high-speed navigation in cluttered environments [29]. CNN was used to learn a navigation strategy by imitating expert demonstrations and learning an end-to-end policy [30], [31]. The autonomous flight was done in an indoor environment for micro air vehicles (MAV) using a single image by classifying the environment using deep learning modules and then estimating the desired direction to fly (left, right, center) [32], [33].
However, these works need a vast data-set (data-driven approaches) or human expert demonstration to obtain the desired policy that accomplishes the predefined task. Moreover, the complexity of acquiring a data-set makes its adoption for aerial robotic still limited. Researchers merged deep learning with reinforcement learning to overcome these problems, which led to DRL, in which interest is growing since it shows excellent results in many video games [34]. It does not need a data-set or human demonstration, and the agent always tries to maximize the accumulated reward from the interaction with the environment by trial and error.
Many studies applied DRL to autonomous navigation tasks for different robotics platforms [35]- [37]. Lie [38] showed that a mobile robot could be trained end-to-end through an asynchronous DRL map-less motion planner. Zhang [39] used the A3C algorithm [40] and semantic segmentation to close the gap between simulation and the real world. The obtained control policy was tested successfully on a ground robot. Mirowski [41] presented goal-driven navigation with auxiliary depth prediction and loop closure classification tasks as a reinforcement learning problem. His approach showed that the robot could avoid obstacles in a complex 3D maze environment. Another study [42] addressed the problem of autonomous map-less navigation in a crowded scene. A policy-gradient-based reinforcement learning algorithm was used to accomplish the task. The obtained collision-free policy was tested with different mobile robotics platforms to perform free collision navigation in the real world.
There are very few works on applications of DRL for aerial robotics. One study [43] addressed the autonomous landing of a UAV on a moving object. Another study in [44], [45] presented a method to control a quadrotor with a neural network trained using reinforcement learning techniques. Abhik Singla [46] presented a deep recurrent Qnetwork with temporal attention for obstacle avoidance for a UAV in an indoor environment. End-to-end DRL was merged with expert data for obstacle avoidance based on monocular images [47]. The work in [48] presents an actor-critic method that allows UAVs to execute navigation tasks in large-scale complex environments. Although the method was tested only in simulation environments, there is no guarantee that it will work in the real world.
Recent research works such as [49], [50] are still limited and only work in specific conditions like light conditions (day/night), or they need some data to be collected by an expert pilot. These limitations motivated us to develop a new system that can work in different environments. We believe it is the first time that sensor-fusion information was used to make a strategy for autonomous navigation of an aerial robotic through end-to-end DRL.

III. ROBOT PLATFORM AND SYSTEMS DESCRIPTION
In this work, a quadrotor MAV was used as an experimental platform to validate our algorithms. This system is vulnerable to many physical effects, such as aerodynamic effects, gravity, gyroscopic effects, friction, and moment of inertia. The configuration of the quadrotor is shown in figure 1.
A description of the system's main mathematical equations of motion is required to understand the MAV's main dynamics. The derivation of such equations was done by relying on the hypotheses below [51]: 1 The quadrotor body is rigid and symmetric 2 The propellers are rigid 3 Thrust and drag forces are proportional to the square of the rotors speed 4 The center of mass and origin of the coordinate system of the quadrotor structure coincide Two reference frames were used. The earth reference frame R E is defined by axes x E , y E , z E with z E axis pointed upward, and the body-fixed frame R b is defined by axes where s(.) and c(.) stand for sin(.) and cos(.) functions, respectively. Based on the Newton-Euler formulation, the rotational and translational dynamics are written below in equation 2 where U i (i = 1, 2, 3, 4) is the altitude, roll, pitch, and yaw control inputs. I (xx,yy,zz) is the moment of inertia for each axis. The complete architecture of the control system is presented in figure 2. The deep Q-network plays a key role in intelligent guidance and decision making by sending the suitable heading rate v yaw and the forward velocity v x to the flight controller. To guarantee a flexible and safe hardware platform, we designed the system using 3D CAD software, including all the sensory systems. This reduces the time and cost and increases the design's accuracy and reliability. Our hardware configuration is shown in figure 1. It includes VOLUME 4, 2016 The developed algorithm uses a forward-facing Intel Realsense R200 USB RGB-D camera, which provides depth information with a maximum distance of 4m at 30Hz. A Hokuyo UST-10LX scanning laser range-finder has a field of view of 270 degrees, a maximum detection distance of 30m, and an update rate of 30Hz. The proposed algorithm runs on the Nvidia Jetson TX2.
The developed MAV can be operated in indoor and outdoor environments. It is equipped with a forward-facing fisheye Intel Tracking camera T265 module that provides a V-SLAM algorithm for robust state estimation where the ground's visible features are used to determine the MAV's position, ground velocity, and orientation, which is sent to an Extended Kalman Filter (EKF) that runs on the Pixhawk to be fused with other sensor data (e.g., IMU data). For accurate altitude feedback in an indoor environment, we used a Lidar-Lite V3 laser range finder. The Jetson TX2 onboard companion computer was flashed with JetPack 3.1 (the Jetson SDK), and the robot operating system (ROS kinetic) [52] was also installed for easy hardware interfacing. We adopt Keras as a deep learning library with Tensorflow in the backend. OpenCV was used for depth-image processing. Our system uses the following ROS drivers: RealSense camera package for interfacing the R200 and T265 USB cameras, URG node for reading Hokuyo lidar data, and MAVROS for communicating with the autopilot.
The human pilot can select two flight modes: auto or manual using a radio transmitter and toggling a switch button. If auto mode is selected and there is no obstacle within 2m of the MAV in an outdoor environment, mission flight mode will be activated to reach specific target points. If there is an obstacle near the MAV within a distance of 2m, the DQN flight mode will be activated. This mode uses fused sensor data from the depth camera and Lidar measurements, making the drone fully autonomous and avoiding possible obstacles.
After the obstacle avoidance maneuver finishes, mission flight mode will be activated again to correct the MAV path and make it head towards the target-points. In case of an emergency, the pilot can intervene at any time by switching to manual flight mode, as shown in figure 3. Simulation environments were developed using the Gazebo simulator to train the agent realistically. All the sensors were simulated using the real specifications. Besides, PX4 provides a Software In The Loop (SITL) simulation [53], which we used for training.

IV. PROPOSED METHOD
Many classical autonomous navigation methods require prior knowledge of the obstacle's location or environment map. Our algorithm does not require mapping for navigation. This section describes a map-less autonomous navigation technique for a MAV. The goal is to find an optimal control input U t = π cf (s t ), where s t is the observation from the fused sensory data at each time step t. The control input U t allows the MAV to avoid obstacles during flight by following the optimal policy π cf .

A. PROBLEM FORMULATION
The MAV navigation problem was formulated as a Markov Decision Process (MDP), where the MAV interacts with the environment using an RGB-D camera and Lidar range finder. A (DQN) with a new CNN architecture was used to solve this problem. The proposed modular architecture contains a Collision Awareness Module (CAM) and Collision-Free Control Policy Module (CFCPM). The CAM is responsible for sensor fusion and generates the robot observation s t . The CFCPM takes the observation s t as input and chooses an action a ∈ A according to the collision-free policy π cf .

B. COLLISION AWARENESS MODULE
The main objective of the CAM is to process sensory data and fuse it. It then generates the observation s t , which is passed to the CFCPM, as presented in figure 4. To reduce the processing time, the field of view for the Lidar was limited to 90 degrees, which provides a 360 laser beam ray. Using these rays, a binary image with a size of (90, 90) pixels has been created by concatenating the rays vertically. This image contains useful 2D distance information from obstacles within the selected field of view. The depth camera detects tiny obstacles from which the Lidar rays do not reflect. The raw depth image was resized to (90, 90) pixels and then converted to a binary image to close the simulation and real-world gap. By combining both sensors, the small and far obstacles can be detected. The obtained images were fused to generate the observation s t by concatenating two consecutive images from the Lidar and two successive from the processed depth. Finally, the total observation s t with a

C. COLLISION-FREE CONTROL POLICY MODULE
The CFCPM uses the deep Q-network algorithm to find the optimal collision-free policy π * cf , which allows the MAV to select the best action a ∈ A in a given observation s t ∈ S to maximize the total future reward R t = T τ =t γ τ −t r τ , where γ is the discount factor.
The interaction between the MAV and the environment includes a series of actions a, including three moving commands (right, left, and forward) and observed rewards r at time t = 1, 2, . . . , T . During the learning process, the MAV collects information about the environment and learns the optimal collision-free policy π * cf . These interactions can be represented by a tuple (s 1 , a 1 , r 2 , s 2 , a 2 , . . . , s T ), where s T is the terminal state. In our case, the MAV reaches the terminal state when the distance from an obstacle is less than 2m, or the maximum step size is exited. The DQN algorithm approximates the action-value function (Q-value) with a deep CNN, as shown in figure 4. It contains two convolution layers with a filter size of (3, 3, 16), stride size of (2, 2), and zeropadding layer to avoid data loss. A ReLu activation function follows each convolution layer, and then the feature maps are transferred to the max-pooling layer to down-sample and extract the essential features. The output is forwarded to a fully connected layer to generate three velocity commands in the body frame R b of the MAV. Given the collision-free policy π cf (s t ) = a t , the Q-value can be represented as follows: The main objective is to maximize the total future reward R t , which we can do by maximizing the Q-value function: The optimal Q-value can be decomposed into a Bellman equation as follows: The collision awareness module provides a large observation space (90, 90, 4), which is used to approximate the Q-value. The iterative approximation method presented in equation 5 is not feasible in this case. To overcome this problem, the Q-value was approximated with deep CNN so that Q(s t , a t , θ) ≈ Q * (s t , a t ). θ is the network weights to be updated using reward feedback r from the environment. The designed reward function is presented in algorithm 1 based on Lidar measurements, ensuring safety and fast learning. It also motivates the MAV to move forward in cases where there is no obstacle in front. Suppose the minimum distance from any obstacle exceeds 2m, and the selected action moves forward. In that case, a positive reward will support the MAV to move forward rather than rotating left or right. A small negative reward will be given if moving forward is not selected. Still, if the minimum Lidar distance is lower than 2m, a significant negative reward is assigned, the training episode finishes, and the MAV position will be reinitialized. The obtained reward r is used to calculate the VOLUME 4, 2016 The neural network parameters θ are updated by performing stochastic gradient descent on the DQN based on the loss function: We can obtain the gradient of the loss function, as shown below: The total workflow for our approach is presented in Algorithm 2. The experience replays from memory which contains state transitions (s t , a t , s t+1 , r t+1 ) was used, by taking a random sample with a batch size of 32. Experience replay provides batch updating in an online learning fashion as follows.
• Save this tuple (s, a, s t+1 , r t+1 ) in a list to a specific length (64) • If the replay memory is filled, randomly select a sample size of 32. • Calculate value update for each sample. The -greedy training strategy was used by selecting a random action under a certain probability, this allows exploiting the best actions most of the time and keeping exploring other actions from time to time. Also, another target network was used as follows.
• Initialize the Q-network with parameters (weights) θ • Initialize the target network as a copy of the Q-network with different parameters θ T . • Use the -greedy strategy with the Q-network's Qvalues to select action a. • Get the reward and new observation r t+1 ,s t+1 . • The target network's Q-value will be set to r t+1 if the episode has just been terminated or to r t+1 + γmaxQ θ T (s t+1 ) Algorithm 2: DQN for the Autonomous Map-less Navigation of MAV initialization : -Initialize the CFCPM and CAM -Initialize reply memory D with size 10000 -Initialize Q-value Q(s, a; θ) with random weights θ -Initialize the target Q-value Y t with weights θ T = θ -Set minimal distance from any obstacle MD = 2m.
-Arm the drone -Switch to manual flight mode -Takeoff to a fixed altitude 1.5m -Switch back to auto-mode for episode = 1,M do -Set MAV initial position -Predict the distance from obstacles d while d > MD do -Get the current observation s t using collision awareness module(CAM) -With probability select a random action a t -Otherwise select a t = argmax a Q(s t , a; θ) -Perform the action a t get the next observation s t+1 and the reward r -Store the transition (s t , a t , r, s t+1 ) in D -Sample random minibatch transitions (s k , a k , r, s k+1 ) with size of 32 from D if d ≤ MD then Y k = r k else Y k = r k + γmax at+1 Q(s k+1 , a t+1 ; θ t−1 ) -Update the neural network weights θ using stochastic gradient descent -Update target network weights every X step -Save the neural network weights θ • Backpropagate the target network's Q-value through the Q-network • Every C number of iterations, set θ T = θ

A. TRAINING PLATFORM
Several intensive experiments were performed in simulated and real-world environments to evaluate our algorithm's feasibility. Four simulation environments have been built: two to mimic an indoor scenario and two for outdoor scenes based on the Gazebo simulator and SITL, as presented in figure 5. Environments A and B are for training, and C and D are for testing.
The outdoor environments contain randomly placed trees with high density and walls to limit the training area. The indoor one is a corridor with different distances between the walls. The training was performed in both environments A and B using a desktop computer with an Nvidia GTX 1080 GPU, Intel i7-6700 3.40 GHz x 8 CPU, and 16 GB of RAM. A low altitude of 1, 5m and a forward speed of 1m/s were used in environment A (outdoor). For indoor environment B, the forward speed was 0, 7m/s, and the angular velocity was 0, 7rad/s. All models were trained with the Keras framework, CUDA 8, and CuDNN 6. Table 1 shows the training parameters in detail. During the training process, only the DQN-flight mode is activated. We trained for 2000 episodes within 168 hours. The accumulated discounted reward converging curves are shown in figure 6 for  both scenarios (A, B).  The discounted total reward curves are plotted versus the training episodes. Each episode contains a 10, 000 training steps. Figure 6 shows that both curves converge to different values. In the outdoor environment, the total discounted reward converges to a higher value around 73 within 250 episodes, while in the indoor scenario, the overall score converges around 49 within 400 episodes. This indicates that the MAV has better performance outdoors than indoors in terms of obstacle avoidance and navigation because of the bigger open space. During the training process, both the convergence time and total discounted reward depend on the CNN architecture, the observation size, and MAV's speed. Different architectures can lead to different performances; training at low speeds allows learning stable policies. The frame stacking also has a significant impact on the performance. The modular architecture was designed to keep the balance between the learning speed and the success rate. After the training finished, the obtained models were saved for testing in the same environments and different new unseen environments.

B. SIM-TO-SIM: TRAINING RESULTS VERIFICATION
During the testing phase, only the DQN-flight mode is activated. First, different environments have been built to check the generalization capability of our algorithm to new inexperienced scenarios. The same desktop for the training phase was used for testing. Firstly, we tested the obtained models in an indoor scenario. Two environments were used. The first is the same one used for training (B), and the second is a different one with a high obstacle density (D). After that, we tested them in outdoor environments. Like the indoor case, two scenarios were used: the training environment (A) and a new one (C). Throughout the testing phase, 10 episodes with 10, 000 steps were used. The total discounted reward for different cases are presented in figure 7a. In the outdoor cases (A, C), the obtained reward is higher than in the indoor scenarios (B, D). In the outdoor training environment (A). the average total score was positive, about 95, which indicates that the MAV has good performance in obstacle avoidance because all the features were already seen and learned in the training phase. However, in the new unseen outdoor environment (C), the average total score was about 85. The average overall score in the indoor environments was about 60, and the MAV avoided obstacles. To verify our algorithm's generalization in new environments, more detailed simulations have been performed and compared with the traditional potential field method, which uses only laser scan data as input [54]. Figures 7b, 7c show the simulated environments and the flight paths taken by the MAV. An example scenario 1 presented in figure 7b shows that the potential field method fails. The MAV has collided with the obstacle as pointed in a black circle. Our method shows more robustness for obstacle avoidance and more smooth motion, which is not the case for the potential field approach. Another example, scenario 2 was presented to confirm the superiority of our algorithm; figure 7c shows a simulated corridor with different obstacles placed randomly, the MAV starts from an initial position (0, 0) after take-off, it starts to move forward and explore the unknown environment with fixed altitude, the reader can notice that both algorithms perform well at the beginning when there are no obstacles in the middle of the corridor, after that at the point coordinate (7, 7) the potential field method failed due to the obstacle's central location. However, our algorithm was able to guide the MAV until the end of the corridor. Another simulation in a forest environment has been performed; figure 8 presents the obtained results; the red trajectory shows that the potential field algorithm fails in an open space scenario with no solid walls. Our approach guided the MAV autonomously as presented with a blue trajectory, using the same models tested in previous scenarios 1,2.
At last, fully autonomous mission scenarios have been performed. The mean objective is to make the MAV reach a predefined target point while avoiding colliding with the obstacles. Figure 9 shows the first simulated outdoor scenario where the MAV takeoff from an initial position (0, 0) and tries to find a collision-free path to arrive at the target point indicated with a blue circle. Both DQN-flight mode and mission flight mode will be running. When there is no obstacle in front within a distance of 2m, the mission mode lets the MAV head toward the target point and move forward until it detects an obstacle within the predefined range. The DQN-mode executes the obstacle avoidance operation. This maneuver guarantees full autonomy without crashes, as the blue path shows. The second scenario simulates an indoor corridor with pillars placed randomly, as figure 10 presents. This scenario aims to reach multitarget points (T 1 , T 2 , T 3). The path taken by the MAV is shown in the blue line, starting from an initial position (0, 0). The MAV reaches the first point, T 1 , without colliding, then it heads back towards points T 2 and T 3 . The reader is recommended to watch the supplementary videos for more understanding of the simulated scenarios.

VI. SIM-TO-REAL: REAL TIME EXPERIMENTAL VERIFICATION
Several real-time tests were performed to validate our system by deploying the trained models from the simulation in a real MAV. Firstly, only the DQN-flight mode is used to explore unknown indoor corridors autonomously. Two case scenarios were investigated (straight and L shape corridors). Secondly, the experiment has been performed in an outdoor environment to validate the fully autonomous mission across a forest trail in the presence of dynamic and static obstacles. Accompanying videos are available for real-time

A. REAL TIME INDOOR VERIFICATION FOR THE DQN-FLIGHT MODE
The trained DQN model was initially tested in an indoor straight corridor, as shown in figure 11. Figure 12 shows the controlled yaw rates right and left (R, L), linear forward speed (F), some sample images from the depth and Lidar, and the corresponding RBG from the actual test. In the binarized depth and Lidar images, the walls are detected as an obstacle with white color, and the empty space is black. The obtained images from the simulation are similar to the real images. The accumulated score during this test is presented in figure  13. The total score increased before the first 100 episodes because the MAV was landed. After taking off, the drone starts to move forward to explore the unknown corridor, the overall score was around 10 from episode 200 to 700 because the environment was very narrow, the MAV always tries to VOLUME 4, 2016 Another experimental test scenario two was performed in a corridor-like environment with an L shape as presented in figure 14, after takeoff the MAV was ordered to autonomously explore the corridor within a forward speed of 0.6m/s. Figure 15 shows the current path taken by the MAV while avoiding the obstacles. The controlled forward velocity and the heading rate are shown in 16. From the second row, we can see that the desired yaw rate presented in red color is switching between two values (+0.8rad/s, −0.8rad/s) which correspond to left and right discrete actions, the blue   16: Case scenario 2: the first row presents the desired and current yaw angles, the second row shows the desired yaw rate output from DQN-mode, and the current one, the last row shows the desired forward velocity controlled by DQN-mode and the current one.
color preset the current heading rate, which is tracking the desired one, making it more clear the yaw angle was plotted in the first row, the MAV follows the desired yaw angle properly which allows it to avoid the obstacles. The last row presents the desired forward linear velocity generated by the DQN-mode, corresponding to the moving forward discrete action (0.6m/s). The actual velocity is shown in blue color. The MAV tries to track the desired direction speed within a certain overshot of 0.2m/s.
To verify the capabilities of our algorithm, a flight mission has been conducted in a forest environment, see figure 17. Different trees are placed randomly, which act as a static ob-

B. REAL TIME FULLY AUTONOMOUS MISSION
In figure 17a, the black line represents the mission's shortest path connecting the starting point to the desired target point across different obstacles. This path requires less time to finish the mission with less power consumption. However, it contains many obstacles. After take-off, the MAV starts to head toward the target-point with a low altitude of 1.5m, and FIGURE 19: Case scenario 3: the first row presents the desired and current yaw angles, the second row shows the desired yaw rate output and the current one, the last row shows the desired forward velocity and the current one.
if there is any static or dynamic obstacle in front, the MAV switches automatically to DQN-flight mode, which allows it to avoid the obstacles after the obstacle avoidance maneuver finishes the MAV head back toward the target using missionflight mode, this operation ensures the path shortness. For safety reasons, there is always a human pilot for emergency intervention by toggling a switch button. Figure 18 presents the path taken by the MAV during the whole mission. The MAV reached the target point safely without colliding with the trees and the walking person; after that, it returned to the starting point. The whole mission took around 115 second. The controlled heading rate and the forward velocity outputs are presented in figure 19. After activating the auto-mode, the mission starts, and the MAV heading and forward velocity start to track the desired one generated from the auto-mode.  Figure 20 shows sample images collected during the flight test the trees are detected as obstacles with white color, and the images contain strong features of the surrounding environment. After concatenating the depth and Lidar images, they are passed to the neural network model.

VII. CONCLUSION
This paper presents a novel approach for integrated autonomous navigation and obstacle avoidance for a MAV quadrotor using a modular Deep Q-Network architecture. The proposed algorithm's inputs are Lidar range finder distance measurements and a depth image, and it fuses the sensory data through an end-to-end CNN. To find the optimal collision-free policy, we built different virtual indoor and outdoor simulation environments with realistic simulations of the sensor data . The collision-free policy was entirely trained in simulation and then deployed on a real MAV platform. The simulation and real-world experiments show that the proposed method significantly improves the trained DRL policies' generalization capability. It also shows outstanding performance in terms of the success rate and collision avoidance. Future work will address the applications of this approach to the navigation problem in a 3D space.