Perception Action Aware-Based Autonomous Drone Race in a Photorealistic Environment

The development of autonomous, fast, agile small Unmanned Aerial Vehicles (UAVs) brings up fundamental challenges in dynamic environments with fast and agile maneuvers, unreliable state estimation, imperfect sensing, coupling action, and perception in real-time under severe resource constraints. However, autonomous drone racing is a challenging research problem at the intersection of computer vision, planning, state estimation, and control. To bridge this, we propose an approach in the context of autonomous, perception-action aware vision-based drone racing in a photorealistic environment. Our approach integrates a deep convolutional neural network (CNN) with state-of-the-art path planning, state estimation, and control algorithms. The developed deep learning method is based on computer vision approaches to detecting the gates and estimating the flyable area. The planner and controller then use this information to generate a short, minimum-snap trajectory segment and send corresponding motor commands to reach the desired goal. A thorough evaluation of our proposed methodology has been carried out using the Gazebo and FlightGoggles (photorealistic sensor) environments. Extensive experiments demonstrate that the proposed approach outperforms state-of-the-art methods and flies the drone more consistently than many human pilots. Moreover, we demonstrated that our proposed system successfully guided the drone through tight race courses, reaching speeds up to 7m/s of the 2019 AlphaPilot Challenge.


I. INTRODUCTION A. MOTIVATION
In recent years, autonomous Unmanned Aerial Vehicles (UAVs) or drones in diverse scenarios have gained much attention. UAVs are attracting increased interest across various communities such as defense, emergency response, disaster relief, healthcare, agriculture, mining, infrastructure development, sports, education, and many other areas [1]. One of the challenging tasks for UAVs is competitive drone racing. Each UAV is controlled by a human pilot in a drone race, who receives a first-person-view live stream from an onboard camera and flies the drone by sending commands via a radio transmitter. Human drone pilots need years of training The associate editor coordinating the review of this manuscript and approving it for publication was Zhenbao Liu .
to master advanced navigation but frequently involve crashes. However, the maneuverability and speed of these human pilots still outperform the existing control algorithms for autonomous UAVs. The performance of autonomous racing drones is still far from human pilots' speed, versatility, and robustness; hence, a lot of research and innovation is needed to fill this gap. Developing fully autonomous racing UAVs to achieve optimal performance is arduous due to dynamic modeling, onboard perception, trajectory generation, localization, and mapping. Due to the miniaturization of onboard electronics and technological advancement, we expect that fully autonomous racing drones will soon compete against human pilot experts very effectively and efficiently.
In 2019, Lockheed Martin and the Drone Racing League launched the AlphaPilot Challenge [2], [3], an open innovation challenge with a grand prize of $1 million, in order to get the AI-enabled autonomous drone race. Current autonomous drones have very little onboard decision-making; they will almost always follow particular human commands and rarely accomplish higher-level tasks. Artificial Intelligence algorithms are mainly developed only for simulations, and from simulation to the real world, flying is a big challenge. Another challenge of autonomous drone flight is the perception based on cameras because the faster the drone maneuver, the more blurred the image gets.
This research work's main motivation and goal are to develop fully autonomous UAVs that navigate through a racecourse using AI, computer vision, and control theory. This research work and Alphapilot challenge aim to beat one day the best human pilot by pushing the limits in terms of speed and course size to advance state of the art, while other autonomous drone races [4], [5] focus on complex navigation. Because of its high speed, autonomous drone racing raises fundamental challenges in real-time state estimation, perception, planning, and control, to beat the best human pilot. The limited computational power and complex visual environments (e.g., motion blur and low light) are challenging tasks for autonomous drone racing.
This paper is organized as follows. Section I presents the introduction, in which we discuss the motivation behind this research work, the latest related work in this area, and our contribution. Section II provides the drone race format, mathematical modeling of UAV, drone model, and classification. The complete methodology is presented in section III, which comprises gate detection (perception system), control system, and drone interface. Section IV presents the performance of the experiments in simulated and real scenarios, with their results, and finally, section V concludes the paper and points towards future research directions.

B. RELATED WORK
This section puts our proposed methodology into context, focusing on the most related work. The cumulative complexity of the subproblems, i.e., object detection and recognition [6], non-linear control [1], and path planning [7] makes drone racing an exciting and challenging problem. The rapid development in AI has contributed to broader use in the robotics community of many novel technologies, such as in deep learning [8] and reinforcement learning [9]. With the recent work in AI and the progress in powerful computer chip development, deep convolutional neural networks (CNNs) became the standard approach for AI and computer vision applications [10], [11].
In 2013, Ross et al., in their paper [12] presented Learning monocular reactive UAV control in cluttered natural environments, in which the task is the collision-free flight in the forest. The network input was imaging from a forward-facing camera, network output was desired literal speed, and the training methodology was supervised learning with recorded data from a human pilot. After initial training on the expert data, the policy is refined using datasets Aggregation. However, the limitation in this paper is that the vehicle dynamics are not taken into account, 2D sideways motion only (roll left or right), and the constant linear speed limit is less than 1.5m/s. In 2017, in Smolyanskiy et al. [13], the images were captured from a forward-facing camera following the forest trail, and the heading angles were provided to the network input. The network output was the lateral offset to the forest trail was the network output, the training methodology was supervised learning on hand-recorded data, and the limitations were 2D unicycle motion, vehicle dynamic not taken into account, and constant linear speed was less than 2m/s. In 2018, Jung et al., in their paper [8], in which the task is to navigate through a set of gates. The image from the forward-facing camera was network input and segmentation of gate, and the velocity commands are computed to align optical center with gate center, and network inference is performed onboard was network output. However, their limitations are 2D unicycle motion, and vehicle dynamics are not considered. In 2019, Muller et al., [13], demonstrated that the task of a UAV is to navigate through a set of gates as fast as possible. The network input was an image from a forward-facing camera and platform state, and low-level control commands (body rates and thrust) were network output. The network is trained using an ensemble of classical controllers; each classical controller is evaluated, and the best one is chosen to imitate, but the limitations are that they assume perfect knowledge of the system state and only work in simulation.
Despite these progresses in AI and control, the optimal output representation for learning-based algorithms that couple perception and control is an open question. This paper presents a complete system for solving the missions designed in the AlphaPilot drone racing 2020 competition in a fully AI-enabled autonomous manner and developing generic and versatile approaches that can be utilized in many other applications.

C. CONTRIBUTION
The combination of robotics and AI will change the future, and AI-enabled autonomous technologies will help researchers, scientists, astronauts, military service, and first responders demand dangerous jobs more safely and efficiently. The main contribution of this work is given as follows: • The proposed hybrid approach systems based on machine learning (for perception) and models (for control) are more robust than systems based exclusively on one of the two. It also does not require any explicit map of the environment and runs fully onboard.
• Several computer vision methods were used to improve the network's ability to generalize and increase the training dataset and image shifting, flipping, wrapping, and color manipulation methods.
• Two approaches were considered for detecting the gate and estimating the flyable area in real-time. The first approach was a detection and classification network like VOLUME 10, 2022 the YoloV3 [14]. The second approach was to use a pixel-wise segmentation network.
• After the network was trained in an end-to-end fashion, the predictions were thresholded to create a binary mask and to detect the inner and the outer corners of the predicted gate and return their pixel coordinates.
• Gate with occlusion posed another challenge. Our approach is significantly more robust to the occlusions of the gate.
• The optimization-based VIO method, the VINS-Mono, was chosen to utilize the pre-integration of IMU jointly with feature observations from cameras.
• Compute time-optimal trajectory through multiple gates given the full knowledge of the track.
• The planner then uses the above perception information to generate a short, minimum-snap trajectory segment and corresponding motor commands to reach the desired goal.
• Achieved robustness to track uncertainties while accounting for nonlinear system dynamics and actuator constraints.

II. DRONE RACE FORMAT AND UAV'S MODEL A. DRONE RACE FORMAT
We designed a complete track system with gates at different distances from the ground in the Gazebo environment, as shown in Fig. 1, in which a UAV can fly and navigate in a simple 3D track marked with squared gates (dimensions are 0.5m × 0.5m), the UAV enters with a similar pose to complete a full lap at best speed without collisions with gates. The squared, or parallelogram gates are marked with unique visual fiducial markers (ARuco tags with 4 × 4 dictionary) that can help to estimate the pose and size of the gate, as given in the attached video. 1 The drone race course format was provided by the Alphapilot team [2], [15], in which the UAV starting location is in the upper right corner as shown in Fig. 2. In this challenge, the UAV must traverse gates in order along the course marked (Gate 10, 21, 2, 13,9,14,1,22,15,23,6) in red and must finish by passing through the gate in the lower-left corner of the figure (gate 6). In the figure below, gate IDs belonging to gates along the race path are marked in yellow, while the gate IDs in blue are not part of the racecourse.
In the case of the autonomous drone race, an effective dynamic model of the system is needed to design a controller that captures the non-linearities of UAV flight dynamics. In this paper, X-configuration-based six DOF underactuated UAV (Intel aero RTF UAV [16]) is considered. Six DOF includes three translational motions and three rotational motions around three axes. The schematic configuration of the UAV consists of four propellers which are divided into two pairs, Clockwise (CW) and Counterclockwise (CCW), that are mounted on four motors in two orthogonal directions to compensate for the effect of reactive torque. The front F1 and rear F3 Propellers rotate CCW, while the lateral propellers (F2 and F4) rotate clockwise direction to produce the lift force and balance the yaw's torque required. The roll rotation is achieved by changing the speed of the F2 and F4 motors of the propeller. The pitch rotation is realized by adjusting the speed of the F1 and F4 motors of the propeller. To determine the model dynamics of UAV, some assumptions are established as follows: Assumption 1: The structure of UAV is symmetrical. Assumption 2: The body frame of UAV and propellers has a rigid body with six DOF that can be derived by using Newton-Euler equations.
Assumption 3: The drag and thrust forces are equal to the size of the square of the propeller's size.

1) THE REFERENCE FRAMES AND TRANSFORMATION
The mathematical model of the UAV is derived by using (1), (2) and (3) assumptions. Let E = {O e , X e , Y e , Z e } represents an inertial-fixed frame and T and can be expressed in (1), as shown at the bottom of the next page, where φ, θ, ψ represent the roll, pitch and yaw angles respectively. The position of the UAV in the inertial or earth-fixed frame is indicated by ξ = [x, y, z] T , the successive rotations of the yaw ψ, pitch θ and roll φ attitude or by obtaining Euler angles η = [φ, θ, ψ] T ,. The translational velocities are expressed by υ = [υ x , υ y , υ z ] T in the x, y and z directions. The angular velocites in body-fixed frame are ω = [p, q, r] T , where p(t) is the x-axis roll rate, q(t) is the y-axis pitch rate, and r(t) is the z-axis yaw rate. The (X b , Y b , Z b ) transformation from the body-fixed coordinates to the earth-fixed coordinates (X e , Y e , Z e ) is given by The UAV kinematic equations [17], [18] can be expressed as: If ∀t the pitch angle θ(t) ∈ (−π/2, π/2), roll angle φ(t) ∈ (−π/2, π/2), and yaw angle ψ(t) ∈ (−π, π), then the derivatives of Euler angle can be related to the body-fixed angular velocities by: and det(W ) = sec(θ). Therefore, if the pitch angle satisfies θ = (2κ − 1)π/2, (κ ∈ Z ), the matrix W (η) is invertible.

2) FORCE EQUATIONS AND PROPELLER DYNAMICS
The combination of mass and inertial acceleration, using the Newton-Euler equation, is equal to all the forces exerted on the UAV, i.e., total thrust T, weight (mg), and the drag F w where m is the UAV mass, g is the gravitational acceleration, i is the maximum thrust in the (-z) direction generated by four propellers. The reaction torque caused by air drag is Q i = k 2 i , where i represents the speed of the propeller i, k > 0 is the thrust constant and b > 0 is the drag factor on the propeller.

3) MOMENT EQUATIONS
The time derivatives of the UAV (H = I ω) angular momentum are equal to the external moments τ totol , on the system including gyroscopic moments , and the torques τ . The moment equations are therefore given by: The total inertial matrix I f ∈ R 3×3 under assumption 3 is a positive definite symmetric matrix expressed in the bodyfixed frame. The vector corresponds to the gyroscopic torque is given as: where I r is the motor axis moment of inertia and τ = [τ φ , τ θ , τ ψ ], is the external propeller torques.
Let d is the distance between the UAV center of mass and the center of propeller. The control torques produced by the four propellers are then given by: To simplify the computation of the real control inputs, i.e., ω i (i = 1, 2, 3, 4), the equations of force and torque are computed as follow; Since the controller determines the control inputs (T , τ φ , τ θ , and τ ψ ), these inputs are converted by inverting the above equation into the desired angular velocities of the propellers. Typically, this transformation is called motor mixing.

III. METHODOLOGY
The methodology comprises six main subsystems: sensor interface, perception, state estimation, path planning, and control pipeline to address the fast, agile, and robust flight of quadrotors in the drone racing arena. The quadrotor's sensor interface pipeline is equipped with cameras, IMU, and other sensors. The information from the sensor interface is then used in the perception module, which mostly does two things. First, a gate detection component uses computer vision and deep learning to detect the gates and predict the goal direction in local image coordinates from the forward-facing camera. The VINS-Mono provides us with information about the motion of the quadrotor. This information is then used by the state estimation module, which uses the information about the gate corners and the odometry data to compute a consistent global map of the gate layout and the state of the quadrotor. We know where our quadrotor is, how it is oriented in the world and where it is relative to the gates through this. The path planning module then takes this information and generates a minimum snap trajectory, using the minimum snap trajectory generation method [19]. This is then output in the form of a reference trajectory fed to the control component of the pipeline. In a control system, the reference trajectory is used to compute the actual commands of the rotors of the quadrotor platform to follow the trajectory as accurately as possible. The computed commands are then passed to the drone interface, and the complete system architecture is shown in Fig. 3.
The following section discusses the methodology of our proposed perception, state estimation, path planning, and control systems.

A. PERCEPTION SYSTEM
We proposed and developed a deep learning approach for detecting the gates and estimating the flyable area in the perception system. The perception pipeline is divided into two parts, gate detection and Visual Inertial Odometry (VIO); for gate detection, we built a modular algorithm consisting of three pipelines: preprocessing, detection, and post-processing. This agile architecture enabled us to rapidly test different methods for detection and implement the fastest and most accurate one.

1) PIPELINE 1
This is the preprocessing pipeline in which we have curated the dataset list and filtered the lists with missing polygons. The biggest challenge was the dataset annotation. Polygon annotation is a precise way to annotate objects by selecting a series of x and y coordinates along its edges. The first polygon annotation was used to identify and train the flyable area features of gates, while the second polygon annotation was used to identify and train the classifier with gates features. Using the provided inner polygons of the gate frame alone was insufficient to train a robust classifier and generalize to unseen frames during the race. Therefore we added another polygon annotation to the dataset that helped us identify and train the classifier with gates features rather than the flyable area feature. Our theory is that if we could locate the gate, we could find the flyable area easily using a post-processing pipeline. Consequently, we created masks for training.
The preprocessing pipeline also included a camera calibration method that used the provided gate images to remove the deformation of the images caused by the lens used for taking the pictures for the data set. Several data augmentation methods were introduced to improve the network's ability to generalize and increase the training dataset. The augmentation included image shifting, flipping, wrapping, and color manipulation methods. The pseudocode of image augmentation is provided in algorithm 1. Two approaches were considered for detecting the gate and estimating the flyable area in real time. The first approach was a detection and classification network like the YoloV3 [14]. The second approach was to use a pixel-wise segmentation network. We decided to use a real-time pixel-wise segmentation to improve the flyable area's accuracy. We trained, tested and evaluated four segmentation networks ENet network [20] and DeepLabv3 [21] with xception and resnet backends, and Unet [22]. The DeepLabV3 network was chosen for the final pipeline based on the gate segmentation accuracy and speed. DeepLabV3 uses dilated convolution, Multi-Grid, and Atrous Spatial Pyramid Pooling (ASPP) [21]. We trained the network using the resnet backend using a GTX 2080TI GPU.

3) PIPELINE 3
After the network was trained in an end-to-end fashion, the predictions were thresholded to create a binary mask. This post-processing algorithm consisted of several computer vision methods to filter and estimate the flyable area. The second step in the algorithm is to detect the outer and inner corners of the predicted gate and return their pixel coordinates.

4) DATASET COLLECTION AND DISTRIBUTION
The network is trained on a training dataset contains roughly 9,300 images, a total of 2.8GB, and the ground truth labels for this dataset are available here. 2 All the images are taken with Canon DSLR with an 18mm lens, and the square sizes of the images are 19×19mm. While the test dataset 3 contains roughly 1,000 images totalling 360MB. This test dataset should be used to see how well the algorithms perform on unseen images.

B. OPTIMIZATION BASED VIO METHOD
The optimization-based VIO method, the VINS-Mono [23] was chosen due to its high accuracy and robustness under high-speed motion. VINS-mono is a graph-optimizationbased solution to a non-linear visual-inertial system. The VINS-Mono utilized the pre-integration of IMU jointly with feature observations from cameras. They optimally minimize the residual consisting of the sum of re-projection error of feature observations from cameras and the pre-integration of measurements from IMU. The below equation is the objective function of VINS-Mono, which features a tightly coupled and sliding window-based formulation.
For brevity, we have omitted the explanation of (12) here and refer the reader to [23] for detailed definition and derivation of the variables and symbols in the function. Equation (12) is linearized and solved iteratively using Ceres [24], an incremental linear solver. The key to the robustness of VINS-Mono lies in its on-the-fly estimator initialization and loop closure detection. The working pipeline of VINS-Mono is given in Fig. 4.

C. STATE ESTIMATION
The perception, i.e., gate detection (gate corners) and VINS-Mono (odometry data) are then used by the state estimation module of the system. Here we used Extended Kalman Filter (EKF) to fuse the information about the gate corners and the odometry data to compute a consistent global map of the gate layout and the state of the quadrotor. So we know where our drone is, how it is oriented in the world and where it is relative to the gates. The EKF is the extension to the nonlinear case, where we use the Jacobian matrix of the transition G t and observation functions to compute a point-wise linear estimate and then do the updates. We define the Extended Kalman Filter (EKF) algorithm 2 by following [25]. We refactor it to include separate Predict and Update methods and to use our notations, u t is the control input, x t is the states andμ t Algorithm 2: EKF Algorithm.
and t are the mean and covariance while Q t are the process noise.

D. PATH PLANNING
The controller is designed to deal with the deviations from the desired path and disturbances that occur during path execution, and we need to guarantee that the planned trajectory is feasible for the drone to follow and does not diverges from the desired path. Minimum snap trajectory generation [26], the fourth derivative of position (angular acceleration), is used to compute a global reference trajectory that passes through all track gates. We need to optimize the trajectory for which the fourth derivative of the position (x) is minimized. The cost of the fourth derivative of the trajectory can be expressed as, The path can be represented as a piecewise polynomial between the waypoints, and the condition that minimizes the cost is x (8) = 0. Thus in practice a 7th order polynomial is used to generate the path.
where n corresponds to the index of the spline segment containing t, and t n is the time associated with the start of segment n.
To optimize snap of a segment, we minimize the integral from 0 to d n of the 4th derivative of the polynomial, where d n is the duration of segment n, equal to t n+1 − t n . The cost of the fourth derivative of the trajectory can be expressed as, in one dimension, x Thus, the total snap s over a given dimension is: s n (16) where N is the total number of segments on the trajectory.

E. CONTROL SYSTEM
A cascaded PID controller is implemented for appropriate thrust and angular velocities in the built-in low-level angular speed controller. The outer loop of the controller aims to control the position and yaw of the drone along a given trajectory, while the inner loop of the controller achieves pursuit of the desired altitude and attitude. The dynamics of the drone presented in (11) are simplified, assuming that the provided low-level controller perfectly regulates the angular velocities. The proposed control strategy consisted of Altitude controllers: wherez ff is the feed-forward component of the controller. Lateral position controller: Roll-Pitch controller: k p−roll , k p−pitch , where indexed ''R'' are the corresponding rotation matrix as given in (1), and c = − F m , in which ''F'' is the lifting force. A 'P' controller is used to control the UAV's yaw and Body rate controller are: which takes the moments of inertial and body rates to calculate the roll-pitch controller, for both simulation and hardware purposes.

A. GATE DETECTION: PERCEPTION SYSTEM
When designing the perception and computer vision test, rapidity, computational efficiency, perception, and robustness have been our core objectives. There were around 9000 images for training the gate detection with poor labeling. So, we divided the gate detection part into three sections, preprocessing, detection and postprocessing, as discussed in section III.A. We have curated the dataset list in preprocessing pipeline and filtered the lists with missing polygons. Consequently, we created masks for training. Sample masks annotation and augmented images are shown in Fig. 5. DeepLabV3 with xception and resnet backend provides the more accurate results, as shown in Fig. 6 as compared to others. The final accuracy of our masks was 95.6% mIoU with our validation dataset and 96.1% mIoU on our manually annotated images.
During gate detection, another technical challenge was generated with the occlusion, as shown in Fig. 7 and multiples colors and blurred gates. In order to tackle these issues, convex hulls were used to close the gaps and approximate the occluded prediction to a quadrilateral shape with four corners. Depending on the size of the occluding pilar, a gap estimation and selection strategy were employed. If the gap was too big, two full shapes were approximated, and the smallest was chosen for the flyable area prediction.

B. OPTIMIZATION BASED VIO METHOD
We applied a state-of-art feature-based and non-linear optimization-based VINS-Mono algorithm to get the information about the drone's motion in the FlightGoggle simulator, as shown in Fig. 9 which is photorealistic and   which reflects UAV's real dynamics. VINS-Mono also provides online temporal calibration, which solves the synchronization problem of sensors. The Root Mean Square Error (RMSE) between the ground truth and estimated states for VINS-Mono is around 1.7215m, while the average FPS is 14.9. As VINS-Mono solves non-linear optimization to  estimate states, it uses memory the most, but it is accurate. It provides online temporal calibration so that poor calibration can be ignored and hardware synchronization is unnecessary. In addition, it supports loop-closure and multi-sensor fusion. However, FPS was too low, and CPU usage was too high to use onboard. Results of the experiment can be seen in video clips here. 4

C. STATE ESTIMATION
The EKF estimates a global map of the gates, and, since the gates are stationary, the gate detections align the VINS-Mono estimate with the global gate map, i.e., compensates for the VINS-Mono drift. The EKF has significantly improved the state estimation accuracy relative to the gates. The proposed EKF is not constrained only to use the next gate but can work with any gate detection and even profits from multiple detections in one image. The filtering result is shown in Fig. 10. The EKF runs the state prediction and measurement update loop and the estimated state. The error distributions between estimated states and planned path or ground-truth states are shown on the right side of Fig. 10. Fig. 13 also shows the estimated trajectory with the nominal performed trajectories of the quadrotor in a Flightgoogle drone racing arena.

D. PATH PLANNING
A typical drone race involves the challenge of flying through a series of checkpoints/gates and finishing the course within minimum time. We have tested the minimum snap trajectory  generation process in 2D and 3D environments, as shown in Fig. 11.
The use of a minimal set of waypoints yields the paths typically composed of natural high-speed arcs in unconstrained regions of the environment while slowing in tight spaces to minimize snap around sharp corners. This process guarantees asymptotic convergence to a globally optimal solution provided by sampling-based approaches but returns superior paths in much shorter running times than a purely sampling-based approach. Minimum-snap polynomial splines have proven very effective as quadrotor trajectories since the vehicle's motor commands and attitude accelerations are proportional to the path's snap or fourth derivative. We show that this minimum-snap technique can be coupled cascaded controller that generates fast, graceful flight paths in cluttered environments while accounting for collisions of the resulting polynomial trajectories. In Fig. 12, we also show the velocityẋ = v, accelerationẍ = a, jerk ... x = j and snap .... x = s of quadrotor to make sure that they are continuous in time.

E. CONTROL SYSTEM
To successfully navigate a race track, a drone has to continually sense and interpret its environment and be robust to cluttered and possibly dynamic track layouts. It needs precise control and estimation to support the aggressive maneuvers required to traverse a track at high speed. We attempted to solve these problems using a cascaded PID control for position and attitude. A cascade PID controller outputs the appropriate thrust and angular rates to the built-in low-level angular speeds controller. Fig. 13 shows the results of the proposed control algorithms. From the figure, we can observe that the result shows that the drone can complete the track of 11 gates with a speed of 7 m/s. The methods and the results compared with our proposed method selected for this paper are given in Table 1. We compared our results in terms of deep learning (detection), Estimation, path planning, and control, as well as we compared the speed and race track time with other methods. Table 1 and Fig. 13 show that the proposed method has successfully guided the drone through tight racecourses, reaching a speed  up to 7 m/s, and completed the race track in 18.5 seconds. Moreover, we showed in Table 1 that YOLOX and Deep-SORT, and Model Predictive Controller (MPC) would be used in future work.

V. CONCLUSION
This research brings up fundamental challenges in perception, state estimation, planning, and control algorithms. We proposed an approach in the context of autonomous, perception-action aware vision-based drone racing in a photorealistic environment. Our approach combines a deep convolutional neural network (CNN) with state-of-the-art path planning, state estimation, and control systems. In our work, a novel and computationally efficient gate detection and VINS-Mono method are implemented to detect the corners of gates and get information about the motion of the quadrotor. Then a more efficient EKF is implemented to fuse this information to get the gate's map and quadrotor's state. The planner and controller then use this information to generate a short, minimum-snap trajectory segment and send corresponding motor commands to reach the desired goal. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods and flies more consistently than many human pilots. Moreover, we showed that our proposed system had successfully guided the drone through tight race courses reaching speeds up to 7m/s of the 2019 AlphaPilot Challenge. There are multiple directions for future works.
While our current set of experiments was conducted in drone racing, we believe that the presented approach could have broader implications for building robust robot navigation systems that need to be able to act in a highly dynamic world.