UAVStereo: A Multiple Resolution Dataset for Stereo Matching in UAV Scenarios

Stereo matching is a fundamental task for 3D scene reconstruction. Recently, deep learning based methods have proven effective on some benchmark datasets, such as KITTI and Scene Flow. UAVs (Unmanned Aerial Vehicles) are commonly utilized for surface observation, and their captured images are frequently used for detailed 3D reconstruction due to high resolution and low-altitude acquisition. At present, the mainstream supervised learning network requires a significant amount of training data with ground-truth labels to learn model parameters. However, due to the scarcity of UAV stereo matching datasets, the learning-based network cannot be applied to UAV images. To facilitate further research, this paper proposes a novel pipeline to generate accurate and dense disparity maps using detailed meshes reconstructed by UAV images and LiDAR point clouds. Through the proposed pipeline, this paper constructs a multi-resolution UAV scenario dataset, called UAVStereo, with over 34k stereo image pairs covering 3 typical scenes. As far as we know, UAVStereo is the first stereo matching dataset of UAV low-altitude scenarios. The dataset includes synthetic and real stereo pairs to enable generalization from the synthetic domain to the real domain. Furthermore, our UAVStereo dataset provides multi-resolution and multi-scene images pairs to accommodate a variety of sensors and environments. In this paper, we evaluate traditional and state-of-the-art deep learning methods, highlighting their limitations in addressing challenges in UAV scenarios and offering suggestions for future research. The dataset is available at https://github.com/rebecca0011/UAVStereo.git


I. INTRODUCTION
O NE of the most active areas of research in photogram- metry and computer vision is three-dimensional (3D) reconstruction of the environment via dense matching, which can be performed in stereo (in two views) [1] or multi-view stereo (MVS) [2].Among image-based approaches, stereo matching [3], where expected correspondences are on the epipolar lines, is arguably the most popular and intensively researched technique.Significant progress in this field has been made in terms of accuracy and cross-domain performance.Thanks to stereo benchmarks [4] [5][6] [7], researchers have achieved high accuracy on benchmarks of driving scenarios and indoor environments.Furthermore, some aerial stereo datasets paved the way for deep learning to succeed also in aerial stereo images [8][9] [10].However, lack of large-scale datasets hinders research on cross-domain performance and application of the stereo matching algorithm to UAV images.
UAV introduces a low-cost alternative to classical aerial photogrammetry for large-scale topographic mapping or detailed 3D recording of ground information and is a valid complementary solution to terrestrial observation.With such UAV images, current networks face at least three challenges: • Larger disparity search space: The conversion between disparity and depth can be written as disparity = Bf Depth , with baseline B and focal length in pixels f .Focal length in pixels f can be further expressed as Wpx×fmm W CCD , in which W px , f mm , W CCD represents image width in pixels, focal length in mm, CCD width in mm respectively.With the advancement of digital cameras, the resolution of acquired images is increasing, resulting in a larger disparity search space, which places additional performance requirements on the algorithm.
• Bigger possibility of ill-areas: UAV are often used to capture information about the ground surface, such as forests and grasslands, where features are harder to match.In addition to the low-altitude acquisition characteristics, it is easier for UAVs to acquire images containing illareas such as textureless and repetitive textures, which is extremely challenging and may in many cases be an inherently ill-posed problem.• More varied disparity distribution: the UAV's lightweight and adaptable characteristics allow it to collect images from a variety of heights, where disparity distribution differs greatly from driving and indoor scenarios and is significantly more variable.The majority of current algorithms are applied to datasets with the same disparity distribution, such as driving and indoor, which presents an additional challenge.
In addition to the above properties of UAV images, the literature [5] points out that the existing algorithm has a huge gap between the synthetic domain and the real domain.To shorten this gap, synthetic image pairs are used in advance for pretraining and a small amount of real images are used therewith for finetuning, which can significantly improve the pretrained models' capacity on real data.Obviously, current single synthetic dataset or the real dataset cannot match the requirements.It is necessary to combine synthetic and real data in one dataset.
Towards this goal, we propose a novel UAV scenarios dataset that co-exists with synthetic and real data.For synthetic data, we propose a large-scale stereo dataset with sufficient variation, realism, and size to successfully train large networks in UAV scenarios.For real data, the acquired images are provided after four-step processing (including initial disparity maps generation, epipolar images generation, epipolar disparity maps generation, and post-processing).In addition, the original resolution stereo pairs and corresponding disparity maps are supplied for high-resolution network evaluation.The main contributions of our paper are: • We propose a pipeline (detail in III-B and III-C) capable of generating dense disparity for both synthetic and real images from UAV-obtained images and point clouds • We construct a new UAVStereo, which consists of image pairs and dense disparity maps for three representative UAV scenes.To shorten the gap between synthetic and real domain, we construct both synthetic and real data to increase the availability of in real scenarios and decrease the quantity demand for real data.In order to adapt to the imaging characteristics and disparity distribution in UAV scenarios, we published multi-resolution images and corresponding disparity maps.• We evaluate traditional and state-of-the-art deep techniques on our dataset.Experimental results across different datasets and stereo methods demonstrate that our dataset is more suitable for the UAV scenario.Our dataset also presents challenges to current algorithms in terms of resolution, disparity search range and geospatial feature matching.

A. Stereo Matching Methods
For many years, most algorithms have been solving stereo matching problem following a typical four-step pipeline [1]: Matching cost computation, cost aggregation, disparity computation, and disparity refinement.Among the vast literature on traditional algorithms [11] [12] [13] [14] [15], Semi-Global Matching (SGM) [16] is the most popular, which is a reference approach combining mutual information and dynamic programming optimization on several directions [17].
With the development of deep learning, the first research efforts focused on replacing the individual steps of the conventional pipeline with deep learning counterparts.For instance, 2D convolutional neural networks (CNN) prove effective in feature extraction [18].SGM-Net uses a CNN network to provide learned penalties for SGM [19].
Then, end-to-end deep stereo networks rapidly gained the main stage [20][21] [22].Inspired by FCN used in semantic segmentation [23][24], DispNet [21] adopts an encoderdecoder architecture to enable end-to-end disparity regression, where the matching cost can be directly integrated to encoder volumes.GC-Net [20] combines contextual information by 3D convolutions over a cost volume.PSMNet [25] integrates global context information using spatial pyramid pooling and regularizes cost volume using stacked multiple hourglass 3D convolutional networks.To tackle the high memory consumption for high-resolution image matching, Deepprunner [26] proposes to prune the 3D cost volume with a differential patch match method.STTR [27] revisits the problem from a sequence-to-sequence correspondence perspective and replaces cost volume construction with dense pixel matching using position information and attention.
While these methods were developed by the computer vision community on the indoor or driving dataset, numerous researchers introduced stereo matching networks into geospatial aerial images and proved to be effective [28] [29].
Compared with the traditional method, learning-based stereo matching networks have shown excellent feature matching capabilities in many scenarios and have been applied in aerial image stereo matching due to their superior cross-domain generation capabilities.However, the capability of the network in UAV imagery is not validated due to the lack of data.

B. Stereo benchmarks
Among the factors behind the rapid development of stereo matching, the growing availability of datasets plays a crucial role.Table I lists some datasets for stereo matching proposed by the aerial photogrammetry and computer vision communities.Some are for driving scenes: the KITTI datasets in two versions, KITTI 2012 [4] and KITTI 2015 [30]; the large-scale stereo DrivingStereo [5] and AppolloScape [31].Then Middlebury 2014 [7] framing indoor environments, and ETH3D [6] including both indoor and outdoor scenes are also popular and widely utilized.In aerial photogrammetry, SatStereo [32], [33] and ISPRS2019Benchmark [8] dataset is frequently used for dense matching evaluation.Using advanced computer graphics, The Scene Flow [21], Virtual KITTI [34] , and MPI Sintel [35] datasets synthesize dense disparity maps, yet it remains a huge gap between the synthetic domain and the real world.
UAVs are commonly used for earth observation because of their mobility, flexibility and low cost.However, UAV images processing requires high time and computational memory, due to its high resolution.We are attempting to reduce the processing time by incorporating stereo matching network into the it.To this goal, large-scale UAV scenarios datasets are required to train the network.In this paper, we propose the UAVStereo dataset, which contains a large amount of image pairs with dense disparity to facilitate the training and testing of stereo matching networks.

III. UAVSTEREO DATASET
This section introduces the UAVStereo data production process.The data acquisition system and covering areas in section III-A.The data production pipelines for synthesis and real data are described in III-B and III-C respectively.
The horizontal accuracy and vertical Accuracy of L1 radar are 10 cm and 5 cm respectively.The maximum range of DJI L1 is 190m at 10%, 100klx and 450m at 80%, 0klx.The point clouds acquired by Zenmuse L1 LiDAR is first converted into the standard las format by DJI Terra1 .Then 3D digital surface models with OBJ format were reconstructed using Daspatial GET3D Cluster software2 from a substantial amount of images and point clouds.To make the surface model more accurate to the actual scene, we manually corrected the photos with severe rotation and the points with large errors in the aerotriangulation results.
Three different scenarios are included in our UAVStereo, including residential land, forest and mining area.As shown in Fig. 2, the residential area contains dense and regular tall buildings, flat roadways and other urban scene features, covering about 700 × 1200 m 2 .This area provides an urban scene with disparity saltation like building.The forest area contains high-coverage trees, several houses and other field scene components, covering about 1350 × 1500 m 2 .This region has textureless and repeated textured images, which presents a difficulty for stereo matching algorithms.The mining zone is made up of around 700 × 700 m 2 of agriculture, low structures, and bare ground, which contains continuous variation of disparity.These three areas are representative areas for UAV earth observation and can represent different disparity distribution.

B. Synthetic Dataset
The model trained on synthesized data can provide a good initial pretrained model for the application in real UAV images.Therefore, we generate multi-resolution and multi-scene UAV data to adapt to the application of different sensors and different scenes.Similar to Scene Flow [21], we used the open-source 3D creation suite Blender3 to simulate the flight path of drones and render the results into tens of thousands of frames.As shown in fig 1, We rendered textured 3D models into color images and the corresponding ground truth disparity maps, generating a synthetic dataset subset consisting of both low and high resolution subsets.Given the intrinsic camera parameters (focal length f , principal point (x 0 , y 0 )), the render settings (image size W, H, and sensor size and format) and the exterior orientation (camera center (Xs, Y s, Zs) and three rotational angles (ϕ, ω, κ), baseline B ), Blender can directly retrieved the depth of each pixel from imported OBJ models.We generated stereo images using Blender's Stereoscopy following the designed drone flight path.According the formula disparity = Bf Depth , we adjusted the render settings and converted the depth to disparity using the known configuration of the virtual stereo rig, generating the corresponding disparity maps Adjusting the external orientation elements, the synthetic images and corresponding disparity maps are acquired at 100 -300 m above the model with high overlap.For all frames and views, we provide 8-bit RGB images and disparity maps with lossless pfm format.We rendered all image data using a virtual focal length of 35 mm on a 36 mm wide simulated sensor.We released the high and low resolution subsets at 960 × 540 px (same as Scene Flow) and 8192 × 5460 px (same as Zenmuse P1).At the same time, we also resize highresolution images to 3840 × 2160 px and 1920 × 1080 px for multi-resolution evaluation.The baseline is set to 1 m to 15 m for the low-resolution subset due to the image size limitation, while 15 m to 35 m for the high-resolution subset.The image size, camera center and baseline length of these two subsets are greatly different, which can not only provide multi-resolution images, but also test the robust performance of stereo matching algorithms.
In Tab.II, we present the sample data in Synthetic subset with a baseline length of 15 m.

C. Real Dataset
For real data, we make use of the images collected by P1, POS (position and orientation system) data containing image position and orientation, and the georeferenced OBJ models.The collected data generates epipolar stereo image pairs and corresponding disparity maps in a four-step pipeline: initial disparity maps generation, epipolar image generation, epipolar disparity maps generation, and post-processing.
After aligning the camera with the model using the position (Xs, Y s, Zs) and orientation (ϕ, ω, κ) in POS, an initial disparity map corresponding to the image can be rendered in the same way as the synthetic data using Blender.
The second step of the processing pipeline is to create epipolar image pairs from adjacent images with sufficient overlap.Stereo rectification can be implemented by feature extraction and matching, homography matrix calculation, and interpolation resampling, which Off-the-shelf functions in the OpenCV library can implement.The corresponding orientation parameters are generated to facilitate subsequent disparity map processing.
Then, in order to retain the same transformation between pictures and disparity maps, we deal with disparity maps by applying the orientation parameters from the previous step.
The well-chosen photos and the corresponding disparity maps are then croped to 960 × 540 px, which can be applicable to most networks.
The result image pairs and coeresponding disparity maps are shown in Tab.III IV.EXPERIENCES

A. Dataset Overview
The above process creates a sizable stereo matching dataset containing real and synthetic data using various scene photographs and LiDAR data collected by the UAV platform.We divided the training set and testing set in a ratio of approximately 8:2 for each scenario, spliting the total 38781 stereo samples into 31024 and 7757 for training and testing proposes, detailed in Tab.IV.In Tab.IV, we also list key parameters and additional details of UAVStereo dataset.
The contributions of UAVStereo can be summarized as follow: • First-ever UAV scenario stereo dataset.Different from autonomous driving, aerial, and indoor datasets, we propose a noval pipeline to generate images and disparity maps

B. Experience Setup
After generating the epipolar images and the corresponding ground truth disparity maps, we evaluate several traditional and learning-based methods on our UAVStereo dataset.Comparative results illustrate the significance and challenge of our dataset.
We implemented the above deep neural network in PyTorch.All networks are trained end-to-end, given the images as input The traditional SGM is available in OpenCV and easy to implement with C++.In SGM, we used census to calculate the matching cost, aggregated the matching cost on 8 paths.Postprocessing such as consistency check, uniqueness constraint and culling of small connected regions were adopted to the completeness and consistency of output.
To assess the accuracy of stereo algorithms and networks, we conducted quantitative statistics EPE (End Point Error) and NPE (N-pixel Error) as evaluation metrics.EPE is the absolute mean of the difference between the estimated disparity map and the ground truth; NPE is the percentage of pixels having error larger than a threshold N: m is the total number of pixels, D pred is the output predicted disparity map, D gt is the ground truth disparity map.As initially our ground-truth disparity maps are inferred at 960 × 540 px, we assumed 3 pixels as the lowest threshold.Then, given the much larger disparity in real subset, we computed error rates up to 30PE and 100PE.

C. Evaluation on Synthetic subset
In the evaluation of the synthetic subset, we trained the network on all training data instead of a single dataset to avoid overfitting on a particular subset.The pre-trained models was verified on different scenarios.Tab.V compares the predicted disparity maps with its corresponding disparity ground-truth, collecting the outcome of this evaluation.In the evaluation of low-resolution(the upper portion of Tab.V), the experimental results indicate that the traditional SGM method has considerable errors with UAV image pairs, whereas end-to-end networks significantly improve the stereo matching accuracy.Applying the SGM algorithm to UAV images, the output disparity maps are incomplete and discontinuous, resulting in large error metrics.Comparing the performance of SGM on Middlebury, we demonstrate that the SGM algorithm is unsuitable for processing UAV geographic images containing ill-areas, because this method actually uses a matching window of limited size, which is incapable of obtaining and utilizing global information.Among leaning-based methods, PSMNet achieves the best results with EPE and 3PE values of 3.443 px and 11.634 %, respectively.The result demonstrates that our UAVStereo dataset can be applied for stereo matching networks, with precision result comparable to that of blend-edMVS [40] when converted into depth.Despite the fact that the stereo matching network has grown rapidly since 2018, we found that PSMNet performed best on low-resolution images for challenging data, such as low-altitude drone images, possibly due to the use of global information through spatial pyramid pooling and 3D convolution strategies.Meanwhile, we point out that the latest network EAIStereo network's loss function converges to the large loss value, leading to large errors on the testing set.Consequently, EAIStereo may not be suitable for UAV images.
We also list representive disparity maps predicted by above algorithms in VII.It can be seen that the disparity maps generated by SGM exist quite a few invalid values, resulting in large error metrics.While learning-based models can inference complete and continuous disparity maps in challenging regions like textureless ground.This result shows the capability superiority of deep learning-based stereo matching on UAV images.Among the learning-based algorithms, PSMNet and RAFT-Stereo perform better in the dataset, since accurate disparity maps can be obtained in all three scenarios.
The disparity searching range increases as images resolution does.As shown in Tab.II and III, the disparity searching range of 1920 × 1080px and 3840 × 2160 px should be set to 768 px and 960 px, causing an increase in computing memory usage.Since networks inferences are performed on a single GPU, most algorithms can run only at 960 × 540 px because of memory constraints.Consequently, their predicted disparities are upsampled with bilinear interpolation in order to perform the comparison with higher resolution ground-truth maps, with predicted disparities scaled by the upsampling factor itself.In the bottom portion of Tab.V, we list the evaluation results at 1920 × 1080 px and 3840 × 2160 px.Due to the incompatibility of SGM and EAIStereo with drone imagery, we did not evaluate it on larger resolution.We can notice how all methods struggle at achieving good results at such high resolution, with RAFT-Stereo achieving best results.This result was expected because RAFT-Stereo achieves top-rank on Middlebury.Since all error metrics on low resolution images are still very far from those on existing benchmarks [4] [21] [7], we point out how they are still have large great challenge in geospatial UAV images due to the large presence of ground objects.By comparing the results of low and high resolution, we confirm that resolution is undoubtedly a challenge in our benchmark.
Whatsmore, comparing the test results on R, F and M subsets, the error of PSMNet in forest area is higher than that in residential area and mining area, whereas the other methods are just opposite.This suggests that DSMNet, CFNet, and RAFTStereo are more suitable to deal with disparity estimation of repetitive textures, like woods.

D. Evaluation on Real subset
In order to demonstrate the capacity of real subset in UAVStereo, we select the top-performing RAFT-Stereo network from previous evaluation, which is capable of handling huge disparity estimation.We conducted two experiments on the UAVStereo real subset: training the network directly with real training set and finetuning the synthetic pre-trained model using real data.In the latter experiment, we finetuned the pre-training model using only a quarter of the synthesized data.Take disparity range into account, we set the maximum disparity search range in the training stage of real data to 1920 px.Due to the large disparity value in the real scene,  It has been demonstrated that the real subset of UAVStereo can be used to train stereo matching models, though there is a large gap with the results on low resolution synthetic datasets.This may be related to the large disparity search range.In addition, we found that, although utilizing less data, the finetuned results were comparable to the training results.This confirms both our claims on the challenges in deep stereo networks as well as the significance of our dataset.

V. CONCLUSIONS
The quantity and quality of the dataset are critical to the performance of stereo matching algorithms.In this paper, a novel pipeline is proposed for generating image pairs and dense disparity for UAV scenes using images and point clouds.With the proposed pipeline, we have constructed the UAVStereo, a novel stereo dataset in UAV Scenarios -containing synthetic and real image pairs -featuring large disparity searching range and covering geospatial information, which is extremely challenging for the existing learning-based networks.Compared to available stereo datasets targeting autonomous driving, indoor, and aerial, UAVStereo is the first stereo matching dataset in UAV Scenarios including a large number of image pairs and dense label, which is supposed to speed up the process of 3D reconstruction.
UAVStereo dataset can represent to some extent the characteristics of UAV stereo matching data with large disparity search space, bigger possibility of ill-areas and more varied disparity distribution.Many experiments demonstrate that our dataset can be used for stereo matching of traditional algorithms and deep learning algorithms, and through experiments we find that deep learning-based methods have significant advantages over traditional algorithms and have great potential for development.Using a small amount of real data to finetune the pretrained model can achieve the comparable accuracy as the real data, and this strategy can effectively reduce the amount of real data required.
Our experiments show that UAVStereo unveils some of the most intriguing challenges in deep stereo and provides hints on promising research directions.In particular, followup work fostered by UAVStereo may be devoted to 1) investigating on the ability of deep models processing on large disparity searching range.2) enhancing the capability of process geospatial data covering ground objects, which is great different from driving indoor scenes.3) improve the generalization ability of the network between the synthetic domain and the real domain.
Therefore, we hope that UAVStereo holds the potential to improve future research in UAV scenario stereo matching and 3D reconstruction.

Fig. 1 :
Fig. 1: Pipeline of data generation.The red box shows the real data generation process.The green box shows the synthetic data rendering process.
UAVStereo Synthetic data.Left: Residential land.Center: Forest area.Right: Mining area.In distribution histogram, the blue, purple, green and orange line respectively represent the disparity distribution of 960 × 540 px, 1920 × 1080 px, 3840 × 2160 px, 8192 × 5460 px disparity maps.using UAV imagery and LiDAR point clouds in UAV scenarios.• Large disparity range.For the changes in imaging sensor and exploring areas, our dataset contains multiple resolution images in representative scenes to adapt to the changes of payload sensors and environments.• Containing both synthetic and real data.Compared with the existing single synthetic or real dataset, UAVStereo contains both synthetic and real datasets to bridge the gap between real and synthetic domains.• High diversity.Our dataset provides a variety of representative scenarios and multiple flight paths, making it possible to account for most situations from the straightdown perspective of a drone.
III: UAVStereo Real data.Left: Residential land.Center: Forest area.Right: Mining area.and the disparity maps.We used the same learning rate, optimizer and loss function as the original network.Since deep networks inferences are performed on a NVIDIA 3090 RTX GPU, we set the batch size to 1 and crop the image to 256×512 px in all network training phase.The training process continues until the loss function is no longer changing.The last epoch model was used for evaluation.

TABLE I :
Comparison of available stereo datasets.

TABLE IV :
Key parameters of stereo data in UAVStereo.

TABLE V :
Results on the UAVStereo Synthetic subset.We trained model on 960×540 px and evaluated the model on multiple resolution ground-truth maps.Best scores in bold.R: Residential land testing subset, F: Forest areas testing subset, M: Mining areas testing subset, A: All synthetic testing set.

TABLE VI :
Results on the UAVStereo Real subset.

TABLE VII :
Comparison of disparity maps estimated by different methods.From top to bottom: left image, ground truth disparity map, SGM, PSMNet, DSMNet, CFNet, RAFT-Stereo