Object Detection in Large-Scale Remote Sensing Images With a Distributed Deep Learning Framework

With the accumulation and storage of remote sensing images in various satellite data centers, the rapid detection of objects of interest from large-scale remote sensing images is a current research focus and application requirement. Although some cutting-edge object detection algorithms in remote sensing images perform well in terms of accuracy, their inference speed is slow and requires high hardware requirements that are not suitable for real-time object detection in large-scale remote sensing images. To address this issue, we propose a fast inference framework for object detection in large-scale remote sensing images. On the one hand, we introduce $\alpha$-IoU Loss on the YWCSL model to implement adaptive weighted loss and gradient, which achieves 64.62% and 79.54% mAP on DIOR-R and DOTA test sets, respectively. More importantly, the inference speed of the YWCSL model reaches 60.74 FPS on a single NVIDIA GeForce RTX 3080Ti, which is 2.87 times faster than the current state-of-the-art one-stage detector S$^{2}$A-Net. On the other hand, we build a distributed inference framework to enable fast inference on large-scale remote sensing images. Specifically, we save the images on HDFS for distributed storage and deploy the YWCSL model to the Spark cluster. When using 5 nodes, the speedup of the cluster reaches 9.54, which is 90.80% higher than the theoretical linear speedup (5.00). Our distributed inference framework for large-scale remote sensing images significantly reduces the dependence of object detection on expensive hardware resources, which has important research significance for the wide application of object detection in remote sensing images.

Abstract-With the accumulation and storage of remote sensing images in various satellite data centers, the rapid detection of objects of interest from large-scale remote sensing images is a current research focus and application requirement. Although some cutting-edge object detection algorithms in remote sensing images perform well in terms of accuracy, their inference speed is slow and requires high hardware requirements that are not suitable for real-time object detection in large-scale remote sensing images. To address this issue, we propose a fast inference framework for object detection in large-scale remote sensing images. On the one hand, we introduce α-IoU Loss on the YWCSL model to implement adaptive weighted loss and gradient, which achieves 64.62% and 79.54% mAP on DIOR-R and DOTA test sets, respectively. More importantly, the inference speed of the YWCSL model reaches 60.74 FPS on a single NVIDIA GeForce RTX 3080Ti, which is 2.87 times faster than the current state-of-the-art one-stage detector S 2 A-Net. On the other hand, we build a distributed inference framework to enable fast inference on large-scale remote sensing images. Specifically, we save the images on HDFS for distributed storage and deploy the YWCSL model to the Spark cluster. When using 5 nodes, the speedup of the cluster reaches 9.54, which is 90.80% higher than the theoretical linear speedup (5.00). Our distributed inference framework for large-scale remote sensing images significantly reduces the dependence of object detection on expensive hardware resources, which has important research significance for the wide application of object detection in remote sensing images.

I. INTRODUCTION
O BJECT detection has always been a research hotspot in computer vision. It is frequently used in practical problems, such as pedestrian detection, industrial detection, Manuscript  building detection, etc. With the continuous development of remote sensing technology, large-scale and high-resolution remote sensing image datasets have emerged, such as DOTA [1], DIOR-R [2], HRSC2016 [3], etc. These high-resolution remote sensing image datasets are rich in ground objects, including common ground objects such as airplanes, ships, vehicles, and basketball courts. It is of great significance to classify [4], [5], segment and detect [6] these remote sensing images. Object detection technology in remote sensing images [7] has been widely used in land planning, maritime fisheries, military reconnaissance, deforestation detection, and other fields [8], [9]. However, in practical applications, object detection in remote sensing images is more challenging than in natural scenes [7]. This is mainly reflected in the following difficulties: a) Greater Detection Difficulty: Compared with natural scene images, the background of remote sensing images is more complex, and there are many small objects. The objects in remote sensing images have the characteristics of arbitrary orientations, scale variations, extremely uneven distributions, and large aspect ratios, which undoubtedly bring more difficulties to object detection in remote sensing images. b) Higher Inference Costs: Object detection in remote sensing images often involves large-scale datasets (TB or even PB level) [10] in practical applications, which require a high data processing speed. In addition, remote sensing images have a high resolution [11], and direct input into the model for inference will take up a lot of memory. Usually, the original image needs to be cut before being fed into the model for inference, which will undoubtedly further increase the time overhead of detection. More importantly, the current state-of-the-art object detection models in remote sensing images generally rely on expensive GPU hardware resources in the actual inference process to achieve a tolerable inference speed, which significantly increases the practical application cost of object detection models in remote sensing images. It is one of the biggest obstacles to the widespread application of object detection models in remote sensing images to practical tasks.
In recent years, many researchers have made outstanding achievements in the field of object detection in remote sensing images. For example, cutting-edge algorithms, such as Reti-naNet OBB [12], Cascade Mask R-CNN [13], and RoI Transformer [14], have achieved high accuracy on the DOTA-v1.0 dataset.
However, most of the algorithms presented above are twostage detectors, which achieve better detection accuracy results than one-stage detectors. However, they are generally slower in terms of inference speed. More importantly, these algorithms require a high hardware environment, which increases the hardware cost of object detection models in practical applications. In contrast, not only does the algorithm of YOLOv5 combined with circular smooth label (CSL) [15] have excellent detection accuracy [16] (slightly lower than ReDet [17]), but also its performance far exceeds the inference speed of most models (see Fig. 1). It is more suitable for object detection tasks in large-scale remote sensing images. Therefore, we use "YOLOv5 with CSL" [16] as the baseline for our follow-up work, which we refer to as YWCSL for convenience. Our contributions can be summarized as follows: 1) We introduce the α-IoU loss function [18] to regress the bounding box on top of the YWCSL model and analyze in detail how the α-IoU loss function improves the performance of the model. Our improvement work enables the YWCSL model to achieve 64.62%, 79.54%, and 76.28% mAP on the DOIR-R, DOTA-v1.0, and DOTA-v1.5 test sets, respectively, and when 0 < α < 1, the YWCSL model performs more excellently than baseline on the category of small and medium pixel sizes. More importantly, its detection accuracy on the DOTA-v1.0 test set is 0.12% higher than that of the current state-of-the-art one-stage detector S 2 A-Net. 2) We fully demonstrate the excellent inference speed of YWCSL. We use the DOTA-v1.0 test set on a single NVIDIA GeForce RTX 3080 Ti to test the inference speed of some current state-of-the-art algorithms. As shown in When tested with five nodes, the speedup ratio of the inference speed of our distributed inference framework reaches 190.8% of the theoretical linear speedup ratio, which greatly reduces the dependence of object detection in remote sensing images on expensive GPU resources [19] and reduces the hardware cost of practical applications. It plays a crucial role in promoting the wide application of object detection in remote sensing images. We will introduce the excellent object detection algorithms in remote sensing images in recent years in Section II. Section III will introduce our improvements to the YOLOv5 model and detail the steps to deploy the model on a Spark cluster for distributed inference. In Section IV, we compare the detection accuracy and inference speed of YWCSL with other state-of-theart models. In addition, we also analyze the impact of a different number of nodes and different dataset sizes on the speedup ratio of the distributed inference framework. Finally, we select remote sensing images of a 13.53 square km area near the Optics Valley Plaza in Wuhan, China, to test the detection performance of our distributed inference cluster in practical applications and compare the detection details of our model and baseline on some small objects.

A. Object Detection in Remote Sensing Images
Object detection has always been a research hotspot in computational vision. The mainstream object detection models mainly include anchor-based and anchor-free detectors. Anchor-based detectors include one-stage detectors represented by the YOLO series [20] and two-stage detectors represented by Faster R-CNN [21]. Two-stage detectors usually rely on the Region Proposal Network (RPN) to generate high-quality Region of Interests (RoI), which is disadvantageous for real-time detection tasks. The one-stage detectors extract features directly in the convolutional neural network to predict object classification and location, which achieve a good balance between detection accuracy and inference speed. Therefore, the one-stage detectors of the YOLO series are especially widely used in real-time detection tasks. Another type is the anchor-free model, which is represented by CornerNet [22], ExtremeNet [23], etc.
Nowadays, with the rapid development of object detection technology, a series of object detection models suitable for remote sensing images have been proposed, including R2CNN [24], RRPN [25], and SCRDet [26], which are all improved based on Faster R-CNN. In addition, there are some more advanced detection algorithms, such as CSL [15], DCL [27], RoI Transformer [14], ReDet [17]. They all show good performance on various remote sensing datasets, such as DOTA and DIOR-R. Unlike traditional object detection tasks, objects in remote sensing images are more intensive and have different directions. To align objects more accurately, the oriented bounding box (OBB) is often used instead of the horizontal bounding box (HBB) in detection tasks to wrap the objects in remote sensing images.
For the oriented object detection, Ding et al. [14] proposed the Rotated Region of Interests (RRoI) learner module to learn RRoI from the feature map of the Horizontal Region of Interests. In addition, they proposed a Rotated Position Sensitive RoI Alignment module to extract rotation-invariant object features and finally achieved 69.56% mAP on the DOTA-v1.0 test set, and their work provides essential ideological support for the ReDet algorithm.
Yang et al. [15] discretized the regression of the OBB angle as a classification task and solve the periodicity of angular (PoA) problem using a window function with periodicity. They use the FPN-CSL-based model to improve mAP to 76.17% on the DOTA-v1.0 test set.
On the basis of previous work, Han et al. [17] proposed the Rotation-invariant RoI Align (RiRoI Align) module, which adaptively extracts rotation-invariant features from rotationequivariant features according to the direction of RoI and improves mAP to 80.10% on the DOTA-v1.0 test set, while significantly reducing the size of the model.
Most of the cutting-edge algorithms mentioned above are two-stage models. Although they bring much improvement in accuracy, they also bring a lot of inference time overhead. Aiming at the balance between model inference speed and accuracy, Han et al. [28] proposed a high-performance one-stage detection model S 2 A-Net. They proposed a feature alignment module (FAM) to generate high-quality anchors and used an oriented detection module (ODM) to alleviate the inconsistency between classification scores and localization accuracy, achieving 79.42% mAP on the DOTA-v1.0 test set and improving the inference speed to 16.0 FPS (test on a single NVIDIA Tesla V100 GPU). In addition, Wen et al. [29] chose to use the CSL module to improve the YOLOv5 model (see Fig. 2) and used data enhancement methods such as mosaic to participate in the training. The improved YOLOv5 model achieves 58.2% mAP on the DOTA-v2.0 test-dev set, which is an excellent detection performance. The open-source model in [16] uses a similar approach to combine the YOLOv5 model with CSL (YWCSL) and exhibits 77.30% mAP and 73.19% mAP on the DOTA-v1.0 and DOTA-v1.5 test sets, respectively. Moreover, YWCSL retains the excellent inference speed of YOLOv5, which is very suitable for object detection tasks in large-scale remote sensing images.

B. Distributed Deep Learning Computing Framework
As a mainstream Big Data computing framework, Apache Spark [30] provides efficient Big Data processing capabilities. Unlike Hadoop [31], the intermediate data generated during Spark calculation is directly stored in the memory, reducing the time spent interacting with disks. More importantly, Spark generates a directed acyclic graph (DAG) internally based on program execution logic, which significantly speeds up distributed parallel computing. Therefore, Spark's computing speed is usually much higher than Hadoop's.
Nowadays, machine learning has been integrated into all walks of life. For example, Liu et al. [32] used KNN-XGBoost to predict missing values, which played an essential role in the task of detecting transmission line risks in smart grids. Song et al. [33] proposed a Bi-CLKT to track students' learning, which helps education departments to formulate systematic learning plans for students. However, most of the above work is based on the traditional single-machine computing mode, which is not good at processing Big Data as the amount of data in the actual task increases rapidly. Instead, Spark's distributed computing model, which combines a Big Data processing framework with machine learning algorithms, has been widely used in various industries. The MLlib module of Spark provides a wealth of machine learning algorithms, such as Kmeans, ASL, SVM, and other mainstream algorithms. Users can easily use various machine learning algorithms in the MLlib module by calling a simple interface. In addition, a more efficient deep learning framework suitable for Big Data processing has become one of the current research hotspots. In recent years, many well-known distributed deep learning computing frameworks have been proposed, such as Google's DistBelief [34], Baidu's DeepImage [35], SparkNet [36], TensorFlowOnSpark [37], and BigDL [38], etc.
1) TensorFlowOnSpark: TensorFlowOnSpark is a distributed deep learning computing framework developed by Yahoo, which supports all functions in TensorFlow, including model parallelization, data parallelization, and synchronous or asynchronous training and inference. It uses server-to-server direct communication to speed up the operation, supports Hadoop and Spark clusters, and implements distributed deep learning on GPU and CPU clusters. TensorFlowOnSpark provides distributed TensorFlow training and inference for Spark clusters but currently does not support other mainstream deep learning frameworks such as Pytorch.
2) BigDL: Dai et al. [38] point out that the current mainstream distributed deep learning frameworks (CaffeOn-Spark [39], TensorFlowOnSpark [37], etc.) all adopt a "connector approach" method and use an integrated workflow to develop suitable interfaces to connect different data processing and deep learning components. In practice, the adaptation of other frameworks will bring a lot of overhead (such as data serialization, persistence, and interprocess communication). More importantly, this training mode will have the problem of impedance mismatches [40]. Big data systems and deep learning systems have very different execution modes. Tasks in Big Data systems are parallel and independent, while in deep learning systems, tasks are coordinated and interdependent. When a worker in Spark fails, the tasks will be restarted, which may cause the entire workflow of the system to block indefinitely. Therefore, BigDL takes a different approach to directly implementing distributed deep learning support in Big Data systems, eliminating the impedance mismatch problem. BigDL has realized support for mainstream frameworks, such as Pytorch and TensorFlow, which significantly improve the development and operation efficiency of Big Data deep learning applications.
However, whether TensorFlowOnSpark, BigDL, or other mainstream distributed deep learning frameworks, they still face many problems during the training phase [38], [40]. For example, TensorFlowOnSpark and similar distributed deep learning frameworks using the "connector approach" training mode face the above impedance mismatches problem. BigDL does not support training on distributed GPU clusters. Moreover, with the frequent updates of deep learning frameworks such as TensorFlow, various new versions of BigDL need to be maintained frequently. At present, the maintenance of the BigDL community is not enough, and the community users are not active enough, which shows that the technology of applying the distributed deep learning framework to the training phase is not mature enough. Of course, we do not deny this kind of distributed training mode here. On the contrary, we firmly believe that this kind of distributed deep learning training technology will soon get widespread attention and greatly promote the development of deep learning after it is perfected.
Given the impedance mismatches and other issues associated with applying deep learning tasks to training on Big Data clusters, using the traditional stand-alone model for training and inference on Big Data clusters seems to be a better choice. Practice has proved that this is a very convenient and low-cost way to fully demonstrate the performance of object detection models in remote sensing images.
In this work, we encapsulate the inference of the model into a generic function, and each node of the cluster can execute this function independently and in parallel, and their execution resources are uniformly managed by YARN. As a result, we implement a lightweight and high-performance distributed inference framework, which is very suitable for object detection in large-scale remote sensing images. In addition, to improve the detection accuracy of the model on remote sensing images, we introduce the α-IoU [18] loss function to improve the detection effect of the model on small object detection. We will introduce our work in detail in the next section.

III. METHOD
In this section, we first introduce the nature of α-IoU and analyze the effect of different α values on the loss and gradient of the training process in Section III-A. Next, we detail the principles of the distributed inference framework, including data storage, resource management, and model inference steps, in Section III-B. We use cheap CPU cluster resources to build a lightweight and high-performance distributed inference framework suitable for object detection in large-scale remote sensing images, which significantly reduces the dependence of the model on expensive GPU resources in practical applications and has essential research significance for the practical application of object detection in remote sensing images.
A. α-IoU [18] In the Anchor-based detector, the bounding box regression and object classification are usually divided into two subtasks for learning. In the bounding box regression subtask, the localization loss is generally calculated according to the Intersection over Union (IoU) of the bounding boxes and the ground truths. Since the traditional IoU loss has the problem of gradient vanishing when there is no overlapping area between the bounding boxes, scholars have successively proposed different loss functions, such as generalized intersection over union (GIoU) [41], distance-IoU (DIoU), and complete IoU (CIoU) [42] to solve this problem.
In GIoU, Rezatofigh et al. [41] introduced the minimum enclosing rectangle of the predicted bounding box and the ground truth to reflect their coincidence degree and solve the problem of gradient vanishing caused by the loss of zero when the predicted bounding box and the ground truth do not overlap. In DIoU [42], the central point distance, overlap rate, and scale between the predicted bounding box and the ground truth are also considered, which makes the bounding box regression more stable to achieve a faster convergence speed than GIoU. Based on DIoU, not only the overlap area, central point distance, but also the aspect ratio between bounding boxes are considered in CIoU.
The α-IoU used in this article can cover all of the above IoU loss functions [18]. In addition, the loss and gradient of high-IoU objects and low-IoU objects can be adaptively weighted to further improve the performance of the model. α-IoU can be defined as follows: This approach introduces three important properties as follows [18]: 1) Order Preservingness It is not difficult to obtain that α-IoU loss is the same as GIoU, DIoU, and other loss functions. With the increase of IoU, the loss decreases monotonically. When α = 1.00, the α-IoU loss becomes the traditional IoU loss.

2) Relative Loss Reweighting
α-IoU loss can be regarded as IoU loss multiplied by a weighting factor. The definition of the weighting factor can be expressed as follows: The above formula can be obtained that lim IoU→0 w L r = 1 and lim IoU→1 w L r = α. When α > 1, w L r will increase the loss weight of high-IoU objects as IoU increases, making the model pay more attention to objects with high IoU (see Fig. 3). On the contrary, when 0 < α < 1, w L r will reduce the loss weight of high-IoU objects as IoU increases, and the model naturally pays more attention to regressing the bounding boxes of low-IoU objects.

3) Relative Gradient Reweighting
Similar to the second property, α-IoU can also adaptively weight the gradient. The weighting factor is defined as follows: When 0 < α < 1, the weighting factor decreases with the increase of IoU, and when α > 1, the weighting factor increases with the increase of IoU. It is worth noting that when α = 1, the α-IoU loss function re-weights the gradient of the object according to the IoU, and then adjusts the learning speed of different objects. In order to verify the above theory, we will show the performance of the YWCSL model under different α values through experiments in Section IV.

B. Distributed Inference Framework for Object Detection in Large-Scale Remote Sensing Images
The distributed inference framework for object detection in large-scale remote sensing images built in this article mainly includes a data storage layer, data loading and preprocessing layer, distributed computing layer, and visualization layer (see Fig. 4).
We store large-scale remote sensing image data in HDFS to achieve distributed storage and ensure data reliability and high fault tolerance in the data storage layer. In the data loading layer, we use the interfaces in RasterFrames [43] and PyHDFS to interact with HDFS to realize data reading and writing of conventional remote sensing image formats such as tif, png, and jpeg.
For the distributed computing layer, we choose Apache Spark [30] as the underlying computing engine to build a distributed and fast inference framework suitable for Big Data processing. We deploy the pretrained YWCSL model to each worker node of the Spark distributed cluster. When the program starts to execute, the Driver process will assign the task to the corresponding worker according to the data location to minimize the network transmission overhead. Each worker will load the pretrained YWCSL model into memory and load the RDD partition data to execute the task (see Algorithm 1). The number of tasks is related to the number of partitions of the RDD, which determines the program's parallelism and affects the program's running time. We adopt a custom partitioner (RankPartition) based on self-incrementing primary key modulo to distribute the remote sensing image data into each partition as evenly as possible. As shown in Fig. 11, we built a three-node Spark cluster and compared Spark clusters' inference speed using two different partitioners: RankPartition and HashPartition. To more clearly demonstrate the advantage of RankPartition over HashPartition, we define the inference speedup of Spark clusters using HashPartition partitioners as 1.0 and calculate the speedup ratio of the inference speedup of Spark clusters when using RankPartition. RankPartition is significantly more efficient than the default HashPartition. In the case of the same number of CPU cores, the larger the amount of data, the more pronounced the improvement in inference speed brought by RankPartition.
For the visualization layer, we choose to store the inference results of the model in HBase [44] to support fast querying of massive unstructured data. Users can query the specified remote sensing images in real time according to attributes such as area name and geographic location and visualize the inference results of the model to the front end for further data analysis.
The inference steps of the distributed inference framework for object detection in large-scale remote sensing images implemented in this article are as follows: Send image path information for task id ; 6: end for; 7: for each Worker ∈ Spark cluster do 8: Worker ← CPU cores and memory, etc. 9: Worker pull tasks and loading model; 10: for each batch image paths ∈ Worker's task do 11: images ← getImage(image paths); 12: images ← Processing(images); 13: results = model(images); 14: if the images is split then 15: Send the results to Driver for the next merge operation. 16 1) First, split the original image into segments of 1024 × 1024, and store these segmented images and segment information in the specified path of HDFS. 2) The Driver reads the path information of the above image fragments, builds a DataFrame, and uses the RankPartition partitioner to repartition the DataFrame. The model inference process is encapsulated as a UDF function (including an image loading module, a data processing module, a model Inference module, and non-maximum suppression (NMS) module), the input is batch path information, and the output is the model inference result (including prediction bounding box coordinates, category, confidence, and other information). 3) Yarn is responsible for cluster resource management, assigning program running resources to each Worker, and the Worker pulls the task and calls the above UDF function. The model results are fed back to the Driver. 4) The Driver summarizes the model inference results of each shard and merges the results according to the shard information. Each Worker, in parallel, can perform the merging operation. 5) The merged results are directly written to HBase after the last NMS. The above execution process has a high fault tolerance mechanism. When an exception occurs in a Worker, the Driver will receive the exception feedback and call other available workers to restart the abnormal task.

A. Datasets
To validate our proposed method, we conduct experiments using two large-scale remote sensing datasets, DIOR-R [2] and . The train, validation, and test datasets contain 1411, 458, and 937 images. DOTA-v1.5 uses the same data as DOTA-v1.0 but adds many small objects (less than 10 pixels) and a new category container crane (CC) containing 402 089 instances. The division of the training set, validation set, and test set is consistent with the DOTA-v1.0 version. Compared with DOTA-v1.5, DOTA-v2.0 adds two categories, airport (Air) and helipad (Heli), with 11 268 images and 1 793 658 instances. The training, validation, test-challenge, and test-dev sets contain 1830, 593, 6053, and 2792 images and 268 627, 81 048, 1 090 637, and 353 346 instances, respectively. The size of each category of the above two remote sensing datasets is shown in Fig. 5.

B. Implementation Details
For the DIOR-R dataset, we directly use the original images of the trainval dataset to train for 100 epochs, and then use the original test set to validate the performance of the model. The image size of the DOTA dataset is between 800×800 and 20 000×20 000, considering the large memory footprint associated with feeding images directly into the network, we slice the original image into a series of 1024×1024 slices with a stride of 824. For multiscale testing, we scale the original test  All models are trained on a single NVIDIA RTX 3080 Ti with a batch size of 8. We employ a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. For training, we used data augmentation with random flips, Mosaic [46], Mixup [47], and no data augmentation for testing.

C. Comparison With State-of-the-Arts
a) Results on DIOR-R: As shown in Table I, when α = 0.75, the YWCSL model achieves 64.62% mAP on the DIOR-R test set, which is 0.21% higher than AOPG. Compared with the state-of-the-art algorithm DODet, the YWCSL model has higher detection accuracy on small and medium-sized objects such as APL, APO, BF, BR, STO, ETS, WM, VE, and CH. Especially on the extremely small objects such as APL, BF, STO, and WM, the detection accuracy of the YWCSL model is 8.21%, 7.11%, 6.62%, 6.22% higher than that of the DODet model, respectively. The partial inference results of YWCSL on the DIOR-R test set are shown in Fig. 6. b) Results on DOTA: As shown in Tables II and III, the YWCSL model can achieve up to 79.54% and 76.60% mAP on the DOTA-v1.0 and DOTA-v1.5 test sets, respectively. Notably, when α = 0.5, YWCSL reaches the state-of-the-art of one-stage detectors, which is 0.12% higher than S 2 A-Net. Compared with advanced algorithms such as Oriented R-CNN and DODet, the   Fig. 7 c) Comparison of Model Inference Speed: As shown in Table IV, the biggest advantage of the YWCSL model is its fast inference speed. We use the DOTA-v1.0 test set to test state-of-the-art algorithms on a single NVIDIA GeForce RTX 3080 Ti. The table shows that the inference speed of the YWCSL model reaches 60.74 FPS, which is 3.83 times that of the current state-of-the-art model, the Oriented R-CNN. More importantly, it still maintains a high detection accuracy, which is higher than the current state-of-the-art one-stage algorithm S 2 A-Net. This shows that the YWCSL model is very suitable for the real-time detection task of large-scale remote sensing images. We conduct ablation experiments on the DOTA dataset and the DIOR-R dataset, respectively. We train 100 epochs using the conventional CIoU loss and α-CIoU loss, respectively. As shown in Table V, the YWCSL model trained with α-CIoU loss is 0.52% and 0.46% mAP higher than the model trained with CIoU loss for the single scale processing of the DOTA-v1.0 and DOTA-v1.5 test sets, respectively, and when the multiscale expanded test set is used, the detection accuracy of the former is 0.77% and 0.04% mAP higher than that of the latter. Fig. 9(a) shows the detection accuracy of YWCSL in each epoch when α is 0.5/0.75/1.00/3.00. The results show that when 0 < α < 1, the detection accuracy of the YWCSL model improves faster.
According to Tables I-III, when 0 < α < 1, the detection accuracy of the YWCSL model is usually higher than that of the    Fig. 8 shows the bounding box regression in the first five epochs of YWCSL  training with different α values. It can be seen that more and higher-quality bounding boxes are produced when α = 0.75. b) Effect of Training Set Size on YWCSL Performance: Fig. 9(b) shows the effect of different data amounts on the performance of the YWCSL model. We divided the DIOR-R For the first row, we use CIoU Loss as the loss function for the bounding box, trained for 100 epochs on the DOTA-v1.5 training set, and for the second row, we use α-CIoU Loss (α = 0.5) as the bounding box regression loss function. The same epochs were trained on the DOTA-v1.5 training set. We use the yellow dotted box to mark the areas with different detection results. It can be seen that α-CIoU Loss (α = 0.5) can effectively improve the detection effect of the model on small objects. training set by 50%, 70%, 90%, and 100% in each category, and the test set was not processed in any way. To avoid chance, we trained on each case three times and took the average mAP as the experimental result. The results show that α-CIoU only shows advantages when the data volume of the training set reaches 90% of the original data set and performs poorly when the data volume is small. To sum up, we should set α to around 0.75 when the training data is large and set α to 1.00 when the training data is small. The impact of different data sizes on the training speed of the YWCSL model is shown in Table VI. It can be seen that it takes about 400 s to train an epoch using all the data sets.

E. Spark Cluster Performance Comparison
To reflect the advantages of the distributed inference framework built in this article, as shown in the Table VII, we compared the inference speed in three different modes: single-machine CPU, single-machine GPU, and 8-node Spark cluster mode. The comparison results are shown in Fig. 10. It can be observed that the average speedup ratio (11.785) of the 8-node Spark cluster built in this article is much larger than the linear speedup ratio (8.000). Moreover, it can be seen that as the amount of data or the number of nodes increases, the greater the speedup ratio brought by the Spark cluster (the Spark cluster partitioner below uses RankPartition by default). Compared with the Single GPU mode, the Spark cluster inference framework we established can achieve a detection speed of 80.61% of the Single GPU mode in an environment that only uses the same CPU resources (same memory and CPU cores). Moreover, as the amount of data increases, the speeds of the two different inference modes of Spark cluster and Single GPU are closer.
In addition, to investigate the influence of different node numbers on Spark cluster inference speed, we compared the inference times on the validation set (5297 images, 6.77 GB, image size is 1024×1024) from a traditional Single CPU to eight nodes. As shown in Fig. 10(a), as the number of nodes increases, the Spark cluster inference speed increases, and the actual speedup ratio is far beyond the theoretical linear speedup ratio. However, after the number of nodes reaches 5, the speedup ratio of Spark cluster gradually degenerates to a linear speedup ratio.

F. Practical Application of Distributed Inference Framework
To test the detection effect of our distributed inference framework for object detection in large-scale remote sensing images in practical applications, we used the baseline in [16] and our improved model for deploying to the Spark cluster. The test dataset was intercepted from the 13.53-square-kilometer area near the Optics Valley Square in Wuhan, China, in the ArcGIS map. The image resolution reaches 17520×11712, and the size is 211 MB. As in Section IV-B, we first scale the original image by three scaling factors (0.5, 1.0, 1.5) and then cut the image into slices of 1024×1024 with a stride of 512. After cutting, the number of images is 2669, and the data size is 3.3 GB.
As shown in Fig. 12, when α takes the value of 0.5, the YWCSL model outperforms the baseline on small objects such as SV and SH. In addition, we compared the Single CPU inference mode and the Spark cluster inference mode. The former took 2883.98 s, while the 8-node Spark cluster we built took only 268.53 s, which improved the inference speed by 9.74 times.

V. CONCLUSION
This article proposes a distributed inference framework for large-scale remote sensing images that can achieve rapid object detection in large-scale remote sensing images using relatively cheap CPU cluster resources. Notably, the actual speedup ratio of our established distributed inference framework for object detection in large-scale remote sensing images far exceeds the theoretical linear speedup ratio. As the number of data increases or the number of nodes increases, the actual speedup ratio increases (see Fig. 10). When using only eight nodes, the speedup ratio can reach 12.28 under the data volume of 84.97 GB in this experiment, which is 53.5% higher than the linear speedup ratio of 8.00. It has crucial research significance for the practical application of object detection in large-scale remote sensing images.
For the balance between model inference speed and accuracy, we further improve the accuracy and inference speed of the model based on [16]. Our improved model achieves 64.62%, 79.54%, and 76.28% mAP on the DIOR-R, DOTA-v1.0, DOTA-v1.5 test sets, respectively, which is better than the current state-of-the-art one-stage detector S 2 A-Net performs better. The distributed inference framework for object detection in large-scale remote sensing images established in this article can also support distributed GPU inference. However, we did not conduct experiments on GPU clusters due to the lack of sufficient GPU cluster machines. In addition, the distributed inference framework for object detection in large-scale remote sensing images that we have established supports the flexible switching of models. Users can deploy their models directly on distributed clusters without paying attention to the underlying distributed inference details.
There are still some shortcomings in our improvement work. As analyzed in Section IV-D, the YWCSL model performs poorly when the training data is insufficient after the introduction of α-IoU. More importantly, the detection performance of the YWCSL model is more sensitive to the value of α taken. In general, if a dataset contains many difficult examples, α can be set between 0.50 and 1.00. Otherwise, α can be set between 1.00 and 3.00. According to the analysis in this article, the model learns better on the bounding boxes with low IoU when 0 < α < 1, and performs better on the bounding boxes with high IoU when α > 1. In the early stage of training, most of the bounding boxes have low IoU, so the former performs better in the early stage of training, while in the later stage of training, a proper boost of α will help the model learn better. Our subsequent work will investigate the adaptive α-IoU loss to ensure that the YWCSL model can be trained more stably and efficiently.