Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics

The shear volumes of data generated from earth observation and remote sensing technologies continue to make major impact; leaping key geospatial applications into the dual data and compute intensive era. As a consequence, this rapid advancement poses new computational and data processing challenges. We implement a novel remote sensing data flow (RESFlow) for advanced machine learning and computing with massive amounts of remotely sensed imagery. The core contribution is partitioning massive amount of data based on the spectral and semantic characteristics for distributed imagery analysis. RESFlow takes advantage of both a unified analytics engine for large-scale data processing and the availability of modern computing hardware to harness the acceleration of deep learning inference on expansive remote sensing imagery. The framework incorporates a strategy to optimize resource utilization across multiple executors assigned to a single worker. We showcase its deployment across computationally and data-intensive on pixel-level labeling workloads. The pipeline invokes deep learning inference at three stages; during deep feature extraction, deep metric mapping, and deep semantic segmentation. The tasks impose compute intensive and GPU resource sharing challenges motivating for a parallelized pipeline for all execution steps. By taking advantage of Apache Spark, Nvidia DGX1, and DGX2 computing platforms, we demonstrate unprecedented compute speed-ups for deep learning inference on pixel labeling workloads; processing 21,028~Terrabytes of imagery data and delivering an output maps at area rate of 5.245sq.km/sec, amounting to 453,168 sq.km/day - reducing a 28 day workload to 21~hours.


I. Introduction
Earth observation and remote-sensing are both fields that have undergone a renaissance recently, making major impacts in key geospatial applications including land cover mapping, infrastructure mapping, damage assessment, and population distribution studies [1]- [4]. Multiple factors are contributing to this change, including significant improvements and rapid deployment of satellite technologies that are enabling the acquisition of vast volumes of high resolution imagery at high velocity rates. As such, remote sensing applications have leaped into a data and compute intensive era presenting challenges and opportunities for new advanced machine learning and computer vision workflows. Examples of such applications include providing possibilities to study sustainability outcomes at scale [5], and identifying urban environments over large contexts using abundant satellite imagery and breakthroughs in deep learning based image classification [6]. To achieve greater impact with machine learning on data and compute intense workloads, new advanced workflows are required for efficient utilization of high performance computing resources.
Modern advances in computing hardware are enabling new opportunities for the manner in which large volumes of imagery data are processed. Over the past decade, Hadoop has emerged as an early experimental testbed for several big data applications due to its excellent large-scale datahandling capability, high fault tolerance, reliability, and low cost of operation [7]. Hadoop provides distributed data storage and analysis solutions which previously have been exploited for implementing large scale mean-shift based image segmentation algorithms [8]. In [9], an optimization effort on the Hadoop file storage system was studied to elicit better performance for large scale computing with image data. The authors of [10] study a Hadoop and MapReduce [11] based implementation of the parallel Kmeans algorithm to reduce the computational time taken for executing parallel data clustering on a large number of satellite images. Pursuing content mining on digital images, the authors in [12] introduced an approach for large-scale scene retrieval on massive image databases.
While MapReduce enables large-scale distributed computing for imagery when used in conjunction with Hadoop, a limitation is seen in its heavy usage of disk inputoutput (I/O) operations and network resources to store intermediate steps during processing. Deterred by this computational cost, Huang et al. [13] study Apache Spark [14] to take advantage of its resilient distributed datasets (RDDs) [15]. Spark has been shown to accelerate several other remote sensing imagery workloads. For example, in applications where trans-regional remote sensing images are key, the frequent data I/O requirements for mosaicking were shown to benefit from a parallel algorithm implemented with Spark [14], [16]. The work of Sun et al. [17] also demonstrates this performance increase, wherein the authors implement an iterative singular value decomposition algorithm to process massive amounts of remote sensing data. In concurrence, high performance computing environments are enabling targeted computing with extremely large earth observation data and the sharing of data in parallel across hundreds of nodes [18], [19].
Taking advantage of the high processing power, large memory capacity, and Infiniband-enabled interconnects between nodes in Summit, [20] proposed an exascale ready workflow and software stack for extracting signals for extreme weather patterns using variants of deep neural networks. The work scaled up to 27'360 V100 GPUs and sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. Another growing technology, even though focused on horizontal scaling resources is cloud computing. The technology is suited for targeting embarrassingly parallel tasks. Leveraging deep feature learning and parallel computing power, in [21] an efficient and scalable framework for pan-sharpening of remotely sensed big data in cloud computing environments was proposed. The framework characterizes a remote sensing application as a directed acyclic graph for cloud computing with Apache Spark [14] and TensorflowonSpark [22]. Several copies of data set are distributed via RDDs across virtual computing nodes to perform achieve distributed computing at scale.
Herein, we present RESFlow, an end-to-end workflow to address three challenges in large-scale machine learning for geospatial applications: (1) data complexity, (2) computational complexity, and (3) human labor requirements for collection of ground-truth data. On the data intensive challenge: The shear volumes of remote sensing imagery are increasingly becoming heterogeneous and challenging the efficacy of current machine learning workflows. With imagery data acquired as signals from varying system configurations and environmental conditions, efforts to analyze such diversity at scale are immediately thwarted by the lack of generalizable workflows. RESFlow seeks to stratify imagery into homogeneous distributions from which levels of diversity are contained to inform the reuse of models for inference tasks on imagery partitions with similar characteristics but originating from mixed geographies.
On the compute intensive challenge: Pixel labeling algorithms are compute intensive, necessitating efficient use of high processing computing resources on more than one computing node, either for feature extractor training or model inferencing. To address computational aspect of the problem, we propose to formulate the application as constituting a set of sub-tasks with interdependence but which are each executable in an embarrassingly parallel fashion. Given that sub-tasks have dependency on each other, this imposes an order of precedence on their execution, creating a task scheduling problem and resource contention that we handle via a novel GPU ticketing system.
On the labor intensive challenge: Current techniques from machine learning, especially deep learning, continue to demonstrate near-human performance. However, such methods are heavily dependent on large amounts of annotated data sets. When small amounts of high-quality labeled data are available, models tend to achieve poor generalization capability. The endeavor to obtain high quality training data can be tedious, especially for pixel labeling, and is characterized by multiple attributes that include; availability of domain experts, stratification of data into diverse representative samples, mitigation of human sampling bias, and accurate labeling of data samples. By seeking automated mapping into homogeneous partitions, our goal is to create data buckets from which to sample representative images with similar characteristics, mitigating the need to stratify large and diverse training data.
This paper seeks to address the above mentioned challenges from a generic perspective amenable for use in other large scale object mapping applications using remote sensing imagery. The technical contributions of this paper are: (1) We propose a novel and efficient single pipeline with multiple deep neural networks for multi-pass distributed image data analysis performed in an embarrassingly parallel fashion. (2) We present an unprecedented homogeneous partitioning of massive amount of data based on the semantics and spectral characteristics for parallel and distributed imagery analysis. (3) We leverage a deep metric space to create unique binary codes for efficient indexing 10s of models and 1000s of image patches for distributed pixel labeling. (4) We take advantage of Apache Spark to provide, for a single large image scene, a fast parallel inference functionality achieving tremendous speed-up with area pixel labeling rate of 5.245sq.km/sec, amounting to 453,168 sq.km/day -reducing a 28 day workload to 21 hours. (5) We present a containerized workflow for Apache Spark operations coordinated with GPUs for deep learning inference best practices, e.g. efficient GPU usage and ticketing across multiple workers, for large deep learning workloads deployed on GPU clusters.
Although presented for a pixel labeling task on satellite imagery, the workflow can easily be deployed for bio-medical and climate image based applications.

II. Satellite Image Analytics with Deep Neural
Networks Given the prevalent nature of high resolution remote sensing instruments, it is now conceivable to pursue computer vision methods for large scale object segmentation. Veryhigh-resolution remote sensing imagery, which now supports ground spatial resolutions of less than 50cm, is enabling new capability to exploit subtle and yet expressive spatial features for fitting highly complex objective functions for structured predictions with computer vision and machine learning methods. Deep convolutional neural networks have become the dominant machine learning technique for visual recognition, achieving state-of-the-art results on a number of problems that seek dense semantic labelling of image pixels. Early attempts on this problem include work in [23] where an atrous method to expand the support of filters and reduce the down-sampling for input feature maps to achieve dense labeling was used. In [24], an efficient and precise biomedical image segmentation convolutional neural network (U-net) was proposed. Improving on the architectural design to reconstruct the original input resolution, [25] proposed a semantic pixel-wise segmentation method using a fully-convolutional neural network (Seg-Net), which uses decoder-deconvolutional layers to map the low-resolution encoder feature maps to the full input resolution feature maps. The use of deep convolutional neural networks extends to other applications including big data mining for search and retrieval tasks. Designed to seek expressive spatial and visual content representational features, a deep hashing framework demonstrated capability for large-scale image retrieval in [26]. In another image retrieval task, pre-trained networks were used in order to extract intermediate image representation as input for metric and hash-code learning [27].
In general, to perform complex tasks at the level of humans, deep learning methods heavily depend upon the availability of enormous amounts of high quality annotated data. Despite the fact that remote sensing instruments are acquiring data in substantial volumes and the robust computing power needed to efficiently process it is available; such massive data sets are not simple to annotate. The process of gathering labeled training data is mired by inconsistencies, poor selection of representative samples and the annotation is often prohibitively expensive. It is labor intensive requiring a huge number of worker hours, making it challenging to train a single high performing deep network model for use on wide area geographical coverage. We therefore feel it is appropriate to seek automated workflows which support representative training data selections e.g. avoid under-representation, enable localized model for capturing homogeneous data distributions or fit models on diverse yet equally sampled image characteristics.
Moreover, with a growing demand for geospatial applications to deliver imagery products on data that scales over 10s of Terrabytes (TB) of high-resolution outputs per given geographic region, current applications are gradually becoming immense in terms of both data and compute requirements. To be specific, typical workloads for semantic labeling often entail processing imagery acquired across an average country of land-area size 783, 562sq.km. The accompanying imagery coverage would span 3 × 783, 562sq.km to account for scene overlapping and lack of cloud free data. Compare this volume of data against the largest known computer vision dataset ImageNet [28]. ImageNet has a total of 14,197,122 images, each at 224 × 224 pixels thus 50, 176 pixels per image, totaling 712, 354, 793, 472 pixels. When sampled at ground-sampling distance of 50cm, a mosaicking of all imageNet totals 356, 177sq.km, slightly less the size of Montana, US. In contrast, for an average country size, at 50cm ground sampling distance each RGB image scene spans about 40, 000 × 35, 000 pixels and carries ≈ 13 GB of data for a total of 3000 scenes (covering equivalent of 3×783, 562sq.km (7×ImageNet) land-area and totalling ≈ 39TB of data). Using current serial processing pipelines a single image scene takes 35-minutes to process a pixel-labeling task on a single computing node with one 16GB GPU card. Considering the demands to process multiple country scale products, it is imperative that object segmentation and semantic labeling tasks are deployed through parallel and distributed inferencing pipelines to reduce such computational complexity. With this motivation we identify advanced remote sensing dataflows (including  RESFlow) to be located in the top two quadrants of Figure 1 and continue to develop core computational modules that can match the demands of such applications.

III. Proposed RESFlow Framework
The RESFlow architecture seeks to present itself as an intelligent big data engine where end-to-end inference tasks are efficiently executed while exploiting the geometry of the data and being agnostic to sensor variations as well as geographic constraints. To this end, it is formulated to contain several integrated computational modules and algorithms; providing a common data pipeline which is shared across multiple inference tasks and geospatial applications. At the core of RESFlow is the concept of data distribution partitioning which is performed via efficient geometric based clustering and metric learning. Both techniques play key roles in the overall mapping of data to a partitioned image gallery for indexing -a strategy often lacking in traditional learning workflows. Here we believe that data partitioning is key to mitigating bias in training data collection and labelling (even though not in scope of the current study). Figure 2 illustrates the partitioning of an embedding representation for image patches extracted from a single large satellite image scene. From this result, the concept of data partitioning can be observed as the grouping and extraction of homogeneous image spaces for further exploitation during inference of the large image scene.
As noted above, computer vision and machine learning algorithms are proving superior in providing automated means to describe the distinctive nature of objects in remotely-sensed image data [1], [5], [6], [29]. However, the deployment of such algorithms remains a significant challenge when considered on large geographic areas covered by hundreds of thousands of images [29].
As is tradition with data/compute intense applications, success depends upon scarce and expensive hardware resources. It is therefore not surprising that use of hybrid CPU/GPU technology stacks is emerging as the means to address such deployment challenges. For example, deep learning functionality for analysis can be developed as user defined functions (UDFs) and used within Apache Spark clusters for inference deployment on GPU and CPU servers in a resourceful manner. Motivated by such potential we combine salient features from the deep learning frameworks (e.g. TensorFlow and PyTorch) and big-data capabilities from Apache Spark, to implement accelerated and parallelized inference modules for use on both CPU and GPU servers.
The central means to achieving such capabilities is the idea that both remotely-sensed image data, and deep learning models, can be mapped to and paired within local regions in which the extreme diversity induced by sensor characteristics and scene content is constrained. As depicted in Figure 3, this procedure is initially facilitated using a learned functional mapping which partitions high dimensional data embeddings into several buckets of similar semantic and spectral content and stored within an image gallery. The bucket partitions then provide a basis to train and update associated indexable models, stored within a model gallery, which can be tailored for specific inference tasks as defined by the application domain.
We briefly describe the RESFlow architecture several of its modular components in the following sections.

A. Clustering and embedding module
The first step to enable remote sensing imagery partitioning is utilizing the Clustering and Embedding module (CEM) in RESFlow The responsibilities of the CEM are two-fold. First, as the initial step within the coalescing of imagery to appropriate partitions of the Image Gallery, the module maps each input image as a datapoint using a learned feature extractor to an intermediate representation in which other datapoints characterized by similar acquisition conditions, spectral and semantic content share a close proximity. This intermediate mapping, or network embedding, is important as it provides a basis for an appropriate metric space to be learned in a data-driven manner. Second, during RESFlow's initialization, the CEM is used to assign labels via clustering as a means to assist learning of the metric space projection function.

B. Image-Bucket Assignment
After clustering the partitioned images, we seek to construct buckets that uniquely represent those clustered images. The Unique binary representation generated for each image plays a significant role in two ways: (1) they provide compact binary bitstrings that preserve semantically similar content for the partitioned image chips and (2) the binary bitstrings provide an efficient image-model indexing mechanism during large scale inference. This dual benefit is achieved by learning a hashing metric space whose properties include 1) generating a unique hash-map associated with each distinct image chip, 2) providing a compact representation which is smaller than the original input dimensions, and 3) a metric space from which the distance property for binary bitstrings can be used to relate image scenes whose geographies and image characteristics are also similar. The resulting binary bitstrings represent a desirable format with which to both efficiently index pre-trained models and the respective buckets proving an intuitive gathering space for localized training data.
The assignment of images to buckets is achieved by first computing the centroid binary bitstring for each bucket. Each bucket centroid offers a unique binary bitstring that's reused to identify each bucket model. Secondly the centroid gets reused within a hamming space to identify which bucket each image chip be assigned to. Figure 4 illustrates the main concept upon which RESFlow operates with hash-mapped buckets created via clustering. Clustering provides the key soft-labels needed to learn the semantic structure of the large satellite imagery archive. As shown, the reconstruction of the similar content structure, without emphasizing cluster-bucket correspondence, is carried out by the hash-mapping function. Furthermore, the distinct buckets become completely indexable within the image gallery shown in Figure ??.
To illustrate the hash-mapping module we observe the ability of a convolutional network based model to reconstruct the CEM generated embedding space while varying the number of initial clusters. Clusters for generating the soft-labels are chosen based on observing the number of clusters with an average smallest variance per cluster while varying the number of clusters. The hash-mapping network is evaluated using the mean Average Precision (mAP) metric which assesses the average value of the maximum precision for different recall levels while reconstructing the structure of initial clusters. Using a color-coding scheme, Figure 4 shows the relative changes between the agglomerative based clusters and the convolutional neural network based hash-mapped buckets.

C. Image gallery
Following their creation via the metric space projection function, the binary bitstrings generated from input imagery are partitioned by similarity relative to the distinct buckets shown in Figure 4. These partitions form a powerful abstraction, independently characterizing homogeneous image acquisition characteristics and spectral content, from which both training and inference data can be sourced for a related Model Gallery network. Further associated with each binary bitstring entry within this gallery is additional meta-data detailing attributes such as geoinformation, acquisition conditions, storage location and image sub-coordinates. Combined with the efficient binary representation of each datapoint for rapid image search capability, the inclusion of these attributes enable rich insights to be gained through exploratory inspection of the Gallery content. These insights could not only benefit performance evaluation of models but also help explain complex data characteristics and their bias.

D. Model gallery
Once populated, the Image Gallery presents a collection of homogeneous partitions. As previously motivated, this collection, with constrained diversity, is a desirable property when sampling for both training and inference data with a given deep learning algorithm. The Model Gallery will provide a direct mapping between a bucket partition within the Image Gallery and a paired and trained network model which is either fine-tuned during training, or applied during inference. Following this procedure can prove as a highly efficient alternative to standard model training and inference practices, as the localized constraints placed on the new data which each gallery model sees mitigates the need for continual retraining.

E. Accelerated inference
To this point, each of the discussed modules are seen to play a role in overcoming the extreme variance characteristic within imagery with the core concept of partitioning remote sensing imagery. However, the issue of analyzing this data at sufficient scale to meet the demands of global-size applications remains a prohibitive concern. In addressing this problem, it is further observed that each of the modules within RESFlow utilize elements of deep learning to perform different tasks, presenting a computationally heavy workload that requires the same GPU hardware resources to be reused in a single inference run. However, RESFlow's building blocks and their functionality are amenable to massive parallelization across the partitioned image space. In recognition of this condition, we exploit Apache Spark as a fabric for distributing and coordinating the framework's computations at scale. Learning from libraries such as TensorflowOnSpark [22] and Tensorframes [30], we leverage Spark's big data capability to ingest large quantities of input data and process these using deep learning frameworks complimentary to Spark in a fault-tolerant and highly parallel manner. Figure 7 shows RESFlow tile inference and reconstruction illustration. The tile partitioning strategy injects a key property that allows for spatially non-contiguous image tiles to be processed by  a single bucket model -enabling consistent inference over wide geographic conditions.

F. Application space
By design, the modules presented thus far within the RESFlow framework have been agnostic to any specific application or use case within the realm of remote-sensing imagery. In this manner, one could think of the Model Gallery partitioned to fulfill multiple tasks such as object detection, neighborhood and settlement mapping, or temporal change detection, which are only a subset of potential applications. Based on this premise, within the Application Space multiple copies of the Model Gallery are formed, each containing trained models which are uniformly purposed to perform a given task.
G. Image analytics via parallel computing 1) Satellite Imagery RDDs: Spark currently does not support extended image file types such as .tiff or .dicom to be serialized into byte arrays encapsulated within its RDD objects 1 . As a consequence of this limitation, a design 1 see: https://issues.apache.org/jira/browse/SPARK-21866 2) User Defined Functions: We implement several User Defined Functions (UDFs) within Spark in order to realize RESFlow's operation. Represented in Figure 5, these functions encapsulate tasks such as getting the extent of a given scene tile, or performing inference across a partition-based bucket. We utilise external libraries such as Tensorflow [31], Pytorch [32] and GDAL [33] to assist in performing these operations which act upon one or more rows records within a given RDD partition. 3) GPU Allocation Heuristic: An important limitation in using Spark for RESFlow's implementation is its lack of support for GPU-based resources. Here, first-class status is given to cluster-based resources such as CPU cores and RAM; allowing a fixed quantity of these resources to be allocated to instantiate a spark-executor. On the other hand, Spark contains no functionality to enable GPUs to be exclusively associated with a given executor in the same manner. When considered in the light of how Spark runs tasks which belong to the same processing stage, maintaining independence and concurrency, this lack of GPU resource allocation can become problematic. For example, during the embedding and hashing step of the workflow each executor is required to process the data within its associated partitions via GPU computation. With no robust mechanism to reserve a GPU for a given executor, this step may result in several executors utilizing the same GPU -causing the stage to fail as the limited amount of GPU RAM is quickly exhausted by the concurrent executor tasks.
As a workaround to this problem, we implemented a simplistic 'GPU checkout' routine which allows for the soft-assignment of GPU resources to take place between the executors on a given worker node. We accomplish this by creating a ticketing folder within the executors' shared work-space, into which we place one or more 'tickets' per physical GPU within the node. Using this ticket-system metaphor, when an executor requires the use of a GPU, it scans the ticketing folder for available instances -removing a ticket if one is available, or returning a ticket if the GPU resource is no longer required.

IV. Experiments and Results
Systems specifications:The computing environments for all experiments consist of various Spark cluster configurations on GPU optimized platforms. All Spark clusters are configured to take advantage of the NVIDIA DGX Systems as they deliver an integrated hardware and software solution thatâĂŹs been optimized to deliver faster time-to-solution with latest GPU resources. Nvidia-DGX1 systems: We first consider an environment with 3xNvidia-DGX1 machines each capable of a total of 80 threads from 40 cores via two Xeon CPU E5-2698 processors operating at 2.20GHz with 512G DDR4 of system RAM. Each machine has eight 16GB Volta GPUs achieving a maximum graphics clock of 1530Mhz and a maximum memory clock speed of 877Mhz with a 300Watt power cap on each GPU. The all SSD based local storage has 500G for the OS with a additional 7TB on Raid 0 for data storage. Each machine is connected to Network File System (NFS) storage via a single 10Gb Ethernet network connection. In total we have three such machines that are interconnected via 4x100Gb EDR IB ports (2 GPUs per IB connection). Nvidia-DGX2 systems: For the second environment we consider cluster nodes setup on NVIDIA DGX2 systems which each combine 16 GPUs fully interconnected via NVLink. The first node, a Nvidia-DGX2, can run 96 threads from 48 cores via two Xeon Platinum 8168 processors operating at 2.7Ghz with 1.5TB of DDR4 system RAM. The machine has sixteen 32GB Volta 350watt GPUs with max clocks of 1597/958 (Graphics/Memory). Local storage consist of a 1TB NVME Raid 1 boot partition and a 30TB NVME Raid0 data partitoin offering maximum transfer speeds of approx 20GBps. The machine has 8x100Gb EDR IB Ports (2 GPUs per IB port). The second node, Nvidia-DGX2-H,  Colors denote inference deployment of different gallery models as assigned by the image-bucket module. The tile partitioning strategy injects a key property that allows for spatially noncontiguous image tiles to be processed by a single bucket model -enabling consistent inference over wide geographic conditions. similarly has 96 threads with 48 cores, however with two Xeon Platinum 8174 operating at 3.1Ghz with 1.5TB RAM DDR4 2666. The machine has sixteen 32GB OCed Volta GPUs running at 450watt maximum power consumption with max clocks of 1702/1107 (Graphics/Memory), local storage consists of a 1TB NVME Raid 1 boot partition and a 30TB NVME Raid0 data partition offering maximum transfer speeds of approx 20GBps. The machine equally has 8x100Gb EDR IB Ports (2 GPUs per IB port).
Experimental evaluations are conducted by first setting up a single-Nvidia-DGX1 Spark cluster as follows: a configuration of one master node and eight workers. The cluster is instantiated from a single Singularity containers as depicted in Figure 6. Each worker is allocated 30GB memory and 9 cores and is responsible for launching a single executor. At execution the master node runs the driver process whose allocated memory is 10GB. Extending this configuration to a three-node Nvidia-DGX1 Spark cluster we instantiate a Singularity container with single master node and eight workers on one Nvidia-DGX1 machine and add two more Singularity instances on two more additional machines each equipped to support eight workers with allocation of 30GB memory and 9 cores per worker. Two more clusters, single-node Nvidia-DGX2-H Spark cluster and two-node Nvidia-DGX2 Spark cluster, are setup in a similar manner however each worker is allocated 80GB of memory and 6 cores per worker.

A. Case Study: Building Footprint Mapping
Country scale building footprint mapping fits the scope of both a data and compute intensive application. We envision current deep learning algorithms for solving this case study benefiting from a hybrid combination of the in-memory computing capability of Apache Spark and high performance computing hardware platforms.
Our experimental dataset consists of (.3 − .7)-meter resolution satellite imagery acquired by Digital Globe constellations i.e. WorldView-2, and WorldView-3. These constellations provide high resolution imagery that is suitable for pixel level semantic segmentation objects. As summarized in Table I, training and validation image data (using a 90%:10% split) is collected from several countries Ethiopia, South Sudan, Zambia. Each sample is a 500 × 500 pixel RGB image with an associated label mask. Importantly, we further explore out-of-country/outof-sample testing regions from other different geographical areas.These areas consist of New Mexico of the United States, Puerto Rico, Alabama, and Arizona, and are used to demonstrate both the deployment and generalization performance of RESFlow. These testing sets also represent a more realistic use-case of the framework once placed in production where rapid inferencing is much needed. Table I summarizes the distribution of testing samples that are used from each of the out-of-sample locations.

B. Workflow initialization
To initialize the RESFlow ensemble with all available training samples, we selected an optimal count of six buckets for the image and model galleries based-off the average intra-cluster variance measured over a range of cluster numbers used for the clustering and embedding module. For each of these buckets, we trained four different convolutional neural networks for the building mapping task where training and validation data are from its corresponding image gallery bucket. We picked these CNNs, namely ResNet50-FCN, [34], U-net [24], Seg-Net [35], and DeepLab [36] without multiple feature fusion, based on various sizes of model (number of parameters to train) as well as their superior performance on semantic segmentation tasks. The number of parameters for these models are listed in Table II. This also showcases the flexibility of the modularized application space in RESFlow where researchers can easily set up the preferred algorithms to test the model gallery. During testing (performing inferencing) on the out-of-sample data, each of these trained models becomes responsible for independently performing building extraction on testing samples, which are assigned to its membership via the same process of image gallery mapping. When reporting results, we refer to this combined quorum of models as RESFlow. All of the CNNs were concurrently trained from scratch, without using any pre-trained models. Hyper-parameters are held constant through all experiments so that we limit number of variables in the experimental runs.

C. Performance and computational efficiency
Among the main contributions of the paper, improving workload computational efficiency is vital in processing large volumes of data. RESFlow seeks to achieve this by combining algorithmic innovations as well as accelerated deep learning computing capable system architectures for remote sensing data analytics. More specific, its capability to partition data in the metric space offers desirable properties for large scale accelerated training and inferencing tasks. We observe and assess the system behavior under varying workloads and different computing environments to identify computational trade-offs between constrained resources and need to process large volumes of imagery. The deep learning fully convolutional network of choice is the U-Net architecture [24], merely selected for its fast training convergence across the 60% F1-score on validation of 300 samples as shown in Figure 10.
To provide an appropriate comparison of performance we additionally train a single U-Net model, again using all available training samples with the identical fully convolutional network architecture. We refer to this monolithic network as the Mono model, and measure its performance across the entire out-of-sample test set. As metrics, we utilize the Intersection over Union (IoU) in addition to F1 scores in order to follow the accepted community standard [37] and report performance results in Table IV. To quantitatively assert the viability of our proposed framework, Table IV presents the performance of both the Mono and RESFlow models on the held-out set of test data. Here RESFlow is seen to perform very similarly to the Mono model for two of the three test regions. This is considerable, as each model from the RESFlow Image Gallery sees considerably less data compared to its Mono model counterpart during training, and yet is able to generalize to a similar degree. This is not true for the New Mexico region, however, with a large deficit in performance being observed. Furthermore, visual maps to illustrate the performance of RESFlow ensemble of models over large geographic extents is shown in Figure ??. Table III and Table VI illustrate the computational performance of deploying RESFlow on an Apache Spark cluster equipped with deep learning modules based on PyTorch and Tensorflow. The baseline is recorded to average 35 minutes for pixel-level semantic segmentation inferencing time for a single image scene of size 40,000x35,000 pixels and data volume of 11GB processed on a single Nvidia DGX1 V100 GPU. Figure 9 shows the speedup as a function of both the number of GPUs and data sizes in GigaBytes(computed from number of image scenes). The speedup is calculated as the ratio of the baseline execution time to the normalized compute time for given number of GPUs. For all different size workloads, we observe a tremendous speed up ranging from 9x to over 400x across all Spark cluster environments. Overall, Table III to Table V and Figure 9 demonstrate the speedup factors of the various cluster configurations.
While we report on the strong and weak scaling of the segmentation inference module. The embedding and hashing modules are embarrassingly parallel with operations performed at image patch level to completion and always benefiting from additional numbers of GPU workers. To establish the strong scaling aspect of the workflow, Figure 9 shows speed-up performance over 6 workloads of data (between 1-12 image scenes) while varying the number of spark-based workers. The inference module is parallelizable while the merging of image patches to reconstruct the large scene is performed in a serial fashion; perhaps introducing an upper limit on the compute speedup. An immediate observation is that compute speedup does not assume a linear increase with additional GPUs -parallelization efficiency decreases for each workload as the amount of GPU workers are increased. The presence of the increasing and decreasing trends on the performance curves could be explained by highlighting two aspects. The first is due to algorithmic implementation of our workflow. As aforementioned the most important contribution of our workflow is its ability to partition or tile large imagery and enable computing at scale. During inference execution on tiled partitions, worker processors do not communicate with one another, however on completion, we perform a reduceByKey operation within Spark to group all tiles of the same image scene ID for merging into the large extent output mask. The merging operation for output mask reconstruction is performed by a single worker. As illustrated from Figure 9, the performance bottleneck is more pronounced for workers executing on 6 GPUs or more. In addition, inferencing of fewer image scene appears to incur the largest compromise on speed up. The second aspect significantly influencing the observed performance bottleneck is attributed to increased I/O read activities. The computational efficiency introduced by RESFlow benefits tremendously from the lazy evaluation of Spark RDD data transformation. Throughout the pipeline, image data is only read from disk at inference time for each corresponding image tile. As a result, an increase in workers executing on more GPUs concurrently, thereby adding to the I/O read count, considerably compounds and causes a decrease in compute performance. Table III to  Table VI further illustrate computational efficiency across a range imagery data sizes (or as measured in land area square kilometers). To establish weak scaling, we increase the workload on each GPU processor and also observe an improved weak scaling aspect of the workflow. Each row shows the compute time obtained for different area sizes for a fixed number of GPUs. The results indicate a desirable scaling factor, i.e. computational efficiency appears to increase as the land area to map (calculated for number of image scenes) correspondingly increases. For example, in Figure 9, for 12 GPU-workers, the compute speed for an area of 313.97 sq.km (or a single image coverage with data size of 15.03GB) is a 6.91x improvement over the baseline of 35minutes to process a single image scene. However, for an increased workload or land area of 4141.16 sq.km(or 12 image scenes with data size of 197.13GB), the speed-up reaches 750x over the baseline. This performance increase appears to be due to fact that overall communication and system bottlenecks are more dependent on the number of GPUs than on the land area.
We also evaluate the overall inference performance for varying input image size to the U-Net network architecture. Figure 8 provides results for RGB image input of size 500 × 500 × 3, 800 × 800 × 3,1000 × 1000 × 3, and 1500 × 1500 × 3. Inference is done in batches of size 12, 8, 5, and 2 image scenes, respectively. Compute efficiency is observed to vary with network completing inference at rate 1220 per sec (or throughput of 3.7GB/sec) for image tiles of size 500 × 500 × 3, which has an equivalent area rate of 23.45sq.km/sec. Given the smaller size in input image, I/O reads are increased putting a burden to the NFS file system. By considering much larger input images not only do we reduce the amount of I/O reads per image scene but we also increase both the throughput and the equivalent land area mapped. For example, with 1000 × 1000 × 3 input images the network reaches a peak inference throughput of 6.59GB/sec (processing a total of 564 images/sec) and equivalence of mapping 50.78sq.km/sec. At this rate we posit that the GPU compute time dominates the I/O read. However when for much larger input image sizes we note a throughput degradation to level of 4.69GB/sec (processing a total of 178 images/sec) for 1500 × 1500 × 3 tiles. Even though larger tiles increase the GPU utilization time they also require allocation of larger memory footprint for the inference results.
V. Large scale experiments a) Training:: We first evaluate several aspects of the training phase in the SPARK-enabled GPU clusters. Taking 3.02m * 4.52m * 6.77m * 8.92m * 12.21m * 12.12m * * Indicates a result for which soft gpu-executor assignment is needed to prevent run failure.  advantage of the partitioned image gallery, there is no interaction between buckets, turning the model training into an embarrassingly parallel task. As shown in Table II, the training speed scales with the complexity of the model (i.e. number of trainable parameters). Note that also the total time needed to achieve the model convergence is also linear to the number of training samples (total training time = number of training samples × secs/ sample)). In Figure 10, we demonstrate the similar model training performance achieved by a regular DGX-1 and a SPARK-enabled DGX-1. Because of different network initialization conditions (as we trained the models from scratch), the learning curves are not identical under these two computing environments. Nevertheless, with the fixed training hyper-parameters, we can see the best F-1 scores obtained from both computing platforms for these four models are similar: Seg-Net and U-Net both deliver F-1 scores slightly above 0.7, DeepLab has F-1 scores close to 0.7, and ResNet50-FCN achieves F-1 scores close to 0.65. In addition, the training improvements become trivial after ∼25 epochs for both computing environments except for ResNet50-FCN, which requires longer training epochs (∼60 epochs) to get optimized parameters. b) Inference:: We here present the large scale inferencing results produced by the RESFlow model gallery on three states/countries: Puerto Rico, New Mexico of the United States and South Sudan. Figure 11 visualizes output building semantic segmentation maps across varying geographies that encompass the above three test sites. Finally, having evaluated the different components for RESFlow e.g. impact of varying number of workers for different number of image to process in a single batch (Figure 9), establishing the throughput bounds as a function of input image size (Figure 8), we assess the scalability and applicability of deploying the workflow as depicted in Figure II on 14TB of imagery data covering the State of New Mexico. We select the Two-node Nvidia-DGX2 Spark cluster to execute the task with 28GPUs and batch size of 12 for image scenes. The pipeline execution entails computing three deep learning tasks for each image tile: deep feature extraction stage, a deep hashing stage and a deep semantic segmentation inference. The main goal for the large scale deployment was to assess the performance of our pipeline when deployed for production task. Therefore we account for both read and write (I/O) bottlenecks in addition to the compute time. Table VII presents the throughput for  this workload. This end-to-end inferencing workflow demonstrates unprecedented processing of vast amounts of satellite imagery. Averaging a throughput of .243GB/sec (or 5.245sq.km/sec) inference of the entire set of 1440 image scene completed in 21hrs -a tremendous achievement over a previous serial based inference workflow that would have taken over 28days to complete.

VI. Conclusions and Future Work
With a novel remote sensing data partitioning concept, this paper presented a parallel inferencing workflow based on accelerated AI deployment hardware. Demonstrated on the pixel labeling challenge, we extended herein, Apache Spark based satellite imagery RDDs for processing with deep learning at scale and demonstrated compute efficiency across Terrabytes of high resolution data covering land area 3.78m 4.52m 5.59m * 6.94m * 6.34m * 7.73m * 16GPUs 4.75m * 5.71m * 6.69m * 6.82m * 7.50m * 9.32m * * Indicates a result for which soft gpu-executor assignment is needed to prevent run failure.  equivalent of 787 300 sq.km. We achieve unprecedented pixel labeling area rates of 5.245sq.km/sec, amounting to 453,168 sq.km/day (or a daily capacity processing of 21,028Terrabytes) demonstrating a reduction of a 28 day workload to 21hours. In order to leverage Apache Spark to support deep feature learning and accommodate the problem of model generalization, the remote sensing data partitioning, the central concept in RESFlow, is realized from a combined set of four modules. RESFlow ignites optimism about a number of other relevant, and perhaps such new research directions, including enabling faster search for retrieval of images with similar content, combining human geography and machine learning e.g. establishing limiting bounds and performance of models learned from finer abstract and deconflated geographical spaces, establishing robustness and stability of learning models for objects that are rare in some geographies and common in others. These directions involve studying varying levels of imagery data complexity and their impact on training and inference tasks, and can potentially benefit from sensor and geographic agnostic workflows as compared to spatially constrained approaches. Other future directions: understanding the staging of geospatial workloads and deploying containerized workflows on supercomputing platforms presents a future research direction for our work. In addition, the work can be extended by exploring methodologies to enable better architectural design of computers specific to geospatial processing to benefit more applications.