GeoFlink: An Efficient and Scalable Spatial Data Stream Management System

This era is witnessing an exponential growth in spatial data due to the increase in GPS-enabled devices. Spatial data can be of extreme use to commercial businesses, governments and NGOs if processed timely. Spatial data is voluminous and is usually generated as a continuous data stream, for instance, vehicles tracking data, mobile location data, etc. To process such a huge data streams, highly scalable systems are needed. Apache Spark Streaming, Apache Flink, and Apache Samza are among the state-of-the-art scalable stream processing platforms; however, they lack spatial objects, indexes, and queries support. Besides them, other scalable spatial data processing platforms including GeoSpark, Spatial Hadoop do not support streaming workloads and can only handle static or batch data. To fill this gap, we present GeoFlink which extends Apache Flink to support spatial objects, indexes and continuous queries over spatial data streams. A grid-based index is introduced to support efficient spatial query processing and effective data distribution across distributed cluster nodes. GeoFlink supports spatial range, spatial $k$ NN and spatial join queries on Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon spatial objects. Besides, GeoFlink supports data streams in GeoJSON, WKT, and CSV data formats. A detailed experimental study on real and synthetic spatial data streams proves that GeoFlink achieves significantly higher query throughput than the existing state-of-the-art streaming platforms.


I. INTRODUCTION
With the increase in the use of GPS-enabled devices, spatial data is omnipresent. Many applications require real-time processing of spatial data, for instance, to provide route guidance in disaster evacuation, for patients tracking to prevent the spread of serious diseases, to support smooth voice and data services to mobile subscribers in all areas, for road traffic monitoring and management, etc. Such applications entail real-time processing of millions of tuples per second. Existing spatial data processing frameworks, for instance, Post-GIS [1] and QGIS [2] are not scalable to handle such huge data and throughput requirements, while scalable platforms The associate editor coordinating the review of this manuscript and approving it for publication was Lei Shu .
like Apache Spark Streaming [3], Apache Flink [4], Apache Samza [5], etc. do not natively support spatial data processing, i.e., they lack spatial data objects, indexes, and queries support, resulting in increased spatial querying cost. Besides, there exist a few solutions to handle large scale spatial data, for instance Hadoop GIS [6], Spatial Hadoop [7], GeoSpark [8], etc. However, they cannot handle real-time spatial streams. To fill this gap, we present GeoFlink, which extends Apache Flink to support spatial objects, indexes and continuous queries over spatial data streams.
Indexes are indispensable for efficient query processing and pruning. Spatial data indexes can be broadly divided into 2 types: 1) Tree-based, 2) Grid-based. Unlike static data, stream tuples arrive and expire at a high velocity. Thus, we need an index with low or zero maintenance cost. Tree VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ based indexes are extensively used for static spatial data processing due to their low data retrieval cost. However, the tree based indexes do not perform well in case of high insertions and deletions as these operations may trigger index restructuring which is a heavy operation. Thus, tree-based indexes are not suitable for data streams due to their high data arrival velocity, which results in high index maintenance cost [9]. On the other hand, given a data boundary, grid indexes have fixed structure and do not require maintenance as new data stream tuples arrive or depart. Therefore, to enable real-time processing of spatial data streams, a light weight logical grid index is introduced in this work. GeoFlink assigns grid-cell IDs to the incoming stream tuples based on which the objects are processed, pruned and/or distributed dynamically across the cluster nodes. GeoFlink supports spatial range, spatial kNN and spatial join queries on Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon spatial objects. Besides, GeoFlink supports data streams in Geo-JSON, WKT, and CSV data formats. It provides a userfriendly Java/Scala API to register spatial continuous queries (CQs). GeoFlink is an open source project and is available at Github. 1 Code 1. A GeoFlink (java) code for spatial join query.

Example 1 (Use case: Patient Tracking):
A city administration is interested in monitoring the movement of a number of their high-risk patients. Particularly, the administration is interested in knowing and notifying all the residents in real-time, if a patient happens to pass them within certain distance r. Let S1 and S2 denote the real-time ordinary residents' and patients' location stream, respectively, obtained through their smart-phones. This query includes real-time 1 GeoFlink @ Github https://github.com/aistairc/SpatialFlink join of S1 and S2, such that it outputs all the p ∈ S1 that lie within r distance of any q ∈ S2. Code 1 shows the implementation of this real-time CQ using GeoFlink's spatial join. The details of each statement in the code is discussed in the following sections.
This paper is an extension of our previous work [10]. The main contributions of this extension are summarized below: • Grid index extension for LineString and Polygon objects.
• Real-time and window-based spatial range, kNN and join queries for Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon spatial objects.
• Grid-based pruning for efficient spatial queries processing.
• Extensive experimental study on synthetic and real data streams. Rest of the paper is organized as follows: Section II presents related work. Section III discusses the essential concepts useful in understanding the later discussion. In Section IV, spatial objects and spatial streams are defined. Section V discusses spatial distance and r-neighbors computation. Section VI elaborates on spatial indexing is discussed and GeoFlink's grid index is presented. Section VII presents the supported spatial queries along with their algorithms. In Section VIII, GeoFlink architecture is presented. In Section IX detailed experimental study is presented while Section X concludes our paper and highlights a few future directions.

II. RELATED WORK
Existing spatial data processing frameworks like ESRI ArcGIS [11], PostGIS [1] and QGIS [2] are built on relational DBMS and are therefore not scalable to handle huge data and throughput requirements. Besides, scalable spatial data processing frameworks, like, Hadoop GIS [6], Spatial Hadoop [7], GeoSpark [8], Parallel Secondo [12] and GeoMesa [13], cannot handle real-time processing of spatial data streams. One can find a number of extensions of Spark to support spatial data processing. SpatialSpark [14] leverages Apache Spark's broadcast mechanism to provide partitions based on broadcast space. GeoSpark [8] processes spatial data by extending Spark's native Resilient Distributed Dataset (RDD) to create Spatial RDD (SRDD) along with a Spatial Query Processing layer on top of the Spark API to run spatial queries on these SRDDs. For efficient spatial query processing, GeoSpark creates a local spatial index (Grid, R-tree) per RDD partition rather than a single global index. For re-usability, the created index can be cached on main memory and can also be persisted on secondary storage for later use. However, the index once created cannot be updated and must be recreated to reflect any change in the dataset due to the immutable nature of RDDs. LocationSpark [15], GeoMesa [13] and Spark GIS [16] are a few other spatial data processing frameworks developed on top of Apache Spark. Like GeoSpark, all these frameworks do not support real-time stream processing as we do. Apache Spark Streaming [3], Apache Flink [4] and similar distributed and horizontally scalable platforms support large-scale, real-time processing of data streams. However, they do not natively support spatial data processing, i.e., they lack spatial data objects, indexes, and queries support, resulting in increased spatial querying cost.
For real-time queries, Apache Spark introduces Spark Streaming that relies on micro-batches to address latency concerns and mimic streaming computations. Latency is inversely proportional to batch size; however, the experimental evaluation in [17] shows that as the batch size is decreased to very small to mimic real-time streams, Apache Spark Streaming is prone to system crashes and exhibits lower throughput and fault tolerance. Furthermore, even with the micro-batching technique, Spark Streaming only approaches near real-time results at best, as data buffering latency still exists, however miniscule. Indeed F. Zhang et al. [8] mentions that the architecture of their spatial querying framework could demonstrate a higher throughput if implemented on Apache Flink. Other distributed streaming platforms worth considering are Apache Samza [5] and Apache Storm [18]. Performance comparison by Fakrudeen et al. [19] revealed that both the Samza and Storm demonstrate a lower throughput and reliability than Apache Flink [4]. Thus, we extend Apache Flink, a distributed and scalable stream processing engine, to support real-time spatial stream processing.
To support real-time processing of spatial data streams, Zhang et al. [20] extended Apache Storm. They proposed and implemented distributed spatial indexing for the continuous spatial query processing. The tuple in their work consists of two pairs of coordinates, corresponding to new and old coordinates. Thus, if an object moves greater than some threshold δ distance, the tuple coordinates are updated. For query processing in their work, a hybrid index is introduced, consisting of a primary grid index and a secondary index; where the secondary index corresponds to the objects in a grid cell. The secondary index is either tree-or hash-based and physically stores the moving object tuples. The secondary index is mainly maintained to improve the query performance and must be updated as the object moves or updated. In contrast, the grid index in GeoFlink is logical, in the sense that it only assigns a grid ID to the incoming streaming tuples or moving objects and based on the grid ID the objects are processed, pruned and distributed across the cluster nodes. We do not physically store the objects in any data structure or memory, hence no update is required as we receive an updated object location. Hence it enables GeoFlink to achieve higher query throughput.
Another related work is on systems for spatio-textual stream processing [21]- [23]. They deal with streaming tuples which contain geolocations and textual contents. Tornado [21], [22] is a distributed system developed on top of Storm for real-time processing of spatio-textual queries FIGURE 1. The number of Kafka topic partitions are less than the number of Flink's task slots. It results in 1 to 1 mapping between the topic partitions and the task slots and causes the remaining task slots to sit idle during distributed query execution, wasting valuable system resources. and streams. SSTD [23] enables more variety of continuous queries and snapshot queries as well as continuous ones. These approaches employ sophisticated global indexes and cost models to achieve dynamic load balancing of workers in multiple query and stream processing context. In contrast, GeoFlink utilizes maintenance-free logical gird index and key-based (hash-based) data partitioning scheme inherent in Flink for data distribution. As proved in the experiments in Section IX, GeoFlink can achieve load-balance in processing multiple queries and streams with skews and hotspots if the grid size is chosen carefully.

III. ESSENTIAL CONCEPTS
In this section we present Flink programming model and Flink-Kafka pipeline, which are essential in understanding the later discussion.

A. APACHE FLINK
To support continuous query processing over spatial data streams our platform needs to support features like aggregation, join, windowing, state management, etc. The choice boils down to Apache Spark Streaming and Apache Flink with both the architectures providing basic frameworks to implement the above features. However, as the target of this work is streaming data and continuous queries rather than batch data and one-time queries, Apache Flink is a natural fit for it as it is inherently designed for streaming applications. In the following, we present a quick overview of Apache Flink.

1) DATA COLLECTIONS AND PROGRAM
Apache Flink uses two data collections to represent data in a program: 1) DataSet: A static and bounded collection of tuples, 2) DataStream: A continuous and unbounded collection of tuples. A Flink program consists of 3 building blocks: 1) Source, 2) Transformation(s)/Operator(s) (in the following, transformations and operators are used interchangeably), and 3) Sink. When executed, Flink programs are mapped to streaming dataflows, consisting of streams and transformations. Flink comes with a number of basic transformations VOLUME 10, 2022 FIGURE 2. The number of Kafka topic partitions are more than the number of Flink's task slots. It results in the assignment of multiple topic partitions to a task slot. This causes non-uniform data transfer to some of the task slots which overburdens the small number of consumer task slots.
such as filter, map, reduce, keyby, join, aggregations, window, etc. Each dataflow starts with one or more data sources and ends in one or more data sinks. The dataflows resemble arbitrary directed acyclic graphs (DAGs). Flink provides seamless connectivity with data sources and sinks like Apache Kafka (source/sink), Apache Cassandra (sink), etc. [4]. In the following, we will discuss Apache Flink from DataStream perspective which is the focus of this work.

2) WINDOW TRANSFORMATION
Window is one of the most important transformations and is regarded as the heart of stream processing. Therefore, we discuss this transformation separately. Transformations can be classified broadly into blocking and non-blocking.
Definition 1 (Blocking Operators): Blocking operators require processing of the entire input before an output can be generated. For example, sort, aggregation, join, etc.
Definition 2 ( Non-Blocking Operators): Non-blocking operators generate output as they receive input tuples without the need to wait for other tuples. For example, filter, map, etc.
Window splits the infinite stream into buckets of finite size over which computations can be applied. Windows are mainly used for blocking operators; however, depending upon applications requirements it can also be used for non-blocking operators. For instance, filter is a non-blocking operator but one can implement window-based filter operator to generate periodic output.
Flink supports four types of windows: tumbling window, sliding window, session window and global window. 1) Tumbling window: When using this window, a stream is divided into non-overlapping partitions and the stream data is kept only for the current partition. The partition size is specified using a user-defined parameter window size.
2) Sliding window: In this case, a stream is divided into possibly overlapping partitions of newest events. Similar to the tumbling window, the size of a sliding window is specified using window size parameter. An additional window slide parameter controls the output frequency. A sliding window is overlapping if window slide < window size, else non-overlapping. 3) Session window: The session window groups tuples by sessions of activity. Session windows do not overlap and do not have a fixed start and end time. Instead a session window closes when it does not receive tuples for a certain period of time, i.e., when a gap of inactivity occurred. 4) Global window: A global window assigns all tuples with the same key to the same single global window. Without a loss of generality in this work, only tumbling and sliding windows are used.

3) PARALLEL AND DISTRIBUTED PROCESSING
Programs in Flink are inherently parallel and distributed. During execution, a transformation/operator is divided into one or more subtasks which are independent of one another and execute in different threads that may be on different machines or containers. Flink parallelism depends on the number of available task slots which are equivalent to the number of CPU cores. In Flink, keys are responsible for data distribution across operator instances. All the tuples with the same key are guaranteed to be processed by a single operator instance. In addition, many of the Flink's core data transformations like join, groupby, reduce and windowing require the data to be grouped on keys. Intelligent key assignment ensures the near uniform data distribution among operator instances and hence leverage the performance offered by parallelism. Many times, data source is a bottleneck in leveraging the performance offered by Apache Flink. Hence, we provide a quick overview of Flink-Kafka pipeline in Section III-B, where Apache Kafka is one of the most common streaming source used with Apache Flink.

4) WORKLOAD FLUCTUATION AND SCALING
Streaming jobs usually run for several days or even longer and may experience workload fluctuation during their lifetime. Although some of these fluctuations are more predictable than others, in all cases there is a change in job resource demand that needs to be addressed if we want to ensure the same quality of service. To address the change in job resource demand, Flink supports the following two scaling approaches [24]: • Manual: Manually rescaling a Flink's job has been possible since Flink 1.2 introduced rescalable state, which allows users to stop-and-restore a job with a different parallelism. For example, if our job is running with a parallelism of p = 100 and our load increases, we can restart it with p = 200 from the savepoint created during shutdown to cope with the additional data.
• Reactive: The reactive mode has been introduced in Flink 1.13. In this mode, end-user can add or remove resources at will and Flink does the rest, i.e., manages the rescalable states. Reactive mode is a mode where JobManager will try to use all TaskManager resources available.
Since GeoFlink is built on top of Flink, it supports all the rescaling approaches supported by Flink.

B. FLINK-KAFKA PIPELINE
Apache Flink [4] is mainly used for stream data processing and analytics. In real applications, mostly if not always, the data streams are ingested from Apache Kafka [25], a system that provides durability and pub/sub functionality for data streams. A typical pipeline of Flink and Kafka include data streams being pushed to Kafka, which are then consumed by Flink programs. The outputs of these programs are fed back to Kafka for consumption by other queries, applications or services, written out to distributed file systems, or sent to web frontends [26].
Apache Kafka provides a distributed data streams for efficient processing of Flink jobs. This is achieved by an abstraction in Kafka called topic. A topic is a handle to a logical stream of data, consisting of many partitions. Topic partitioning is important in providing distributed data streams to its consumers. Since partitions are assigned to Flink's parallel task instances, when there are more Flink tasks than Kafka partitions, some of the Flink consumers will just sit idle and not read any data as shown in Figure 1. On the other hand, when there are more Kafka partitions than Flink tasks, Flink consumer instances will subscribe to multiple partitions at the same time as shown in Figure 2. Besides, data streams velocity is unpredictable. One may need to rescale Flink job by increasing or decreasing its parallelism to handle the changing velocity of incoming data stream. Thus, it is very important to partition a topic by keeping in view the maximum available Flink task slots and possible rescaling to avoid wastage of resources. In general, it is recommended to have more partitions than the number of Flink task slots for higher throughput and to avoid task slot idleness. However, too many partitions can increase latency. Detailed discussion on Kafka topic partitioning is beyond the scope of this work. One may refer to the article by Paul Brebner [27] for an indepth discussion on intelligent topic partitioning for higher throughput.

IV. SPATIAL OBJECTS AND SPATIAL STREAMS
This work proposes GeoFlink, which is an efficient and scalable spatial data stream management system. GeoFlink supports the following types of spatial objects in 2D space. Logically, we can deal with multi-dimensional objects in nD space. We focus on 2D space in this paper. Figure 3 shows the spatial objects listed above. A Point is the simplest spatial object consisting of 2 positional attributes. In geographical coordinate system, the positional attributes are referred as longitude and latitude. A MultiPoint can be defined as a set of Points consisting of multiple Points. A LineString is a connected series of line segments and is also referred as a polygonal chain in geometry, where a line segment is a line bounded by two distinct end points. A Mul-tiLineString consists of multiple disconnected LineStrings. A Polygon is also a connected series of line segments, where the first end point of the first line segment coincides with the second end point of the last line segment, forming a closed shape. A Polygon may also contain one or more holes within it. Like MultiLineString, MultiPolygon consists of multiple disconnected Polygons. In this work, we use ψ to denote a spatial object.
A spatial stream consists of one or more type of spatial objects. Formally, spatial stream can be defined as follows: Definition 3 (Spatial Stream S): A spatial stream S is a sequence of spatial tuples ordered by their timestamps, where each tuple consists of at-least two attributes: spatial object (ψ) and timestamp.
To make the discussion simple in this work and without loss of generality, we assume that a spatial stream consists of any one type of spatial object. Thus, a Point stream consists of only Points, a Polygon stream consists of only Polygons, and so on. Furthermore, each stream tuple contains entire spatial object and timestamp, and may contain other attributes, like object identifier, object bounding box, etc.

V. DISTANCE AND NEIGHBORS
In this section, we discuss distance computation between different spatial objects and r-neighbors computation, required for the proposed spatial queries discussed in Section VII.

A. SPATIAL DISTANCE
Spatial queries proposed in this work require extensive distance computation between spatial objects. Since GeoFlink supports 6 different types of spatial objects (Figure 3), distance computation between them is required for the execution of the proposed queries. In GeoFlink Point, LineString and Polygon are treated as special cases of MultiPoint, MultiLineString and MultiPolygon consisting of one Point, LineString or Polygon, respectively. Namely, we require the following distance computations.
• Point to Point • Point to LineString • Point to Polygon • LineString to LineString VOLUME 10, 2022 • LineString to Polygon • Polygon to Polygon Distance computation between geometrical shapes, also known as similarity measure, is a well-studied topic. There exist a number of distance functions and their computation techniques in the literature, for example, Hausdorff metric [28] for Point sets and Polygons, Frechet distance [29] for LineStrings; and their computation techniques in the literature [30], [31]; however, their detailed discussion is outside the scope of this work. GeoFlink uses distance functions provided by JTS Topology Suite [32]. The JTS Topology Suite is a Java API for modeling and manipulating 2-dimensional linear geometry. JTS uses the implicit coordinate system of the input data. The only assumption it makes is that the coordinate system is infinite, planar and Euclidean (i.e. rectilinear and obeying the standard Euclidean distance metric). In the same way JTS does not specify any particular units for coordinates and geometries. Instead, the units are implicitly defined by the input data provided. JTS provides numerous geometric predicates and functions. In this work Euclidean distance is used for distance computation between spatial objects although, users can write their own distance functions by altering the methods in the GeoFlink's Distance-Functions.java class.

B. r -NEIGHBORS
All of the GeoFlink's spatial queries require neighborhood computation. We define r-neighbor region (g) and r-neighbors(ψ) given a spatial object ψ and distance r.

1) POINT AS A QUERY OBJECT
Given Point ψ as a query object, Figure 5 shows the r-neighbors(ψ) computation for the spatial objects Point, LineString, and Polygon. In case of Point, r-neighbors(ψ) is defined as the objects which lie within r distance of ψ. Namely, all the spatial objects overlapping the r-neighbor region (ψ) are r-neighbors(ψ). In the Figure 5, r-neighbor region (ψ) is given by a circle around ψ due to the use of 2D Euclidean distance. In Figure 5(a), Points o4, o5, o7, o8, and o9, in Figure 5(b), LineStrings o4, o5, and o9, whereas in Figure 5(c) Polygons o4, o5, o8, and o9 are r-neighbors(ψ).

VI. SPATIAL STREAM INDEXING
Spatial data index structures can be classified into two broad categories: 1) Tree-based, and 2) Grid-based. Tree-based 24914 VOLUME 10, 2022  spatial indexes like R-tree, Quad-tree and KDB-tree can significantly speed-up the spatial query processing; however, their maintenance/restructuring cost is high specially in the presence of heavy updates like insertions and deletions [33]. On the other hand, grid-based indexes enable fast updates but they cannot answer queries as efficiently as tree-based indexes [34], [35].
Since GeoFlink is a distributed spatial data stream management system, an index structure which can work efficiently in distributed environment in the presence of heavy updates is desirable. Considering the distributed processing in GeoFlink where all the nodes work independently, the index must be maintained on each cluster node corresponding to the data it receives. Tree index structure is data dependent resulting in different maintenance cost for each cluster node depending on the data it receives whereas, grid index does not require any maintenance. To this end, grid-based index seems to be a natural choice for GeoFlink.
One problem with the grid-based index is that it requires data boundaries for its construction. Since a stream is dynamic and infinite, its boundaries cannot be known in advance; however, it can be estimated based on the geographical location of the sensors generating the data stream in case of fixed sensors or it can be estimated by utilizing the loose bounds of the area or city of the objects if the data stream is being generated by moving objects. For instance, assuming that we are aware of the city the stream source(s) originate from, we can use its geographical boundaries for the girdindex creation.

A. GeoFlink GRID INDEX
The grid index used in this work is partially borrowed from our previous work [10], where it was mainly used for spatial Point object. However, in this work, we extend the grid index for other spatial objects discussed in Section IV. For brevity, we summarize it here.
The grid index [36] in this work is aimed at filtering or pruning objects during spatial queries execution. Furthermore, it helps in the near-uniform distribution of spatial objects across distributed cluster nodes. In GeoFlink, the grid index (G) is constructed by partitioning a 2D rectangular space, given by (MinX , MinY ), (MaxX , MaxY )  (MaxX − MinX = MaxY − MinY ), into square shaped cells of length l, as shown in Figure 8. Here we assume that G's boundary is known, which can be estimated through data source location. Let c x,y ∈ G denotes a grid cell with indices x and y, respectively. We will use c instead of c x,y when it is clear from the context. Each cell c ∈ G is identified by its unique cell ID (key) obtained by concatenating its x and y binary indices. Figure 8 shows a grid structure with a cell c x,y and its unique key. The Gauranteed cells (ψ) and Candidate cells (ψ) in Figure 8 are discussed in Section VI-B.
Within GeoFlink, each stream object is assigned one or more cell IDs (keys) on its arrival, depending on the G cells it belongs to. A spatial object ψ belongs to a set of cells C if it overlaps partially or completely with c ∈ C. In this work, we assume that a spatial Point belongs to only one cell whereas, other spatial objects can belong to multiple cells depending on their sizes and positions. Hence, a single key is assigned to a Point whereas a set of keys may need to be assigned to other spatial objects. The grid-based index used in this work is logical, that is, it only assigns keys to the incoming streaming objects. Besides, no physical data structure is needed hence, no update is required when an object expires or an updated object location is received. This makes our grid index fast and memory efficient.

1) COST ANALYSIS
GeoFlink's grid index is logical, i.e., no structure is maintained in memory. Thus, there is no insert, search or delete cost. Moreover, being logical the index does not occupy any memory space. In GeoFlink, each stream tuple is assigned a grid cell key or a set of keys based on its coordinates, which is used for its neighborhood computation, pruning and distribution across the cluster nodes. key computation using stream tuple object coordinates is the only grid indexing and maintenance cost which is discussed above. Section VI-B goes into the details of the use of neighborhood computation.

B. GRID INDEX AND r -NEIGHBORS COMPUTATION
This section discusses how the proposed grid index is used for efficient query computation in GeoFlink given a spatial query object ψ particularly, r-neighbors(ψ) computation. One traditional and an effective approach to reduce query computation cost is to prune out the objects which cannot be part of the query output without using distance computation, because it can be costly specially for complex spatial objects. In the following, we discuss how our grid index is used to identify r-neighbors(ψ) and prune non r-neighbors(ψ).
Firstly, we need to identify C ψ (set of cells containing the spatial object ψ). This is done by obtaining C bbox ψ (set of cells containing bbox ψ (bounding box of ψ)) and then computing C ψ by identifying the cells in C bbox ψ overlapping ψ. The set C bbox ψ is obtained by fetching ψ's extreme coordinates to get its bounding box (bbox ψ ) and using it to identify the cells which lie within bbox ψ . Figure 9(a) shows a query Polygon ψ with its bbox ψ and C bbox ψ (the gray cells). After obtaining C ψ ⊂ G, the grid is divided into nonoverlapping regions Guaranteed cells (ψ), Candidate cells (ψ), and Non-neighboring cells (ψ), as follows: Definition 7 (Guaranteed cells (ψ)): A set of cells such that the objects in them are guaranteed to be r-neighbor(ψ). Guaranteed − 1 Definition 8 (Candidate cells (ψ)): A set of cells such that the objects in them may or may not be r-neighbor(ψ). Hence, require distance evaluation. Candidate

Definition 9 (Non-neighboring cells (ψ)): A set of cells such that the objects in them cannot be r-neighbor(ψ).
Non-neighboring cells (ψ) = {c x,y | c x,y / ∈ Guaranteed cells (ψ) ∧ c x,y / ∈ Candidate cells (ψ)} Algorithm 1 shows the steps to compute Guaranteed cells (ψ) and Candidate cells (ψ) given ψ, r, and G. Given the definitions Guaranteed cells (ψ), Candidate cells (ψ), and Non-neighboring cells (ψ), Guaranteed-, Candidate-, and Non-neighbors of a spatial object ψ can be given as follows: ∈ Candidate-neighbors(ψ)} Example 2: Let ψ denotes a spatial Polygon object (red colored) in 2D space as shown in Figure 9(a). Given distance r and assuming that ψ denotes a query object, the r-neighbor region (ψ) is given by green boundary around ψ. In addition to ψ, there exist 11 other spatial objects, o1 ∼ o11. Furthermore, Figure 9(a) shows grid index, where Algorithm 1 Guaranteed cells (ψ) and Candidate cells (ψ) Computation Require: ψ: Spatial object, r: Query distance, G: Spatial grid index Ensure: Guaranteed cells (ψ) and Candidate cells (ψ) 1: Obtain bbox ψ by ψ's extreme coordinates 2: Compute C bbox ψ ; C bbox ψ = Cells overlapping bbox ψ . Since grid cells' boundaries are known, C bbox ψ can be obtained by fetching the grid cells overlapping bbox ψ . 3: Compute C ψ ; C ψ ⊂ C bbox ψ and is obtained by identifying the cells in C bbox ψ overlapping ψ. 4: for each c x,y ∈ C ψ do 5: the gray cells denote C bbox . The query object cells are denoted by C ψ . Namely, C ψ is a set of cells which overlap ψ. Using c ∈ C ψ we compute Guaranteed cells (ψ), Candidate cells (ψ), and Non-neighboring cells (ψ) as shown in Figure 9(b). Given the guaranteed-, candidate-, and nonneighboring cells, objects o6 and o11 are classified as guaranteed, o2, o5, o7, o8, and o10 as candidate, whereas o1, o3, o4, and o9 as nonr-neighbors(ψ). Figures 10,11,and 12 show the Guaranteed cells (ψ), Candidate cells (ψ) and Non-neighboring cells (ψ) for the query objects Point, LineString and Polygon, respectively. In the figures, query objects are represented by ψ while other objects are represented by o1, o2, . . . , o10. In case of LineString and Polygon query objects, their bounding boxes are used to identify the cells containing them. The blue cells represent Guaranteed cells (ψ), the yellow cells represent Candidate cells (ψ), while the white cells represent Non-neighboring cells (ψ). Please note that all c ∈ Guaranteed cells (ψ) overlap the r distance of query geometry ψ, thus, all the objects in Guaranteed cells (ψ) are rneighbors(ψ). Objects which overlap Candidate cells (ψ) may or may not be in r-neighbors(ψ), consequently require actual distance computation. While objects which entirely lie in Non-neighboring cells (ψ) can be safely pruned.

VII. SPATIAL CONTINUOUS QUERIES
This section presents three basic spatial continuous operators/queries proposed in this work, i.e., spatial range, spatial kNN and spatial join. These queries are required by most of the spatial data processing and analysis applications. The spatial continuous queries are executed on continuous data VOLUME 10, 2022  streams. Each GeoFlink query has two variants: real-time and window-based.
• Real-time: Real-time query is triggered with the arrival of new stream tuples. Precisely, as a new tuple is received by GeoFlink, it is processed by realtime query and a corresponding output (if any) is generated.
• Window-based: Triggering of window-based query is based on window size and window slide step. The window-based query performs computation on all the spatial objects in the window and generates output (if any) corresponding to all the window contents. The query output is generated every slide step (W s ) in case of sliding window or every window size (W n ) in case of tumbling window. In GeoFlink, a query can be configured as real-time or window-based by utilizing QueryConfiguration class and providing it appropriate parameters. In the following code snippets, we will use realTimeConf and windowBasedConf for the real-time and window-based queries configuration, respectively.
Code 2. A GeoFlink (java) code for query type configuration.    Table 1 lists the notations used in the rest of the manuscript.

A. SPATIAL RANGE QUERY
This query comes in handy when a user wants to fetch spatial objects within certain distance of query objects.
Definition 10 (Spatial Range Query): Given S, ψ, and r, range query returns the r-neighbors(ψ) in S continuously.
Spatial range query returns all the s ∈ S that lie within r distance of ψ.
The spatial range query makes use of grid-index for efficient computation. Using the proposed approach, the s ∈ S objects which cannot be part of the query result are pruned out at an early stage, reducing the number of distance computations and the query processing cost. Algorithm 2 shows the key phases of the distributed spatial range query: 1) Filter, 2) Shuffle, and 3) Refine. The algorithm is executed in a distributed fashion on a cluster of nodes. Each node receives a data stream and executes Filter Phase (Lines 3 ∼ 7) independently. This phase directs guaranteed-neighbors to the output, candidate-neighbors to the next phase and prunes out nonneighbors based on their keys. For the spatial tuples containing a set of keys with size > 1, if any of its key belongs to Guaranteed cells (ψ), it is a guaranteed-neighbor and is directed to the output. If none of its keys belongs to Guaranteed cells (ψ), it is checked for candidate neighborhood in a similar fashion. The filter phase is followed by the Shuffle Phase which logically re-distributes the filtered stream tuples across cluster nodes based on their keys. All tuples with the same key are assigned to the same partition at this stage. (Line 8) Finally, the Refine Phase evaluates the candidate stream tuples using Algorithm 2 Spatial Range Query: Real-Time Require: S in : Spatial data stream, ψ: Query object, r: Query distance, G: Spatial grid index Ensure: S out : Continuous spatial stream of r-neighbors(ψ) 1: Compute Guaranteed cells (ψ) and Candidate cells (ψ) using ψ, r and G (Algorithm 1) {Filter Phase} 2: if (s ∈ S in ) lies within or overlaps the Guaranteed cells (ψ) then 3: s is a guaranteed-neighbor and is added to S out 4: else if (s ∈ S in ) lies within or overlaps the Candidate cells (ψ) then 5: s is a candidate-neighbor and is added to S c {S c denotes a filtered stream containing only candidate neighbors} 6: end if {Shuffle Phase} 7: Shuffle s ∈ S c based on s.key to balance the load across the distributed cluster nodes {Refine Phase} 8: if (s ∈ S c ) is in r-neighbors(ψ) then 9: Add s to S out {r-neighbours(ψ) is computed using distance function} 10: end if 11: return S out spatial distance function and directs r-neighbor(ψ) to the output. Please note that the refine phase is executed in a distributed fashion, where each node receives one or more logical streams partitions from the shuffle phase.
Algorithm 3 presents the window-based version of the range query. The only difference between the real-time and window-based range query algorithms is the refine phase. In the real-time range query, the refine phase generates output continuously. Whereas, in the window-based range query, the output is generated periodically every W s time units corresponding to the window size W n .
To execute a real-time or window-based range query via GeoFlink, run method of one of the classes listed in Table 2 is used with appropriate query configuration discussed in Code 2.
The code snippet 3 shows the registration of a range query on Point data stream and Point query object.

B. SPATIAL kNN QUERY
This query is useful in fetching k nearest spatial objects of a query object.  10: if (s ∈ S c ) is in r-neighbors(ψ) then 11: Add s to S out {r-neighbours(ψ) is computed using distance function} 12: end if 13: end for 14: return S out Code 3. A GeoFlink (java) class for spatial range query.
Definition 11 (Spatial kNN Query): Given S, ψ, r and a positive integer k, kNN query returns the k nearest r-neighbors(ψ) in S. If less than k neighbors exist, all the r-neighbors(ψ) are returned.
In addition to the parameters discussed in Definition 11, real-time kNN query takes an additional parameter recency duration (ω). The real-time kNN query continuously outputs the k nearest r-neighbors(ψ) in S, where s ∈ S must have arrived in the last ω duration. In contrast, window-based kNN takes two additional parameters, i.e., window size (W n ) and window slide step (W s ) and periodically outputs the k nearest r-neighbors(ψ) in S, every W s time units, corresponding to window of size W n . In practice, ω is much smaller than W n and results in near real-time kNN.
Just like the range query, spatial kNN query makes use of grid index for efficient computation. Using the proposed approach, s ∈ S which cannot be part of the query result is pruned out at an early stage, thus reducing the number if (s ∈ S c ) is in r-neighbors(ψ) ∧ (s ∈ S c ) is a kNN then 9: Add s to S knn {S knn denotes stream of k nearest neighbors} 10: Update S knn to keep the size of S knn at most k 11: end if 12: end for {Merge Phase} 13: Merge and sort s ∈ S knn from all the distributed nodes to obtain an integrated kNN and add the result to S out 14: return S out of distance computations and the query processing cost. Algorithm 4 shows the key phases of the distributed spatial range query: 1) Filter, 2) Shuffle, 3) Refine, and 4) Merge. The algorithm is executed in a distributed fashion on a cluster of nodes. Each node receives a data stream and executes Filter Phase (Lines 3 ∼ 5) independently. This phase prunes out non-neighbors and directs candidate-neighbors in the guaranteed and candidate neighboring layers to the next phase. This phase is followed by the Shuffle Phase which logically re-distributes the filtered stream tuples across cluster nodes based on their keys. Namely, all tuples with the same key are assigned to the same partition (Line 6). The Refine Phase evaluates the candidate stream tuples arrived in the last ω duration for kNN using distance function (Lines 7 ∼ 12). Since the refine phase is computed independently on distributed nodes, Merge Phase is needed to combine kNNs from all the distributed nodes to obtain true kNNs.
Since the real-time and window-based kNN algorithms are very similar, kNN window based algorithm has been omitted. The main difference between the real-time and windowbased kNN algorithms is the refine phase, where the window parameters W n and W s are used instead of recency parameter (ω) for the window size. In practice, ω is much smaller than W n .
To execute a real-time or window-based kNN query via GeoFlink, run method of one of the classes listed in Table 3 is used with appropriate query configuration discussed in Code 2. The code snippet 4 shows the registration of a kNN query on Point data stream and Point query object.
Please note that the output of the kNN query is a stream of sorted lists with respect to the distance from ψ, where each list consists of kNNs corresponding to the last ω duration or window size W n .

C. SPATIAL JOIN QUERY
Spatial join is a useful operator where one stream is joined with another based on some query distance. Join is an expensive operator as it involves Cartesian product between the two streams.
Definition 12 (Spatial Join Query): Given two spatial streams S1 (Ordinary stream) and S2 (Query stream) and a distance r, a join query returns all pairs of objects (ψ i , ψ j ), such that ψ i ∈ S1 is in r-neighbors(ψ j ), where ψ j ∈ S2.
In addition to the parameters discussed in Definition 12, real-time join query takes an extra parameter recency duration (ω). The real-time join query continuously outputs the spatial join pairs (ψ i , ψ j ), where ψ i ∈ S1 and ψ j ∈ S2 must have arrived in the last ω duration. In contrast, window-based spatial join takes two additional parameters, i.e., window size (W n ) and window slide step (W s ) and periodically (every W s time units) outputs the spatial join pairs (ψ i , ψ j ), where ψ i ∈ S1 and ψ j ∈ S2 belong to window of size W n . In practice, ω is much smaller than W n and thus results in near real-time spatial join.
Spatial join is an expensive operation, where each tuple of query stream must be checked against every tuple of VOLUME 10, 2022 ordinary stream which involves a large number of distance computations. Hence, we propose an efficient grid index based spatial join. According to the spatial join definition (Definition 12), given a query stream object ψ j , the objects which lie within r-neighbors(ψ j ) can be part of join output. Thus, a query stream object needs to be joined with the objects which lie only within r distance of it rather than all the objects. This requires some pruning strategy. To achieve this in a distributed environment using grid index, where the objects belonging to the same key (cell) lands on the same logical computing node for join processing, we replicate each query stream object based on its Guaranteed cells (ψ) and Candidate cells (ψ). We then join the replicated query stream with other stream using key-based join, causing the ordinary and query stream objects corresponding to a particular cell to land on the same logical join operator and prune out the objects with no matching keys (i.e., cell IDs) resulting in an efficient grid-based spatial stream join.
Algorithm 5 shows the distributed spatial join algorithm consisting of the following three phases: 1) Replication phase, 2) Join phase, and 3) Filter phase. Let S1 and S2 denote an ordinary and a query stream, respectively. Assuming that C ψ j denotes a set of cells containing a query object ψ j , then given r, the Replication Phase computes Guaranteed cells (ψ j ) and Candidate cells (ψ j ) for each ψ j ∈ S2. Next, the ψ j ∈ S2 are replicated in such a way that each replicated object is assigned keys of the cells in Guaranteed cells (ψ j ) and Candidate cells (ψ j ). We denote the replicated query stream by S2 . Next, in the Join Phase, we make use of Apache Flink's key-based join transformation to join the two streams S1 and S2 . The Flink's key-based join enables the tuples from the two streams with the same key to land on the same operator instance. This causes the join to be evaluated only between q ∈ S2 and p ∈ S1 belonging to the cells in Guaranteed cells (ψ j ) and Candidate cells (ψ j ), while filtering out the non-neighbors of ψ j . Since the ψ i ∈ S1 corresponding to Guaranteed cells (ψ j ) are guaranteed to be part of the join output, they are sent to the output directly without distance evaluation. However, for ψ i ∈ S1 corresponding to Candidate cells (ψ j ), distance evaluation is done to find if ψ i ∈ S1 is in r-neighbors(ψ j ), where ψ j ∈ S2 . The Filter Phase filters out the output stream to remove the duplicate pairs, if any.
Since the real-time and window-based join algorithms are very similar, window-based join algorithm has been omitted. The main difference between the real-time and windowbased join algorithms is the join phase, where the window parameters W n and W s are used instead of recency parameter (ω) for the window size. In practice, ω is much smaller than W n .
To execute a real-time or window-based join query via GeoFlink, run method of one of the classes listed in Table 4 is used with appropriate query configuration discussed in Code 2. The code snippet 5 shows the registration of a join query on Point data stream and Point query stream. Please note that the output of the join query is a stream of pairs (ψ i , ψ j ), where ψ i ∈ S1 and ψ j ∈ S2 such that ψ i is r-neighbors(ψ j ).

Algorithm 5 Spatial Join Query: Real-Time
Require: S1: Spatial stream 1, S2: Spatial stream 2, r: Query distance, ω: Recency duration, G: Spatial grid index Ensure: S out : Continuous spatial stream of pairs (ψ i , ψ j ), where ψ i ∈ S1 and ψ j ∈ S2 such that ψ i is rneighbor(ψ j ) {Distributed S2 Replication Phase} 1: Compute C ψ j for each ψ j ∈ S2 {C ψ j denotes the cell containing ψ j } 2: Compute Guaranteed cells (ψ j ) and Candidate cells (ψ j ) using ψ j , r and G (Algorithm 1) 3: for each c ∈ Guaranteed cells (ψ j ) ∨ c ∈ Candidate cells (ψ j ) do 4: Add ψ j ∈ S2 to S2 , such that ψ j .key is set to c.key 5: end for {Distributed Join Phase} 6: Perform distributed key-based join between S1 and S2 , which directs all the tuples from S1 and S2 with the same key to the same logical operator instance. This step filters out the S1 and S2 tuples with non-matching keys. 7: for each (ψ i ∈ S1) in ∧ (ψ j ∈ S2 ) in { denotes a window of size ω} do 8: if ψ i is in r-neighbors(ψ j ) then 9: Add (ψ i , ψ j ) to S c {r-neighbors(ψ j ) is computed using distance function} 10: end if 11: end for {Distributed Filter Phase} 12: Filter (ψ i , ψ j ) ∈ S c to remove duplicate pairs and add the result to S out 13: return S out GeoFlink supports 6 types of spatial objects. Thus, we provide overloaded methods of each query to support these spatial objects as input stream and query object/stream. Table 5  lists the geometry combinations supported by GeoFlink for each operator. Figure 13 shows the GeoFlink architecture. Users can register queries to GeoFlink through a Java/Scala API and its output is available via a variety of sinks provided by Apache Flink. The GeoFlink architecture has two important layers: 1) Spatial Stream Layer and 2) Real-time Spatial Query Processing Layer.

A. SPATIAL STREAM LAYER
This layer is responsible for converting incoming data stream(s) into spatial data stream(s). Apache Flink treats spatial data stream as ordinary text stream, which may lead to its inefficient processing. GeoFlink converts it into spatial data stream of spatial objects. GeoFlink supports all the commonly used spatial objects including Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon. Furthermore, this layer assigns grid cell IDs (keys) to the spatial objects for their efficient distribution and processing. GeoFlink's grid index is discussed in Section VI-A.

1) SPATIAL OBJECTS SUPPORT
GeoFlink supports GeoJSON, CSV and TSV input stream formats from Apache Kafka. A GeoFlink user needs to make an appropriate Apache Kafka connection by specifying the kafka topic name and bootstrap server(s). Once the connection is established, users can construct spatial stream from the input streams by utilizing PointStream, LineStringStream or PolygonStream methods of the GeoFlink's Deserialization class.

2) SPATIAL STREAM PARTITIONING
Uniform partitioning of data across distributed cluster nodes plays a vital role in efficient query processing. As discussed in Section III-A, Apache Flink keyBy transformation logically partitions a stream into disjoint partitions in such a way that all the tuples with the same key are assigned to the same partition or to the same operator instance.
To enable uniform data partitioning in GeoFlink, which takes into account spatial data proximity, grid index is used. GeoFlink assigns grid cell IDs (keys) to each incoming stream tuple based on its spatial location. Since all the spatially close tuples belong to a single grid cell, they are assigned the same key, which is used by the Flink's keyBy operator for stream distribution. It is good to have the number of keys greater than or equal to the amount of parallelism, to enable Flink to distribute data uniformly.
It is worth mentioning that, GeoFlink receives distributed data streams from distributed messaging system, for instance, Apache Kafka [25]. To enable uniform distribution of incoming data stream across GeoFlink cluster nodes, the right configuration is needed which is discussed in III-B.

B. REAL-TIME SPATIAL QUERY PROCESSING LAYER
This layer provides support for a number of basic spatial operators required by most of the spatial data processing and analysis applications. The supported operators include spatial range, kNN, and join over spatial data streams. The queries are continuous in nature, i.e., they generate continuous results on continuous spatial stream(s). Users can use Java or Scala to write the spatial queries or custom applications. This layer makes extensive use of the grid index for efficient query execution.

IX. EXPERIMENTAL EVALUATION
This section presents detailed experimental evaluation of GeoFlink. In the following, we discuss data streams and experimentation environment in Section IX-A and evaluation results in Section IX-B.

A. DATA STREAMS AND ENVIRONMENT
For GeoFlink evaluation, three real and four synthetic datasets are used. Microsoft TDrive (TDrive) [37], ATC shopping mall (ATC) [38], and NYC Buildings (NYCBuildings) [39] are real, whereas Gaussian, Random, Hotspot, and Multi-hotspot are synthetic datasets. Table 6 shows a quick summary of the datasets and Figure 14 shows the datasets' distribution.
TDrive is a trajectory dataset consisting of 10,357 taxi trajectories in the Beijing city of China. The dataset was collected during February 2 to 8,2008. The dataset contains 17 million data points, with the trajectories covering a distance of around 9 million kilometres. Each data point consists of a taxi id, timestamp, longitude and latitude [37].
ATC is a pedestrian tracking dataset within a shopping center in the Osaka city of Japan. The dataset was collected between October 24, 2012 and November 29, 2013. The data collection was done every week on Wednesday and Sunday, from morning until evening ( [38]. Table 6 shows a summary of the two datasets.
NYCBuildings dataset is obtained from NYC open data [39] and contains New York City's buildings footprints. The buildings footprints represent the full perimeter outline of each building as viewed from directly above. Besides the perimeter, other useful attributes of this dataset include ground elevation at building base, roof height above ground elevation, construction year, and feature type. The buildings dataset consist of more than 1 million NYC buildings information including residential, commercial and government buildings. This dataset is used as a source of polygon and linestring data streams. Originally the dataset consists of polygon objects; the last point of each outer polygon is removed to convert it into linestring.
Gaussian, Random, Hotspot and Multi-hotspot dataset consists of streams of m distinct objects, where m is a userdefined parameter. Unless specified, m is set to 100 in the experiments. As for the Gaussian data stream, given userdefined number of points (vertices), the first m distinct objects are generated randomly at random locations. The next m objects are also generated randomly; however, their locations are Gaussian distributed, where mean is the last position or centroid of the corresponding object and variance is a user-defined parameter which is set to 0.5 for the dataset generation. The Random data stream follows random distri- The synthetic Gaussian dataset generator can also be used to generate spatial trajectories and is available as an open source project [40].
For the experiments, a 4 node Apache Flink cluster with GeoFlink (1 Job Manager and 3 Task Managers with a total of 30 task slots) and a 3 node Apache Kafka cluster (1 Zookeeper and 2 Broker Nodes) are used. The clusters are deployed on AIST AAIC cloud [41], where each VM has 128 GB memory and 20 CPU cores (Intel Skylake 1800 MHz processor). All the VMs are operated by Ubuntu 16.04. The datasets are loaded into Apache Kafka [25] and are supplied as a distributed stream to GeoFlink cluster.

B. EVALUATION
This section presents detailed evaluation results. In particular, we evaluated GeoFlink's throughput and latency for the three spatial queries discussed in this manuscript. Unless otherwise specified, following default parameter values are used in the experiments: r = 0.005 (approx. 500 meters), W n = 100 seconds, W s = 50 seconds, g = 200, k = 50, ω = 1 second, query stream arrival rate = 10 tuples per second, number of query objects = 50 (range query), 1 (kNN query).
• Throughput: The number of stream tuples processed by the system per second. It is obtained by dividing the total number of input stream tuples processed by the system with total processing time.
• Latency: The amount of time a stream tuple takes to be processed by the system after ingestion. It is calculated by assigning each stream object the system time on its arrival and subtracting it from the system time at its output generation.

1) THROUGHPUT COMPARISON: GeoFlink, APACHE FLINK AND APACHE SPARK STREAMING
In this section, we compare GeoFlink throughput with Apache Flink and Apache Spark Streaming. In particular, we compare the throughput of real-time range (Point stream,   a set of Polygon query objects), knn (Point stream, a Polygon query object), and join (ordinary Point stream, query Point stream). In contrast to Apache Flink and Apache Spark Streaming, GeoFlink utilizes grid index; however, to keep the comparison fair, efforts are made to distribute the query processing uniformly across the cluster nodes. Apache Spark is primarily developed for batch processing, thus, the internal architecture of it is different from Apache Flink. To process data streams, tuples are bundled as micro batches in contrast to Flink which processes each stream tuple separately. Therefore, we need to provide an additional parameter batch size for Spark Streaming. The appropriate batch size depends on the application. Larger batch sizes can improve throughput but increase latency, which is not desirable in many real-time streaming application. Since the target of this section is to compare the throughput, we performed preliminary experiments on Apache Spark Streaming with different batch sizes to identify the ideal batch size. For the preliminary experiments, TDrive dataset is used. Figure 15 compares the batch sizes, throughput and execution cost per batch for the real-time range, kNN and join queries. As can be observed, larger batch sizes can improve the query throughput but result in higher per batch processing cost. Higher batch processing cost causes the streaming tuples in the next batch to suffer from latency or processing delays. Furthermore, from the experiments we identify that for the very large batch size, i.e. 1 million tuples in case of the join query (Figure 15(c)), the throughput actually decreases   instead of increasing. Furthermore, the execution cost per batch also increases exponentially. This is because the join query involves Cartesian product of the two stream batches, which is an expensive operation. The larger the batch size, the larger the Cartesian product the higher the per batch processing cost. Keeping in view the preliminary experiment results on Apache Spark Streaming, we identify 100,000 tuples as an ideal batch size for our experiments. Thus, for the throughput comparison experiments batch size: 100,000 tuples (Spark Streaming only) is used.
Next, we present throughput comparison for the real-time range, kNN and join queries in Figures 17, 18, and 19, respectively. Figures 17(a), 17(b) and 17(c) show the throughput comparison of the real-time range query by varying the number of query Polygons for the TDrive, ATC and Synthetic datasets, respectively. From the figures, it is evident that GeoFlink performs better than Apache Flink and Apache Spark Streaming, resulting in higher throughput. This is due to the strong pruning capability of GeoFlink. Another thing to note in the figures is that the throughput decreases slightly with the increase in number of query Polygons, which is obvious as with the increase in the number query Polygons the required number of distance evaluations increases. Figures 18(a), 18(b) and 18(c) show the same comparison for the kNN query by varying query distance. Here again GeoFlink maximizes throughput, showing the advantage of GeoFlink pruning. Compared to the range query in Figure 17, Apache Spark Streaming results in lower throughput than Apache Flink. We observed that as the query gets complex, the throughput of Apache Spark Streaming decreases. Furthermore, the throughput of all the approaches decreases with the increase in the query distance, as the query needs to check more query object's neighbors for the kNN computation.  In Figures 19(a), 19(b) and 19(c), we compare the throughputs of the real-time join query by varying the query stream arrival rate for the TDrive, ATC and Synthetic datasets, respectively. From the figures, the throughput of Apache Flink and Apache Spark Streaming are significantly lower than GeoFlink because join is an expensive operation and in the absence of indexes, Cartesian product is computed between the two streams, resulting in quadratic distance computations and reduced throughput. On the other hand, GeoFlink's grid index helps in pruning out the streaming tuples which cannot be part of join output as can be observed from Figure 16, resulting in higher throughput for GeoFlink. Another point to note in the figures is that the throughput decreases with the increase in query stream arrival rate. This is because the higher the query stream rate, the larger the Cartesian product, resulting in reduced throughput. However, the decrease is not significant in case of GeoFlink, showing the power of GeoFlink indexing.

2) SKEWED DATA AND GRID INDEX
In this section we evaluate the performance of the grid index on uniform and skewed datasets. For this purpose, three Point datasets are used in this subsection (Sec. IX-B2): Random (Fig. 14(e)), Hotspot (Fig. 14(f)), and Multi-hotspot ( Fig. 14(g)).
Figures 20 and 21 present the effect of grid size (g) on Random and skewed (Hotspot and Multi-hotspot) data distributions for real-time range, kNN and join queries, respectively.
We observe low throughput for smaller g, that is 5 and 10, for all the queries. The throughput tends to become normal for grid sizes 50 and above. There are two reasons for the lower throughput: 1) Smaller g leads to ineffective grid-based pruning, 2) Smaller g results in imbalanced data distribution especially for skewed datasets. Figure 22 shows the task managers (TMs) data load for the real-time range query. The data load is computed at the last operator of the range query after gird-based pruning and shuffling (hashing) at the refine phase in Algorithm 2. It is computed after the processing of all the stream tuples of the respective datasets. One can observe imbalanced TM load for smaller g, i.e., 5, 10, 25 and 50, for the skewed datasets. The TM load becomes uniform for higher (finer) g because finer grid leads to smaller grid cell size and a large number of keys, thus, near-uniform hashing even for skewed datasets. It can be noticed in Figure 22 is that the TM load decreases gradually from grid sizes 5 to 50 and then becomes steady. This is due to smaller g leads to ineffective grid-based pruning, thus, resulting in higher data load at the refine phase of Algorithm 2. From this we can conclude that grid index performs well for uniform and skewed data distributions if the g is chosen carefully.

3) GeoFlink THROUGHPUT
This subsection (Section IX-B3) presents a detailed throughput evaluation of GeoFlink queries. For these experiments, TDrive and NYCBuildings datasets are used for Point and LineString/Polygon objects, respectively.     Figure 23 shows the throughput evaluation of the range query by varying grid size (g) parameter from 100 to 500. In Figure 23(a), the query is evaluated for Point stream with Point, Polygon and LineString query objects. The throughput decreases slightly with the increase in grid size. This is because as g increases, the number of grid cells increases quadratically. This quadratic growth in the number of cells results in increased processing (hashing) cost and reduced  throughput. Notably, the window-based range query throughput is slightly lower than its real-time counterpart. This is because the window-based query buffers stream tuples for a fixed time duration, processes them as a batch and generates output corresponding to the whole window contents. In contrast, real-time range query generates output as soon as it gets an input tuple. The slightly lower throughput for the windowbased query is due to the extra processing cost required in maintaining and processing the window operator. Figures 23(b) and 23(c) show the range query evaluation for the Polygon and LineString streams, respectively. The trend in the Figures is same as that of Figure 23(a). However, the throughput for the Polygon and LineString streams is lower than that of the Point stream. This is obvious as the Polygon and LineString are more complex spatial objects than Point object and their distance computation is much more expensive, resulting in lower throughput.
Next, Figure 24 evaluates the kNN query by varying grid size (g) parameter. In Figure 24(a), the query is evaluated for Point stream with Point, Polygon and LineString query objects. The throughput decreases slightly with the increase in grid size due to the reason discussed above. Figures 24(b) and 24(c) show the kNN query evaluation for the Polygon and LineString streams, respectively. The trend in the Figures is same as that of Figure 24(a).
In Figure 25, join query is evaluated by varying grid size (g) parameter. In Figures 25(a), 25(b) and 25(c), the query is evaluated for the Point, Polygon and LineString streams, respectively. The evaluations follow the same trend as that of range and knn queries; however, throughput of the join query is much lower. This is due to the complex nature of join query. Figure 26 shows the throughput evaluation of the range query by varying query radius (r) parameter from 50 (0.0005) meters to 50,000 (0.5) meters. In Figure 26(a), the query is evaluated for Point stream with Point, Polygon and LineString query objects. Based on the figure, the throughput decreases with the increase in r. This is because as r increases, pruning decreases and output size increases. Thus, resulting in an increase in the computational complexity and decrease in throughput. Similar trends can be observed from Figures 26(b) and 26(c).
In Figure 27, kNN query is evaluated by varying query distance (r) parameter. In Figures 27(a), 27(b), and 27(c), the query is evaluated for the Point, Polygon and LineString streams, respectively. Again the throughput decreases with increase in the query radius due to the reason discussed above. From the figures, it is clear that the throughput is highest for Point stream, followed by LineString and Polygon streams. This is due to the complexity of geometrical shapes. As the complexity of the shape increases, the combined cost of the distance computations increases resulting in lower throughput. In our case, Point is the simplest geometrical shape, resulting in the highest throughput and so on.
For the join query, we varied the query distance (r) parameter from 50 (0.0005) meters to 5,000 (0.05) meters as shown in Figures 28(a), 28(b) and 28(c). The query follows the same trend as the case of range and kNN query, i.e., the throughput decreases with the increase in query distance for the reason discussed earlier.     In Figure 29, we perform experiments by varying the parameter window size W n from 50 to 250 seconds. Figures 29(a), 29(b), and 29(c) show the range query evaluation for the Point, Polygon and LineString streams, respectively. From the figures, we do not see any significant change in throughput with the increase in W n . The reason is that the range is a computationally light query and its processing is not much affected by the window/batch size. The only difference we observe in Figures 29(a), 29(b), and 29(c) is that the throughput of the range query on Point stream is    streams, respectively, for the parameter W n . The throughput decreases slightly with the increase in W n . This is because, as W n increases the number of tuples to be joined with query stream increases, resulting in a sharp increase in the number of tuples from the Cartesian product between  ordinary and query streams, resulting in a slight decrease in throughput.
Next, experiments are performed by varying the parameter window slide step W s from 25 to 125 seconds. In Figure 32, larger the parameter W s , smaller the overlap between consecutive windows and lesser the processing overhead. Figures 32(a), 32(b), and 32(c) show the range query evaluation for the Point, Polygon and LineString streams, respectively. It can be seen from the figures that the throughput does not show significant change with the increase in W s because of the simple query. Similar trends can be observed for the kNN query in Figures 33(a), 33(b) and 33(c). Figures 34(a), 34(b) and 34(c) show the join query evaluation for the Point, Polygon and LineString streams, respectively, for the parameter W s . Here, in contrast to W n , the throughput increases slightly with the increase in W s . This is because, as W s increases the computation overlap decreases. More precisely, if W s < W n , the window is overlapping, i.e., W n − W s tuples overlap in the consecutive windows, whereas, at W s = W n , the window is non-overlapping or in other words called tumbling window. Thus larger the W s , smaller is the overlap, resulting in higher throughput.
In Figure 35, we performed throughput evaluation by varying parameter k from 25 to 125 for the kNN query. In Figures 35(a), 35(b) and 35(c) we varied k for Point, Polygon and LineString streams, respectively. From the figures, we do not see any change in the throughput for the different values of k. This is because kNN query is bounded by query distance r and remains largely unaffected by varying k as long as r is constant. The throughput trend is same for all the figures except that the throughput of Polygon and LineString streams is lower than Point stream for the reasons discussed above.
In the final throughput evaluation, Figures 36(a), 36(b) and 36(c) show the variation of query stream arrival rate from 1 to 10 for the join query. No significant change is observed in the throughput for the different query stream arrival rate. This is due to the strong pruning and data distribution capability of GeoFlink. From Figure 19, it is observed that throughput of Apache Flink and Apache Spark Streaming deteriorates significantly with the increase in query stream arrival rate as compared to GeoFlink.

4) GeoFlink LATENCY
This subsection (Section IX-B4) evaluates the latency of the real-time and window-based queries. In particular, we evaluated real-time and window-based range, kNN and join queries in Figures 37, 38 and 39, respectively. The latency is evaluated in seconds (sec). For the latency evaluation, W n = 10 seconds, W s = 5 seconds and ω = 1 second is used. Figures 37(a) and 37(b) show the latency graphs of real-time and window-based range queries. The latency of real-time range query is low and does not change much as GeoFlink processes each incoming tuple as it arrives. Whereas, the latency of window-based range query varies between 0 to 10 seconds because of W n = 10 seconds. Thus, the tuple which arrives at the very start of the window has a latency of size W n , i.e., 10 seconds because it has to wait for W n duration to be processed. The latency decreases for the tuples arriving later in the window as can be observed from Figure 37(b).
In Figures 38(a) and 38(b) latency of real-time and window-based kNN queries are evaluated. The latency of realtime kNN query follows the trend of realtime range query. However, the latency is much higher given the kNN query complexity, i.e., it takes much time to process a tuple by kNN query compared to the range query. Furthermore, realtime kNN query utilizes recency parameter ω for the kNN computation, adding ω duration to latency. The latency of the window-based varies between 0 to 11 seconds as can be observed from Figure 38(a) due to the reason discussed above. Figures 39(a) and 39(b) show the latency graphs of realtime and window-based join queries. Given ω = 1 second, the latency of real-time join query varies mainly between 0 and 1 second. The latency of window-based join query follows the trend of window-based range and join queries, i.e., the latency varies between 0 to 10 seconds due to W n = 10 seconds.

X. CONCLUSION AND FUTURE WORK
This work presents GeoFlink which extends Apache Flink to support spatial data types, index and continuous queries. A gird-based index is introduced to enable efficient processing of continuous spatial queries and for the effective data distribution among the Flink cluster nodes. The grid index enables the pruning of the spatial objects which cannot be part of a spatial query result and guarantees efficient query processing. Furthermore, it helps in uniform data distribution across distributed cluster nodes. GeoFlink supports spatial range, spatial kNN and spatial join queries on Point, LineString, Polygon, MultiPoint, MultiLineString and Mul-tiPolygon spatial objects. Additionally, GeoFlink supports incoming data streams in GeoJSON, CSV and WKT formats. An extensive experimental study proves that GeoFlink results in higher throughput than Apache Flink and Apache Spark Streaming. Moreover, our experiments prove that grid index provides good performance for uniform and skewed data distributions if the grid size is chosen carefully. As a future direction, we are working on GeoFlink's extension to support complex spatial query operators involving join between stream and static data. Furthermore, we are looking into other efficient spatial index structures for spatial stream processing. Moreover, fuzzy and adaptive complex event processing and nested pattern queries processing over data streams are also interesting directions for the extension of this work.