cBiK: A Space-Efficient Data Structure for Spatial Keyword Queries

A vast amount of geo-referenced data is being generated by mobile devices and other sensors increasing the importance of spatio-textual analyses on such data. Due to the large volume of data, the use of indexes to speed up the queries that facilitate such analyses is imperative. Many disk resident indexes have been proposed for different types of spatial keyword queries, but their efficiency is harmed by their high I/O costs. In this work, we propose $c\text{B}i\text{K}$ , the first spatio-textual index that uses compact data structures to reduce the size of the structure, hence facilitating its usage in main memory. Our experimental evaluation, shows that this approach needs half the space and is more than one order of magnitude faster than a disk resident state-of-the-art index. Also, we show that our approach is competitive even in a scenario where the disk resident data structure is warmed-up to fit in main memory.


I. INTRODUCTION
The boom of mobile devices has made it possible to live experiences and to access services that were inconceivable a few years ago. Location-based applications are a good example. Although they are ubiquitous nowadays, a decade ago it was hard to imagine that a mobile phone would allow us to look for the nearest ''tapas'' restaurant or to obtain a list of pet-friendly hotels located less than 3 km away. These applications have emerged due to the development of positioning technologies such as GPS (Global Positioning System), but also due to the consolidation of geo-tagged datasets, which provide textual descriptions (usually as lists of keywords) for geo-positioned objects. SPOI 1 (Smart POI dataset) is a good example of geo-tagged dataset. It contains more than 27 million points of interest around the world, including cafes, pubs, restaurants, or hotels, which are described using keywords about their The associate editor coordinating the review of this manuscript and approving it for publication was Waleed Alsabhan . 1 https://sdi4apps.eu/spoi/ specialties or the amenities they provide. Picture datasets or collections of microposts from social-networks can be also considered geo-tagged datasets. The former provides large collections of geo-positioned pictures that are described using particular keywords (e.g. an image of the ''Grand Canyon'' has its GPS coordinates, and also a set of keywords like ''Natural park'', ''USA'', or ''Ancestral Puebloans'', among others). On the other hand, collections from social networks expose large amounts of microposts that contain the location from which they were published, and lists of hashtags, as keywords.
In other domains, such as the Web, where just a portion of the content is geo-tagged, there are some attempts to automatically detect locations from web resources [1]. Such results would enable the development of general location-aware search engines, in which data structures such as the one proposed in our work, are essential for an efficient information retrieval as an extension of ubiquitous inverted indexes.
Geo-tagged datasets are increasingly larger, hence managing and querying them is becoming more challenging. Many and varied approaches have been described in the state VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of the art for such purposes, but all of them share a main feature: they are designed over disk-resident data structures (indexes), which involve an important overhead in query resolution time due to I/O operations. More recently, compact data structures [2] have emerged as an efficient approach for managing large datasets in main memory. These structures store data in a compact manner, and are able to query them with no prior decompression. This approach avoids costly I/O operations, at the price of performing more computations to access the data. In other words, compact data structures are usually slower than traditional approaches when running in the same level of the memory hierarchy, but they are more likely to fit in higher levels, drastically increasing query time performance. Although compact data structures have been successfully used for different applications, including spatial [3] or keyword-based search [4], they have not been evaluated yet for queries that involve spatial and keyword-based predicates, to the best of our knowledge.
In this paper, we propose a novel compact data structure (called cBiK) that is able to store geo-tagged data in compact space and it is also able to resolve three different spatial keyword queries, Boolean Top-k Spatial Keyword Query (BkSKQ), Ranked Top-k Spatial Keyword Query (RkSKQ) and Boolean Range Searching Spatial Keyword Query (bRS-SKQ). cBiK indexes spatial objects using an implicit KD-tree, with no pointers, and encodes their descriptive keywords using compressed bitmaps. Our experiments report that cBiK uses only 35−40% of the space required by a stateof-the-art index, while answers the corresponding queries up to 2 orders of magnitude faster for a selected testbed, which includes different real-world datasets. When both indexes reside in main memory, query times are comparable, which is not a surprising result when using compact data structures as explained above. Furthermore, it is worth noting that cBiK is a static index; i.e. it must be rebuilt from scratch to add new data or update spatio-textual objects descriptions. This is common when using compact data structures due to the cost of update bitmaps [2].
A preliminary partial version of this research was presented in [5]. In this article, we describe an improved version of our data structure, which is also able to answer RkSKQ and bRS-SKQ queries. In addition, we provide a more comprehensive experimental evaluation, including some new experiments to thoroughly evaluate the performance of our data structure and algorithms.
The rest of the paper is organized as follows. First, Section II formally states the problem of spatio-textual indexing and its variants. Then, Section III describes common approaches for dealing with spatial and temporal data independently. Also, some basic concepts about compact data structures are provided. This section can be skipped by the reader with previous knowledge about such topics. Section IV summarizes state-of-the-art approaches to combine both dimensions in spatio-textual indexes. Then, Section V describes our data structure, and Section VI explains the three algorithms proposed to implement the corresponding spatial keyword queries. A comprehensive experimental evaluation is presented in Section VII, where cBiK is compared to a reference approach. Finally, Section VIII concludes about our current results and devises future lines of research.

II. PROBLEM DEFINITION
Before delving into the background of the paper, two basic definitions need to be established: Definition 1 (Spatio-Textual Object): A spatio-textual object o is a tuple l, t , where o.l provides the corresponding spatial position, in a two dimensional plane, and o.t is the list of keywords that describe the object.
Definition 2 (Geo-Tagged Dataset): A geo-tagged dataset D is a collection of n spatio-textual objects, which are tagged using a set T of m different keywords. Fig. 1 illustrates a simple geo-tagged dataset that contains n = 7 spatio-textual objects, each one describing a hotel and the set of different services it offers. Note that m = 10 different services are offered, and the corresponding set of keywords is T = {Air-conditioning, Bar, Buffet, Laundry, Parking, Pets, Pool, Room-service, Smoke-free, Wi-Fi}.

A. SPATIAL KEYWORD QUERIES
A Spatial Keyword Query (SKQ) takes a user location and user-supplied keywords as arguments and returns objects that are spatially and textually relevant to these arguments [6]. Although different SKQs have been proposed in the state of the art, we focus on the three types studied in [7], that are explained below.

1) BOOLEAN TOP-k SPATIAL KEYWORD QUERY (BkSKQ)
BkSKQ retrieves the k objects closest to a given location, which also satisfy the requested keywords. More formally, BkSKQ is defined as a query q = l, t, k , where q.l is the location of the query point (latitude and longitude), q.t is the list of requested keywords, and 1 ≤ q.k ≤ n is the maximum number of objects to be retrieved.
BkSKQ returns a result set that contains, at most, q.k objects: {o 1 , o 2 , . . . , o q.k }, which are the q.k objects closest to q.l, ordered by Euclidean distance (ascending order), that also satisfy An example of BkSKQ query is looking for the 2 closest hotels to our location 2 that provide Wi-Fi and Parking services: q = (x i , y i ), {Wi-Fi, Parking}, 2 . As can be seen in Fig. 1, p 3 and p 1 are the closest hotels to our location that provide the requested services, so they are returned in such an order.

2) RANKED TOP-k SPATIAL KEYWORD QUERY (RkSKQ)
RkSKQ uses a function score to retrieve the best-rated objects according to their proximity to the desired location and their textual relevance to the requested keywords. Specifically, the function to obtain the score of a spatio-textual object o for a query q, is defined as: Note that δ(o.l, q.l) corresponds to the spatial proximity between o and q, and θ(o.t, q.t) measures the textual relevance between o and q. Spatial proximity and textual relevance are weighted by a preference parameter α ∈ [0, 1]. Thus, if α = 1, τ (o, q) only considers spatial proximity, while if α = 0, only textual relevance is used to rank the best candidates.
More formally, RkSKQ is defined as a query q = l, t, k, α , where q.l, q.t, and q.k have the same meaning that in BkSKQ, and α ∈ [0, 1] weights the importance of spatial proximity and textual relevance in the final result. Thus, RkSKQ returns a ranked list that includes the best-rated spatial points (in descending order) for the given query.

a: SPATIAL PROXIMITY
The spatial proximity δ(o.l, q.l) is calculated using the normalized Euclidean distance, as defined in (2), where dist(o.l, q.l) is the Euclidean distance between o and q, and dist max is the maximum Euclidean distance between two objects in D: b: TEXTUAL RELEVANCE Different information retrieval models, like Language Model [8], Cosine Similarity [9] or BM25 [10], have been used to obtain the textual relevance of a point to a given query. In our case, we use a simple model that assigns the same relevance to all the keywords requested in a query, so the textual relevance of a point (to a given query) is proportional 2 We assume that ''our location'' is defined by coordinates (x i , y i ).
to the number of requested keywords that it contains: An example RkSKQ query, where proximity is considered more relevant than the services provided by the hotel (e.g. proximity is weighted as 0.75) is looking for the 2 best hotels that provide Wi-Fi and Parking services: q = (x l , y l ), {Wi-Fi, Parking}, 2, 0.75 . In this case, p 3 is returned because it provides both services and it is closer to our location. However, p 5 is returned (instead of p 1 ) although it does not provide Parking, because it is closer to our location, and proximity is more relevant for the given query.

3) BOOLEAN RANGE SEARCHING SPATIAL KEYWORD QUERY (bRS-SKQ)
bRS-SKQ retrieves all objects located within a determined query region that also contain all the requested keywords. More formally, q = r, t where q.r indicates a (rectangular shaped) query region and q.t, as in the previous queries, provides the list of requested keywords.
The result of bRS-SKQ is a collection of objects: An example of bRS-SKQ is looking for hotels that provide Wi-Fi in our region 3 , which is represented by the red-line square in Fig. 1: q = (x bl , y bl ; x tr , y tr ), {Wi-Fi} . In this case, only p3 and p5 are returned, because the other candidates with Wi-Fi are outside the given region.

III. PREVIOUS CONCEPTS
In this section, we summarize the basic knowledge about spatial indexes, textual indexes and compact data structures that is necessary to follow our work. The reader with previous knowledge about such topics may safely skip the section.

A. SPATIAL INDEXES
A spatial index is a data structure designed to answer different types of spatial queries. Here we review three of the most used spatial indexes, which are also components of the spatio-textual indexes described in Section IV.

1) R-TREE
The R-Tree is one of the most studied spatial access methods and also one of the most widely used in practice, as it has been adopted by several Database Management Systems (DBMS) such as Oracle and Postgres. It was proposed by Güttman [11] to dynamically index multidimensional information (points, polylines and regions in space) and it performs a recursive partition of the space where spatial objects are located and organizes such partitions into a balanced tree.
In general, leaf nodes of the R-tree store a simplification of the actual objects called Minimum Bounding Rectangle (MBR). Leaves also store a reference ref to the actual object.
In the particular case of points, the leaves of the tree can store the actual objects. Internal nodes, on the other hand, correspond to a disk block and contain entries of the form MBR i , ref i , where ref i is a pointer to the corresponding child node, and MBR i is the smallest rectangle that covers all the MBRs associated with the child nodes. The key idea behind the R-Tree is to store spatially-close objects in the same block.
Although the R-Tree was initially proposed to solve range queries, different algorithms have been proposed to solve varied geometric problems. For example, the k > 0 nearest neighbors to a given point q [12], the k > 1 pairs of nearest neighbors between two sets of points [13], or the computation of the Hausdorff distance between two sets of points [14], to name a few.

2) QUADTREE
A Quadtree [15], [16] is a multidimensional data structure used to represent and index point-type objects in a d-dimensional space. Like the R-Tree, it recursively decomposes the space by means of iso-oriented hyperplanes representing the subspaces using a tree data structure, whose root represents the entire space.
In a space of d dimensions, each internal node of the quadtree has 2 d children, each one representing the 2 d subspaces of the space associated with the node. The recursive decomposition continues until the number of objects is below a threshold. A quadtree is not necessarily balanced because the denser subspaces (i.e. those containing more objects) can generate nodes much deeper than less dense subspaces [16]. This is because, unlike the R-Tree, the space partitioning is oblivious of the actual objects and just depend on the universe (i.e. the space in which the objects are represented).

3) KD-TREE
The KD-Tree [17] was proposed to index points in k-dimensions. It also decomposes the space recursively but, unlike the Quadtree, the division does not have to generate two partitions of equal size. An orthogonal hyperplane aligned with the axes of coordinates is used in the partitioning. In each level, the axes alternate; that is, if working with two dimensions, the plane is first cut by the x axis, then by the y axis, and in the next level again by the x axis.
The structure is represented as a binary tree in which each internal node represents a cut in space. To balance the tree, the points that are lower than the current cut-off point, which is the median of the subspace in the order of the corresponding dimension, are stored in the left sub-tree and the others in the right sub-tree. The leaf nodes represent the point themselves when there are no more subspaces to divide.

B. TEXT INDEXES
A text index is a data structure particularly designed to answer text-based queries. Here, we introduce text indexes that are frequently used to perform efficient keyword-based queries.

1) INVERTED INDEX
An inverted index (also referred to as inverted file) [18] is the reference index for information retrieval purposes. It is built over the set of m different words used in a data collection. An inverted index is composed by two main components (i) a vocabulary, which encodes the set of m words, and (ii) a set of m inverted lists (or posting lists), where each one stores references to all occurrences of a particular word.
Posting lists may encode occurrences information at different levels of granularity. According to the scope of this paper, we consider that each list stores the IDs of the spatio-textual objects that are described by a given keyword.
Inverted indexes excel at word-based query resolution. They basically look for the searched word in the vocabulary, and the corresponding inverted list is retrieved. Boolean AND/OR queries, like the ones required by SKQs, are also resolved efficiently by manipulating the inverted lists of each requested keyword.

2) SIGNATURE FILE
A signature file is a text indexing approach [19] that divides a text collection into logical blocks of D distinct words. Each word has its signature, a bitstring of size F, whith m bits activated. F and m are parameters of the file. To encode a block, the signatures of their words are superimposed (bit OR-ed) to obtain a single F-bit pattern. The whole text collection is encoded by concatenating block signatures.
The process is replicated for query resolution. That is, the signatures of the requested words are first superimposed and the resulting signature is searched.
In comparison with inverted indexes, signature files require much less storage space but, on the other hand, have a high processing cost to perform a query due to the I/O operations [20].

C. COMPACT DATA STRUCTURES
Compact data structures store data using small space while allow them to be queried with no prior decompression [2]. These structures drastically reduce their memory footprint, enabling to manage large data collections at the top levels of the memory hierarchy, and also avoiding costly I/O operations. Regardless of the type of structure being implemented, most compact data structures are built on top of particular configurations of bitvectors. A bitvector is a bit array B [1, n] that provides three main operations: This operation is similar to that provided by traditional bitmap indexes.
• rank v (B,i) counts the number of times that the bit v ∈ [0, 1] occurs in B [1, i], for any 0 ≤ i ≤ n (note that rank v (B,0)=0). • select v (B,j) returns the position of the j-th occurrence of the bit v ∈ [0, 1] in B, for any j ≥ 0 (note that There are several implementations that provide a space-time trade-off to support such operations. To ensure constant time performance for these operations, it is necessary to enhance bitvectors with some additional structures that add extra bits on top of the bit array. In this paper, we will use the bit_vector implementation from SDSL [21], which uses 64 n/64 + 1 bits. Some other approaches leverage bit redundancy to compress bitvectors, ensuring efficient performance. In this paper, we use an implementation 4 of the sparse array (SD-array) [22] to exploit the sparseness of the cBiK bitvectors. This structure excels when the proportion of 1s in the bitvector is below 5%, providing (very) efficient select(B,1) in constant time, and rank(B,1) in O(log(n/m)), where m is the number of 1-bits.

IV. RELATED WORK
As shown above, different indexes have been proposed to solve spatial and textual queries independently, but solving spatio-textual queries implies the combination of both types of indexes. In this section, we review the main approaches in the literature to solve the three types of queries described in Section II-A.
Most of the proposed indexes focus on efficiently processing the corresponding queries, but do not consider the large storage requirements to achieve such goal. It is worth noting that all approaches described in the following are based on data structures that reside in secondary memory and have large space requirements.
The first approaches to solve bRS-SKQ kept the spatial index apart from the textual index. For example, [23] proposes two approaches using a Quadtree and an Inverted File. As the indexes are independent, one is used to filter the candidate results, and the other one to refine the result. The main advantage of this approach is its simplicity, but the efficiency is lower when compared with hybrid approaches.
One of the first hybrid approaches was proposed in [24], which combines an R*-tree [25] and an Inverted File in different ways. The best results were obtained by the IF-R*, which builds an Inverted File for each term, which is also associated with an R*-Tree containing all the corresponding points. To perform a query, the Inverted File is used to filter the keywords and then, spatial range queries are performed in each candidate R*-tree, which implies additional disk costs due to the access to independent trees. Later, the KR*-tree (Keyword R*-tree) [26] reduced the I/O costs thank to an efficient pruning of the tree traversal.
De Felipe et al. [27] described the first approach for BkSKQ queries: the IR 2 -Tree, which stores a signature file [19] in each node of the tree to summarize the presence/absence of keywords. Our proposal of using compact bitmaps to represent the keywords that are present in each subspace is inspired in such work. However, there are several important differences, such as the use of compact bitmaps (instead of signature files), the use of an implicit KD-tree, and the support for the three types of queries described in Section II-A.
Later, [8] proposed a new kind of top-k query that takes into account both location proximity and text relevancy (RkSKQ), the IR-tree (combines an Inverted File and an R-Tree in which each node is augmented with a summary of the textual information). Similar proposals can be found in [28]- [30].
Rocha-Junior Nørvåg [31] proposed the Spatial Inverted Index (S2I) to solve RkSKQ queries using an R-tree and an Inverted File. S2I uses different strategies to index frequent and infrequent keywords. Hence, S2I maps each keyword t to an aggregated R-tree (aR-tree) or to a block storing spatio-textual objects that contain t. The most frequent keywords are stored in the aR-tree using a tree for each keyword. The less frequent ones are stored in blocks in a file, one block per keyword (similar to the Inverted File). In the aR-Tree, each node stores an aggregate value that indicates the maximum impact (in terms of the punctuation of the textual relevance) of the keyword in the objects of the tree. The aR-Tree can be devised as an IR-tree [8], [29] for a single keyword. For queries with a single keyword, only a small tree or a block is accessed, in general. For queries with several keywords, some nodes of a small set of trees or blocks are accessed leading to efficient execution of queries.
The potential of S2I is demonstrated in the experimental evaluation of [7], which compares the best 12 spatio-textual indexes to date for the three types of queries described in Section II-A. Although S2I originally responds only RkSKQ, it is easily modifiable to answer the other types of queries. The experimental results show that, in general, it is always among the best indexes for the three types of queries. Taking the S2I as a basic component, some other indexes have been proposed in [32]- [34].
If we classify spatio-textual indexes regarding the text index they use, two main categories arise. On the one hand, [8]- [10], [23], [24], [26], [28], [35], [36] use an Inverted File [18] that orders the objects by their IDs. On the other hand, bitmaps [19] are also used to index textual information contained in the sub-trees. In this case, each bit represents the presence or absence of a keyword in the object. Note that, unlike the ones used in our approach, these are plain bitmaps without support for rank/select operations. The use of sparse bitmaps with support for rank/select operations is a distinguishing idea of our work.
All the approaches described above are based on disk resident data structures and therefore are always penalized by I/O costs. A recent work in [37] evaluates solutions for main memory. However, it focuses on the streaming model. Hence, it adapts the data structures to support data arriving at high speed rates. Also, the temporal dimension plays a more important role in such a model, defining different types of queries from the ones studied in our work.

V. cBi K -COMPACT DATA STRUCTURE FOR SPATIO-TEXTUAL DATA
In this section we introduce the data structure cBiK (compact Bitmaps and implicit KD-Tree) to represent spatio-textual datasets. The main components of this data structure are: i) a balanced implicit KD-Tree (iKD-Tree) and ii) two compact bitmaps that represent the keywords. We describe both components below and the query algorithms in Section VI.

A. i KD-TREE
The iKD-Tree is based on the KD-Tree proposed in [38] for a static set of points of size n. The KD-Tree is a balanced binary tree that uses pointers to link its nodes. Unlike the KD-Tree, the iKD-Tree does not use pointers and stores the points directly in an array, which we call nodes. Like the KD-Tree, the iKD-Tree is a balanced tree and its height is bounded by O(log n).
To construct the iKD-Tree, the spatial objects are first sorted in each of the dimensions (two in this case). Then, according to the coordinate selected to perform the partition, the element in the middle of the array is the root of the tree. For example, in Fig. 2(a) if x is considered as the partitioning coordinate, the point (5, 3) is stored in the middle of the array (see Fig. 2(c)). Then we proceed recursively with each subarray by alternating the coordinate that guides the subspace partition. Fig. 2(b) shows the partitions of the space generated by the iKD-Tree in Fig. 2(a). The construction of the iKD-Tree takes time O(dn log n), with d the number of dimensions [38]. Associated with the iKD-Tree, the bitmap BM of size 2n is used. The first n bits in Fig. 3 indicate whether a node of the iKD-Tree is internal (1) or a leaf (0). The second half of BM indicates if the internal node has an explicit summary (1 if it does, and 0 otherwise) of the T keywords in each of the subspaces generated by the node (see Section V-B2 for more details).
The functional support of the bitmap BM is critical since it helps to build part of the cBiK structure and also facilitates many calculations used in the query algorithms. For example, by using BM it is possible to know, using operation rank 1 (BM , i), which internal nodes of the iKD-Tree exist up to a certain position i.
It is possible to traverse the iKD-Tree through the variables start and end, which store the position in the nodes array of the first and last point, respectively, in a subspace generated by the iKD-Tree. If start = 0 and end = n − 1, then the algorithm is considering the whole space. The position j of the root node of the subspace delimited by start and end can be obtained as j = start+end

B. KEYWORDS REPRESENTATION
Our structure uses two bitmaps, one to represent the keywords of each point and the other to indicate the keywords belonging to the left and right subspaces of the internal nodes of the iKD-tree.

1) BITMAP KEYWORDS (BK )
BK is a bitmap of size n · m used to indicate the keywords that correspond to each of the nodes or points of the iKD-Tree (see the bitmaps inside the nodes in Fig. 4). If the i-th keyword is present at a point j, with 0 ≤ i < m, the i-th bit of j is set to 1, and to 0 otherwise. For 0 ≤ j < n, the keywords of node j are represented by the bits [j · m . . .
Bitmaps BK are encoded using the SD-array [22] to exploit their sparseness. Thus, each bitmap is encoded in space proportional to the number of keywords that describe a particular point, and not to the number of different keywords used in the collection. This ensures the effectiveness of cBiK even for collections that use a large set of keywords.

2) BITMAP SUMMARY (BR)
This bitmap represents the keywords that are in any of the points of each subspace generated by the iKD-Tree (see the bitmaps next to each internal node in Fig. 4). Conceptually, each internal node p is augmented with a bitmap of size 2 · m called local summary (RL). The first m bits of a bitmap RL represent the keywords that are located in any of the points belonging to the subspace to the left of p (in our algorithms we refer to these conceptual bitamps as LS), and the m remaining bits represent the keywords of the points belonging to the subspace to the right (we refer to them as RS).
We distinguish two types of internal nodes: i) nodes whose children are leaves, and ii) nodes whose children are internal nodes. Let p be an internal node with left child l and right child r, if p is a node of type i), then its RL is calculated by concatenating the bits [l · m . . . (l + 1) · m − 1] and [r ·m . . . (r +1)·m−1] of BK . On the other hand, if p is of type ii) its construction can be described as an inorder traversal of the sub-tree induced by p that recursively accumulates the positions of the keywords in such sub-tree as follows: The first m bits of RL of p are obtained by performing the bitwise or operation (∨) between the first m bits of the bitmap RL associated to l with the m bits of the bitmap RL associated to r. The remaining m bits are obtained in a similar way. In BR, only the RL's of type ii) nodes are stored (in a compact form), as shown in Fig. 5. The RL's of type i) nodes are computed online when necessary.
The procedure to retrieve the bitmap RL of an internal node p indexed by j in the array nodes is as follows. First, we need to know the type of node p. Let i 1 = access(BM , j) and i 2 = access(BM , n+j). If i 1 = i 2 = 1, then p is of type ii) and its RL is computed as follows. First, we obtain the number of internal nodes of the iKD-Tree as ni = rank 1 (BM , n), and then we count the number of type ii) nodes as k = (rank 1 (BM , n + j) − ni). Finally, the RL of node p is in the range of positions [2 · k · m . . . 2 · m · (k + 1) − 1], with k ≥ 0 of bitmap BR. If i 1 = 1 and i 2 = 0, then node p is of type i) and its associated RL is obtained as explained above.
Note that the BR bitmap plays an important role in the processing of spatio-textual queries since it allows the index cBiK to prune entire branches using both spatial and textual dimensions at the same time. As BK , BR is also implemented using sparse arrays [22] to optimize the number of bits required to encode each effectively used keyword.

VI. ALGORITHMS FOR SPATIAL KEYWORD QUERIES
Let q = l, t be a SKQ query. Hereinafter, all the algorithms assume a folklore hash-based mapping approach to obtain a numerical identifier for each keyword t to be used to access bitmaps BK and BR. Also, all the algorithms assume that the array nodes and the bitmaps BM , BK and BR are global variables.
The traditional approach to find the k nearest neighbors in a KD-Tree is used in our solution. That is, from the root of the tree, the algorithm descends to the leaves alternating the axis used for the partition of the subspaces. Both in the descent (leaf nodes) and in the return of the recursion (internal nodes) the points associated with the nodes are used to improve the solution and then, it suffices to determine if the candidate point contains the searched keywords. Parameters like q are also considered, which represents the query variable and has a spatial attribute q.l, a textual attribute q.t, and an attribute q.k that indicates the number of requested results; start and end define the range in nodes where to continue the traversal in the iKD-Tree; initially start = 0 and end = n − 1. The variable depth indicates the depth in the iKD-Tree and determines the direction of the subspace partitions. VOLUME 8, 2020 FIGURE 6. BkSKQ query example given q and the search path on the conceptual tree of the structure cBi K. Each node contains the spatial information, the keywords of the point, and the summary bitmap for an efficient pruning.

A. EVALUATION OF BkSKQ QUERIES
The general idea of the algorithm is to traverse the tree from the root to the leaves creating and adjusting the solution by simultaneously taking into account the euclidean distance and the searched keywords. To do that, the summary bitmaps are used to determine the subspaces that the algorithm should keep exploring in the traversal. In this way, some complete subspaces can be pruned when the algorithm detects that a searched keyword is not present in the subspace even when there are some points inside it that are close to the query location. Recall that all the results to this type of query must contain all the keywords.
To illustrate the procedure of BkSKQ, we use Fig. 6 with a query q that has three parameters: the query point (2,8), the keywords that are represented by the bitmap 0100, and the value of k = 1.
The query algorithm starts from the root and evaluates the partition in the direction of the x axis. Since the value of the x coordinate of the query point is less than the one of the evaluated point (2 ≤ 5), then the left path, to node (2, 4), is followed since its RL (the first half of bits) indicates that the keyword searched is present in that subspace. The traversal continues to the point (2, 7), as the value of the query coordinate is less than or equal to the one of the point (2 ≤ 2), the algorithm should follow the left side. However, the summary of the point indicates that the keyword is not present in any subsequent subspace, therefore, the traversal is completed and the point is evaluated as a candidate solution.
In the return of recursion, the points visited are still evaluated to improve the solution, if appropriate.
Algorithm 1 receives five parameters in addition to those described at the beginning of the section. The variable heap, corresponds to a Max-Heap of size k that is used to maintain the candidate points of the solution. At the end of the algorithm, the solution is also stored in heap. Note that it may be impossible to obtain k objects that satisfy all the keywords defined in q.t and therefore, the number of objects contained in heap may be less than k.
The algorithm begins by obtaining the position in the array nodes of the root of the sub-tree that is between start and end (line 1). In lines 3-26, the case of the internal nodes is solved using a depth-first search (DFS) traversal. Lines 4 and 5 retrieve the coordinate values corresponding to the direction of the partition according to the depth in the iKD-Tree, both for the point q.l and for the point p of the node. In lines 6-15 the algorithm processes the case in which the point q.l is to the left of the point p, since the value of c q is less than or equal to that of c p . Then, function checkLS(., ., ., .) (line 7) verifies if the bitmap BR associated to the left subspace of the node contains all the keywords indicated in q.t, that is, if q.t is a subset of the union of the keywords of all the points located in the left subspace. If so, the traversal is continued by recursively accessing the first half (sub-tree) of the subarray of nodes between start and end (line 8). If it does not contain them, the other branch of the iKD-Tree is processed (lines [12][13][14]. Then, function existsRight(., .) (explained below) verifies if it is necessary to explore the points of the subspace to the right of p with the goal of finding out if they may improve the best solution achieved so far. Analogously, lines 16-25 process the case when the path must continue on the right side of p.
Lines 27-29 process the case in which the depth traversal has reached a leaf node, which must be revised, or when the algorithm has returned from recursion and the solution must be improved. In such cases, the function checkKeywords(., .) (explained below) verifies if the point contains all the keywords specified by q.t. If so, the point is considered a candidate and it is used to update the heap using the function updateCandidate(., ., ., .).
The algorithm uses the following non-trivial functions: • updateCandidate(heap, p, q.l, q.k) is used to add point q to the candidates heap. It is assumed that p is also a candidate, that is, it contains all the keywords of the query. The function decides wether to insert p to heap or not. To do this, it verifies if the size of the heap is less than k, if so, p is inserted. Otherwise, it is verified if the euclidean distance between p and q.l is less than the distance between q.l and the point in the root of heap.
If so, the point on the root is removed and p is inserted.

B. EVALUATION OF RkSKQ QUERIES
In this section, we describe an algorithm to compute the RkSKQ queries defined in Section II-A2. In a nutshell, the strategy consists of searching for the spatio-textual objects that are closer to the query point, ranking the results by the weighted sum (score) (1) of the spatial proximity (euclidean distance) and the textual relevance (number of keywords associated with the object).
The algorithm is based on branch and prune, and it is similar to classical algorithms for K nearest neighbors. The main difference is that we use the score instead of just the euclidean distance or any other spatial distance. When branching, the algorithm evaluates the score of each subspace by using conceptual bitmaps LS and RS, and the boundaries of the subspace associated with the visited node. Then, it proceeds to the node with highest score. The same score is also used to prune the traversal. If the score of a subspace does not improve the best known solution so far, the subspace is discarded. A particular case is when none of the searched keywords is presented in the subspace. In such case, the score is zero and the subspace is discarded.
An example of the RkSKQ algorithm can be seen in the conceptual tree of Fig. 7. For a better understanding of each step, we provide Table 1. The figure shows a query q with four parameters: the query point (2,8), the keywords that are represented as a bitmap, the weight α = 0.3 that indicates that the total score is 30% due to spatial proximity and 70% due to textual relevance, and the value of k = 1. Step by step execution of the RkSKQ query in Fig. 7.
Note that the traversal path is determined by the classification score of the summaries. At iteration 4, the point (1, 3) corresponds to a leaf node (i.e. it does not have a summary) so the process is completed and the point is considered as a candidate solution. Then, for each point on the recursion backtrack, the algorithm tries to improve the solution. Note that when backtracking to the point (2,4), the other subspace is revised since RS estimates that there may be better candidates having a score higher than the current best and, indeed, the best solution becomes the point (2, 7).
Since the summary score of the root node is equal in both sides, the algorithm should keep looking for candidate points, in step 9, the final solution is reached at point (6, 8) with a total score 0.894. Note that point (2, 7) is the closest to the query point q, but as the textual component has been given greater importance (α = 0.3), the final solution corresponds to a point a little farther away, but containing a better match of the searched keywords.
This procedure is synthesized in Algorithm 2, which receives five parameters: q, start, end, which were described at the beginning of the section, together with a variable h that corresponds to a Min-Heap of size k used to keep the candidate points ranked by score from lowest to highest, and the variable α that weights the importance of the spatial and textual scores.
The algorithm first obtains the median of start and end (line 1), corresponding to the position of the point to be evaluated in the array nodes. Then in lines 2-4, the spatialS, the textualS, and the totalScore of the accessed point with respect to the query point are computed.
Then, lines 5-23 correspond to the case when the accessed point is an internal node of the tree, and it is necessary to determine to which subspace the traversal should continue. Line 6 obtains the textual scores of the left (textualScores LS ) and right (textualScores RS ) summaries to, then, obtain the total scores (score LS and score RS ). The traversal uses this information to decide which subspace should visit next.
Lines 9-14 are related to the case in which there are more keyword matches in the left subspace than in the right subspace (score LS > score RS ), hence the traversal process such subspace in line 10. In the backtrack of the recursion, function goToR(., ., .) determines if it is necessary to access the other subspace in order to improve the solution. If necessary, this is accessed on line 12. Analogously, lines 14-19 are related to the opposite case in which the right subspace has a greater score. In lines 19-22, the case when both subspaces have the same score is processed by recursively traversing both of them.
Finally, lines 24-26 are related to the case in which the traversal has either reached a point corresponding to a leaf node, or it is returning from the recursion and the current point has to be considered as a candidate to improve the solution.
The non-trivial functions used in Algorithm 2 are explained as follows: • summaryTextualScores(q.t, pos, start, end): returns an object with two attributes, the textual score (3) of the left and right summary, i.e. textualScores LS and textualScores RS , respectively.
• updateCandidate(heap, p, totalScore, q.k): is similar to the function used in Alg. 1, and it decides the insertion of p into the heap depending on its score. As heap is a Min-Heap, the points with higher scores will be kept in the heap.

C. EVALUATION OF bRS-SKQ QUERIES
The general idea is to traverse down the tree until the specified range is located. The access to the corresponding child is done as long as the RL reports that all the searched keywords are present in the corresponding subspace, stopping the traversal, otherwise. The navigation ends when a leaf node is reached, then, in the return of the recursion, it is checked if the visited nodes are contained in that region and if they contain an exact match of the requested keywords, if so, the point is added to the result list. Fig. 8 illustrates this procedure with a query q that has three parameters: the two endpoints of the spatial region in Fig. 8(a), and the searched keyword represented by its bitmap 0100. The traversal is shown in Fig. 8(b), starting from the root node and continuing until point (2, 7) that intersects the query region. Hence, candidate points may exist in both subspaces and they must be visited. However, as RL indicates that the searched keyword is not present in any of the children, the traversal backtracks and checks if the visited points are contained in the query region. The final result only contains the point (2,7). Note that point (3,7) is contained in the range, but it does not contain the searched keyword.
Lines 3-23 evaluate the case when the revised point is an internal node. First, in lines 4-6 the value of the coordinate according to the partition is revised, c m and c M correspond to the lower left and upper right corners of the region, and c p is the actual value of the evaluated point. In order to determine to which subspace the search should continue, the following three cases are evaluated: • Case 1: When c M ≤ c p , the partition of the point is located to the right (or above) the query region (line 7), the search continues to the left subspace.
• Case 2: When c m > c p , the partition of the point is located to the left (or under) the query region (line 11), the search continues to the right subspace. Note that in all cases, in order to make the recursive call, it is necessary to verify that the subspace contains the requested keywords using the functions checkLS(., .) and checkRS(., .), according to the case. If the subspace contains them, the traversal continues as there are candidate points within the range of queried positions.
Finally, lines 24-26 evaluate if the point is contained in the query region using function containsPoint(q, p).
Moreover, function checkKeywords(q.t, mid) determines if the searched keywords completely match the keywords of the point. In such case, the point is added to the list result.

D. ANALYSIS OF THE TEMPORAL/SPATIAL COMPLEXITY OF THE ALGORITHMS
In this section we explain the temporal complexity of the algorithms to find the nearest neighbor (BkSQK and RkSKQ), and also the range searching (bRS-SKQ). In addition, the spatial complexity is analyzed to determine the amount of memory (RAM) required by the structure of cBiK.
For all algorithms, the construction of cBiK considering n nodes and m keywords. The worst case happens when all the n points have all the m existing keywords and, therefore, the memory used is represented by (4), which includes the substructures used in cBiK. It is important to emphasize that this is a worst case analysis and, as it can be seen in Section VII, in practice the space is much lower due to the effectiveness of the SD-array to encode sparse bitmaps as BK and BR: Storage = S points + S keys + S summary + S map + S hashing (4) • S points : Corresponds to the totality of the spatial points (coordinates x and y) that are stored with variables of type double (8 bytes). Hence, the memory consumption is given by (5).
S points = n · (16 bytes) (5) • S keys : It refers to the storage of Bitmap Keywords (BK), which represents the keywords associated with each point. As in the worst case all n points have m bits, then the memory used is shown in (6).
• S summary : Equation (7) shows the bytes used by the Bitmap Summary (BR). As just the explicit summaries are stored, the nodes of the last two levels of the tree have to be omitted. Such nodes are represented as the following summation h−3 i=0 2 i , with h the height of the tree.
• S map : Represents the storage of the Bitmap Map (BM), which considers twice the number of nodes, as shown in (8).
• S hashing : Equation (9) corresponds to the mapping hash-structure used to transform keywords into their corresponding numeric identifiers, being l i (with i > 0) the length of the keyword.
where size(l i ) is the length of string in bytes. For all the above, the spatial complexity corresponds to O(n · m), in the worst case.
Regarding the temporal complexity of the algorithms, we can review the following two cases:

1) THE k NEAREST NEIGHBORS (kNN)
If k = 1 and k n, the worst case to find the nearest neighbor (1NN) happens if all the points contain all the searched keywords and, at the same time, they are located in a circle centered in the query point q. This layout forces the traversal of all the nodes of the tree as all the distances to the point q are equal. Hence, the temporal complexity is O(n).
On the other hand, if the points are uniformly distributed, the worst case to find the 1NN is O(log n) on average.
If k > 1 and k n, the operations in the heap to deliver the results have to be also considered, which has size at most k. O(n · (log k + |q.t|)) (10)

2) RANGE SEARCHING
For this type of query, the worst case occurs when all the points inside the query region R, contain all the requested keywords. To obtain the temporal complexity of the balanced iKD-Tree, with height h = log 2 n , we must determine which is the maximum number of subspaces Q(n) intersected in the KD-Tree by cutting the plane with a line l. For this, the key idea is to consider two levels of the tree at the same time. If you consider first a vertical cut and then a horizontal one, 4 subspaces are obtained, each with n 4 points. Then, the line will intersect two subspaces and the remaining subspaces will be outside or completely within R. The recurrence that expresses the above is the following: This recurrence is resolved to Q(n) = O( √ n). Also, the total time to report all the points P contained in the region Q contributes with O(P) time. Finally, the temporal complexity for the range query in the worst case is O( √ n+P).

VII. EXPERIMENTAL EVALUATION
This section provides a comprehensive evaluation of cBiK that analyzes its space consumption and its performance to resolve the three different queries proposed in this paper: BkSKQ, RkSKQ, and bRS-SKQ. Our prototype 5 is coded in C++, and uses different bitmap implementations available in the SDSL [39] to build the corresponding bitmaps of cBiK.
We compare cBiK to S2I [9], one of the most competitive approaches in the state of the art [7] for the current scenario. S2I is a disk-oriented index, in contrast to cBiK, which resides on main-memory. Although this fact penalizes S2I, which always pays I/O costs, the following experimentation enables disk and memory-based solutions to be effectively compared. Note that I/O is one of the main bottlenecks of geo-textual indexes competitive approaches in the state of the art [7], and our approach tries to overcome it by using succinct data structures that are more likely to fit in main memory. The S2I prototype 6 is coded in Java.

a: DATASETS
We consider two different datasets 7 for our experiments, both used in [40]: • The POI's dataset contains 1.1 million real-world points of interest (POI) obtained from the social network Foursquare. Each POI provides its geographical coordinates and a short description.
• The Twitter dataset is a collection of 20 million geo-tagged tweets, each one containing a short text and a geographical description of the location from which the tweet was published. It is worth noting that S2I is not able to index the whole dataset, forcing us to work with smaller data. We divide the Twitter dataset into four smaller ones, namely Twitter1M, Twitter3M, Twitter5M and Twitter10M, that contain 1, 3, 5, and 10 million tweets. Table 2 summarizes the most relevant features of all these datasets. Note that the average number of words per object is quite similar in all cases (between 4 and 4.7), so the number of total words is proportional to the dataset size. On the other hand, the number of different words also increases with the dataset size, ranging between 261, 212 and 1, 364, 787, for POI's and Twitter10M, respectively. Note that the number of different words is an important metric for cBiK because it determines the size of the bitmaps.

b: QUERIES
We designed a testbed of randomly generated queries for each dataset in our setup. For BkSKQ and RkSKQ, we generated five sets of 1,000 queries, each one providing a point (x, y) and a list of k keywords (1 ≤ k ≤ 5). Latitude and longitude values were randomly generated at the [−90, 90] and [−180, 180] intervals, respectively, while the keywords were also randomly chosen from the collection of different words used in each dataset. This decision ensures that all queries return, at least, one result.
Queries for bRS-SKQ follow a similar pattern, but regions were queried instead of points. Each region is defined around an existing point in the dataset, which is randomly chosen. The corresponding region is the (squared) bounding box that has the selected point as its center.
We consider five different-size regions according to the length of the diagonal of the bounding box: 1km, 2km, 5km, 10km, and 20km. The POI's dataset describes points that cover almost half of the Earth (note that the distance between its farthest points is 18, 909.6 kilometers), so regions of 1Km diagonal are just 0.005% of the whole area of the dataset, while regions of 20Km cover 0.106% of this area. Twitter datasets cover smaller areas, so the query regions are proportionally larger in these cases.
The performance of our solution is benchmarked in the following sections. Note that all experiments were run on an Intel R Core TM i5@3.4GHz (4 Cores), 8GB of RAM, and a 1TB SATA disk, over Ubuntu 16.04 LTS (64 bits).   Table 3 shows construction time in seconds of both the baseline, S2I, and our proposal, cBiK, for all the datasets used in the experiments. From these results, we can conclude that our proposal clearly outperforms the baseline in several orders of magnitude. Taking as an example the dataset Twitter5M, S2I requires about a week to be built, whereas our proposal can be built in a couple of minutes. For the largest dataset, we were not able to build the baseline as the process exceeded the maximum secondary memory available (1TB) after three weeks of computation. Even for this dataset, our proposal can be built in about 20 minutes. This is an important result for the scalability of our proposal. The space complexity measures the amount of storage space used by each index. As showed in Table 4, S2I uses more space than cBiK to index each dataset (2.61 times more space, on average). In quantitative terms, cBiK saves 841.9 MB to index Twitter5M with respect to S2I, and this improvement remains proportional for the other datasets.

A. CONSTRUCTION TIME AND STORAGE USAGE
Although space savings are expected due to the use of succinct data structures in cBiK, this improvement is noticeable in practical terms because it enables larger datasets to be managed in main memory, avoiding costly I/O operations. Thus, managing smaller indexes also improves query performance, as shown below.
An interesting result regarding our structure is that most of the space is used by bitmap BR. Hence, a simple technique to reduce the space even further is not to store these bitmaps in all the levels but every x levels. This technique obviously impacts in the time performance, so parameter x would provide a time-space trade-off. To further explore this idea is proposed as an open problem.

B. QUERY PERFORMANCE
This section presents and discusses performance figures for the three different queries considered in our setup: BkSKQ, RkSKQ, and bRS-SKQ. In all cases, averaged query times  are reported; i.e. the corresponding query set (of 1, 000 queries) is executed, and the total running time is divided by 1, 000. All times are reported in milliseconds per query.

1) BOOLEAN TOP-k SPATIAL KEYWORD QUERY
BkSKQ is a boolean query that retrieves the Top-K closest locations (to the query point) that match all required keywords. Thus, the performance of this query is affected by the number of requested results, and the number of keywords provided by the query. Fig. 9 reports query times as a function of the number of keywords provided by the query: from 1 to 5 keywords. Note that all of these queries ask for the corresponding top-5. cBiK is more than 1 order of magnitude faster than S2I, but the improvement is almost of 2 orders of magnitude for the smallest datasets: POI's and Twitter1M. S2I reports query times between 10 and 68ms per query, and cBiK ≈ 0.08 − 1.5ms per query. Note that query times increase with the number of keywords because more comparisons must be done to evaluate each candidate. However, S2I performs better for 5 than for 4 keywords, in the POI's dataset. This is because many objects are described with few keywords (note that each point in POI's has 4 keywords on average, and the S2I prune algorithm is able to exploit this fact in this particular case.
Varying k barely affects query times, as showed in Fig. 10. In this case, all queries provide 3 different keywords and ask for the best 1, 5, 10, 15, and 20 results. Note that query times remain stable for all datasets and all top-k queries. As in the previous case, cBiK is more than one order of magnitude faster than S2I, but the difference is larger for the smallest datasets.

2) RANKED TOP-k SPATIAL KEYWORD QUERY
RkSKQ performance is also measured according to the number of keywords and the requested results, but it also considers the value of α.
We fixed k = 5 and α = 0.3 to measure the effect of the number of keywords (we consider queries including from 1 to 5 keywords). It means that the score algorithm weights text relevance with 0.7 and spatial proximity with 0.3, in the corresponding Top-5 queries. Fig. 11 reports query times for this experiment. Although cBiK is still faster than S2I, the difference decreases with the number of keywords. S2I reports similar numbers than for BkSKQ, but cBiK pays an important overhead due to its pruning algorithm, which is less effective when not all the requested keywords must be present in the point description, and more candidates must be checked to obtain the best k results.
The experiment varying the number of k requested results draws similar conclusions than for BkSKQ. As showed in Fig. 12, query times remain quite stable from k = 5 to k = 20, although cBiK reports a small increase from k = 1 to k = 5. In any case, cBiK is 1 order of magnitude faster for all datasets, reporting times of ≈ 0.5 − 6.2 miliseconds per query for all datasets.
Finally, the effect of α is evaluated. This experiment measures the impact of combining spatial and text filters in the query. We consider five different values for α = {0.1, 0.3, 0.5, 0.7, 0.9} to analyze whether query performance varies when one of the components is more relevant than the other. The case of α = 0.5 is particularly interesting because it weighs similarly both the spatial proximity and the text relevance. All these queries ask for the corresponding Top-5 and provide three different keywords. Fig. 13 shows that cBiK and S2I report very stable numbers for all datasets. Thus, we can conclude that α barely affects VOLUME 8, 2020    query performance, and cBiK is again one order of magnitude faster than S2I, reporting times from 1.3 to 5.5 milliseconds per query, while S2I performs at the level of dozens of milliseconds per query.

3) BOOLEAN RANGE SEARCHING SPATIAL KEYWORD QUERY
Finally, the bRS-SKQ performance is analyzed. In this case, we consider two parameters: the number of keywords and the size of the requested region. Fig. 14 reports query times in function of the number of keywords. In this case, we fix query regions of 10Km and evaluate queries providing from 1 to 5 keywords. S2I reports similar figures than for BkSKQ (see Fig. 9), so it behaves similar when a reference point or a region are queried. On the other hand, cBiK reports competitive numbers from ≈ 0.017−0.29 miliseconds per query. Note that its performance runs in parallel with S2I one, but it is two orders of magnitude faster, in all cases. It shows that cBiK is very efficient to resolve range-based queries. Fig. 15 analyzes the effect of querying by different-length regions (1, 2, 5, 10, and 20Km of diagonal length), fixing 3 different keywords per query. Query times remain stable for all region lengths and all datasets, both for S2I and cBiK. Assuming that the longer the region, the greater the   number of points matching the query, this result means that bRS-SKQ performance does not depend on the number of retrieved points. cBiK also outperforms S2I in this scenario, but the improvement is higher, reaching up to 3 orders of magnitude for Tweets 1M.
Finally, Fig. 16 shows the scalability of the data structures with respect to the set size, defined as the number of spatio-textual objects. Both data structures present a linear behaviour with respect to the set size. However, the cBiK keeps it advantage of one to three orders of magnitude. The curve for S2I is not shown for the 10M set, because it was not possible to build such index with the available resources.

C. PLACING THE S2I IN MAIN MEMORY
In this section we show complementary experiments that compare our solution with the S2I when both solutions run in main memory. To do that, we use a Warm-up with the S2I. The idea of this technique is to run each query at least twice and just report the time of the second execution, which ensures that the part of the data structure involved in the running query is already in main memory, hence avoiding I/O operations. Although this kind of comparison is not completely fair (because it also favours cache-behaviour, making the data structure even faster than a main memory resident one), it provides a roughly estimation of the performance of the structure in main memory.
The results of these experiments are shown in Fig. 17, 18 and 19. In these figures, we label the S2I that uses the Warm-up technique as S2I-WU. First, it is important to observe the impact of the warm up technique in the S2I. As it can be observed, this technique reduces query times about one or two orders of magnitude, depending on the type of query. Second, Fig. 17 and 19 show that, even in this scenario, cBiK is faster than S2I-WU for the queries of type BkSKQ and bRS-SKQ. On the other hand, Fig. 18 shows that S2I-WU is faster than cBik for RkSKQ queries. This is not a surprising result because: i) S2I is a data structure designed ad-hoc to efficiently solve this specific type of query and ii) it is a well known result in compact data structures that they are usually slower than classical data structures when running in the same level of the memory hierarchy. Overall, we can conclude that cBiK is a scalable solution to efficiently solve the three types of queries studiend in this work. VOLUME 8, 2020

VIII. CONCLUSION AND FUTURE WORK
This paper proposes cBiK, the first compact data structure that manages geo-tagged datasets and allows spatial keyword queries to be resolved in main memory. Although this type of indexes performs more computations than traditional disk-based ones, they are more efficient by avoiding costly I/O operations.
Our experimentation verifies this fact using a selected testbed of real-world datasets, including points of interest descriptions and geo-located microposts from Twitter. cBiK is able to compact these datasets up to 35 − 40% of the space used by a state-of-the-art index (S2I), while solving the three types of SKQs up to two orders of magnitude faster than S2I. When warming-up the S2I, to simulate another main-memory resident solution, both approaches are comparable in query times. Regarding construction time, our approach also scales much better with the number of objects to index. These numbers endorse our approach, and consequently an emergent line of research focused on the use of compact data structures for managing and querying big geo-tagged datasets.
We plan to enhance cBiK to support additional search capabilities, as part of our future work. On the one hand, we can replace the current mapping, which transforms keywords into integer identifiers, by a powerful compressed string dictionary that allows inexact text queries. More concretely, we plan to use the compressed FM-index dictionary [41] because it can be tuned to perform approximate string matching, with different string similarity measures [42], on the Burrows-Wheeler transform [43]. On the other hand, another interesting line of work is to provide semantic spatial keyword queries in cBiK. Based on the experiences of Tekli et al. [44], [45], we can add a compressed semantic index [46] that allows efficient semantic relationships between keywords to be efficiently navigated. In both types of searches, the additional indexes will be first queried to obtain the corresponding set of queries that are evaluated from the spatial perspective.
From a more applied perspective, the application of this approach in low memory devices, such as smart-phones, is promising. In such scenario, not just the space usage, but also the battery consumption must be reduced.
MIGUEL A. MARTÍNEZ-PRIETO received the Ph.D. degree in computer science from the University of Valladolid (UVa), in 2010. He held a Postdoctoral position with the University of Chile, from 2010 to 2012. His scientific experience extends over the last 13 years. He is currently an Associate Professor with the Department of Computer Science, UVa. His main research interests include data engineering challenges, more concretely data compression and indexing, semantic web, and big data.
DIEGO SECO received the M.Sc. and Ph.D. degrees in computer science from the University of A Coruña, in 2006 and 2009, respectively. He is currently an Associate Professor with the Department of Informatics Engineering and Computer Science, Faculty of Engineering, Universidad de Concepción, Chile. He also participates with the Millennium Institute for Foundational Research on Data. His research interests include geographic information retrieval, geographic information systems, compressed data structures and algorithms for textual and geographic data, and bioinformatics.