Reverse Spatial Visual Top-k Query

,


I. INTRODUCTION
With the wide application of mobile Internet techniques and location-based services (LBS), massive multimedia data with geo-tags (geo-multimedia for short) has been generated and collected by smartphones and tablets with local sensors, and then uploaded and stored on the Internet. On the one hand, the multimedia sharing platform and online social networking provide geo-multimedia storage and sharing service. For example, more than 95 million photos with location information captured by smartphones and digital cameras are stored on Flickr, 1 which is one of the largest picture sharing platforms. more than 140 million Twitter 2 users post 400 million tweets in the form of text and image with geo-location information (referred as geo-text and geo-image). In China, The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . 1 http://www.flickr.com/ 2 http://www.twitter.com/ lots of users of WeChat, 3 the most popular mobile application, share texts, images and short videos with geo-tags every day. On the other hand, geo-multimedia data are used in many location-based services. For instance, Dianping 4 provides the rating and review services for finding restaurant, hotel, gym, cinema, etc. via sharing the geo-texts and geo-images uploaded by users. Another LBS application is Foursquare, 5 which helps users to share the places visited and find the best places nearby via geo-multimedia data. These geo-multimedia data is a fusion of multimedia content [1], [2] and geo-location information [3], which enables queries consider geographical proximity and multimedia content similarity simultaneously. Spatial keyword query [4] is one of the significant problems that has attracted much attention in the spatial database and information retrieval community. This query aims to find  spatial objects by taking into account both spatial proximity and relevance of keywords. Several types of spatial keyword search, i.e., collective spatial keyword query [5], m-closest keywords search [6], best keyword cover search [7], group top-k spatial keyword query [8] and etc., are studied deeply and applied widely in many scenarios to provide efficient spatial keyword query.

A. MOTIVATION
Reverse spatial keyword query [9] is another important search problem, which is to find a set of geo-objects that have the query as one of the most relevant objects in both geographical proximity and textual similarity. Many researches [10]- [13] propose efficient algorithms to speed up the reverse search on Euclidean space and road network space. However, previous works only focus on keyword search, which are suitable for unstructured data, such as geo-image. In other words, these techniques cannot be applied directly to the reverse spatial query for geo-multimedia data. To this end, this paper consider geo-image that is the most common type of geomultimedia. Thus, we propose a novel type of reverse top-k query, named reverse spatial visual top-k query (RSVQ k for short), which takes into account both geo-location proximity and visual similarity between images. In other words, users can submit a reverse query with geo-images, rather than keywords. To the best of our knowledge, this is the first time to investigate RSVQ k problem. To introduce this problem more intuitively, we provide an example of reverse spatial visual top-k query as follows: Example 1: As shown in Fig. 1, a manager of a steak house wants to know the consumer preferences of people nearby so as to carry out more accurate advertising. She submits a reverse spatial visual top-k query by taking a picture of steak with a smartphone in this steak house. The system will return the users who have this steak house as one of the k most desirable restaurants in both aspects of geographical proximity and the visual similarity between their posed images and the query image.

B. OUR METHOD
To overcome this challenge, firstly, this paper defines reverse spatial visual top-k query in formal, and introduces the relevant notions, i.e., the geographical proximity measurement and visual similarity measurement. As far as we know, this is the first time to propose the definition of RSVQ k and no existing approach has been proposed for this problem. Thus, a baseline that uses R-Tree and the threshold algorithm [14] is proposed. To organizing the geo-image data more efficiently, we careful design a novel hybrid index, named VR 2 -Tree, which is a integration of the visual representation of geo-images and R-Tree. The visual representation of an image in this work is a vector of visual words. Two operations of visual words vector, namely Weight OR and Weight AND are proposed to support the generation of the non-leaf nodes of VR 2 -Tree. Besides, an extension of VR 2 -Tree, named CVR 2 -Tree is developed to enhance the pruning power by its specific entry in tree node, namely CEntry set. Furthermore, we discuss the calculation of lower bound and upper bound via CVR 2 -Tree, and then introduce the optimization technique via CVR 2 -Tree to tighter the bounds. In addition, the CVR 2 -Tree based query processing algorithm with the optimization is introduced.

C. CONTRIBUTIONS
The main contributions of this work are summarized as follows: • We propose the definition of reverse spatial visual top-k query and the relevant notions. Besides, a baseline for reverse spatial visual search is introduced. To the best of our knowledge, this work is the first time to study RSVQ k problem.
• We present a novel hybrid index, named VR 2 -Tree which is a combination of visual representations of VOLUME 8, 2020 geo-images and R-Tree. In addition, an extension of VR 2 -Tree, called CVR 2 -Tree is designed, which can further improve the pruning power during the reverse search.
• We carefully develop the efficiency RSVQ k algorithm, which utilizes the optimization technique via CVR 2 -Tree to enhance the search performance significantly.
• We have conducted extensive performance evaluation on four geo-image datasets. Experimental results demonstrate that ths proposed approach has really high performance.

D. ROADMAP
In the remainder of this paper, we review the previous studies about this work in Section II. In Section III, we propose the definition of reverse spatial visual top-k query and the related notions. Besides, a baseline is introduced in this section.
In Section IV, a novel hybrid index, named VR 2 -Tree and its extension, i.e. CVR 2 -Tree are proposed. Furthermore, an efficient reverse spatial visual search algorithm named RSVQk is carefully designed. In Section V, we evaluate the proposed algorithms on four geo-image datasets. Finally, we conclude this paper in Section VI.

II. RELATED WORK
In this section, we review the previous studies of image retrieval and collective spatial keyword query, which are related to our work. To the best of our knowledge, we are the first to study the problem of collective geo-image query.

A. IMAGE RETRIEVAL
Image retrieval is one of the classical problems in the community of multimedia and computer vision, and it can be applied in versatile big data applications [15]- [23]. Lots of researches have been proposed to combat this challenge. As two powerful visual feature representation tools, Scale-Invariant Feature Transform (SIFT) [24], [25] and Bag-of-Visual-Word (BoVW) [26] are widely utilized. For example, Ke et al. [27] proposed an effective PCA-based local feature representation method called PCA-SIFT to improve the accuracy and efficiency. Mortensen et al. [28] proposed to augment the original SIFT descriptor by combining SIFT feature with a global context vector to enhance the matching rate. Li and Ma [29] improved SIFT descriptor by integrating color and global information which provides powerfully distinguishable information. Dimitrovski et al. [30] improved BoVW model by using predictive clustering trees to construct codebook to reduce the number of local descriptors. More recently, with the rise of deep learning [31]- [33], lots of researchers employed more powerful tools such as CNN [34], RNN [35] and LSTM [36] to greatly hoist the image retrieval accuracy [37], [38]. In 2012, AlexNet [39] markedly improved the image classification accuracy and won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Matsuo et al. [40] proposed a CNN-based style vector that is transformed from style matrix with PCA dimension reduction. Gordo et al. [41] proposed a CNN-based global fixed-length representation for image retrieval, which is generated by a ranking framework. Tan et al. [42] utilized different CNN models to extract the multiple visual features that are fused into weighted average feature. Liu et al. [17] introduced a method that fuses high-level features from CNN and low-level features to generate two-layer codebook features. Seddati et al. [43] combined multi-scale and multi-layer feature extraction from improved RMAC approaches, which generates short descriptors and get better performance without the need of CNN fine tuning. Yang et al. [44] introduced a method in which a dynamic match kernel is constructed by calculating the matching thresholds between query and candidate.
It is obvious that the deep learning based methods have much better performance than the traditional hand-crafted feature based methods. In our previous works [45], [46], we proposed to combine the spatial search techniques and visual feature representations to solve geo-multimedia retrieval problem. However, as far as we know there is no existing image retrieval approach that is suitable to address the reverse spatial visual query (RSVQ) problem. In this work, we attempt to design efficient index structure and algorithm for RSVQ problem.

B. SPATIAL KEYWORD QUERY
Spatial keyword query [47], [48] is a significant problem in the community of spatial database [49]- [51], which is well studied by researchers in recent years. It aims to returns spatial-textual objects that are spatially and textually relevant to the query. Several spatial indexing structures such as R-Tree [52], R * -Tree [53], IR-Tree [54], [58], KR * -Tree [55], IL-Quadtree [56], etc. have been proposed to improve the spatial keyword search effectively.
Felipe et al. [57] proposed to address the top-k spatial keyword queries by using a novel index called Information Retrieval R-Tree (IR 2 -Tree) that is a combination of R-Tree and superimposed text signatures. Cong et al. [58] introduced a new indexing framework, in which the inverted file is employed for text retrieval and R-tree for spatial proximity search. Rocha-Junior et al. [59] proposed a novel index named Spatial Inverted Index (S2I) which maps each distinct term to a set of objects containing the term. Zhang et al. [60] developed I 3 that is an integrated inverted index with quadtree to partition the data space into cells in a hierarchical manner. In another work of them [61], they modeled the top-k distance-sensitive spatial keyword query as top-k aggregation problem, and then an extension of CA algorithm, called Rank-aware CA algorithm, to enhance the search.
Unfortunately, These researches whether in European space or road network space can only be applied to structured data, e.g., keywords. That means they are not suitable to cope with spatial unstructured data, such as geo-image. To the best of our knowledge, this paper is the first time to develop effective and efficient technique for the geo-image search task.

C. REVERSE QUERY PROCESSING
Reverse query [11], [62], [63] is another significant problem in the area of spatial-textual search, which is from the perspective of point of interest (POI), e.g. restaurant, supermarket, store, tourist attraction, etc, rather than users. More specifically, it aims to retrieve the users for which the query objects is one of the most preferences, such as geographical proximity [64]- [67]. Reverse query can be applied in lots of applications, e.g., advertising, recommendation, marketing, etc.
Many researches have been proposed to combat this challenge in the last decade. Vlachou et al. [64] proposed the reverse top-k query and introduced two versions, namely monochromatic and bi-chromatic. In their another work [68], they proposed distance-based reverse top-k query problem which can be applied in the mobile environment. For the reverse k nearest neighbor (RkNN) problem, Cheema et al. [69] proposed a novel notion, named influence zone, which is the area such that every point inside this area is the results of RkNN query and every point outside it is not the results. Yu et al. [70] studied the reverse top-k search by using random walk with restart in large graphs. In road network space, Wang et al. [71] investigated continuous monitoring of RkNN queries. They utilized the influence zone to boost the search.
Not only the proximity of space distance, the textual similarity is considered into the reverse query. For example, Lin et al. [9] proposed the reverse keyword search for spatio-textual top-k queries (RSTQ) at the first time and developed a novel hybrid index, called KcR-tree, to store and summarize the spatial and textual information of objects. Yang et al. [11] proposed to extend half-space-based pruning technique to solve the spatial reverse top-k queries and introduced a novel regions-based pruning algorithm according to SLICE [72] that is a regions-based pruning algorithm for reverse k nearest neighbors queries to improve the efficiency. Instead in the Euclidean space, Luo et al. [73] investigated reverse spatial and textual k nearest neighbor queries on road networks. Besides, they proposed several spatial keyword pruning techniques to speed up the search. Gao et al. [10] introduced another novel query paradigm, called reverse top-k Boolean spatial keyword (RkBSK) retrieval on Road Networks that considers both spatial and textual information. To boost the system performance significantly, they developed a new data structure named count tree to overcome the drawback of the count list.
However, these solutions for reverse queries cannot be extended to the geo-image query problem since they are not suitable for unstructured data such as image. To combat this limit, in this work we propose to address the reverse spatial visual top-k query problem that takes into account both visual similarity and geographical proximity simultaneously. To the best of our knowledge, we are the first to propose this query paradigm and try to solve it effectively and efficiently.

III. PRELIMINARY
In this section, for the first time, we formulate the definition of reverse spatial visual top-k query problem and introduce the relevant notions. Then we propose the baseline to combat this challenge. Table 1 summarizes the notations frequently used throughout this paper to facilitate the discussion.

A. PROBLEM DEFINITION
Before defining the reverse spatial visual top-k query problem, we introduce the notion of geo-image that contains two aspects of information, i.e., geo-location and visual content.
Let I = I 1 , I 2 , . . . I |I| be a geo-image dataset. Each geo-image I ∈ I is represented by a tuple I .λ, I .ν , where I .λ is the geo-location descriptor that is a 2-dimensional vector to represent the geographical information in the form of longitude X and latitude Y , i.e., I .λ = (X , Y ). I .ν is the visual descriptor which is a γ -dimensional vector to represent the visual features of the image, i.e., I .ν = ν (1) , ν (2) , . . . ν (γ ) . In this paper, we employ BoVW [26] model to construct the visual descriptor, thus each item ν represents a visual word.
Definition 1 (Reverse Spatial Visual Top-k Query): Given a geo-image dataset I and a query Q = Q.λ, Q.ν . A reverse spatial visual top-k query (RSVQ k ) aims to retrieve all the geo-images in I that consider the query Q as one of the top-k most relevant geo-images in both aspects of geo-location and visual content. Formally, it is described as follows: where SVQ k (I, I , k) represents the spatial visual top-k query that aims to return k most similar geo-images by a query I considering geographical proximity and visual similarity VOLUME 8, 2020 FIGURE 2. An example of reverse spatial visual top-k query (RSVQ k ). There are ten geo-images I 1 , I 2 , . . . I 10 , and a query Q containing ten different visual words in this example. The left is the spatial distribution of these ten geo-images. The table on the right demonstrates the details of them: the geo-location descriptor and the visual descriptor.
simultaneously, formulated as follows: where Sim(Î , I ) is the similarity function to measure both geographical proximity and visual similarity betweenÎ and I . Herein we define it in formal as follows: where µ ∈ [0, 1] is a parameter to balance the proportion between geographical proximity and visual similarity, i.e., GeoSim(Î .λ, I .λ) and VisSim(Î .ν, I ν). If µ = 1 (or µ = 0), the query is just considering the geographical proximity (or visual similarity). In our solution, the users are allowed to set this parameter according to their query preferences.
In the next we formulate the definition of geographic proximity and visual similarity measurement and introduce how to implement GeoSim(Î .λ, I .λ) and VisSim(Î .ν, I ν).
Definition 2 (Geographical Proximity Measurement): Given a geo-image dataset I, ∀I ,Î ∈ I are two geo-images. The geographical proximity betweenÎ and I is measured by the following function: where EucliDst(Î .λ, I .λ) is the function to calculate the Euclidean distance betweenÎ .λ and I .λ, shown as follows: The function MaxDst(I) in Eq. 4 measures the maximum Euclidean distance between any two geo-locations in the dataset I, which is to normalize the Euclidean distance into

MaxDst(I)
where Max(·) is to return the largest element from the input collection.

Definition 3 (Visual Similarity Measurement):
Given a geo-image dataset I, ∀I ,Î ∈ I are two geo-images. The visual similarity between these two geo-images is measured by the following function: where ExJacc(Î .ν, I .ν) is the extended Jaccard distance measurement shown as following: to simplify the description, herein we usev (i) and v (i) to denote i-th visual word ofÎ .ν and I .ν, i.e.,v (i) ∈Î .ν and v (i) ∈ I .ν. The function W (·) in Eq. 8 is to calculate the weight of visual word by TF-IDF [74]. Similar to the role of MaxDst(I) in Eq. 4, the function MaxVisSim(I) in Eq. 7 is to return the maximum visual similarity, i.e.,

MaxVisSim(I)
In the following, we give a simple example to present reverse RSVQ k problem and how to find the results by comparing to the conventional reverse top-k query.
Example 2: As shown in Fig. 2, there is an example to describe the reverse spatial visual top-k query (RSVQ k ) task. Ten geo-images, i.e., I 1 , I 2 , . . . I 10 illustrated by black spots, are distributed in a region represented by longitude X and latitude Y . N 1 , N 2 , . . . N 7 is the minimum bounding rectangles (MBRs) that is to describe the approximate location. The visual dictionary is the collection of visual words that are contained by these geo-images. The table on the right shows the geographical information and the weights of each visual words that are contained in each geo-image. Given a query Q (the red spot), and Q.λ = For the conventional reverse top-k query that consider only the geographical proximity (Euclidean distance), and let k = 2, the set of results is {I 2 , I 5 , I 8 }. However, for the RSVQ k , let µ = 0.5, now the set of results is {I 8 , I 6 , I 9 } because Q is more similar to I 6 and I 9 in the aspects of visual content.

B. BASELINE INTRODUCTION
As far as we know, there is no study that focus on RSVQ k problem and no baselines have been proposed. Obviously, the existing reverse spatial textual search methods cannot be directly applied to RSVQ k since the necessity of visual representation and similarity measurement. According to Eq. 3, both geographical proximity and visual similarity should be considered simultaneously during the search. Thus, it is not feasible that perform reverse spatial search and reverse visual search separately and then combine the results of them to answer RSVQ k query.
In this work, we propose a baseline for RSVQ k , named RSVQ k -R. A pre-computation is processed to calculate the geographical proximity and visual similarity between the query Q and all the geo-images in the dataset I, and the results are stored in two lists. The threshold algorithm [14] is employed to retrieve top-k geo-images which have the highest similarity computing by Eq. 3 on these two lists. In the process of computing, if the similarity between the query Q and a geo-image I i is larger than the similarity of the k-th geo-image I k , then I i become the new k-th geo-image by replacing I k .
For the visual representation of geo-image, we utilize the hand-crafted features, namely SIFT descriptor, and combining with BoVW model to encode the visual content, which is a conventional way used in many image search tasks [46], [75], [76]. Specifically, the visual features are extracted by SIFT technique and clustered by k-means method to generate visual dictionary. Each geo-image is represented by a visual word vector in which each element is the weight of the visual word measured by TF-IDF. The spatial index employed in RSVQ k -R is R-Tree.

IV. THE PROPOSED APPROACH
In this section, we propose an effective approach to overcome the challenge of RSVQ k . Firstly, a novel hybrid index, named VR 2 -Tree, is introduced in subsection IV-A, which can organize the geo-images efficiently in both aspects of geographical distribution and visual representation. In subsection IV-B we analyze the lower and upper bound of the search in theory. Then we develop a VR 2 -Tree based algorithm to speed up the search markedly.
The Structure. To efficiently organize the geo-images, we integrate the visual representation of geo-images and R-Tree to construct a novel hybrid index, named Visual Representation R-Tree (VR 2 -Tree). As shown in Fig. 3, VR 2 -Tree is a balanced tree built on a geo-image database I. Each leaf node contains several tuples in the form of T = I .λ, I .ν, PTR(I ) , I ∈ I. As defined in Section III, I .λ = (X , Y ) is the geo-location descriptor and I .ν = v (1) , v (2) , . . . , v (n) is the visual descriptor modeled by BoVW technique. PTR(I ) is the pointer of a geo-image I in database. Each non-leaf node contains quadruples in the form of MBR, ANDOR, NUM , PTR(Child) , where MBR represents the minimum bounding rectangle of the child node, ANDOR refers to two visual vectors, namely visual word weight AND vector (AND-vector for short) and visual word weight OR vector (OR-vector for short), which are generated by two novel operations for weighted visual word vector. The definitions of them is given thereinafter. NUM is the total number of geo-images in the leaf nodes which belong to the subtree of this non-leaf node. PTR(Child) is the pointer to the child node.
Definition 4 (Weight AND): Given two γ -dimensional visual word vectors The weight AND operation on ν 1 and ν 2 , denoted as ν 1 ν 2 , is to choose the minimum value of corresponding elements in ν 1 and ν 2 , namely: where Min(·, ·) is to return the minimum of the two inputs.
2 ), W (·) is the visual word weight function. The weight OR operation on ν 1 and ν 2 , denoted as ν 1 ν 2 , is to choose the maximum value of corresponding elements in ν 1 and ν 2 , namely: 2 )), where Max(·, ·) is to return the maximum of the two inputs. For a non-leaf node N of a VR 2 -Tree, it assumes that the geo-images contained in its subtree are {I 1 , I 2 , . . . .I m }, the visual word weight AND vector of a quadruple in N is denoted as AND(I 1 , I 2 , . . . .I m ) = I 1 .ν I 2 .ν . . . I m .ν. Similarly, the visual word weight OR vector is OR(I 1 , I 2 , . . . , I m ) = I 1 .ν I 2 .ν . . . I m .ν. According to Definition 4 and 5, we calculate the visual word weight AND vector and visual word weight OR vector of non-leaf nodes, i.e., N 5 , N 6 , N 7 in Example 2, as shown in Fig. 4. For example, I 1 , I 2 , I 3 are contained in the left subtree of N 5 , and I 4 , I 5 are contained in the right subtree. Thus, for non-leaf node N 5 , the weight AND vectors of two quadruples are AND(I 1 , I 2 , I 3 ) and AND(I 4 , I 5 ), respectively. Likewise, the weight OR vectors are OR(I 1 , I 2 , I 3 ) and OR(I 4 , I 5 ).
Visual Representation. Instead of hand-crafted visual features, we propose to utilize deep CNN features to represent each geo-images since CNN features are powerful to represent semantic concept information. Specifically, AlexNet [39] is employed to extract the visual features from each geo-image in I, i.e., (x 5 is the output of 5-th convolutional layer. We sill use BoVW model to generate the visual word vector as the visual representation. Similar to the conventional manner, k-means technique is exploited to construct the CNN visual word dictionary containing γ different words. Then each geo-image is encoded into the γ -dimensional visual word vector, i.e., i )) = TF-IDF(I i .ν). In the following discussion, we denote the weighted visual words vector by I i .ν.

2) THE CONSTRUCTION ALGORITHM
Inspired by the R-Tree [52] insert operation, we develop a similar insertion algorithm based on the heuristics of minimizing the MBR to implement construction of the VR 2 -Tree, as described in Algorithm 1 detailedly. What is slightly different from the above is that, instead of using the form of (W (v j )), we propose to store the visual representation vector in a node N in the new form of ( h j is a code hashed from visual word v (1) j , α is the total number of visual words in N . To implement the hashing operation, we employ the technique proposed in [77], namely order preserving minimal perfect hashing.
Specifically, the procedure OPMP-HASH (I .ν) in Line 5 is to generate the hash codes by order preserving minimal perfect hashing from the original visual words vector and produce the new representation vector, namely The procedure ChooseLeaf (MBR) in Line 6 is invoked to choose the leaf node according to the MBR, which is similar to the implementation of R-Tree [52]. if N is the root node then 11: M .AddNode(O, P); 12: SetRoot(M ); 13: else 14: AdjustTree(N .Parent, O, P); 15: end if 16: else if N is not the root node then 17: AdjustTree(N .Parent, N , null); 18: end if in a R-Tree. Different from the algorithm AdjustTree in a R-Tree, the procedure AdjustTree(·) invoked in Line 14 and Line 17 is modified for the better compatibility with visual representations.

3) THE EXTENSION OF VR 2 -TREE
There is a limitation of the VR 2 -Tree: although the VR 2 -Tree can organize geo-images according to geographical proximity (by using MBR) as effectively as R-Tree, it ignores the visual similarity during the tree construction. In other words, it could well be that the visual similarity between the geo-images that close to each other in geographical is very small. This phenomenon is easy to find in real environment. For example, on a commercial street, the facilities usually fall into different categories, e.g. restaurant, clothing shop, cafe, cinema, etc. This leads to the low visual similarity between the geo-images collected in these different facilities.
To overcome this limitation, we propose to extend the VR 2 -Tree by exploiting visual content clustering to modify the structure of the non-leaf node, and we call this extension as Clustering based VR 2 -Tree (CVR 2 -Tree). Specifically, before the construction of the tree, we use k-means method to partition the geo-image dataset I into k clusters according to the visual similarity, i.e., {C 1 , C 2 , . . . , C k } = KMEANS(I).
Different from the VR 2 -Tree, the tuple T in non-leaf nodes of CVR 2 -Tree, as shown in Fig. 5, contain a novel entry named CEntry set S C = {E C }. Each CEntry E C corresponding to a cluster is in the following form: E C : C id , I num , where C id is the id of the cluster, I num is the total number of geo-images belong to this cluster. For a non-leaf node, its CEntry set is the specific superposition of all the CEntry sets in its child nodes. To describe it clearly, we propose a novel operation, named CEntry set sum to define this calculation formally, as shown in the following.
Definition 6 (CEntry Set Sum): Given two CEntry set S C1 and S C2 . The sum of these two CEntry sets, i.e., S C1 S C2 is defined as follows: where, and, and the operator is the set union operator, \ is the set minus operator. Therefore, for a non-leaf node N , its CEntry set N .S C is the sum of all the CEntry sets in its child nodes, i.e., N .S C = L i=1 ChildNode(N ) i .S C , where ChildNode(N ) i represents the i-th child node of N , L is the total number of children. For example, consider all the geo-images {I 1 , I 2 , . . . , I 10 } in Example 2, according to visual similarity we cluster them into 4 clusters: C 1 = {I 1 , I 2 , I 5 }, C 2 = {I 3 , I 4 }, C 3 = {I 6 , I 7 , I 8 , I 9 } and C 4 = {I 10 }. Thus, for the non-leaf node N 5 , Like the AND-vector and OR-vector in the node of VR 2 -Tree, we can calculate the CAND-vector and COR-vector for each cluster. Specifically, the CAND-vector contains the minimal weights of each visual words included in the cluster, and the COR-vector contains the maximum weights of each visual words. For the four clusters C 1 , C 2 , C 3 , C 4 mentionedabove, the CAND-vectors and COR-vectors of them are shown in Fig. 6.

B. RSVQ K ALGORITHM
Based on the CVR 2 -Tree, we carefully design a novel algorithm to solve the RSVQ k problem efficiently. Before introduce the detail of this algorithm in Section IV-B.3, we discuss how to compute the lower bound and upper bound of similarity IV-B.1. VOLUME 8, 2020 FIGURE 6. The visual word weight CAND and COR vectors of clusters C 1 , C 2 , C 3 and C 4 in Example 2. Similar to the AND and OR operations in VR 2 -Tree, the CAND-vector contains the minimal weights of each visual words included in the cluster, and the COR-vector contains the maximum weights of each visual words.

1) LOWER BOUND AND UPPER BOUND
To explain the computation of lower bound and upper bound, firstly, we present the notions of minimal similarity and maximal similarity between two tuples in a CVR 2 -Tree, and then introduce the lower bound and upper bound contribution list.
Given a CVR 2 -Tree T, ∀T ∈ T, the lower bound and upper bound of similarity between the tuple T and its k-th most similar geo-image are denoted as T and T respectively. The γ -dimensional visual word weight AND vector and OR vector of T are denoted as T .A = (a (1) , a (2) , . . . , a (γ ) ) and T .O = (o (1) , o (2) , . . . , o (γ ) ) respectively. we define the minimal similarity between two tuples in CVR 2 -Tree as follows.
Definition 7 (Minimal Similarity (MinSim)): Let T 1 and T 2 ∈ T be two tuples, the minimal similarity between T 1 and T ∈ is denoted as MinSim(T 1 , T 2 ), which is computed by the following equation: where tMaxGeoSim(T 1 , T 2 ) proposed in [78] is a tighter Euclidean distance measurement than MaxGeoSim(T 1 , T 2 ) that is the maximal Euclidean distance between T 1 .MBR and T 2 .MBR, and where, T 1 .W (i) denotes the weight of i-th visual word, and, where, Property 1: Given a CVR 2 -Tree T, Definition 8 (Maximal Similarity (MaxSim)): Let T 1 and T 2 ∈ T be two tuples, the maximal similarity between T 1 and T ∈ is denoted as MaxSim(T 1 , T 2 ), which is computed by the following equation: where MinGeoSim(T 1 , T 2 ) is the minimal Euclidean distance measurement between two MBRs of T 1 and T 2 , is the maximal visual similarity between T 1 and T 2 , which is computed by the following equation: According to the definition of minimal and maximal similarity between two tuples in VR 2 -Tree or CVR 2 -Tree, we propose other two notions, namely Lower Bound Determinant Queue and Upper Bound Determinant Queue, which are used to reduce the candidate set effectively.  MaxSim(T , Q), the subtree of T can be pruned safely.

Property 3: Given a lower bound determinant queue
According to the Definition 9 and Property 3, the candidate set can be reduced by pruning the tuples that are not similar enough to the query. Therefore, the lower bound T can be assigned by ψ (α) .ξ α .
Similar to lower bound determinant queue, upper bound determinant queue has an important property that is formulated as follows.
Property 4: Given a upper bound determinant queue Q), then Q is one of the k most similar geo-images for all geo-images in T .
It is easy to understand from the Property 4 that the number of geo-images that are similar to any geo-image in the tuple T (i.e., similarities of them are larger or equal to MinSim(T , Q)) is at most k − 1. Therefore, the upper bound T can be assigned by ψ (β) .ξ β .

2) OPTIMIZATION: TIGHTER BOUND VIA CVR 2 -TREE
To improve the performance of search, we propose a optimization method via CVR 2 -Tree to obtain a tighter bound. According to cluster id, this method aims to identify the outliers from the tuples in CVR 2 -Tree, which are picked out from the normal geo-images and severally calculate their bounds. Thus, the bounds of the normal tuples can be tighter.
The outlies can be identified according to the following two situations: Situation-1: For a tuple T , most geo-images in the subtree of T can be pruned, but there exist a few of geo-images that cannot be pruned, and we treat them as outliers. Obviously, these outliers make the tuple T and its subtree cannot be pruned.
where is a parameter. The geo-images that are in T but not in Sub 1 ({C}) are treated as outliers. Situation-2: For a tuple T , most geo-images in the subtree of T can be treated as results, but there exist a few of geo-images that cannot be treated as results. Therefore, the tuple T cannot be treated as a result tuple. Formally, for a query Q and a tuple T , if MinSim(T , Q) < T < MaxSim(T , Q), and there exist a subset Sub where is a parameter. The geo-images that are in T but not in Sub 2 ({C}) are treated as outliers. According to the above two situations, we can identify the tuples whether their subtree can be pruned or treated as results. The implementation of this optimization method is shown in the next part. Algorithm 2 RSVQk Algorithm 1: INPUT: the tree root of a CVR 2 -Tree T.Root, a reverse spatial visual top-k query Q. 2: OUTPUT: All the geo-images I , s.t., I ∈ RSVQk(Q, k, I). 3: Initializing: A max-priority queue P ← null; 4: Initializing: A candidate geo-image list L C ← null; 5: Initializing: A pruned tuples list L P ← null; 6: Initializing: A results list L R ← null; 7: EnQueue(P, T.Root); 8: while IsNotEmpty(P) do 9: T P ← DeQueue(P); 10: for each child tuple T of T P do 11: 12: 13: if ¬IsResultOrPruned(T , Q, L R ) then 14: for each tupleT ∈ L C ∪ L R ∪ P do 15: Update (T ,T ); 16: if IsResultOrPruned(T , Q, L R ) then 17: Remove(T , L C ∪ L R ∪ P); 18: end if 19: end for 20: if ¬IsResultOrPruned(T , Q, L R ) then 21: if IsIndexNode(T ) then 22: if T is Situation-1 or Situation-2 then 23: for each T ∈ Subtree(T ) do 24: if C T ⊂ Sub 1 ({C}) then 25: Prune(T ); 26: else if C T ⊂ Sub 2 ({C}) then 27: L R .Add(T ); 28: else if IsIndexNode(T ) then 29: EnQueue(P, T );

3) TOP-k SEARCH ALGORITHM
Based on the CVR 2 -Tree and the notion of the lower bound and upper bound, we carefully develop an efficient search algorithm for the task of RSVQk, which is shown in Algorithm 2.
Specifically, the inputs of RSVQk algorithm are a tree root of a CVR 2 -Tree and a query Q. This algorithm accesses  9: if T < tMiNSim(T ,T ) then 10: 12: if T < MinSim(T ,T ) then 13: 14: end if the CVR 2 -Tree T from top to bottom and computes the lower bound T and T step-by-step for each T ∈ T. Then, according to T and T , the algorithm to determine a tuple T should be pruned or the geo-images in it are the results. At the beginning of it, a max-priority queue P and three lists are initialized, i.e, a candidate geo-image list L C in which the geo-image need to be checked, a pruned tuples list L P in which the tuples will not be results and a results list L R . The first step is to put the tree root into the queue P by invoking the procedure EnQueue(P, T.Root) (in Line 7). Then If the queue P is not empty, the tuple with the highest priority, denoted by T P is dequeued from P (Lines 8-9). After that, for each child T of T P , it inherits the lower bound determinant list and upper bound determinant list from T P (Lines [11][12]. Based on L (T ) and U (T ), the procedure IsResultOrPruned(T , Q, L R ) (Algorithm 3) is invoked to determine whether T is a result or need to be pruned (Line 13). As shown in Algorithm 3, if T ≥ MaxSim(T , Q), that means T can be pruned, we put T into list L P ; if T < MinSim(T , Q) and T is the rightest child, that means T can be treated as a result, we put it into results list L R ; if T does not belongs to above situations, we tighten the lower bound and upper bound by Algorithm 5 Verify(L C , L P , L R , Q) 1: while IsNotEmpty(L C ) do 2: Initialize T ∈ L P with the lowest level; 3: L P = L P − {T }; 4: for each geo-image I ∈ L C do 5: Update (I , T ); 6: if IsResultOrPruned(I , Q) then 7: end if 9: end for 10: for each child tupleT of T do 11: L P = L P ∪ {T }; 12: end for 13: end while invoking procedure Update (T ,T ) usingT ∈ L C ∪ L R ∪ P (Lines 14-15). In Line 16 and 17, the algorithm invokes procedure IsResultOrPruned again to determine whether T is pruned or treated as a result. If yes, then the algorithm removesT from P or L C . In Lines 20-35, if T is not a result or pruned, and meanwhile it is an index node (Lines 20-21), then we identify whether the tuple T belongs to situation-1 or situation-2. If yes, the algorithm checks whether the tuples in subtree of T are results or not based on the relation between C T and the cluster set Sub 1 ({C}) and Sub 2 ({C}). If not, the algorithm puts T into queue P. Finally, in Line 43, the procedure Verify is invoked to decide whether the geo-images in list L C are results.
The pseudo-code of procedure Verify is shown in Algorithm 5, which aims to check the effect of the tuples in T P on each tuples in L C . First, in Lines 1-2, this procedure chooses a tuple from the list L P with the lowest level in the CVR 2 -Tree. The reason of this process is that the tuples in the lower level generally have tighter bounds. That means they are more likely to identify the tuples that are results or not. In Lines 4-7, the tuple T is used to update the determinant queue of each geo-image that is contained in L C , then the geo-images are checked whether they can be dropped from the L C . In Line 10-11, this algorithm adds child tuple of T into list L P , due to the effect on the candidates in L C .

V. EXPERIMENTS
In this section, the comprehensive experiments on four datasets are presented, which evaluate the performance of the proposed approach. Firstly, the datasets and workload of the experiments are introduced in section V-A, then discuss the evaluations in section V-B.

A. DATASETS AND WORKLOAD 1) DATASETS
In our experiments, four synthetic geo-image datasets are used to evaluate the performance of various approaches. Two common used image datasets, i.e., Flickr and ImageNet, are used as the source of the synthetic geo-image datasets. The following four datasets are deployed in the experiments: • Flickr-RP. The synthetic dataset Flickr-RP is produced by obtaining geographical locations from corresponding spatial datasets from Rtree-Portal 6 and randomly geo-tagging the images in Flickr, 7 the most popular photo-sharing platform. That means we do not use the original geo-tags of these images. To evaluate the scalability of the proposed approach, The dataset size varies from 200K to 1000K.
• Flickr-US. The synthetic dataset Flickr-US is produced by obtaining geographical locations from the US Board on Geographic Names. 8 Like the dataset Flickr-RP, we use these geographical location information to generate new geo-tags for the images in Flickr.
• ImageNet-RP. The synthetic dataset ImageNet-RP is generated by obtaining geographical locations from the US Board on Geographic Names 9 and randomly geo-tagging the images obtaining from the largest image dataset ImageNet. 10 ImageNet is widely used in image processing and computer vision, which includes 14,197,122 images and 1.2 million images with SIFT features. Like the Flickr dataset, We generate ImageNet dataset with varying size from 200K to 1000K.
• ImageNet-US. The synthetic dataset ImageNet-US is generated by obtaining geographical locations from the US Board on Geographic Names 11 and randomly geo-tagging the images in ImageNet. Some samples of Flickr and ImageNet dataset are shown in Fig. 7.

2) WORKLOAD
A workload for reverse spatial visual top-k query experiments includes 100 input queries. The query locations are randomly selected from the locations of the underlying geo-objects. By default, the number of final (top-k) results k = 3; the image dataset size is 600K, which grows from 200K to 1000K; the parameter µ is set to 0.7; The number of query visual words is set to 100, which changes from 25 to 150. We report the average response time of 100 queries. The details of these parameters are presented in Table 2. All the experiments are run on a workstation with Intel(R) CPU Xeon 2.60GHz, 16GB memory and NVIDIA GeForce GTX 1080 GPU running Ubuntu 16.04 LTS Operation System. All query algorithms in the experiments are implemented in Java.
To the best of our knowledge, this work is the first time to investigate the problem of reverse spatial visual top-k query. In other words, there exists no method for this challenge. we compare the performance of the following approaches:  • RSVQ k -R. RSVQ k -R is the baseline introduced in Section III-B, which employs R-Tree as the spatial index.
• RSVQ k -VR 2 . RSVQ k -VR 2 is the proposed method introduced in Section IV-A.1, which employs VR 2 -Tree as the spatial index.
• RSVQ k -OptCVR 2 . RSVQ k -OptCVR 2 is the proposed method which uses CVR 2 -Tree with the optimization method introduced in Section IV-B.2.
As discussed above, the techniques of visual word generation used in the baseline is SIFT+BoVW. We utilize SIFT technique to extract local visual features of samples in the geo-image datasets, and then encode them into visual words vectors with a pre-learned vocabulary tree. The number of local visual features of each sample is from 1 to 300. For the proposed approaches, the pre-trained CNN model, i.e., AlexNet is used to learn the visual features. We finetune the AlexNet on the two geo-image datasets by stochastic gradient descent (SGD) algorithm. The momentum is set to 0.9 and weight decay is set to 0.0005. To prevent over-fitting, each layer is followed by a drop-out operation with a drop-out ratio of 0.5. After fine-tuning, the outputs of the first two fully-connected layers as the deep visual features, which are used to generate deep visual words vectors.

B. PERFORMANCE EVALUATIONS
In this section, we evaluate the reverse search performance of the proposed approaches, i.e., RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 , and compare them with the baseline RSVQ k -R on different size of geo-image datasets. Some search results of the proposed approaches are shown in Fig. 8. The images in green rectangle are the correct results and the failed cases are in the red rectangle.

1) EVALUATION ON THE SIZE OF DATASETS
We evaluate the effect of varying the size of geo-image dataset on Flickr-RP, Flickr-US, ImageNet-RP and ImageNet-US, shown in Fig. 9 using log-scale. Obviously, the proposed algorithms outperform the baseline on these four datasets. Particularly, with the increasing of the dataset size, the efficiency of RSVQ k -R declines dramatically because all the geo-images have to be considered for spatial visual top-k search. By comparison, the performances of RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 drop relatively slowly due to the efficiently spatial index and search algorithm.
To clearly demonstrates the trends of these proposed approaches, we draw the experimental data of RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 via linear scale, shown in Fig. 10. For these four datasets, the performance of   RSVQ k -VR 2 is the lowest. Specifically, its response time is fluctuating upward in interval [200K, 800K], and after that it grows rapidly. By using the more efficient index, i.e., CVR 2 -Tree, the algorithm RSVQ k -CVR 2 can defeat the former. Similarly, the response time rises markedly when the dataset size is larger than 800K. Benefit from the optimization VOLUME 8, 2020  technique, RSVQ k -OptCVR 2 is the most efficient algorithm, whose growth rate of response time is the lowest as well. On Flickr-RP, it increases from 1.8 at 200K to nearly 3.2, which is similar to the situations on the other three datasets.

2) EVALUATION ON THE NUMBER OF RESULTS k
We evaluate the effect of varying the number of results k on Flickr-RP, Flickr-US, ImageNet-RP and ImageNet-US, shown in Fig. 11. As the huge performance gap between the baseline and the three proposed algorithms, we do not plot the experimental data of RSVQ k -R. Instead, we just show the differences of RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 . Beyond all doubt, the response time of all these algorithms increase gradually with the rise of k. Due to the optimization method, RSVQ k -OptCVR 2 overcomes the others on all the four datasets. By comparison, without the optimization, the performance of RSVQ k -CVR 2 is worse than the former, which shows an upward trend with fluctuation. Apparently, the response time of RSVQ k -VR 2 is the highest since the promotion of efficiency by the VR 2 -Tree is not larger than CVR 2 -Tree, especially the applying of optimization technique.

3) EVALUATION ON THE BALANCE PARAMETER µ
We evaluate the effect of varying the value of balance parameter µ in the similarity measurement on the four datasets. Like above experiments, we do not plot the data of RSVQ k -R due to the enormous efficiency gap. On Flickr-RP dataset shown in Fig. 11(a), we can see clearly that the efficiency of RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 are not obviously affected by changing µ in interval [0, 0.9]. Specifically, they move up and down slightly. However, when µ = 1, the time cost of these algorithms drop down obviously because the visual similarity is ignored totally. As expected, RSVQ k -OptCVR 2 wins this comparison by applying optimization via CVR 2 -Tree. On Flickr-US, the runtime of these algorithms are slightly lower than the values on Flickr-RP, but the trends of them is very similar. They decline rapidly at µ = 1. As expected, the situations on ImageNet-RP ( Fig. 11(c)) and ImageNet-US ( Fig. 11(d)) are very similar to the former two.

4) EVALUATION ON THE NUMBER OF QUERY VISUAL WORDS
In the last set of experiments, we evaluate the effect of varying the number of query visual words on these four datasets. The experimental results are illustrated in Fig. 13. By the same token, we do not consider the results of baseline and just show the differences between RSVQ k -VR 2 , RSVQ k -CVR 2 and RSVQ k -OptCVR 2 . It is evident that the runtime of these algorithms decrease gradually as the number of query visual words increases. In particularly, the change rates of them in interval [25,75] is a bit larger than the value in [100, 150]. The reason is that more visual words may enhance the pruning by diminishing the average visual similarity between query and geo-images. Same as those of the above sets of experiments, RSVQ k -OptCVR 2 has the highest efficiency on all these datasets. In summary, these experimental results demonstrate that the proposed spatial index VR 2 -Tree, especially CVR 2 -Tree with the optimization method can substantially improve the performance of reverse spatial visual search. The proposed search algorithm shows obvious superiority with the comparison to the baseline.

VI. CONCLUSION
This paper investigates a novel search problem named RSVQ k query, which aims to retrieve a set of geo-image objects that have the query image as one of the most relevant images in both aspects of geographical proximity and visual similarity. To improve the search efficiency, a new hybrid index named VR 2 -Tree and its extension is presented, which is a combination of visual representation of geo-image and R-Tree. Besides, the optimization method to tighter the bound via CVR 2 -Tree is introduced. In addition, an efficient CVR 2 -Tree based algorithm, named RSVQ k algorithm is careful developed, which can speed up the reverse search significantly. Comprehensive experiments are conducted on four geo-image datasets, and the results demonstrate that the proposed approach can address the RSVQ k problem effectively and efficiently.