SVS-JOIN: Efficient Spatial Visual Similarity Join for Geo-Multimedia

In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale geo-multimedia retrieval. Spatial similarity join is one of the significant problems in the area of spatial database. Previous works focused on spatial textual document search problem, rather than geo-multimedia retrieval. In this paper, we investigate a novel geo-multimedia retrieval paradigm named spatial visual similarity join (SVS-JOIN for short), which aims to search similar geo-image pairs in both aspects of geo-location and visual content. Firstly, the definition of SVS-JOIN is proposed and then we present the geographical similarity and visual similarity measurement. Inspired by the approach for textual similarity join, we develop an algorithm named SVS-JOIN<inline-formula> <tex-math notation="LaTeX">$_{B}$ </tex-math></inline-formula> by combining the PPJOIN algorithm and visual similarity. Besides, an extension of it named SVS-JOIN<inline-formula> <tex-math notation="LaTeX">$_{G}$ </tex-math></inline-formula> is developed, which utilizes spatial grid strategy to improve the search efficiency. To further speed up the search, a novel approach called SVS-JOIN<inline-formula> <tex-math notation="LaTeX">$_{Q}$ </tex-math></inline-formula> is carefully designed, in which a quadtree and a global inverted index are employed. Comprehensive experiments are conducted on two geo-image datasets and the results demonstrate that our solution can address the SVS-JOIN problem effectively and efficiently.


I. INTRODUCTION
In the big data era, online social networking services, search engine and multimedia sharing services are rapidly growing in popularity, which generate, collect and store large-scale multimedia data [1]- [4], e.g., texts, images, audios and videos. For example, we use online social networking services such as Facebook, 1 Twitter, 2 Linkedin, 3 Weibo, 4 etc. to make friends, sharing hobbies and work information by posting texts, uploading images or short videos. On the other hand, for multimedia data [5] sharing platforms such as Flickr, 5 more than 3.5 million new images posted online every day in March 2013. Every minute there are 100 hours of videos uploaded to YouTube, 6 and more than 2 billion videos totally stored in this platform by the end of 2013. In China, IQIYI 7 is the largest video sharing web site. The total watch time monthly of this online video service exceeded 42 billion minutes. These multimedia online services not only provide great convenience for us, but create possibilities for the generation, collection, storage and sharing of large-scale multimedia data [6], [7]. Moreover, this trend has put forward greater challenges for massive multimedia data retrieval [8], [9].
Smartphones and tablets equipped with communications module (e.g., WiFi and 4G module) and position sensor module (e.g., GPS-Module) collect huge amounts of multimedia data [10]- [12] with geo-tags. For example, users can take photos or videos [13], [14] with the geo-location information. Besides, many mobile applications such as WeChat, Twitter and Instagram support posting, storing and sharing geo-multimedia data. Other location-based services such as Google Places, Yahoo!Local, and Dianping provide the convenient query services by taking into account both geographical proximity and multimedia content similarity.
Motivation: Due to the wide application of smart devices and location-based services, spatial textual search problem has become a hot spot in the area of spatial database and information retrieval. Lots of spatial indexing techniques have been proposed to support efficient query, such as R-Tree [15], R + -Tree [16], R * -Tree [17], KR * -Tree [18], IR 2 -Tree [19] etc. More recently, Deng et al. [20] studied a generic version of closest keywords search called best keyword Cover. Cao et al. [21] proposed the problem of collective spatial keyword querying, Fan et al. [22] studied the problem of spatio-textual similarity search for regions of interests query. Zhang et al. [23] proposed IL-Quadtree to address top-k spatial keyword search problem efficiently. However, these researches just only consider the textual data such as keywords, they do not take into account the content of multimedia data, e.g. images. One of the significant geo-textual search problems, spatial textual similarity join, is to find out the spatial textual object pairs that are similar in both aspects of geo-location and textual content simultaneously. It has attracted wide attention such as [24]- [29]. Nevertheless, there is no work pays attention to geo-multimedia data for this task. In this paper, we aim to investigate a novel paradigm named Spatial Visual Similarity JOIN (SVS-JOIN) and develop an efficient solution to overcome the challenge of geo-multimedia query. Fig. 1 is a simple but intuitive example to describe this problem.
Example 1: As illustrated in Fig. 1, the spatial visual similarity join can be applied in friends recommendation services on online social networking services. According to the geo-images posted by the users, the social networking system can find the similar geo-images in both aspects of geo-location and visual content. It is easy to understand that two people may make friend if they have the same hobbies and their position is very close. There are four similar geo-image pairs in Fig. 1 are searched out. For pair 1 shown in magenta rectangle, two users who took the photos about basketball at two close places are very likely to become good friends.
To the best of our knowledge, this paper is the first time to study the SVS-JOIN problem. We introduce the definition of spatial visual similarity join in formal and present the relevant notions. Besides, we discuss how to measure geographical similarity and visual similarity to find the similar geo-image pairs. To measure visual similarity accurately, we employ two types of visual features to generation visual representations of geo-images: (1) the traditional hand-crafted visual features named Scale-Invariant Features Transform (SIFT for short) and (2) deep visual features extracted by convolutional neural networks (CNN for short). The former is named SIFT-BoVW and the latter is called Deep-BoVW. To combat this challenge effectively and efficiently, an algorithm called SVS-JOIN B inspired by the techniques used in textual similarity join is introduced. Based on it, we develop an extension of SVS-JOIN B called SVS-JOIN G which uses spatial grid partition strategy to improve the efficiency. In order to further improve the search performance, a novel method named SVS-JOIN Q is carefully designed, which is based on quadtree and global inverted index to speed up search.
Contributions: Our main contributions can be summarized as follows: • To the best of our knowledge, this work is the first to study the problem of spatial visual similarity join. We propose the definition of geo-image, SVS-JOIN and relevant notions. The visual similarity function and geographical similarity function are designed for similar geo-image pair search.
• For the visual representation of geo-images, we propose to employ CNN to extract deep visual features for visual words generation, rather than hand-crafted visual features. We call this method Deep-BoVW that is a combination of deep CNN techniques, Bag-of-Visual-Words and k-means clustering method. As far as we know, there is no existing research that uses the combination of CNN and BoVW to address the image JOIN problem.
• We introduce an algorithm named SVS-JOIN B inspired by the techniques used for the problem of textual similarity join. An extension of SVS-JOIN B called SVS-JOIN G is developed which utilizes spatial grid partition technique to speed up search. To further improve the searching performance, we present a novel method named SVS-JOIN Q that is based on quadtree partition technique and a global inverted index.
• We have conducted comprehensive experiments on real geo-image dataset. Experimental results demonstrate that our approach has really high performance.
Roadmap: The remainder of this paper is organized as follows: In Section II we introduce the previous researches concerning content-based image retrieval, spatial textual search and set similarity joins, which are related to this work. In Section III, we propose the definition of spatial visual similarity join and relevant concepts. In Section IV, two visual representations named SIFT-BoVW and Deep-BoVW are presented. Besides, we introduce a baseline and an extension named SVS-JOIN B and SVS-JOIN G respectively. In addition, a novel algorithm named SVS-JOIN Q is proposed, in which a combination of a quadtree and a global inverted indexing structure is employed to solve the SVS-JOIN problem more efficiently. In Section V, we present the experiment results. Finally, we conclude the paper in Section VI.

II. RELATED WORK
In this section, we introduce the previous studies of content-based image retrieval, spatial textual search and set similarity joins, which are relevant to this work. To the best of our knowledge, no priori work on this problem.
A. CONTENT-BASED IMAGE RETRIEVAL 1) CBIR VIA SIFT As one of the most important problems, content-based image retrieval (CBIR for short) [30]- [32] has gained much attention of many researchers in multimedia area [33]- [35]. Scale-Invariant Features Transform (SIFT for short) [36], [37] is one of the conventional methods for visual feature extraction. It transforms an image into a collection of local feature vectors. These features are invariant to translation, scaling, rotation, and partially invariant to illumination changes. In recent years, lots of works have been proposed using SIFT to overcome CBIR challenges. For example, Mortensen et al. [39] proposed a feature descriptor which augments SIFT with a global context vector that adds curvilinear shape information from a much larger neighborhood. Ke et al. [40] proposed a SIFT and PCA based method to encode the salient aspects of the image gradient in the neighborhood of feature point. Su et al. [41] presented horizontal or vertical mirror reflection invariant binary descriptor named MBR-SIFT to solve the problem of image matching. To gain sufficient distinctiveness and robustness for the task of feature matching, Li and Ma [42] designed a novel SIFT based feature descriptor by integrating color and global information. Zhu et al. [43] proposed an image registration algorithm called BP-SIFT by using belief propagation, which has significant improvement for the problem of keypoint matching.

2) CBIR VIA BoVW
Originated from text retrieval and mining, BoVW is an important visual representation method in multimedia retrieval and computer vision [5], [11], [44], [45]. BoVW [46] is a conventional image representation model, which is to improve the performance of image feature matching markedly. For image retrieval problem, it generates visual words by utilizing k-means method to cluster SIFT features. Escalante et al. [47] VOLUME 7, 2019 presented an evolutionary algorithm to implement an automatically learning weighting schemes of this model for computer vision tasks. Dimitrovski et al. [48] proposed to use predictive clustering trees (PCTs) to improve the BoVW image retrieval in the large-scale image database. Mandal et al. [49] proposed a patch-based framework by using SIFT descriptor and BoVW model to improve the performance of handwritten signature detection. Based on S-BoVW paradigm, dos Santos et al. [50] proposed a novel method that considers information of texture to generate textual signatures of image blocks. For the task of Medical image retrieval, Zhang et al. [51] proposed a BoVW based medical image retrieval approach named PD-LST retrieval to identify discriminative characteristics between different medical images with pruned dictionary. Karakasis et al. [52] proposed a BoVW based framework for image retrieval, which uses affine image moment invariants as descriptors of local image areas.

3) CBIR VIA DEEP LEARNING
As a powerful tool, deep learning [53]- [55] is widely used to solve image retrieval [56]- [58] and computer vision problems. In 2012, AlexNet [59] proposed by Krizhevsky et al. significantly improves the accuracy of image retrieval. More recently, lots of deep learning based researches have been proposed for CBIR task. Gordo et al. [60] proposed to generate compact global signatures via CNN for image retrieval. Fu et al. [61] utilized CNN to generate visual features and employed SVM for classification. Tzelepi et al. [62] proposed a novel CNN based approach that exploits the data label information to generate better descriptors for image retrieval. For style based image retrieval task, Matsuo and Yanai [63] proposed to use style vector that is transformed from CNN based style matrix. Zhou et al. [64] proposed a CNN-based match kernel to encode CNN feature and SIFT feature to improve the accuracy. Liu et al. [65] combined high-level CNN features and low-level DDBTC features to generate two-layer codebook features for performance boosting. Seddati et al. [66] proposed to improve Regional Maximal Activation (RMAC) approach by combined multi-scale and multi-layer feature extraction of different RMAC extensions. Yang et al. [67] utilized a dynamic match kernel with deep CNN features to search images with different content details but similar semantics. Shimoda and Yanai [68] tested simple, siamese and triplet CNN to generate good visual features for food image retrieval. For some specific application, Nakazawa and Kulkarni [69] presented a CNN based image classification method to solve wafer maps defect pattern recognition issue. Sarraf and Tofighi [70] employed CNN to recognize fMRI image of Alzheimer's brain from normal healthy brain.
It is no doubt that these solutions improve the performance of image retrieval and visual feature matching significantly. However, these works cannot solve the problem of geo-multimedia data retrieval as they have no effective processing for geographical distance measurement.

B. SPATIAL TEXTUAL SEARCH 1) SPATIAL TEXTUAL QUERY
Due to the collection and storage of large scale spatial textual data, there has been increasing interest on spatial textual search problem [71], [72]. Spatial textual search [23], [73], [74] aims to retrieve textual objects or documents with geo-tags by textual similarity and geographical proximity. For top-k spatial keyword queries, Rocha-Junior et al. [75] proposed a novel spatial index named Spatial Inverted Index (S2I for short) to enhance the efficiency of search. Li et al. [76] proposed an efficient indexing structure named IR-tree, which enables spatial pruning and textual filtering to be performed simultaneously. Zhang et al. [77] presented a scalable integrated inverted index called I 3 that uses the Quadtree to hierarchically partition the data space into cells. Zhang et al. [78] proposed an efficient index named inverted linear quadtree (IL-Quadtree for short) and designed a novel algorithm to improve the performance of query. Li et al. [79] presented BR-tree to solve the problem of keyword-based k-nearest neighbor queries. They utilized R-tree to maintain the spatial information of objects and exploited B-tree to main the terms in the objects. Fan et al. [22] proposed grid-based signatures and threshold-aware pruning techniques to address spatio-textual similarity search problem. Zhang et al. [80] proposed to model the spatial keyword search problem as a top-k aggregation problem. They developed a rank-aware CA algorithm that works well on inverted lists sorted by textual relevance and spatial curving order. Wang et al. [81] proposed an efficient technique named AP-Tree to solve the problem of continuous spatial-keyword queries over streaming data. Zhang et al. [82] introduced m-closest keywords (mCK for short) query that aims to search out the spatially closest tuples which match m user-specified keywords. Guo et al. [83] proposed another solution to solve the mCK search problem. They devised a novel greedy algorithm named SKEC that has an approximation ratio of 2 and in addition, they developed two approximation algorithms called SKECa and SKECa+ respectively to improve the efficiency.

2) SET SIMILARITY JOINS
In recent years, lots of researchers paid attentions on the problem of spatial textual similarity join [24], [84], [85]. A spatial similarity join of two spatial databases aims to search out pairs of objects that are simultaneously similar in both aspects of textual and spatial. Ballesteros et al. [25] proposed an algorithm based on MapReduce parallel programming model to solve this problem on large-scale spatial databases. Efstathiades et al. [26] propose the problem of Spatio-Textual Point-Set Join query and extended the existing methods to solve the spatial-textual joins problem of point sets. Hu et al. [27] introduced a signature-based join framework that prunes large numbers of dissimilar pairs to enhance the search efficiency. To overcome the issue of large number of duplicates, Rong et al. [28] introduced a novel duplicate free framework with three filtering methods to prune dissimilar string pairs without computing their similarity scores. Shang et al. [29] presented a knowledge hierarchy based filter-and-verification framework to efficiently identify the similar pairs to address knowledge-aware similarity join problem.
These spatial textual search and similarity joins approaches only consider the textual and spatial information, that means they cannot be directly applied to address geo-image joins problem even if they raise search efficiency substantially. Thus, this paper proposes to combine geographical information and visual representations of geo-images to construct efficient search algorithms for spatial visual similarity joins problem.

III. PRELIMINARIES
In this section, we propose the definition of spatial visual similarity joins (SVS-JOIN) at the first time, then present the geographical and visual similarity measurement. Besides, we briefly introduce the SIFT and CNN techniques respectively, which are the base of our work. Table 1 summarizes the notations frequently used throughout this paper to facilitate the discussion.

A. PROBLEM DEFINITION
Definition 1 (Geo-Image): Let D I = {I 1 , I 2 , . . . , I |D I | } be a geo-image dataset, |D I | denotes the size of D I . A geo-image I i ∈ D I is defined as a tuple I i =< I i .G, I i .V >, where I i .G is the geographical information component that is generated from the geo-tag of this image. More specifically, it consists of longitude X and latitude Y , i.e., Consider two geo-image datasets R I = {I r 1 , I r 2 , . . . , I r |R I | } and S I = {I s 1 , I s 2 , . . . , I s |S I | }, similar to spatial textual similarity join, a spatial visual similarity join aims to retrieval all pairs of geo-images from R I and S I respectively, which are similar enough in both aspects of geo-location and visual content. We introduce two thresholds, i.e., geographical similarity threshold and visual similarity threshold to measure these two similarity. Specifically, for each pair, both of the geographical similarity and visual similarity of these two geo-images are less than geographical similarity threshold and visual similarity threshold. To clarify our work more clearly, we propose the definition of spatial visual similarity join as follows.

Definition 2 (Spatial Visual Similarity Join (SVS-JOIN)):
. . , I s |S I | }, geographical similarity threshold G and visual similarity threshold V . A spatial visual similarity join denoted as SVS-JOIN(R, S, G , V ) returns a set of geo-image pairs P ⊆ R × S, in which each pair contains two highly similar geo-images in both aspect of geo-location and visual content, i.e., where GeoSim(I r i , I s j ) and VisSim(I r i , I s j ) are the geographical similarity function and visual similarity function respectively.
To measure these two similarities quantitatively, we utilize Euclidean distance measurement and Jaccard distance measurement to implement these two functions, shown as follows.
Definition 3 (Geographical Similarity Function): Given two geo-image datasets R I = {I r 1 , I r 2 , . . . , I r |R I | } and S I = {I s 1 , I s 2 , . . . , I s |S I | }, ∀I r i ∈ R I , I s j ∈ S, the geographical similarity between I r i and I s j is measured by the following similarity function: where EucDst(I r i , I s j ) is the Euclidean distance between I r i and I s j , which is measured by the following function: the function MaxDis(R, S) is to return the maximum Euclidean distance between any two geo-images from R and S respectively, which is described in formal as follows: where the function max(·) is to return the maximum element from a set. Definition 4 (Visual Similarity Function): Given two geoimage datasets R I = {I r 1 , I r 2 , . . . , I r |R I | } and S I = {I s 1 , I s 2 , . . . , I s |S I | }, ∀I r i ∈ R I , I s j ∈ S, the visual similarity between I r i and I s j is measured by the following similarity function: where w(v) represents the weight of the visual word v.
In this work, we measure the weight of visual word by term frequency-inverse document frequency tf -idf [86]. Assumption: For ease of discussion, here we assume that R = S. Our approach can be applied well in the case of R = S. Therefore, for a geo-image dataset R, we denote a spatial visual similarity join as SVS-JOIN(R, G , V ).

B. SCALE-INVARIANT FEATURES TRANSFORM
Our first visual representation scheme uses SIFT [36], [37]. This conventional technique aims to transform an image into a large set of local feature vectors, which are invariant to image translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. It has four main phases:

1) SCALE-SPACE EXTREMA DETECTION
The first phase is called scale-space extrema detection. This method searches all the images in scale space, which is to identify potential points of interest that are invariant to scale and orientation by utilizing difference-of-Gaussian (DoG) function.

2) KEYPOINT LOCALIZATION
The second phrase is named keypoint localization, which is to select and localize the keypoints according to their stability. At each candidate location, a fine fitting model is used to determine the location and scale.

3) ORIENTATION ASSIGNMENT
In the orientation assignment phrase, according to the local gradient direction of the image, each keypoint is assigned one or more directions, and all subsequent operations transform the direction, scale and position of the keypoints to provide invariance of features to these transformations.

4) KEYPOINT DESCRIPTOR
In the last phase, the local gradients of the image are measured around each feature point at selected scales. And these gradients are transformed into a representation which allows for significant local shape distortion and illumination transformation.

C. CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Network (CNN for short) was first proposed by Yann Lecun in 1998 [38]. A typical CNN shown in Fig. 3 consists of several convolutional layers, pooling layers and fully-connected layers. The convolutional layer and pooling layer cooperate to form multiple convolution groups, extract visual features from low-level to high-level layer by layer, and finally complete classification through several full-connected layers. CNN simulates feature differentiation by convolutional operation, and reduces the number of model parameters by weight-sharing and pooling. The superiority of the CNN originates from four key ideas [53]: (1) local connections, (2) shared weights, (3) pooling and (4) the use of many layers.
This powerful technique has been applied successfully in many tasks, e.g., image retrieval, visual understanding, pattern classification, etc. In this work, we employ CNN as the second method of visual representation.

IV. METHODOLOGY
In this section, we introduce a novel framework to solve the problem of SVS-JOIN. This framework support two schemes for visual words generation: the one utilizes hand-crafted visual features, namely SIFT in a conventional manner; the other is to produce visual words by generating deep visual representations via CNN, which is a better method to capture high-level semantic concepts from inputs. In addition, inspired by the algorithm of textual similarity joins, we introduce a baseline named SVS-JOIN B and propose a spatial grid partition based algorithm for SVS-JOIN task called SVS-JOIN G . As an alternative approach of SVS-JOIN G , a novel quadtree based global index method named SVS-JOIN Q is designed, which can speed up the search significantly. Fig. 4 illustrates the proposed framework for SVS-JOIN problem. As discussed above, SVS-JOIN is a geo-imageoriented search problem that means the input of the system is a geo-image database. Therefore, the first priority is to generate the representations of geo-images. Two visual representation schemes are proposed in this framework: the one utilizes hand-crafted visual features that are generated via SIFT descriptor in a conventional manner, and then use BoVW model to produce visual words for each geo-image, this scheme is called SIFT-BoVW. The other scheme is to produce visual words by generating deep visual representations via a CNN model, which is a better method to capture high-level semantic concepts from inputs. Similar to the first scheme, we build the deep visual dictionary based on these feature representations generated by CNN and represent all the geo-images by deep visual words. We call this scheme Deep-BoVW. Clearly, these two visual representation schemes are all based on BoVW model, which is the basis of our geo-image index technique. In this work, two geo-image index structures are carefully designed: The first is a combination of spatial grid partition and inverted index, and the second one is a quadtree partition based inverted index. Based on these two index and similarity measurement GeoSim(I r i , I s j ) and VisSim(I r i , I s j ), we develop two efficient SVS-JOIN algorithms, namely SVS-JOIN G and SVS-JOIN Q .

B. VISUAL REPRESENTATION SCHEMES
In this subsection, we introduce the two visual representation schemes in details, namely SIFT-BoVW and Deep-BoVW. Both of them are based on BoVW model. To represent a geo-image as a collection of visual words, we propose to use two different method to generate the visual word representation, namely SIFT and Deep CNN.

1) SIFT-BoVW
In this scheme, Dense-SIFT technique, an extension of SIFT is employed to extract visual features from geo-images. In other words, it maps each geo-image into a 128-dimensions feature vector. After that we utilize k-means clustering method to construct SIFT visual dictionary by converting feature vectors into visual words. Let {I 1 , I 2 , . . . , I n } be a set of geo-images, the feature vectors of them are denoted by is the SIFT feature extractor. According to the distance between these SIFT feature vectors, k-means method groups these feature vectors into k clusters G = {g 1 , g 2 , . . . , g k } which can be formulated by the following objective function: where χ i is the mean vector of the cluster g i , namely, norm is used to measure the distance between mean vector and each visual feature VOLUME 7, 2019 FIGURE 4. The framework to solve the SVS-JOIN problem. Best view in color. This framework supports two schemes for visual words generation: the one utilizes hand-crafted visual features, namely SIFT in a conventional manner; the other is to produce visual words by generating deep visual representations via CNN, which is a better method to capture high-level semantic concepts from inputs. Besides, based on the visual word representations and geographical information, two geo-visual index structures are integrated in this framework to organize geo-images efficiently: The first method is a combination of spatial grid partition and inverted index, and the second one is a quadtree partition based inverted index. Based on these two index techniques and geographical and visual similarity measurement GeoSim(I r i , I s j ) and VisSim(I r i , I s j ), two efficient SVS-JOIN algorithms are developed, namely SVS-JOIN G and SVS-JOIN Q .
vector. After the clustering, the SIFT visual dictionary with k visual words has been constructed, namely where KMEANS(·) is the k-means algorithm. The SIFT visual dictionary is used to encode each geo-image by a kdimensions vector that is the statistics of each visual word. As mentioned in Definition 4, we measure the weight of visual word by tf -idf , namely, where η(·) denotes the number of occurrences of a visual word in an image.

2) Deep-BoVW
As the second and more powerful scheme, we propose to integrate deep CNN and BoVW model to generate the deep visual word representation named Deep-BoVW. Compared with SIFT based method, the feature extraction in a deep convolutional manner can capture the rich high-level semantic concepts, which is more powerful than conventional hand-crafted features with little semantic information. The process is quite similar to SIFT-BoVW: a deep CNN extract visual features from low-level to high-level layer-by-layer, and then a deep visual dictionary is built on these visual features by k-means algorithm, which are used to encode geo-images.
Specifically, we employ a pre-trained deep CNN model, namely AlexNet [59], for the task of visual feature extraction. AlexNet consists of five convolutional layers, some of which are followed by max-pooling layers, three fully-connected layers and a 1000-way softmax layer that is used to classification. In this work, the input images are resized as 227×227 pixels, and we use the fifth convolutional feature representations with the size 13 × 13 × 256 to generate deep visual dictionary. For geo-image set {I 1 , I 2 , . . . , I n }, the deep visual representation set of them is denoted by {ζ 1 , ζ 2 , . . . , ζ n } = CONV({I 1 , I 2 , . . . , I n }; θ), wherein CONV(·; θ) is the deep convolutional feature extractor, θ is the network parameters, and ∀ζ i ∈ {ζ 1 , ζ 2 , . . . , ζ n }, ζ i = (ζ (1) i , ζ No doubt, AlexNet is definitely not the only choice for feature extraction. And actually in the experiments we also employ two other off-the-shelf CNN models, i.e., VGGNet-16 [87] and GoogLeNet [88] to take this job for performance evaluation. VGGNet is a powerful deep convolution neural network developed by Oxford University Computer Vision group and DeepMind researchers in 2014, and GoogLeNet is another outstanding deep CNN model during that year, which utilizes a novel structure named Inception. Both of them are very powerful for computer vision tasks. To facilitate the discussion, we name these CNN based schemes as AlexNet-BoVW, VGGNet-BoVW and GoogLeNet-BoVW respectively during the comparative experiments.

C. THE BASELINE FOR SPATIAL VISUAL SIMILARITY JOINS
In this section, we propose the baseline for the SVS-JOIN problem. Firstly, we introduce the state-of-the-art algorithm named PPJOIN [90] for textual similarity joins, which is utilized in our baseline. Then we present our baseline named SVS-JOIN B in detail.

1) THE METHOD FOR TEXTUAL SIMILARITY JOINS a: INVERTED INDEX BASED METHOD
The traditional way to solve textual similarity joins efficiently is to build an inverted index for the target object dataset R, which associates each word in the global word set W built beforehand to an objects inverted list L.
where TxtSim(o,ô) is the textual similarity function and θ t is the textual similarity threshold.

b: PREFIX FILTERING PRINCIPLE
When we use the inverted index based method, the inverted list L w will be quite long if the word w is very frequent in the dataset. This become a major challenge since a lot of candidate pairs have to be generated in this situation. To reduce the size of the candidate set, an efficient method called prefix filtering principle was devised by [89]. According to this technique, we generate a global word ordering that sorts keywords by word frequency in reverse order, and then for all objects o ∈ R, order the keywords in o.V by . After the ordering, the prefix of o.V is denoted as Pf (o.V ) and the length of it is denoted as |Pf (o.V )|, which is measured by the following equation: where |o.V | represents the number of keywords in o.V , θ t is the textual similarity threshold. It is obvious that the length of prefix of an object is determined by the number of keywords contained by this object and the similarity threshold given in advance. According to this principle, we can get the following theorem: Algorithm 1 PPJOIN Algorithm 1: INPUT: R is an objects dataset sorted by a global ordering , θ t is a textual similarity threshold. 2: OUTPUT: P is the result pairs set. 3: for each w ∈ W do 4: L w ← ∅; 5: end for 6: for each o ∈ R do 7: |Pf (o.V )| S ← |o.V | − θ t |o.V | + 1; 8: |Pf (o.V )| I ← |o.V | − 2θ t θ t +1 |o.V | + 1; 9: for i = 1 to |Pf (o.V )| S do 10: w ← the i-th keyword in o.V ; 11: for e(ô, iô) ∈ L w and |ô.V | ≥ θ t |o.V | do 12: if  Obviously, the basic idea of Theorem 1 is that if the textual similarity between two objects is larger than a threshold, they should share same keywords. Therefore, this theorem can be used to prune the candidate pair set effectively. Specifically, for each object o, we just only to search the keywords contained in the prefix of o.

c: THE PPJOIN ALGORITHM
PPJOIN is one of the efficient algorithms to solve the textual similarity joins problem, developed by Xiao et al. [90]. This algorithm is a combination of positional filtering and prefix filtering-based algorithm.
Algorithm 1 demonstrates the pseudo-code of the PPJOIN algorithm. The input of this algorithm is a textual similarity threshold θ t and an objects dataset R that is sorted in ascending order of their size. At first, in Lines 3 to 5, it generates inverted list L w for each word in the global words set. Then from Line 6 to line 23, this algorithm traverses every objects in the input dataset R and find the similar pairs of objects. Specifically, for each object o ∈ R, probe prefix length After that, the filter condition |ô.V | ≥ θ t |o.V | is used to filters VOLUME 7, 2019 the candidate pairs. The positional and suffix filter are operated by calling two procedures QualifyPosFilter(o, i o ,ô, iô) and QualifySufFilter(o, i o ,ô, iô) (Lines 11-17). The overlap will be added if the pair can qualify these filters. After that, in Lines 18-20, the inverted list L w of each visual word is extended by indexing both geo-objects and geo-locations. At last, this algorithm generates the result set P by executing the verification procedure Verify(o, o , P) that aims to whether the actual overlap between o and the current candidates.

2) THE BASELINES FOR SVS-JOIN
In this subsection, we introduce the baseline approach. Inspiring by the prefix filtering principle and the PPJOIN algorithm, we propose a baseline called SVS-JOIN B for SVS-JOIN problem. Different from the textual similarity joins, our method consider two aspects of information, i.e., geographical information and visual information. We set two thresholds G and V to deal with the measurement of geographical similarity and visual similarity. According to the definition of SVS-JOIN, we implement two procedures GeoSim(I ,Î ) and VisSim(I ,Î ) to measure these two similarities.

a: SVS-JOIN B ALGORITHM
Algorithm 2 demonstrates the computing process in the form of pesudo-code. The input is a geo-image dataset sorted by a global ordering and two thresholds G and V . At the beginning of the process, the inverted list L w is initialized for each visual word. All the objects in R are accessed iterately from line 4 to line 21. For each object, probe and index prefix length are calculated (lines 5-6). The position filter procedure and suffix filter procedure are invoked as the same way of PPJOIN. Different from Algorithm 1, in Line 9, except the filter condition |ô.V | ≥ V |o.V |, GeoSim(I ,Î ) is called as a geographical similarity filter to prune the geo-image pairs whose spatial distance between two images is not short enough. In Line 22, the procedure Verify(I , I , P) generates the final results set from the candidate set P based on I .
Although SVS-JOIN B algorithm can effectively deal with the SVS-JOIN problem, we can still improve the efficiency significantly. It is easily to know that we just consider the geo-imageÎ that satisfies the filter condition GeoSim(I ,Î ) ≤ G . Unfortunately, this is the main limitation. In other words, SVS-JOIN B algorithm considers all the geo-imagesÎ ∈ L w for each visual word v which is contained in Pf (I .V ). To overcome this challenge, in the following we present a grid based spatial partition strategy and develop a more efficient algorithm named SVS-JOIN G extending from SVS-JOIN B .

b: SPATIAL GRID
We propose a grid based spatial partition strategy named spatial grid to improve the performance of algorithm. This strategy is to model the two-dimensional spatial area of dataset R as a grid, denoted as G(R) that contains several Verify(I , I , P); 23: end for 24: return P; cells, which equals to the geographical similarity threshold G in each dimension. Thus, the area of each cell equals 2 G . Clearly, the spatial grid is determined by a spatial visual similarity join with dataset R and threshold G . To put it in another way, for a given dataset R, the grid G(R) do not need to pre-compute. Fig. 5(a) shows how to generate candidate pairs by using spatial grid. The number in a cell is the cell id. We assume that a geo-image is located in the cell 57 colored by yellow, denoted C 57 . To retrieve the candidate pairs (I ,Î ), just only the C 57 and its eight neighbor cells colored by light yellow need to be accessed due to the restriction of geographical similarity threshold. Therefore, for one geo-image, we only need to check total nine cells to find its partner to form a candidate pair. If the current accessed cell is near the edge of the grid, such as C 2 , only six cells should be checked for candidates searching. Thus, the search space can be reduced significantly by using this strategy. We utilize the spatial similarity filter to find the result from these cells mentioned above.
end for 9: end for 10: return P; a spatial grid is constructed for the input dataset R as the basic spatial data structure. The geo-images in R are then accessed in the ascending order of their cell id. For each cell C i , this algorithm will get a cells set denoted as M [C i ], in which the geo-image will be joined with all of the geo-images in C i . In M [C i ], the neighbor cells of C i have smaller id than C i itself.
There are some differences between SVS-JOIN B and SVS-JOIN G . For example, SVS-JOIN G algorithm builds an inverted index for all cells in the grid, rather than a global index. Therefore, for each visual word v in the global visual dictionary, every cell has its inverted index C i .L w .
Algorithm 3 demonstrates the process of SVS-JOIN G algorithm. Similar to SVS-JOIN B algorithm, the input consists of a geo-image dataset R sorted by , a visual similarity threshold V and a geographical similarity threshold G . The first step is to build a spatial grid G(R) for R, shown in Line 3. The geo-images are ordered according to cell id and |I .V |. After this step, it traverses the G(R) to search the join cell by cell. For each cell C i , the procedure GetJoinCell(G(R), C i ) is executed to get the cell set M [C i ]. For all the cells C j ∈ M [C j ], the algorithm executes SVS-JOIN B (C i , C j , G , V ) to return the final results set. It is worth noting that the geo-image I located in each cell C i are checked several times, that means more buffers need to create to store the cells for later processing.

D. THE QUADTREE BASED GLOBAL INDEX METHOD
To further improve the search efficiency, in this section we propose a novel method to solve the problem of SVS-JOIN based on a global inverted index and quadtree partition strategy.

1) QUADTREE PARTITION AND GLOBAL INDEX a: QUADTREE PARTITION
Quadtree is one of the popular spatial indexing structures used in many applications. It aims to partition a 2-dimensional spatial region into 4 subregions in a recursive manner. Fig. 6(a) illustrates an example of quadtree that partitions the spatial region into L levels. For l-th level, the region is split into 4 l equal subregions. Each node of quadtree corresponds to a subregion. The root node of quadtree locate on the 0-th level, which represents the whole spatial region. Four subnodes in 1-level are partitioned from the root node in 0-th level. And the subnodes in 3-level are split from the nodes in 2-level as the same manner. From the Fig. 6(a) we can find that there are three colors of nodes. In specific, the light gray nodes are root node and intermediate nodes. The dark gray nodes in any level of the quadtree are the leaf nodes according to the split condition. For each leaf node, there is a list of geo-images in it. In general, the whole spatial region is partitioned into several nodes and the geo-images distribute in these nodes. Fig. 6(b) shows the partition of the Example 2 by a quadtree. The red color number in quadtree is the node id. Apparently, these 9 geo-images are distributed in the subregions. For node 1, denoted as N 1 , it contains two geo-image I 1 and I 2 . As the number of geo-images in R is really small, the other nodes contain only one geo-image at most.

b: Z-ORDER CURVE
In this paper, we utilize Z-order curve [91] to encode each node of quadtree according to its partition sequence. As a typical space-filling curve technique, Z-order curve can map multi-dimensional data to one dimension while keeping the spatial position of data unchanged. There is a direct relationship between Z-order curve and quadtree. That is, we can utilize the Z-order to sort the data during the quadtree construction. That means the path of the node in quadtree can be represented as Z-order curve. Once sorted, the spatial data can either be stored in a binary search tree and used directly [92]. Fig. 7(a) demonstrates how to generate the Morton code of a subregion based on spatial partition sequence in a region. According to Z-order curve, we denote these 16 subregions from 0 to 15 in decimal, or from 0000 to 1111 in binary. Fig. 7(b) illustrates the Morton code in the quadtree partition of Example 2. It is obvious that the 2-dimensional spatial data are mapped to 1-dimensional space. In our solution, we use the code in binary as the node id.   to solve the spatial visual similarity joins problem efficiently. Algorithm 4 shows the pseudo-code of this algorithm. Algorithm 5 and Algorithm 6 demonstrate two key procedures applied in SVS-JOIN Q . The first step of SVS-JOIN Q is to construct a quadtree by performing the procedure QuadtreeConstructor in Line 3 to partition the whole spatial region of the input dataset R. In Line 4, the sorting function AscSortZ (R) is invoked to sort the data in ascending Z-order. After that, the procedure GlobalIndexConstructor(R, V ) is invoked in Line 5 to build the global inverted index for each β ← set of geo-image in I .node or I .neighbors; 11: Denote the maximum of I .V [i] for all I ∈ r as maxweight i (r); 12: for each I .V [i] > 0 in ascending order of i do 13: 14: if S V > V then 15: InvertedIndexConstructor(ι i ); 16: end if 17: end for 18: end for 19: for each ι j ∈ I G do 20: Record p start and p end of each node in ι j ; 21: Record the p I i in ι j 22: end for 23: return I G ; visual word according to the visual similarity threshold V . When building the inverted index lists for geo-image I , only the geo-imageÎ in the neighbor nodes or the same node need to be considered. Then for each inverted indexing list, the algorithm recalls the start position and end position of for each node N ∈ I .N ∪ I .neighbors do 8: 9: if N ∈ I .neighbors then 10: p start = N .p start ; 11: p end = N .p end ; 12: else 13: p start = GetPosition(ι i , I ); 14: p end = N .p end ; 15: end if 16: for eachÎ ∈ ι i [p start , p end ] do 17: if I equalsÎ then 18: Continue; 19: end if 20: if Sim[Î ] = 0 || score ≤ G then 21: 22: end if 23: score ← score − I .V [i] * maxweight i (R); 24: end for 25: end for 26: end for 27: Verify(I ,Î , S V , P); 28: return P; each node and the exact position of geo-images for searching. JoinSearch(I , I G , G , V ) in Line 7 is invoked to measure the geographical similarity and visual similarity and then retrieve all the similar geo-image pairs to generate the results P.

V. EXPERIMENTS
In this section, we present results of a comprehensive performance evaluation on real and synthetic geo-image datasets to evaluate the accuracy, efficiency and scalability of the proposed approaches. Firstly, we introduce the details of dataset and workload in subsection V-A. Then in subsection V-B we discuss the results of experiments on two different datasets.

A. DATASET AND WORKLOAD 1) DATASETS
Performance of the proposed methods is evaluated on both real and synthetic spatial and image datasets. The following two datasets are deployed in our experiments.
• Flickr. Real image dataset Flickr is obtained by crawling millions image from the popular photo-sharing platform Flickr(http://www.flickr.com/). To evaluate the scalability of our proposed algorithm, The dataset size varies from 100K to 500K. The geo-location information can be obtained from the geo-tag of each image.
• ImageNet. Synthetic dataset ImageNet is obtained from the largest image dataset ImageNet, which is widely used in image processing and computer vision. it includes 14,197,122 images and 1.2 million images with SIFT features. We generate Ima-geNet dataset with varying size from 100K to 500K. The geographical information of the images are randomly generated from spatial datasets Rtree-Portal (http://www.rtreeportal.org). Fig. 8 shows some example images of these two datasets. Some images selected from Flickr are shown in Fig. 8(a), such as the photos of outdoor sports, guitar playing, dogs, etc. The image from ImageNet dataset are shown in Fig. 8(b), which belong to many different categories, e.g., fast food, fish, dog, car, snake, flower, etc.

2) WORKLOAD
The geo-image dataset size increases from 100K to 500K; the number of the visual words contained in a geo-image grows from 20 to 100; the geographical similarity threshold G and visual similarity threshold V varies from 0.02 to 0.10 and from 0.5 to 0.9 respectively. By default, The image dataset size, the number of the visual words, the geographical similarity threshold, visual similarity threshold set to 300K, 60, 0.006, 0.7 respectively. The default visual representation scheme is AlexNet-BoVW. All

B. PERFORMANCE EVALUATION 1) COMPARISON BETWEEN VISUAL REPRESENTATION SCHEMES
We compare the search accuracy of four proposed visual representation schemes: SIFT-BoVW, AlexNet-BoVW, VGGNet-BoVW and GoogLeNet-BoVW. As this is the first work to solve the SVS-JOIN problem, we just compare our methods on Flickr and ImageNet datasets.
To evaluate how the dictionary size affect the search accuracy, we set the size of SIFT/deep visual dictionarys of these four schemes to 500, 1000, 2000, 5000, 10000. The conventional feature representation method, SIFT-BoVW is treated as a baseline, in which the patch size of geo-image is set to 16 × 16 pixels. For the AlexNet-BoVW method, as mentioned above we use features of the fifth convolutional layer with size of 13 × 13 × 256 to generate the visual dic-  tionary. For other two deep feature representation schemes, VGGNet-BoVW and GoogLeNet-BoVW, we utilize features of Conv5_3 layer with size of 14 × 14 × 512 and features of inception 4(e) layer with size of 14 × 14 × 832 to construct dictionary respectively. Besides, we choose two trainingtesting settings for these evaluations, namely (1) training ratio is 80%: the dataset is split into 80% for training and 20% for testing; (2) training ratio is 90%: the dataset is split into 90%  Fig. 9 illustrates the comparisons of SIFT-BoVW, AlexNet-BoVW, VGGNet-BoVW and GoogLeNet-BoVW on Flickr dataset under the training ratio of 80% and 90% respectively. We can see from the Fig. 9(a) that the accuracy of all of the method creep up with the increase of the size of visual dictionary. Because the larger the visual dictionary, the more details can be represented by visual word model. This directly improves the search accuracy. As the superiority of the VGGNet-16 and GoogLeNet in the image recognition, the perfomances of VGGNet-BoVW and GoogLeNet-BoVW are higher than AlexNet-BoVW and SIFT-BoVW. Since the more semantic concepts information can be captured, these CNN-based approaches can easily combat the conventional opponent for the task of SVS-JOIN search. The accuracy of VGGNet-BoVW is a little higher than GoogLeNet-BoVW, near to 92% at the dictionary size is 10000. When the training ratio is increased to 90%, the performances of these four methods are little better than before. Because enlarging the size of training set can improve the performance of feature representation. However, what hasn't changed is the best performance of VGGNet-BoVW, which rises gradually with the growth of dictionary size. Similar to the, the accuracy of GoogLeNet-BoVW and AlexNet-BoVW ranked second and third respectively, which are obviously higher than the traditional approach SIFT-BoVW. Fig. 10 demonstrates the results of the experiment between SIFT-BoVW, AlexNet-BoVW, VGGNet-BoVW and GoogLeNet-BoVW on Flickr dataset under the training ratio of 80% and 90% respectively. Under the training ratio of 80%, shown as Fig. 10 (a), all of the methods show a fluctuating growth as the size of visual dictionary grows. Like the situations on Flickr dataset, VGGNet-BoVW is superior to the opponents, which slowly rises in the internal of [500, 5000]. The performance of GoogLeNet-BoVW is the second, which is a bit lower than the former and much higher than AlexNet-BoVW and SIFT-BoVW. The Fig. 10 (b) shows beyond doubt that the deep CNN based methods are clearly defeat the traditional opponent, SIFT-BoVW, which is exactly the same as before. The accuracy of GoogLeNet-BoVW is very close to VGGNet-BoVW, and the performance of AlexNet-BoVW seems very hard to surpass them. It gradually increases to about 76% at the dictionary size is 10000.

2) COMPARISON BETWEEN DIFFERENT SVS-JOIN ALGORITHMS
In the following we evaluate the search efficiency of SVS-JOIN algorithms on Flickr and ImageNet dataset and discuss how the dataset size, number of visual words, geographical and visual similarity threshold affect the system performance. As this work is the first time to evaluate the SVS-JOIN algorithms, we compare the performance of the following methods: subsection IV-C.
• SVS-JOIN G . SVS-JOIN G is the technique introduced in subsection IV-C.
• SVS-JOIN Q . SVS-JOIN Q is the technique introduced in subsection IV-D.
• SVS-JOIN S . SVS-JOIN S is the technique extended from the signature-based algorithm in [93]. We modify this existing algorithm by replacing the textual Jaccard measurement with our visual similarity measurement. Besides, we use visual word representation to generate signature.
• SVS-JOIN A . SVS-JOIN A is a combination of All-Pairs algorithm proposed in [94] and grid partition technique VOLUME 7, 2019 FIGURE 11. Evaluation on various dataset size on Flickr and ImageNet. over the dataset. Likewise, we replace the textual similarity measurement with the proposed visual similarity function.
As mentioned above, the default visual representation scheme used in all these approaches is Deep-BoVW (AlexNet-BoVW) in this experiment.

a: EVALUATION ON THE SIZE OF DATASET
We evaluate the effect of the variation of dataset size on Flickr and ImageNet shown in Fig. 11. It is obvious that the response time of SVS-JOIN B , SVS-JOIN G , SVS-JOIN Q , SVS-JOIN S and SVS-JOIN A increase gradually in Fig. 11(a). Specifically, the performance of SVS-JOIN S is the worse than SVS-JOIN B because the search algorithm used in SVS-JOIN B is more efficient than SVS-JOIN S , which is nearly 30 seconds when the dataset size is enlarged to 500K. The time cost of SVS-JOIN G fluctuate from about 14 second to 23 second, which is higher than SVS-JOIN Q because the quadtree and global inverted index based solution is more efficient. However, it is more efficient than SVS-JOIN A due to the use of PPJOIN algorithm. Fig. 11(b) illustrates that the evaluation on ImageNet dataset. Similar to the situation on Flickr dataset, the performance of SVS-JOIN Q is the best due to the high efficiency of quadtree partition strategy. However, with the rising of the dataset size from 100K to 500K, the speed of increment of time cost of SVS-JOIN Q is a bit higher than the speed on Flickr, which might be due to the variety of images. On the other hand, the performance of SVS-JOIN B is still worse than SVS-JOIN G and SVS-JOIN Q since it has no better spatial index than the others. But SVS-JOIN B defeats SVS-JOIBN S again, not surprisingly.

b: EVALUATION ON THE NUMBER OF VISUAL WORDS
We evaluate the effect of the number of visual words on Flickr and ImageNet dataset shown in Fig. 12. We can see from Fig. 12(a) that the response time of all these five methods  grow step by step with the increment of number of visual words. Similar to the situation above, the lowest efficient approach is SVS-JOIN S that cannot defeat any opponent. For SVS-JOIN B , when the number of visual words is larger than 40, the growth speed of it is a bit faster. Apparently, the response time of it is high, which is just lower than SVS-JOIN S . SVS-JOIN Q is the most efficient algorithm among them on this dataset due to the benefit of quadtree index. As the same visual word representation utilized in these five methods, the impacts of increasing the number of visual words on them are the same, which is reflected in the similar trend. The evaluation on ImageNet dataset is shown in Fig. 12(b). Once again, without spatial partition technique and advanced search strategy, the performance of SVS-JOIN S is the worst. In the interval [60,100], the growth speed of SVS-JOIN B and SVS-JOIN A are bit faster. However, this situation does not appear in SVS-JOIN G and SVS-JOIN Q . There is no doubt the performance of SVS-JOIN Q is the best, just like the evaluations mentioned above. Thus, once again, the results confirm that the proposed quadtree partition strategy is better than the grid partition for SVS-JOIN problem.

c: EVALUATION ON THE GEOGRAPHICAL SIMILARITY THRESHOLD
We evaluate the effect of the spatial similarity threshold on Flickr and ImageNet dataset shown in Fig. 13. In Fig. 13(a), with the increasing of geographical similarity threshold, the growth rate of response time of all these five algorithms are relatively small. It is as expected that SVS-JOIN S approach has the lowest search efficiency from beginning to end. For SVS-JOIN B , it shows slight fluctuations of response time, which is higher than SVS-JOIN A , SVS-JOIN G and SVS-JOIN Q all along because there is no advanced spatial index technique used in it to boost the efficiency. As explained above, it just considers the filter condition GeoSim(I ,Î ) ≤ G during the search. For SVS-JOIN Q VOLUME 7, 2019 algorithm, the range of its fluctuation is very small, and this method has the lowest response time. This mainly benefit from our spatial search strategy, i.e., the grid-based and quadtree-based search algorithms with global index are not very sensitive to the change of geographical similarity threshold. On the other hand, the trend of SVS-JOIN G is similar to SVS-JOIN Q , although its efficiency is lower than the latter. But it can defeat SVS-JOIN A . For the comparisons on Ima-geNet dataset, we can find from Fig. 13(b) that the trends of SVS-JOIN S and SVS-JOIN B are slightly different from the situations on Flickr. The growth of the time cost seems to be a bit faster. In specific, when threshold G increases to 0.008, SVS-JOIN S increase from 24.5 to 28, and the proposed baseline SVS-JOIN B has a rise from 22 to 25. However, other two algorithms, SVS-JOIN G and SVS-JOIN Q seem to be not much affected by the increasing of G . Besides, the latter performs much better than the former, which is benefit from the usage of efficient spatial index, namely quadtree with Z-order and global inverted index.

d: EVALUATION ON THE VISUAL SIMILARITY THRESHOLD
We evaluate the effect of the visual similarity threshold on Flickr and ImageNet dataset shown in Fig. 14. We can see from the Fig. 14(a) that with the rising of visual similarity threshold, the search efficiency of these five algorithms are improved gradually. It is mainly because more geo-images are considered to be dissimilar when the threshold is very large. That means more candidates are pruned with enlarging the visual similarity threshold. Like the comparison above, the response time of SVS-JOIN S is the highest since no more efficient search strategy is used. With the grid partition technique, SVS-JOIN A and SVS-JOIN G can defeat the two that do not utilize spatial partition strategy. The efficiency of SVS-JOIN Q is higher than SVS-JOIN A and SVS-JOIN G because the employment of quadtree and global index technique can boost the spatial search obviously. In Fig. 14(b). The response time of them decline gradually but the speed of decrement is a bit slower than the speed on Flickr dataset. Same as the situation above, the SVS-JOIN Q is obviously superior to other approaches.

VI. CONCLUSION
In this paper, we study a novel geo-image retrieval paradigm named SVS-JOIN problem. Given a set of geo-images that contains geographical information and visual content information, SVS-JOIN aims to search out all the geo-image pairs from the dataset, which are similar to each other in both aspects of geographical similarity and visual similarity. We define SVS-JOIN problem in formal at first time and then propose the geographical and visual similarity function. An algorithm named SVS-JOIN B is developed, which is inspired by the approaches applied on spatial similarity joins. To improve the efficiency of search, we extend this algorithm to a novel algorithm called SVS-JOIN G that utilizes spatial grid strategy to enhance the performance of spatial retrieval. Besides, we introduce an alternative algorithm named SVS-JOIN Q that employs quadtree technique and a global inverted indexing structure, which can further speed up the search. The experimental evaluation on real and synthetic geo-multimedia datasets shows that our methods has a really outstanding performance.
LEI ZHU was born in Changsha, China, in June 1988. He received the M.Sc. degree from Central South University, China, in 2014, where he is currently pursuing the Ph.D. degree in computer science and technology with the School of Computer Science and Engineering. His research interests include machine learning, deep learning, computer vision, and spatio-temporal data retrieval.
WEIREN YU received the Ph.D. degree from the School of Computer Science and Engineering, University of New South Wales. He is currently an Assistant Professor of Computer Science with the University of Warwick, and also an Honorary Fellow with Imperial College. Before joining Warwick, he was a postdoctoral position with Imperial College. He has published more than 30 articles in DB and IR. His research interests include web search and information retrieval, graph data management, and streams data mining. He received three Best Paper Awards, two CiSRA Best Paper Awards, a One of the Best Papers of ICDE 2013, and the Best Student Paper Award. He has served on various editorial boards, and as a PC and an active Reviewer for journals, such as the IEEE TKDE, The VLDB Journal, IEEE TIFS, ACM TKDD, WWWJ, Sensors and conferences, such as SIGIR, SIGMOD, VLDB, ICDE, EDBT, CIKM.
CHENGYUAN ZHANG was born in Hunan, China. He received the B.S. degree from Sun Yat-sen University, in 2008, and the master's and Ph.D. degrees in computer science from the University of New South Wales, in 2011 and 2015, respectively. He is currently a Lecturer with the School of Computer Science and Engineering, Central South University, China. His main research interests include information retrieval and query processing on spatial data and multimedia data.
ZUPING ZHANG received the B.S. degree from the Foundation of Mathematics, Hunan Normal University, in 1989, the M.S. degree from the Foundation of Mathematics, Jilin University, in 1992, and the Ph.D. degree in computer application technology from Central South University, Changsha, China, in 2005. He is currently a Professor with the School of Information Science and Engineering, Central South University. His current research interests include information fusion and information systems, big data technology and application, parameter computing, and biology computing.
FANG HUANG was born in Changsha, China. She received the Ph.D. degree in traffic information engineering and control from Central South University, China, in 2007. She is currently a Professor with the School of Information Science and Engineering, Central South University. Her main research interests include social network mining and analysis, data mining and knowledge discovery, and big data analysis.
HAO YU was born in Shangrao, China, in December 1994. He received the M.Sc. degree from Guangxi Normal University, China, in 2018. He is currently pursuing the Ph.D. degree in computer science and technology with Central South University. His research interests include image retrieval, machine learning, computer vision, and crowdsourcing learning.