LayerLSH: Rebuilding Locality-Sensitive Hashing Indices by Exploring Density of Hash Values

Locality-sensitive hashing (LSH) has attracted extensive research efforts for approximate nearest neighbors (NN) search. However, most of these LSH-based index structures fail to take data distribution into account. They perform well in a uniform data distribution setting but exhibit unstable performance when the data are skewed. As known, most real life data are skewed, which makes LSH suffer. In this paper, we observe that the skewness of hash values resulted from skewed data is a potential reason for performance degradation. To address this problem, we propose to rebuild LSH indices by exploring the density of hash values. The hash values in dense/sparse ranges are carefully reorganized using a multi-layered structure, so that more efforts are put into indexing the dense hash values. We further discuss the benefit in distributed computing. Extensive experiments are conducted to show the effectiveness and efficiency of the reconstructed LSH indices.


I. INTRODUCTION
The nearest neighbors (NN) search problem aims to find objects that are close to the given query, which is the basic and important problem in a wide range of applications [1]- [3]. Due to the difficulty in finding an efficient method for exact NN search in high-dimensional space, many researchers have focused on approximate nearest neighbors search as an alternative approach. Locality sensitive hashing (LSH) [4] is known as one of the most promising indexing methods for approximate nearest neighbors search, where the locality sensitive hash function has the property that points that are closer to each other have a higher probability of colliding than points that are farther apart. To improve the hashing effect, m randomly chosen LSH functions are utilized together to generate a compound hash key for each object o. In compensation for the loss of candidate points because of importing compound hash key, a large number l of hash tables are constructed.
Many LSH variants have been proposed in recent years. E2LSH [5] is a typical table-based LSH which proposes a hash function used in Euclidean space. The data points with the same hash value or the same concatenating hash values are placed in the same bucket, which implies that they are close to each other. The approximate NN search is achieved by returning the data in the same bucket as which the query falls in.
However, most of these LSH index structures fail to take data distribution into account. They perform well in a uniform data distribution setting, but exhibit instable performance when the data are skewed. As known, most real life data are skewed, which makes LSH suffer from poor search quality. Based on our observation, the skewed data distribution leads to skewed hash values and, as a result, leads to a skewed index structure. This is the potential reason for the performance degradation.
The Euclidean-based LSH function [5] projects highdimensional data points to a real number line and partitions the line into fixed-length intervals. As a result, the hash values exhibit skewed distribution as long as the original data are skewed. The points with the same hash values are assigned to the same bucket, so the bucket sizes vary greatly. Figure 1a shows the skewed bucket size distribution for the KDD dataset (Table 1). Intuitively, LSH-based kNN search displays higher accuracy for dense queries 1 while lower accuracy for sparse queries. This is because the distances from query q to its kNNs are small (or large) in a dense (or sparse) region, so that they are likely (or unlikely) to be hashed to the same bucket. On the other hand, due to the large bucket size, it displays higher cost (evaluated by the number of distance measurements) for dense queries than sparse queries. This phenomenon is illustrated in Figure 1b. If we randomly return k points in the bucket rather than returning the exactly k nearest points, it may lead to much lower accuracy. Specifically, for the kNN applications with a small k (e.g., top 1 query at extreme), if we randomly return 1 point in the dense bucket, the error problem will be enlarged and will significantly impact the user experience.
In this paper, we propose to rebuild LSH index structures by exploring the density of hash values. The hash values in dense ranges are rehashed to make them distributed more evenly, so as to reduce the query cost. The hash values in sparse ranges are merged to be returned together during query processing, so as to improve the search quality. Therefore, the rebuilt LSH indices become more targeted in terms of data distribution, and a multi-layered structure is constructed. Comparing with the simple rehashing method, the multilayered approach will still guarantee the search quality by carefully choosing the number of groups and hash functions, which is a nice property for applications with restrict accuracy requirement.
Difference to Data Sensitive Hashing The recently proposed data sensitive hashing, e.g., DSH [6] and selective hashing [7], also leverages data distributions. DSH [6] chooses the most suitable hashing family by learning data distributions. Selective hashing [7] creates multiple LSH indices with different granularities (i.e., radii) and locates each point only in one suitable LSH index according to data densities. These data sensitive hashing techniques learn the appropriate hash families from the data, and accordingly have the ability to create relatively balanced indexing structures. Our approach is orthogonal to them since we rely on the density of hash values and directly rebuild the existing index structures. Moreover, our rebuilding scheme can also be used as a postprocessing step for these data sensitive hashing techniques to further improve performance. We will rebuild DSH index to illustrate the possibility.
Contributions We list our contributions as follows.
• We rebuild the basic LSH structure and design Lay-erLSH (Section III). The points in overloaded bucket are recursively rehashed to multiple groups of smaller buckets, forming a multi-layered index structure. Thus, the query is more efficient since a less number of more accurate NN candidates are returned. Further, by carefully choosing the new set of rehashing LSH parameters, the collision probability can be guaranteed. We also propose a stream processing approach to adapt streaming data. • We demonstrate the benefits of our approach in distributed computing (Section IV). We also present a use case on supporting distributed all-pairs computation, i.e., point density evaluation. • We conduct extensive experiments on real datasets to verify the effectiveness and efficiency of the proposed multi-layered structures. LayerLSH can reach the same search quality as LSH with only 5%-20% query cost. LayerLSH also exhibits much better performance on distributed point density approximation (Section V). We survey the related work in Section VI. Finally, we conclude our work in Section VII.

A. PROBLEM SETTING
The problem of nearest neighbors search refers to finding objects that are similar to the query object. The typical kNN search problem is formally defined as follows. Definition 1: (kNN) Given an object q, a dataset O and an integer k (k < |O|), the kNN query returns a set of k objects where |·, ·| denotes the distance between two objects.
In this paper, we focus on answering approximate kNN queries for high-dimensional data in the Euclidean space. That is, we aim to find k objects whose distances are within a small factor (1 + ϵ) of the exact k-nearest neighbors' distances and minimize ϵ. Our goal is to design an indexing scheme for approximate kNN queries with both high search quality and high efficiency.

B. LOCALITY-SENSITIVE HASHING
The Locality-Sensitive Hashing (LSH) function has the property that points that are closer to each other have a higher probability of colliding than points that are farther apart [4]. Let O be the dataset of n data objects in d-dimensional Euclidean space R d and let ||o 1 , o 2 || denote the Euclidean distance between two objects o 1 and o 2 , o 1 , o 2 ∈ O. LSH is formally defined as follows. Definition 2: (Locality Sensitive Hashing) Given a distance r, an approximation ratio c and two probability values P 1 and P 2, a hash function h : R d → U is called (r, cr, P 1, We pick c > 1 and P 1 ≥ P 2. With these choices, nearby objects (i.e. those within distance r) have a greater chance of being hashed to the same value than points that are far apart, i.e. those at a distance greater than cr away.
The commonly used LSH family for Euclidean distance consists of LSH functions in the following form [5]: where a is a d-dimensional random vector, each entry of which is chosen independently from standard Gaussian distribution N (0, 1) [8], b is a real number chosen from [0, w], and w is also a real number representing the partition width of the LSH function. For two data objects o 1 and o 2 , let s = ||o 1 , o 2 ||. The probability that o 1 and o 2 collide under a randomly chosen hash function h, denoted as p(s, w), can be computed as follows [5].
where f 2 (x) is the density function of a Gaussian distribution [5], i.e., f 2 (x) = 2 √ 2π e − w 2 2s 2 , and norm(·) represents the cumulative distribution function for a random variable that is distributed as Gaussian distribution. The collision probability p(s, w) decreases monotonically when s increases but grows monotonically when w rises.
The locality-preserving property of LSH allows us to partition the set of objects based on their hash values. If two points o 1 and o 2 are hashed to the same bucket, o 1 and o 2 are close to each other with certain confidence. However, it is possible that two distant points happen to be hashed to the same bucket according to Equation (1). To reduce such false positives, a group of m hash functions G(·) = {h 1 (·), h 2 (·), . . . , h m (·)} are employed. That is, only points sharing all the m hash values are placed in the same bucket. Thus, each object o is labeled with a compound hash key which is considered as the bucket key. The probability that two objects collide is reduced as shown in Equation (3). However, the probability p(s, w) m may be very small when m is large, which may lead to a large number of false negatives. In order to reduce the loss of false negatives, multiple hash tables are used. That is, a set of l hash groups {G 1 (·), G 2 (·), . . . , G l (·)} are employed and l hash tables are constructed (i.e., each object has l copies in l hash tables), hoping that the close points collide at least on one hash table. The final collision probability P is shown in Equation (4).

III. LAYERLSH: REBUILD BASIC LSH
As illustrated in Section I, the query falling in dense buckets tends to result in high cost, while the query falling in sparse buckets tends to result in low accuracy. Our idea is to split the dense buckets and merge the sparse buckets, which is simple but empirically shown to be effective (Section V). Suppose we have a set of 2-D data objects distributed as shown in the top of Figure 2. They are hashed to different buckets in two hash tables. Some of the buckets are lightly loaded, while some are heavily loaded. LayerLSH will rehash the objects residing in an overloaded bucket to a new set of hash tables, such that the overloaded bucket is rehashed into multiple groups of smaller buckets where each group corresponds to a new hash table. The overloaded buckets are rehashed recursively until no overloaded one exists. At meanwhile, the underloaded bucket will not be further processed but only be marked. When a query falls into the underloaded bucket, the query algorithm will simply expand the search scope and search the "nearby" buckets to improve the accuracy.
It is notable that since limiting bucket size will reduce the accuracy from probability theory's point of view. The objects in the overloaded bucket should be copied to more than one hash tables to compensate for the reduced accuracy. This is for sustaining the expected accuracy P as depicted in Equation (4), which will be described in detail in Section III-A. Since the overloaded buckets are rehashed recursively, multiple layers of LSH tables are constructed. The root level VOLUME 4, 2016 (level 0) of the LayerLSH is exactly the same as the original LSH. The hash tables in higher levels are the new generated LSH tables for the rehashed buckets. Figure 2 shows an illustrative example of the multi-layered tree-like structure.

A. BUILDING LAYERLSH
We rebuild the original hash tables in terms of two factors, the user specified expected recall and precision. Let KNN(q, O) denote the set of kNNs of q. Given a query q and a set of objects O, an approximate kNN query algorithm returns a set of candidates C. We have the recall α = |C∩KNN(q,O)| |KNN(q,O)| , which implies the accuracy, and the precision β = |C∩KNN(q,O)| |C| , which implies the efficiency. Given α and β, we study the lower/upper bound size of each bucket. Proposition 1: (Bucket Size Constraints) When using LSH with l hash tables to answer kNN query, with an expected recall α and an expected precision β, the bucket size S of each hash table has the following constraints: Proof 1: Suppose the LSH parameters are {l, m, w}. In each of the l hash table, the expected size of the bucket that q falls in is S = o∈O p(|q, o|, w) m where p(s, w) m is defined in Equation (2) and (3). Let s (q,k) denote the distance from q to its kth NN. Then we have The first "≥" is true because the candidates for summation is reduced from O to KNN(q, O). The second "≥" is true because for any |q, o| ≤ s (q,k) we have p(|q, o|, w) ≥ p(s (q,k) , w) according to Equation (2). The third "≥" is true because in order to achieve the overall recall α over l hash tables, the probability that q and any of its kNNs collide in a specific hash table should be no less than 1 − l √ 1 − α according to Equation (4). If we can make p(s (q,k) , w) m ≥ 1 − l √ 1 − α for q's kth NN, it is also true for any other kNNs. Thus, the bucket size S should satisfy S ≥ k · (1 − l √ 1 − α). When all of q's kNNs are contained in the returned candidates set, i.e., KNN(q, O) ⊂ C, |C| can be as large as k β to satisfy the expected precision, otherwise |C| has to be smaller. Thus, k β is the upper bound of |C|. Since there are l hash tables, the candidates are retrieved from l buckets. Let us assume the candidates are collected evenly from l hash tables. Thus, the bucket size S should satisfy S ≤ k β·l . Note that, satisfying the above constraints does not necessarily guarantee the expected recall and precision but helps us identify the underloaded and overloaded buckets.
Given the constraints of the bucket size, we propose Algorithm 1 to recursively rehash the overloaded buckets to build LayerLSH index. The input includes the original LSH tables (i.e., level-0 hash tables), the LSH parameters set {l, m, w}, foreach HTi in HT do the number of returned NNs k, the expected recall R = α, and the expected precision P = β. With respect to each hash table, the recall is relaxed to R ′ = 1 − l √ 1 − R, and the precision is tightened to P ′ = P · l (Line 2). Then, we have the lower bound size (T l = k · R ′ ) and upper bound size (T u = k P ′ ) for each bucket (Line 3). We check the size of each bucket of each hash table. For the overloaded bucket that contains more than T u objects (Line 7), we first determine a new set of child LSH parameters (Line 8), based on which the objects in that bucket are rehashed into a new set of hash tables (Line 9). The overloaded buckets are rehashed by recursively invoking this process until no bucket is overloaded (Line 10). For the underloaded bucket that contains fewer than k·α l objects (Line 11), we mark it for future use in query processing (Line 12).
In Algorithm 1, we refer to the to-be-rehashed bucket as parent bucket and the new LSH for this bucket as child LSH. The core of bucket rehashing is to choose a proper new set of child LSH parameters, such that the bucket size is reduced (for efficiency) but at the same time the probability that a query and its kNNs collide in the same bucket is not reduced (for accuracy). We fix the LSH width parameter w (We will explain the reason later). Let {l p , m p , w} denote the set of parent LSH parameters, and {l c , m c , w} denote the set of child LSH parameters. We use the following propositions to guide the selection of child LSH parameters. Proposition 2: (For Accuracy) Let s (q,k) denote the distance from a query q to its kth NN. Suppose we can find a r * such that r * ≥ s (q,k) for any q. In order to guarantee the expected recall, the child LSH parameters {l c , m c } should be chosen to satisfy: where p = p(r * , w) is defined in Equation (2).

VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Proof 2: Given the definition of r * , we have ∀q, r * ≥ s (q,k) and further have p(r * , w) mp ≤ p(s (q,k) , w) mp . That is, the probability that any query q and its kth NN collide in the same bucket is no less than p(r * , w) mp . If the probability p(r * , w) mp could be sustained after bucket rehashing, the probability that q and any of its kNNs fall in the same bucket will not reduce, so that the expected recall α is guaranteed. Accordingly, in terms of Equation (4), the child LSH parameters {l c , m c , w} should be chosen such that Further, since all the objects in child hash tables are originated from the parent bucket, the collision probability in child hash tables will never be greater than p(r * , w) mp . Therefore, we aim to In practice, r * can be approximately estimated by sampling. A number of sample objects are randomly selected to calculate their exact kNNs. The median of their kth NNs' distances is used to estimate r * . Proposition 3: (For Efficiency) Let S denote the size of the overloaded bucket that q falls in. In order to approximately satisfy the precision constraint, the child LSH parameter m c and l c should be chosen as follows: where p = p(r * , w) is defined in Equation (2), β is the expected precision, and S * c is the biggest bucket size among all child hash tables. Proof 3: The expected size of the bucket that q falls in is S = o∈O p(|q, o|, w) mp . From this equation, we learn that the actual bucket size relates to two factors, 1) the number of "close" objects to q and 2) the probability that these "close" objects fall in q's bucket (determined by m p and w). The more "close" objects to q, the more likely S is larger. The larger m p is, the more likely S is smaller. Let X = {o : |q, o| ≤ r * } denote the set of objects within a range of r * , which are considered as "close" objects. In other words, |X| implies q's density. Then we assume that the bucket size S is proportional to |X|, i.e., S ∝ |X|. On the other hand, to simplify the analysis and obtain an approximate answer, we assume that the probability for all objects in X falling in q's bucket is proportional to p(r * , w) mp that is fixed for all objects in X, i.e., S ∝ p mp where p = p(r * , w). Thus, we have S ∝ |X| · p mp .
In order to satisfy the efficiency constraint k S·lp ≥ β, the bucket size S should be reduced to less than k β·lp . Since S ∝ |X| · p mp , |X| · p mp should be correspondingly reduced to |X| · p mp+mc . Therefore, we should choose m c to satisfy the following equation: Since m c should be an integer, we choose the result of the ceiling function, i.e., m c = log p k S·β·lp to sustain the inequality of precision constraint.
On the other hand, after rehashing we also need to limit the number of child hash tables in order to satisfy the efficiency constraint. Suppose S (q,i) is the size of a query q's bucket (or multiple child buckets if it points to child hash tables) in child hash table i, we need to make sure c ·β·lp (since l c should be an integer), we can satisfy the efficiency requirement.
By combining Proposition 2 and Proposition 3, we can obtain the available child LSH parameters m c and l c , which are used during bucket rehashing (Line 8 in Algorithm 1). However, there probably is no solution since the recall and precision requirements cannot be satisfied at the same time (i.e., l c 's lower bound is greater than its upper bound). Furthermore, S * c in Equation (8) is even unknown before bucket rehashing. In such a case, we first choose m c according to Equation (8), ignore the upper bound of l c , and generate enough more child hash tables to sustain accuracy constraint according to Equation (7). We will further satisfy the efficiency constraint during the query processing.
In the original LSH, it is required to set {l, m, w}, while in LayerLSH we use the expected recall R and the expected precision P in place of {l, m, w}. This is because R and P should be more user-friendly since the effectiveness of real LSH applications is usually evaluated by the expected recall and precision. In addition, the reason why we perform the analysis by fixing w is explained as follows. Given a query q, for any point, its probability (shown in Equation 2) to collide with q depends on its distance to q (i.e., s) and the partition width w. If adjusting w is allowed, regarding a particular point, its probability to collide with q would change after rehashing. The new probability p(s, w) depends on a variable s since s is variant for different points. That means, the analysis in Proposition 2 or 3 should also consider s, the distance from a point to a query. Obviously, s is unknown in prior since query is unknown in prior. This will bring big challenges in analyzing accuracy and efficiency. Therefore, we propose to fix w.

B. QUERY PROCESSING
There are two kinds of buckets in LayerLSH, which should be differentiated during query processing. One kind that contains similar data objects, which are referred to as data buckets. Another kind contains the pointers to child hash tables, which are referred to as pointer buckets.
To answer a kNN query, we use Algorithm 2 to retrieve the candidates set from multi-layered hash tables. We first read the LayerLSH parameters {l, m, w} from the input LayerLSH tables (Line 2). With respect to each hash table, the recall is relaxed to R ′ , and the precision is tightened to P ′ (Line 3). Then, we have the lower bound size (T l = k · R ′ ) and upper bound size (T u = k P ′ ) for each bucket (Line 4).
if b is a pointer bucket then 8 HT child ← locate the child hash tables b points to; 9 LayerLSHQuery (q, HT child , R ′ , P ′ , C); Given a query q, we first compute its compound hash keys and project q to the bucket in a particular hash table (Line 6). If the positioned bucket is a pointer bucket, the query q is rehashed in multiple child hash tables along with the bucketbased R ′ and P ′ (Line 8), and the query algorithm is invoked recursively (Line 9).
If the positioned bucket is a data bucket and this bucket is underloaded (Line 11), we will expand the search scope and search the "nearby" buckets whose compound hash keys are slightly different. This searching scope is expanded to more and more buckets as soon as enough objects (T l ) are returned (Line 12-16). The idea of merging "nearby" sparse buckets is similar to multi-probe LSH [9]. Given the property of LSH, if an object is close to a query q but not hashed to the same bucket, it is likely to be in a bucket that is "close by" (i.e., the hash keys of the two buckets only differ slightly). LayerLSH also designates the "close by" buckets by applying a hash perturbation vector ∆ = {δ 1 , δ 2 , . . . , δ m } (e.g., {+1, 0, . . . , 0} or {0, −1, . . . , 0}) on the original compound hash key G(q) = {h 1 (q), h 2 (q), . . . , h m (q)} and obtains the nearby bucket G(q) + ∆.
If the data bucket is not underloaded, the objects in that bucket are conditionally put into the candidates set C (Line 17). Recall that, it is possible that the specified recall and precision are in conflict with each other. It is required to return all candidates from all hash tables in order to satisfy the recall requirement, but also required to return at most T u candidates from only a few hash tables to satisfy the precision requirement. LayerLSH will let users specify one primary choice, the expected recall α or the expected precision β. Then the query processing algorithm will correspondingly include all the objects in bucket to satisfy recall requirement (Line 18) or limit the number of returned candidates to satisfy precision requirement (Line 19). If both or none of recall and precision is primarily selected, LayerLSH balances these two factors and return at most It is noticeable that the query might expand to more and more buckets as it goes deeper in the LayerLSH tree. However, the large number of checked buckets does not necessarily lead to large number of candidates since the checked buckets are much smaller. With regard to the dense buckets, LayerLSH narrows the search scope, as a result the search is more efficient. Rather than using a large number of hash tables to achieve high search quality, we can achieve the same search quality with a smaller number of level-0 hash tables. More hash tables are only created for the dense buckets. The hashing is more targeted in terms of data distribution.

C. STREAM DATA PROCESSING
To deal with dynamic data, LayerLSH needs to support continuous point insertions and deletions. A naive implementation could be letting LayerLSH check the bucket size after each insertion or deletion. However, this may result in too many unnecessary bucket splits. For instance, an insertion to a nearly full bucket followed by multiple deletions may result in unnecessary bucket split. To alleviate this problem, we introduce the time window concept and buffer these insertions/deletions within a time window range. In a time window, if the bucket size does not exceed a predefined maximum tolerance (1 + ϵ m ) · T u , it will not be split, where ϵ m > 0 and T u is the original bucket size upper bound. Otherwise, it will still be split. The maximum tolerance of bucket size implies the effect of buffering. The bigger the ϵ m is, the more insertions can be buffered. Note that, if the overloaded bucket is already in a child hash tables, we will resplit the child hash tables based on the newly updated bucket size instead of recursively splitting the overloaded bucket. This is because, if the recursive split is used, the continuous insertions might result in very deep search path in terms of LayerLSH's tree-like structure, which could degrade the query performance.
At the end of each time window, all the buckets are evaluated to be determined whether they should be split by comparing their sizes to a caching tolerance (1 + ϵ c ) · T u , where 0 < ϵ c < ϵ m . We introduce the caching tolerance for avoiding unnecessary bucket splits after each time window.

VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. if tc − t l ≥ W then 5 split the buckets in HT whose sizes are bigger than the relaxed upper bound (1 + ϵc) · Tu; 6 t l ← obtain current time; By introducing the time window based buffering, a large number of unnecessary bucket splits can be avoided so that the processing throughput is expected to be higher, at the expanse that the query cost can be increased due to the delayed bucket splits. The tradeoff between processing throughput and query cost can be adjusted by tuning the maximum tolerance parameter ϵ m and caching tolerance parameter ϵ c .

D. INDEX MAINTENANCE
In addition, LayerLSH can be implemented as a disk-based index for maintaining large data sets. Since the basic structure in LayerLSH is tree-like, it is straightforward to store the index using a tree structure. The internal nodes storing the child LSH parameters as well as the pointers to child buckets are maintained in an index file, and the leaf nodes storing the data points are maintained in a data file. Note that, a leaf node that stores a large bucket is written to multiple file blocks, and multiple leaf nodes storing multiple small buckets are written to a single file block to save space. Similar buckets are stored continuously in a file block to support "nearby" bucket search.
When answering queries, the internal nodes maintained in index file are loaded into memory for fast access, or part of them for large index. After the data buckets are positioned, the file blocks storing the candidate buckets are loaded into memory for distance measurements, followed by returning the approximate kNNs. To support insertion of a point, we first locate the file blocks that store the hashed buckets and then append the point data. Note that, request of a new file block might be needed if the returned file block is full. To support deletion of a point, we first locate the file blocks and label this point indicating its invalidation. A periodical recycling process is executed offline to recycle the file blocks where no valid data is contained.

A. DISTRIBUTED LSH
The approach of supporting distributed NN search with both LSH and LayerLSH is straightforward. We use Hadoop MapReduce to implement distributed Lay-erLSH. The map() function invocation on a point According to these intermediate key-value pairs, we can have l different partition results (see Section II-B), i.e., l Hash tables. We then send the buckets in these hash tables to the corresponding reducers, so that each reduce() will receive a subset of points (i.e. a bucket) under a specific LSH partition layout. Next, we judge whether the size of the bucket exceeds the threshold. If so, at the reducer, we recursively split until the predefined conditions are met. After the map/reduce operations, we can obtain the LayerLSH index. During the query phase, for each query point, we retrieve from multiple distributed reducers to obtain its l approximate KNN candidate sets. We then aggregate these candidate sets in a second MapReduce job, and finally select the top k as its nearest k neighbors at the single reducer.
Suppose we have n workers. A bucket with key bk is assigned to worker H(bk) n where H(·) is any hash function that maps a bucket key to an integer. As a query q arrives, its multiple hash values G 1 (q), G 2 (q), . . . , G l (q) corresponding to multiple hash tables are first calculated. The query is then sent to multiple workers H(Gj (q)) n , 1 ≤ j ≤ l for local computation. The candidates obtained by local NN search are refined before being merged as a global candidate list, where the refining method could be extracting only the top k nearest neighbors. Due to the skewed data distribution, hot spots might exist in distributed LSH, while LayerLSH has the advantage of alleviating hot spot contention.

B. ALL-PAIRS COMPUTATION
All-pairs computation is a common preprocessing step in many applications, e.g., retrieving similarity matrix for learn- VOLUME 4, 2016 ing data correlations [10], pruning distant neighbors for abstracting a graph structure [11], evaluating implicit properties for each data point [12], and so on. All-pairs computation is known as a computation intensive task, which requires N 2 distance measurements. This is extremely costly for large volume and high dimensional data. Since the all-pairs computation is often performed only based on the nearest neighbors in these applications, LSH is an ideal approximation method to optimize all-pairs computation. Furthermore, using distributed machines can further speedup the computation intensive task. The LSH buckets are distributed among multiple workers, where the all-pairs computation is performed locally within each bucket.
However, distributed LSH-based all-pairs computation suffers from the drawback of skewed bucket size distribution. The workers with dense buckets can be the stragglers, which can significantly slow down the whole process. Fortunately, LayerLSH can alleviate this impact by bounding the bucket size, while at the same time guaranteeing the accuracy. Moreover, we merge similar small buckets in order to not only improve the accuracy but also reduce the number of distributed tasks.

C. CASE STUDY: POINT DENSITY EVALUATION
We take point density evaluation as a use case for illustration. A point p i 's density ρ i is defined as the number of neighbors within a radius R, i.e., ρ i = |{p j |∀j, |p i , p j | ≤ R}|. In this problem, the computation of ρ i only depends on its nearest neighbors with distance to p i less than R. Suppose the approximated density isρ i . By using LSH, the probability Pr(ρ i = ρ i ) can be studied as follows. Lemma 1: Given a point p i and an LSH function h(p i ) = ⌊ a·pi+b w ⌋, the probability that p i and its nearest neighbors set {p j |∀j, |p i , p j | ≤ R} are hashed to the same bucket is: According to the definition of p-stable distribution [5], given a d-dimensional random vector a each entry of which is chosen independently from a standard gaussian distribution N (0, 1), for two points p i and p j , the distance between their projections |a · p i − a · p j | (here | · | means the absolute value) is distributed as |p i , p j | · x, where x is the absolute value of a standard gaussian random variable. Therefore, for any p j where |p i , p j | < R, we have max j |y i − y j | = max j |a · p i − a · p j | < R · x.
Moreover, y i = a · p i + b is uniformly distributed in a certain slot. To ensure that y i and all its R-length neighbors are in the same slot, y i has to be located in the interval of [αw + Rx, (α + 1)w − Rx) for some α. The probability that y i resides in such an interval is w−2Rx w = 1 − 2Rx w . It is worth noting that the random variable a for mapping the query y i is the same as the random variable for mapping all its R-length neighbors.The probability density function of the absolute value of the standard gaussian distribution is , and a further calculation shows that the probability is 1 − 4R √ 2πw . By applying the LSH properties described in Equation (3) and Equation (4), we have the following theorem. Theorem 1: With l groups of m hash functions, the probability is finally enlarged as Proof 5: After applying l groups of m hash functions, we will obtain lρ g i values (1 ≤ g ≤ l). According to the definition of ρ i , we haveρ g i ≤ max gρ From Lemma 1, under a single hash function the probability that p i and all its R-length neighbors are hashed to the same bucket is at least 1 − 4R √ 2πw . With a group of m LSH functions G = (h 1 , h 2 , . . . , h m ) applied on each point, only points sharing all the m hash values are placed in the same partition. Supposeρ g i is the approximated density value for a specific hash function group G g (p i ). Due to the fact that each LSH function is independently and randomly selected, we have: Further, since the l groups of hash functions G g (1 ≤ g ≤ l) is independently and randomly generated, we have the following: Therefore, users are allowed to specify an expected accuracy in density approximation. However, the unbalanced buckets allocation brings troubles in distributed computing. As shown in Section V-F), one or two stragglers significantly slow down the whole process. LayerLSH rehashes the overloaded buckets to alleviate this problem. Meanwhile, the theoretical accuracy can be guaranteed by choosing child LSH parameters in terms of Proposition 2.

V. EXPERIMENTS
The experiments were performed on a Ubuntu system equipped with one Intel(R) Xeon(R) 2.60GHz CPU, 32GB of memory. Datasets and Queries We evaluate our approach using five real datasets, including KDD 2 , Forest 3 , Color 4 , Audio 5 , and Mnist 6 . Properties of these five datasets are summarized in Table 1. We also generate five sets of queries from each dataset. We first evaluate the density of each point, which is the number of neighbors in a given radius, then extract the top 2% highest density points as dense queries, the top 2% lowest density points as sparse queries, and the randomly sampled 2% points as random queries.
So a small ratio implies high query accuracy. An average of the error ratios from all queries is used for evaluation. Query Cost We evaluate the query cost in terms of the number of candidates to be checked for distance measurement. Parameters Setting The original LSH indices for these datasets are first built with parameters l = 3, m = 3 (The effect of different l and m will be shown in Sec. V-D).w is set to satisfy a predefined accuracy which is different for different experiments. The number of returned NNs k is set to 20 unless particularly mentioned. r * is estimated by sampling 1% data points as discussed in Section III-A, which differs with respect to k and dataset. The LayerLSH parameters are set as α = 0.9, β = 0.005 unless particularly mentioned. And none of these two parameters are primarily chosen, so that the algorithm will perform query processing by "averaging" these two requirements as discussed in Section III-B.

A. OVERALL PERFORMANCE
Generally speaking, more query cost will result in high accuracy and vice versa. It does not make too much sense if comparing query cost or query accuracy independently, so we will show the query cost result and accuracy result in the same figure. We are trying to answer the question, which method results in the highest accuracy with the same query cost, or which method requires the lowest cost to achieve the same accuracy. We first build multiple LSH indices for these datasets by using different accuracy requirement parameters. By using different LSH indices, we expect to obtain various (cost, accuracy) pairs when answering kNN queries, which correspond to the multiple points in an (x, y)-plot. We then reconstruct the LSH indices and build their layered versions. By tuning the expected recall α and expected precision β, we can also create multiple layered indices with various (query cost, query accuracy) pairs. We show the results of LSH and LayerLSH when answering different types of queries in Figure 3. All the results are obtained by averaging 3 trials. The error ratio is lower with more query cost as expected. We can see that the query accuracy (reflected by error ratio) and the query cost (reflected by the number of candidates or distance measurements) vary a lot for different types of queries. With respect to the sparse queries, less accurate kNNs are returned and less number of distance measurements is required. With respect to the dense queries, more accurate kNNs are returned and more query cost is required. This is true for both LSH and LayerLSH. Figure 3 also shows the comparison results between LSH and LayerLSH. To consider both query accuracy and query cost, a curve that corresponds to an LSH index and a specific query type exhibits better performance when it is close to the bottom left corner. We can see that, for all types of queries, LayerLSH requires much less query cost (say 5%-20% of that of LSH) to achieve the same error ratio.
In addition, to further verify the superiority of our proposed method on average query performance, we show the average query time and query cost of LSH and LayerLSH in Figure 4. We vary the parameters to have multiple <ratio, query cost> pairs and <ratio, query time> pairs for LSH and LayerLSH, and draw four curves, each corresponding to LSH-cost, LSH-time, LayerLSH-cost, or LayerLSH-time. A curve exhibits better performance when it is close to the bottom left corner. As can be seen from these figures, LayerLSH can always achieve the same error rate with less time and cost than LSH.

B. SPACE CONSUMPTION
The space consumption is the size of the index file which is used to store the index. The space consumption of the basic LSH index and the LayerLSH index are listed in Table 2, where l = 3, m = 3, α = 0.9 for both LSH and LayerLSH. Since the dense buckets are rehashed in extra hash tables and more copies of dense buckets exist, LayerLSH needs more space to store the extra indices. In LSH, using fewer hash tables is supposed to take up less space, but this is not true for LayerLSH. This is because that more overloaded buckets and much denser buckets could exist when using a small number of LSH tables.
In LayerLSH, the space consumption highly depends on the expected precision parameter β, which indicates the threshold for overloaded bucket. The bigger the β is, the smaller bucket is preferred. Thus, a larger number of small buckets and more child hash tables are expected to be created, so that more space for indexing these buckets and hash tables are required. As shown in Table 2, the index size is not increased too much which is acceptable. As will be seen later in Section V-D, we can achieve good enough accuracy and efficiency when setting β = 0.5%

C. REBUILDING TIME
Rebuilding LSH indices is not free. We need extra time for building LayerLSH index. We measure the rebuilding time for LayerLSH in this experiment. The amount of rebuilding effort is highly related to the parameter β, which indicates the threshold for overloaded bucket. A smaller β implies that higher query cost is tolerable, so that it is possible that only a small number of highly overloaded buckets are rehashed. In contrast, a bigger β implies that a large number of buckets are probably recognized as overloaded buckets and great efforts can be put on rehashing. Accordingly, the rebuilding time increases as β is increased. We show the rebuilding time when β = 0.1%, 0.2%, 0.5% in Table 3. The LSH building time includes the data loading time and the index building time. The LayerLSH rebuilding time includes the LSH index loading time and the index rebuilding time. As can be seen, comparing to the original LSH indexing time, LayerLSH needs reasonable more time for rebuilding index for most datasets when β is small. It is expected to require longer rebuilding time when β is bigger. But as shown in the next experiment, by setting β = 0.5% the accuracy and efficiency can be both good enough.

D. PARAMETER STUDIES
In this study, we investigate the parameters that potentially affect the performance of LayerLSH. These parameters include k, the expected recall α which implies the accuracy, and the expected precision β which implies the efficiency. The experiments are launched on all datasets. For each dataset, we show the error ratio and query cost when varying the parameters. We first study the effect of k by varying k from 5 to 100 and fixing α = 0.9, β = 0.005. Figure 5a, Figure 6a, Figure  7a, Figure 8a, and Figure 9a show the results on different datasets. In general, the error ratio drops slightly when k increases, and the query cost increases when k increases. This is under expectation since we fix the expected precision β = 0.005 and the query cost will increase as k increases.
We next study the effect of α by varying α from 0.1 to 0.95 and fixing k = 20, β = 0.005. Figure 5b, Figure 6b, Figure  7b, Figure 8b, and Figure 9b show the results on different datasets. The error ratio drops significantly, and the query cost increases slightly when α increases.
We also study the effect of β by varying β from 0.0001 to 0.1 and fixing k = 20, α = 0.9. Figure 5c, Figure 6c,     Figure 9c show the results on different datasets. The query cost drops dramatically when β increases. At the same time, the error ratio also increases as expected. We can learn that it is not suggested to set β too small when expecting a lower error ratio, since it is not worth due to the significant query cost.
In addition, α is known as the user-specified expected recall. In order to see its effect on the real recall, we set α as the primary parameter (see Section III-B) and measure the recall rates when varying α. As shown in Figure 5d  In addition, we evaluate the effect of different l and m when comparing LSH and LayerLSH on the KDD dataset. The error ratio (abbrv. r) and query cost (abbrv. c) results are shown in Table 4. We can see that LayerLSH constantly outperforms LSH on both query accuracy and query cost when varying l and m. After bucket split, a large number of sparse buckets come up, which will trigger more nearby bucket search operations. This helps reduce the error ratio a lot.

E. HANDLING STREAM DATA
To illustrate LayerLSH's ability for handling stream data, we prepare a sequence of points from the KDD and Forest datasets and process these points one by one. The size of the initial dataset is randomly selected from the entire original dataset, whose size is 25% of that of the entire dataset. The maximum tolerance parameter ϵ m = 0.6 and the caching tolerance parameter ϵ c = 0.2. The time window W is set as  FIGURE 10. The processing throughput and query cost for a sequence of insertions 0.5s. We measure the processing throughput every 50ms time unit. Note that, the overloaded bucket will be split during this process, and the throughput could be affected. At the same time, we use another thread to query a specific point's kNNs after each insertion and record the number of returned candidates for distance measurements, which is considered as query cost.
The processing throughput and query cost results are shown in Figure 10, where Throughput w.DBS and Cost w.DBS are the throughput and query cost recorded. We can see that the processing throughput is reduced drastically after each time window (0.5s) since many overloaded buckets will be split at that time. Note that, the throughput may also drop within a time window since some buckets might be seriously overloaded (i.e., the bucket size is larger than (1 + ϵ m ) · T u ) and cannot wait for the time window to end. The query cost for a specific point will be continuously increased until some of its host buckets are split. From this figure, this happens after the first time window (0.5s). It is probably because one of the query's host buckets is split. In addition, the throughput should be affected by ϵ m and ϵ c . We run a series of experiments and see that the throughput is increased when increasing ϵ m . For ϵ m = {0.2, 0.4, 0.6, 0.8, 1}, the average throughput results on KDD dataset are {1.25, 1.49, 1.66, 1.77, 2.28}× 10 3 pts/s.
To verify the effect of delay bucket split optimization, we turn off this optimization and show the results (i.e., Throughput w/o.DBS and Cost w/o.DBS) for comparison in Figure 10. As can be seen from these two figures, the query cost without delay bucket split optimization is reduced, but the throughput is getting lower more significantly, even 0 throughput during a few periods, e.g., 0.5-0.65 second period   for KDD dataset.

F. DISTRIBUTED ALL-PAIRS COMPUTATION
In Section IV, we introduce a use case of distributed allpairs computation, i.e., point density evaluation. We conduct experiments to show the benefit of our proposed multilayered LSH structure in distributed computing. The experiment is performed in a large distributed cluster which contains 64 m1.medium Amazon EC2 instances. Each instance is equipped with 1 vCPU, 3.75GB memory, and 410GB disk. We utilize LSH and LayerLSH to partition the BigCross dataset, which contains 11,620,300 instances with 57 attributes for each, and compute the approximate point densities by using Hadoop MapReduce. The MapReduce implementation involves two jobs. The first performs LSH partition and all-pairs computation locally. The second job aggregates the local results. The LSH parameters are set as l = 3, m = 3. The expected accuracy is set to 0.95 such that w can be computed based on Equation (4), since the radius for evaluating point density is given. In LayerLSH, the bucket size limit is set as 10000 rather than being set according to precision rate, since this is not a kNN query application. In addition, similar sparse buckets are merged to further improve accuracy and to reduce number of partitions. The lower bound of bucket size is set as 1000. The number of reduce tasks is set as 256.
By applying LSH and LayerLSH, the bucket size distributions are depicted in Figure 11. The dense buckets are rehashed and the sparse buckets are merged in LayerLSH, so that the skewed buckets are balanced. As known, workload balance is crucial for distributed computing, which could bring significant performance gain especially in a very large scale distributed environment.
The runtime of reduce tasks are shown in Figure 12. Each color bar represents a reduce task, which performs local allpairs computation. Each worker is assigned with multiple reduce tasks. The skewed distribution of bucket sizes leads to the skewed runtime of reduce tasks. We can see that the runtime of reduce tasks is seriously skewed when using LSH. While the runtime of reduce tasks in LayerLSH is more balanced. Accordingly, LSH-based point density evaluation requires much longer runtime than LayerLSH-based approach (20h58m2s vs. 1h36m38s). Since the rehashing strategy of LayerLSH does not reduce accuracy and the sparse buckets are merged, the accuracy is even higher by using LayerLSH 7 .

LSH Variants
The LSH functions based on Euclidean space are proposed by Datar et at. [5]. Since then, a large number of LSH variants were proposed for improving accuracy and reducing I/O cost, including table-based LSH such as multiprobe LSH [9], entropy LSH [13], C2LSH [14], and treebased LSH such as LSB-tree [15], LSH Forest [16], SK-LSH [17]. Besides them, numerous excellent works are proposed in recent years. LazyLSH [18] is able to answer approximate NN queries for multiple l p metrics. It uses a single base index to support the computations in multiple l p spaces. QALSH [19] introduces a novel concept of query-aware bucket partition which uses a given query as the anchor for bucket partition. SRS [20] requires only a single tiny index to answer approximate NN queries with theoretical guarantees. I-LSH [21] proposes an I/O efficient random hash based method, which obtains a good trade-off between search accuracy and I/O efficiency by using an incremental, rather than exponentially expanding, search strategy. SL-ALSH and S2-ALSH [22] support efficient approximate nearest neighbor search for multiple weighted distance functions with respect to the l 2 distance. Each of them introduce an asymmetric LSH family on top of E2LSH [5], and the asymmetric LSH families allow them to process each nearest neighbor query flexibly according to the weight vector attached to the query. PM-LSH [23] uses PM-Trees to index the data and improve the query processing time by using a tunable confidence interval to offer a higher accuracy of the results. Instead of using one-dimensional projections, R2LSH [24] uses two-dimensional projections and indexes the data by using B+-Trees, in the query processing phase, R2LSH [24] uses a query-centric ball to search the neighboring areas of the query and saves I/O costs. VHP [25] introduces the concept of virtual hypersphere partitioning and combines Btree index to improve the search efficiency of unbounded and irregular space. LCCS-LSH [26] proposes a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework, which supports c-ANNS with different distance metrics. The LCCS search framework can make data objects that are closer together have longer LCCS than the far-apart ones. It is worth to mention another work that is called Layered LSH [27], which aims to distribute the hash buckets such that the search is likely to be performed on the same physical machine (hence network efficiency). It has different goal from us though has similar name.
Distributed LSH A set of research works focus on design efficient distributed LSH indices for supporting large data sets [28]- [30]. Zhang et al. utilize the locality preserving property of z-values and perform z-value based partition join in MapReduce to approximate the kNN joins [31]. Haghani et al. propose mappings from the multi-dimensional LSH bucket space to the linearly ordered set of peers that jointly maintain the indexed data, so that buckets likely to hold similar data are stored on the same or neighboring peers in a P2P system [32]. Bahmani et al. propose a distributed Entropy LSH implementation and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines [27]. PLSH [33] is a parallel LSH that supports high-throughput streaming of new data, which exploits an insert-optimized hash table structure and efficient data expiration algorithm for streaming data.
Data Dependent Hashing As discussed in Section I, the recently proposed data sensitive hashing, e.g., DSH [6], selective hashing [7], ANN softmax [34], leverage data distributions. However, rather than learning the optimal hash functions from the skewed data, our approach relies on postprocessing and leverages the density of hash values to reorganize the existing structures. It is orthogonal to the data sensitive hashing. HashFile [35] also proposes to recursively partition the dense buckets. However, we use a multi-layered structure to organize the points as a general strategy that also benefits the tree-like LSH indices, which differs from it. OSimJoin [36] also proposes a recursive partitioning strategy, but its main purpose is to use locality-sensitive hashing to minimize the number of I/O operations between external memory and internal memory. Our goal is to improve the performance of locality-sensitive hashing itself by exploring the density of hash values. Second, OSimJoin needs to rehash each bucket to split the problem into sub-problems that fit into internal memory, while LayerLSH only rehashes dense buckets for dealing with skewed data. Moreover, OSimJoin only uses one hash function (m=1) to rehash each bucket, while LayerLSH uses multiple hash functions to rehash a dense bucket multiple times (the number of rehash times is not fixed but dynamic according to the data distribution) to ensure accuracy. The NSH [37] and our work share the same intuition that the limited hash bits should be used to better distinguish nearby items instead of capturing the distances among far apart items. However, NSH aims to devise a new hashing mechanism to achieve this goal, while we propose to reconstruct the existing LSH index structures as a postprocessing step. "Learning to hash" [38]- [40] has recently attracted many research efforts, which uses machine learning techniques to learn hash functions from a specific dataset so that the nearest neighbor search result in the hash coding space is as close as possible to the search result in the original space. Our LayerLSH is also significantly different from LSH Forest [16]: (1) LSH forest contains a set of LSH trees with different {l, m} parameters, and each one is a logical prefix tree for the set of all labels, with each leaf corresponding to a point. While our LayerLSH is maintained in a single tree, where each node is either a data bucket containing a set of points or a pointer bucket containing the pointers to child hash tables. (2) Different from the LSH forest which contains multiple trees each with different {l, m} parameters, LayerLSH is a single tree where the child nodes with the same parent node are a set of LSH buckets with different {l, m} parameters.

VII. CONCLUSION
In this paper, we present the layered version of LSH variants by exploring the density of hash values. The dense buckets are rehashed and the sparse buckets are merged in order to make the hashing more targeted in terms of data distribution. We also discuss the possibilities of rebuilding other LSH variants and demonstrate the benefit in distributed computing. The experiment results have shown their effectiveness and efficiency. Specifically, LayerLSH can reach the same search quality as LSH with only 5%-20% query cost.
JIWEN DING received the Bachelor and Master degrees in computer science from Northeastern University, China. He is currently a Ph.D. student at Northeastern University, China. His research consists of big data management and data mining.
ZHUOJIN LIU is currently a Master student at Northeastern University, China. Her research consists of data mining, data management, and GPU acceleartion.
YANFENG ZHANG received the Ph.D. degree in computer science from Northeastern University, China, in 2012. He is currently an associate professor with Northeastern University, China. His research consists of distributed systems and big data processing. He Has published many papers in the above areas. His paper in Socc 2011 was honored with "Paper of Distinction".
SHUFENG GONG received the PhD degree in computer science from Northeastern University, China. He is a lecture at the Northeastern University, China. His research consists of distributed systems and graph processing.
GE YU received the PhD degree in computer science from the Kyushu University of Japan, in 1996. He is now a professor with Northeastern University, China. His current research interests include distributed and parallel systems, cloud computing and big data management, blockchain techniques and systems. He has published more than 200 papers in refereed journals and conferences. He is the CCF fellow, the IEEE senior member, and the ACM member.