Optimized Signature Selection for Efficient String Similarity Search

In this paper, we study the problem of string similarity search to retrieve in a database all strings similar to a query string within a given threshold. To measure the similarity between strings, we use edit distance. Many algorithms have been proposed under a filtering-and-verification framework to solve the problem. To reduce the overhead of edit distance verification, it is crucial to efficiently generate a small number of candidates in the filtering phase. Recently, an index structure named HSTree has been proposed for efficiently generating candidate strings. To generate candidates, they select and utilize HSTree nodes at a specific level calculated from a given threshold. In this paper, we observe that there are many alternative ways to select HSTree nodes, and propose a novel technique that selects HSTree nodes in an optimized way based on the observation. We also propose a modified HSTree, named a threaded HSTree, which connects inverted lists of an HSTree node to inverted lists of its child nodes. With a threaded HSTree, we can reduce the overhead of index lookups in HSTree nodes while selecting optimal tree nodes. Experimental results show that the proposed technique significantly outperforms the existing technique using the HSTree.


I. INTRODUCTION
Finding similar objects is essential in data analytics, and many similarity measures have been developed for different types of data. For example, SimRank [13] and its variants [26], [37], [47], [49], [50], [52] have been proposed to measure the similarity between objects in an information network; common subgraphs [5], [36], missing edges and features [51], [53], and graph edit distance [15], [32], [34] have been developed to quantify the similarity between complex objects that are represented by graph models; Jaccard [12], Cosine, and Dice [9] similarities are commonly used for set data; and LSA [19] have been developed to measure similarity between documents through corpus analysis.
In this paper, we focus on syntactic similarity between unstructured text data. Because text data are abundant, and typographical errors and different representations of text data cannot be avoidable, finding syntactically similar strings The associate editor coordinating the review of this manuscript and approving it for publication was Qichun Zhang . is an fundamental operation required in a wide range of applications including data cleaning [8], query relaxation [33], DNA read mapping [17], [18], and near duplicate detection [45]. To measure the similarity between two strings, we use edit distance [11], [27], [28], [38], which is the minimum number of edit operations (insertion, deletion, and substitution) to transform one string to the other. Edit distance has the following advantages over alternative measure: it reflects the ordering of characters in the string and it allows non-trivial alignment. These properties enable edit distance to capture typographical errors for text documents, and to capture similarities for Homologous proteins or genes [29], [43].
The problem of string similarity search studied in this paper is to retrieve all strings in a string database whose edit distance to a query string is within a given threshold. This is a challenging problem, because edit distance computation is costly and a scan-based approach that computes the edit distance for each string in the database would incur a prohibiting O(N · n 2 ) cost for a large database, where N is the number of strings in the database and n is the average length of a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ string. To address the performance challenge, there has been a rich literature on this problem [3], [4], [6], [7], [16], [18], [20]- [23], [29], [39], [41], [46], [48]. All existing techniques adopt a filtering-and-verification framework, with a main focus on the filtering phase to reduce the overhead of edit distance computation in the verification phase. To effectively generate candidate strings by filtering out strings dissimilar to a query string, they utilize signatures extracted from data strings. The most widely used signature scheme is q-gram, which is a substring of a string with length q. Given two strings with an edit distance threshold, a necessary condition to meet the threshold is established on the minimum number of q-grams contained in both strings. To efficiently generate candidate strings using the q-gram signature scheme, all existing algorithms utilize an inverted index built on data strings. They extract q-grams from data strings, and make an inverted list on each q-gram, which is a list of ids of strings that contain the q-gram. Then, they build an index that maps each q-gram to its corresponding inverted list. Early work (e.g., [20], [21]) extracts overlapping q-grams from a query string, and using an inverted index, generates those data string that shares enough number of q-grams with the query string. Later work (e.g., [14], [17]) selects nonoverlapping q-grams from a query string to reduce the number of candidate strings.
A drawback of the q-gram signature scheme is that there can be many strings that share a q-gram, because q is usually chosen to be a small value (e.g. from 2 to 4) to support various thresholds. As a result, a large number of candidates can be generated in the filtering phase. To overcome the limitation, the partition signature scheme has been proposed [22], [23]. The partition signature scheme establishes a filtering condition using the pigeonhole principle as follows. Given two strings r and s with a threshold τ , if we decompose r into τ + c partitions, i.e., disjoint substrings, at least c partitions should be contained in s to meet the threshold. Since we can use large signatures (especially when c = 1), the partition signature scheme has been found to be much more efficient than the q-gram scheme. However, an offline index cannot be built on partitions because the number of partitions is determined by the threshold, which is specified when a query is issued. Therefore, it is not suitable to the string similarity search problem. This scheme is proposed to solve the string similarity join problem, where an index is built on-the-fly during join processing.
Recently, a hierarchical index structure, named HSTree, has been proposed to apply the partition signature scheme to string similarity search problem [39], [48]. The HSTree is a full binary tree. At the i th level of the tree, each data string is decomposed into 2 i partitions, where the j th partition is indexed in the j th node of the i th level (see Section II-C for the details of the HSTree index). Given a query string with a threshold τ , it selects the lowest level having at least τ + 1 nodes (or partitions) to use the pigeonhole principle. As HSTree can use the partition signature scheme in the search problem, it exhibits good performance. It is also easily used to support top-k similarity queries.
Although we can use the partition signature scheme with HSTree, this approach has the following limitation. We only use the tree at a specific level, which is determined by a threshold. In partition-based approach, we need at least τ +1 partitions to use the pigeonhole principle. Since HSTree selects nodes in a specific level, the number of partitions used for a query is not consistently determined.
Example 1: Consider we have a string ''SIMILARITY'' in our string database. Figure 1 depicts how the string is partitioned into each HSTree node. For a query with a threshold τ = 2, the tree nodes at the second level is selected and τ + 2 = 4 partitions of the string is used to check if at least c = 2 partitions are contained in the query string. When τ = 4, the tree nodes at the third level is selected and τ + 4 = 8 partitions of the string is used to check if at least c = 4 partitions are contained in the query string.
In Example 1, we have to select τ + 2 partitions for τ = 2, while we should select τ + 4 partitions for τ = 4. When using the pigeonhole principle with τ + c partitions, there is a trade-off between filtering and verification costs for different c values (see Section III-A for the details), but HSTree cannot balance the trade-off because c value is determined by the threshold τ as shown in the example above.
To address the limitation, we propose a partition selection algorithm that selects τ + c partitions, i.e., HSTree nodes, for a fixed value c regardless of τ . We observe that we can select partitions from different levels of HSTree. For example, consider we are given a fixed c = 1. When τ = 2, we can select τ +c = 3 partitions ''SIMIL'' at level 1, and ''AR'' and ''ITY'' at level 2 in Figure 1. If τ = 4, we can select τ +c = 5 partitions ''SI'', ''MIL'', and ''AR'' at level 2 and ''I'' and ''TY'' at level 3. Interestingly, there are many alternative ways to select a given number of partitions. When τ = 2, for example, we can select alternative combinations of τ + c = 3 partitions: {''SIMIL'', ''AR'', ''ITY''} or {''SIM'', ''IL'', ''ARITY''}. Based on the observation, we propose a novel dynamic programming algorithm that selects an optimal combination of a given number of partitions that generates the minimum number of candidates. The following summarize the contributions of the paper • We show that there are many alternative combinations of HSTree nodes to evaluate a query, and develop a novel dynamic programming algorithm that select an optimal combination of nodes. 98194 VOLUME 8, 2020 • We propose an enhanced HSTree, named a threaded HSTree, that connects inverted lists of an HSTree node to inverted lists of its child nodes. With a threaded HSTree, we can reduce the overhead of index lookups in HSTree nodes while selecting optimal tree nodes.
• We implement the proposed algorithm and conduct extensive experiments on real datasets. From our experimental results, we show that the proposed algorithm significantly outperforms the existing algorithm that uses the HSTree.
The rest of this paper is organized as follows. In Section II, we formulate the problem of string similarity search, describe the candidate generation method based on the partition signature scheme and review the HSTree index structure. In Section III, we propose a novel dynamic programming algorithm to select an optimal combination of tree nodes. In Section IV, we enhance an HSTree to reduce the overhead of index lookups. In Section V, we report our experimental results on real datasets. We brief related work in Section VI and conclude the paper in Section VII.

II. PROBLEM FORMULATION AND PRIOR WORK A. PROBLEM FORMULATION
The edit distance between two strings r and s, denoted by ed(r, s), is the minimum number of edit operations to transform r into s or vice versa. An edit operation is insertion, deletion, or substitution of a single character. For example, ed(''string'', ''strong'') is 1 because ''string'' can be transformed into ''strong'' by substituting one character 'i' with 'o'.
Definition 1: Given a string database D, and a query string q with an edit distance threshold τ , the problem of string similarity search is to retrieve from D all strings s such that ed(q, s) ≤ τ .
Example 2: For the strings in Table 1, consider a string database D = {s 1 , s 2 , . . . , s 8 }. Given a query string q = ''string'' with a threshold 1, the result of the similarity search is {s 1 , s 2 , s 3 }.

B. DISJOINT SIGNATURE-BASED APPROACH FOR GENERATING CANDIDATES
We can establish a necessary condition between two strings to meet a threshold using the pigeonhole principle. The following definition and lemma formally state the necessary condition. Disjoint substrings in the definition above does not share any character in a common position. For a string ''abcde'', for example, ''abc'' and ''de'' are disjoint, but ''abc'' and ''cd'' are not disjoint.
Lemma 1: Given two strings r and s and a threshold τ , if we extract τ + c disjoint segments from r, where c is a constant, s should contain at least c segments of r to meet the threshold.
A disjoint segment of r contained in s is called a matching segment. The lemma above states that we need at least c matching segments to meet the threshold. The intuition behind the lemma is that a mismatching segment causes at least one edit operation, and edit operations caused by different segments are independent. If the number of matching segments is less than c, the number of mismatching segments is greater than τ , and thus r and s cannot meet the threshold τ .
Example 3: Consider two strings s 5 and s 6 in Table 1. Given a threshold τ = 2, if we extract τ +2 disjoint segments, ''al'', ''on'', ''ene'', and ''ss'' from s 5 , only one of the segments, i.e., ''al'', is contained in s 6 . Hence, s 5 and s 6 cannot meet the threshold by Lemma 1 In the q-gram signature scheme, existing techniques select non-overlapping q-grams to generate candidates using Lemma 1. However, those techniques cannot utilize string segments between the selected q-grams, and filtering power is limited because the signature size (i.e., q) is small. PassJoin [22], [23] has been proposed to find similar strings using partition-based signatures for Lemma 1. PassJoin decomposes a string s into τ +c disjoint segments 1 w 1 , w 2 , . . . , w τ +c , such that s = w 1 ·w 2 ·. . .·w τ +c , where w i ·w i+1 denotes the concatenation of w i and w i+1 . We call w i a partition of s. By using a partition-based signature, which is longer than a q-gram signature, the number of candidates can be reduced since the longer a signature is, the less strings that share the signature. In the remaining of this subsection, we briefly introduce the technique in PassJoin.
Given a string database D and a threshold τ , an index is built on strings in D l = {s | s ∈ D ∧ |s| = l} as follows. Each string in D l is partitioned into τ + c segments 2 . For the j th segments of the strings in D l , we make an inverted index L j l that maps each distinct j th segment w to L j l (w), which is a list of ids of strings that have w in their j th segments. In this way, an index Example 4: For the string collection D in Table 1, consider that τ = 1 and c = 1 are given. To make I 6 , we decompose each string in D 6 = {s 1 , s 2 , s 3 , s 4 } into two partitioned segments with the same length 3. We can construct I 9 for D 9 in a similar way. Figure 2 shows I 6 = {L 1 6 , L 2 6 } and I 9 = {L 1 9 , L 2 9 }. Therefore, the index for D is I = {I 6 , I 9 }.  Table 1.
Before we describe how to evaluate a query using the index, we present an obvious necessary condition between two strings to meet a threshold. The following lemma states a condition on the size difference of two strings.
Once candidate strings are generated, each of candidate is verified by computing the edit distance to the query string. Intuitively, the smaller size W(q, L j l ), the smaller number of candidates. By utilizing various conditions (e.g., a condition on position difference between segments), the number of segments in W(q, L j l ) can be substantially reduced (see Section II-C).

C. HSTree
The partition-based approach described in Section II-B can build an index only with a given static threshold. Therefore, it is hard to apply the approach to the search problem, where a threshold can vary from query to query. To address the problem, HSTree [39], [48] has been proposed, which maintains alternative partitioning results of each data string.
For each distinct string length l, an HSTree H l is built on D l as follows. At the level i of the tree 3 , H l partitions each data string s ∈ D l into 2 i segments, and the j th segment of s is indexed in the inverted index of the j th node, denoted by L i,j l , just like the partition-based approach in the previous section. For the purpose of presentation, we use the inverted index in a node interchangeably with the node itself in the rest of the paper. Similar to PassJoin, HSTree also use an even-partition scheme. Specifically, for each segment w at the level i, w is partitioned into two disjoint segments in the (i + 1) th level, such that the first segment is the prefix of w of length |w|/2 and the second segment is the suffix of w of length |w|/2 . For example, Figure 3 shows an HSTree H 9 for the string collection in Table 1.  Table 1.
Given a query q with a threshold τ , we can evaluate the query using the HSTree between H |q|−τ and H |q|+τ by Lemma 2. To generate candidate strings, we need at least τ +1 partitions of strings by Lemma 1. For each H l (|q| − τ ≤ l ≤ |q| + τ ), we select the lowest level having at least τ + 1 nodes. Therefore, i = log 2 (τ + 1) . Given 2 i nodes, i.e., L i,j l (1 ≤ j ≤ 2 i ), the query can be evaluated as the similar way introduced in the previous section. A segment set of the query string q for searching an index L i,j l , denoted by W(q, L i,j l ), is computed using the following position lower bound and upper bound (see PassJoin [22], [23] for the details).
where (L i,j l ) and p(L i,j l ) denote the length and the position of the segments in L i,j l , respectively, and = |q| − l. In the formulas above, j − 1 and τ + c − j indicate the relative locations 4 of a partition from the left-side and the right-side perspectives, respectively. Given the LB T and UB T , where q[p, (L i,j l )] denotes a substring of q starting at position p with length (L i,j l ). Since the level i has 2 i nodes, τ + c = 2 i and c = 2 i − τ in Lemma 1. Hence, we generate those strings whose segments are selected from at least 2 i − τ nodes at the level i, while searching the index with W(q, L Example 6: Consider a query string q = ''alignment'' with a threshold τ = 1 for the string collection in Table 1. Since |q| = 9, we need to search H 8 , H 9 , and H 10 . As H 8 = H 10 = ∅, we search H 9 depicted in Figure 3. In H 9 , we select the level log 2 τ + 1 = 1 and search L 1,1 9 and L 1,2 9 . To search L 1,1 9 , we compute W(q, L 1,1 9 ) = {alig} (LB T = 1 and UB T = 1 because p(L 1,1 9 ) = 1, = 0, and (L 1,1 9 ) = 4). Since L 1,1 9 does not contain alig, we generate no candidate string in this node. Next, we search L 1,2 9 with W(q, L 1,2 9 ) = {nment} (LB T = 5 and UB T = 5 because p(L 1,2 9 ) = 5, = 0, and (L 1,2 9 ) = 5). L 1,2 9 does not contain nment, and we generate no candidate in L 1,2 9 either. Hence, the result is ∅.

III. OPTIMIZED HSTree NODE SELECTION A. MOTIVATION OF OUR WORK
In general, the quality of a partition signature for generating candidates is assumed to be proportional to the size of the signature. By choosing c = 1 in Lemma 1, we can maximize the size of each partitioned segment, and thus expect that the number of candidates generated from each partition is minimized. For this reason, PassJoin [22], [23] uses τ + 1 scheme. If we use a larger c value, the size of each partition signature is reduced, and thus the inverted list for the signature becomes longer. Nevertheless, we have a stricter filtering condition, since a candidate requires to have at least c partitioned segments contained in a query. To generate candidates, however, we need to scan and merge more and longer inverted lists, which can degrade the performance of the search. Therefore, c in Lemma 1 can be also used as a tunable parameter (e.g. [14], [17], [18]) to balance filtering and verification costs.
In HSTree, the c value is dependent on τ , that is, c = 2 i −τ where i = log 2 (τ + 1) . For example, we have to use τ + 1 scheme for τ = 3, while we should use τ + 4 scheme for τ = 4. This is because we select nodes in a specific level of the tree based on a given threshold. Since the c value is determined by τ , we cannot expect consistent performance for different τ values, and we have no chance for improving performance by adjusting c. To address the problem, for a given fixed c value, we propose a novel technique that selects 4 The position of a segment starts from 1, while the relative location of a partition starts from 0. optimal τ + c disjoint nodes across different levels in an HSTree, where any two nodes are disjoint if and only if they do not lie on the same root-to-leaf path of the tree. Note that if two HSTree nodes are disjoint, the substrings of a data string corresponding the nodes are also disjoint. Given τ + c disjoint nodes, therefore, we can apply Lemma 1 to generate candidate strings.
As shown in the example above, there can be multiple combinations of τ + c disjoint nodes. Among all possible combinations, in this paper, we develop a novel dynamic programming algorithm that selects an optimal combination that minimizes the number of candidates. We remark that any combination of τ + c disjoint nodes generates candidate strings containing all result strings by Lemma 1. Thus, our optimization technique does not affect the accuracy of similarity search, i.e., it does not miss any result string. In the following subsections, we use c = 1 for simplicity, and we will discuss the effect of different c values by treating c as a tunable parameter in Section V-B.

B. OPTIMIZED NODE SELECTION ALGORITHM
Given a query string q with a threshold τ , we search from H |q|−τ to H |q|+τ to generate candidates as we discussed earlier. Because we independently search each HSTree, and each tree generates candidates independently, in this section, we restrict our discussion to those strings of length l and consider optimized node selection for the HSTree H l . Given τ + 1 nodes {N 1 , N 2 , . . . , N τ +1 } of H l , candidate strings are generated as follows.
where q is the query string and w is each segment in W(q, N i ). Note that N i also denotes the inverted index of the i th node. Like all other string similarity search techniques that utilize signatures to generate candidates, we assume each partition signature independently generates candidate strings. Therefore, the number of candidates can be estimated as: where |N i (w)| denotes the size of the inverted list N i (w). An optimal combination of τ + 1 nodes can be obtained by enumerating all possible τ + 1 disjoint nodes, computing the number of candidates generated by each combination, and selecting a combination having the minimum number of candidates. The following lemma states that we can prune certain combinations of τ + 1 disjoint nodes.
Lemma 3: Given a combination of τ + 1 disjoint nodes there exists another combination that generates candidates no more than the initial combination S, where (N k ) denotes the length of segments indexed in N k .
Proof: If τ +1 k=1 (N k ) < l, there should be at least one node N ∈ S such that no nodes in the subtree rooted by the sibling of N is included in S. By replacing N with its parent, we can have another combination that generates candidates no more than S, because candidate generated by the parent is obviously a subset of that generated by N .
Example 8: In Example 7, we may select τ + 1 nodes {L 2,1 9 , L 3,3 9 , L 3,4 9 , L 3,6 9 , L 2,4 9 }. In this case, we can replace L 3,6 9 with its parent L 2, 3 9 , and candidates generated by L 2,3 9 is a subset of candidates generated by L 3,6 9 . The following recurrence calculates the minimum number of candidates when we select n disjoint nodes from a subtree rooted by L i,j l , which is the j th node at the level i of H l for those data strings of length l.
if n = 1: otherwise : In case n = 1, L i,j l generates the minimum number of candidates, thus we select L i,j l . When n > 1, we select k disjoint nodes from the subtree rooted by the left child L i+1,2j−1 l , and remaining n − k disjoint nodes from the subtree rooted by the right child L i+1,2j l . Among all possible k values, which are discussed in Lemma 4, we select an optimal combination of n disjoint nodes that has the minimum number of candidates. Note that the recurrence considers only those combinations of disjoint nodes {N 1 , . . . N n } such that n k=1 (N k ) = (L i,j l ) by the following lemma.
Lemma 4: In the recurrence above, the range of all possible k values is as follows.
where maxL denotes the maximum level of the tree, i.e., the leaf level log 2 l .
Proof: It is obvious that 1 ≤ k ≤ n − 1. Since an HSTree is a full binary tree, the maximum number of disjoint nodes (i.e., the number of nodes in the leaf level) in the subtree rooted by L i+1,2j−1 l (or L i+1,2j l ) is 2 maxL−(i+1) . Therefore, k ≤ 2 maxL−(i+1) and n − k ≤ 2 maxL−(i+1) , that is,  l is a leaf node, it clearly returns the correct number of candidates by Equation (7). Therefore, by induction, N C (L i,j l , n) correctly computes the minimum number of candidates.
A minor difficulty with the recurrence is that we can identify relative locations of partitions after selecting optimal partitions, while LB T and UB T for W(q, L i,j l ) require the relative location of the partition L i,j l . Notice that j in L i,j l is no more the relative location of the partition L i,j l , since we select partitions in different levels of the tree. We solve the problem by using the following LB L and UB L for W(q, L i,j l ) in our recurrence (see PassJoin [22], [23] for the details of LB L and UB L ).
After selecting optimal partitions, we use LB T and UB T to generate candidates with the selected partitions. It is worth noting that we do not need to lookup indices for the inverted lists used in generating candidates, but we can select the required inverted lists from those inverted lists retrieved during partition selection because [LB T , UB T ] ⊆ [LB L , UB L ] [22], [23]. It is obviously inefficient to compute the minimum number of candidates by recursively enumerating all possible combinations of τ +1 disjoint nodes. Instead, we develop a dynamic programming algorithm based on the recurrence above as follows. There are |H l | = 2 maxL+1 − 1 nodes in H l , where maxL is the leaf level. We make an array A having |H l | elements, where the node L 98198 VOLUME 8, 2020 For example, Figure 4 shows an initial DP array for an HSTree with depth 3 when τ = 4.
Given an initialized DP array A, Algorithm 1 outlines our dynamic programming algorithm that computes the minimum number of candidates in H l for a query string q with a threshold τ . We assume that q, τ and A are globally visible in the algorithm. In the algorithm, each slot of the DP array A has three values: n_cands is the minimum number of candidates, and left and right are links to its children cells, which are used to keep track of the optimal combination of nodes. The algorithm computes the minimum number of candidates only when it is not already computed (Line 1). If the number of nodes to be selected is 1, then it saves the sum of the sizes of the inverted lists for W(q, L i,j l ) of the current node (Line 4). In this case, the child links are set to nil, which indicates that this node is a terminal node (Line 5). If the number of nodes to be selected is greater than 2, the algorithm selects an optimal combination of nodes in the subtree rooted by current node  −1+2j−1 and 2x +1 = 2 i+1 −1+2j, respectively. The algorithm finally returns the number of minimum candidates (Line 13).
Example 9: Given a query string q with a threshold τ = 4, consider an HSTree H l shown in Figure 5(a). In the figure, the number in each node denotes the sum of the lengths of the inverted lists selected by segments of q (i.e., . We can find an optimal combination of disjoint nodes by calling DPSelect(L 0,1 l , τ + 1 = 5). To find an optimal combination, DPSelect construct a DP array depicted in Figure 5(b). It recursively fills each slot in the array while keeping links to its children slots. Once it fills A[L 0,1 l ] [5], we can find an optimal combination by following the links of A[L 0,1 l ] [5], which are depicted in red lines in the figure. The optimal combination of disjoint nodes selected by the algorithm is indicated by the grayed slots.

Lemma 5:
The time complexity of the algorithm is O(l · τ · C I ), where l is the length of strings indexed in an HSTree H l , τ is a threshold for a query, and C I is the average cost for index lookups.
Proof: It can be seen that the number of segments in W(q, L i,j l ) is at most 2τ . The algorithm requires O(l · τ · C I ) to fill the first row of the DP array, i.e., A [1][ * ], since there are at most l slots in the first row. In the n th row of the DP array, there are at most l 2 log 2 n ≤ l n slots. A slot in the n th row requires at most 2(n − 1) lookups of the DP array slots (see Line 6 of the algorithm). Hence, the algorithm requires O(l) to fill the n th row (n > 1), and it requires O(l · τ ) to fill all the rows except the first row. Consequently, filling the first row dominates the time complexity of the algorithm, which is O(l · τ · C I ).

C. REDUCING INDEX LOOKUP OVERHEAD
A drawback of the proposed algorithm is that it looks up indices in all tree nodes. To reduce the overhead of index lookups, we limit the maximum level maxL to log 2 (τ + 1) , i.e., the minimum level having at least τ + 1 nodes. An interesting observation is that we can ignore HSTree nodes below a certain level to select τ + 1 disjoint nodes for a query. Lemma 6 formally states the observation.

IV. THREADED HSTree
To further reduce the index lookup cost, in this section, we develop a threaded HSTree structure. Consider a segment w indexed in an HSTree node N and let the inverted list for w be I w . The first half of w is indexed in the left child of N and the second half of w is indexed in the right child of N . Let the inverted lists of the first and second halves be I w 1 and I w 2 , respectively. We connect I w with I w 1 and I w 2 with pointers, which are called threads. If we look up the inverted index in N to find I w for w, we can directly locate I w 1 and I w 2 by following the threads. Figure 6 shows this modification of the HSTree in Figure 3, where the dashed blue lines denote the threads that connect inverted lists.
We use a threaded HSTree with our algorithm as follows. Given a query, we first look up inverted indices of the nodes at the minimum level minL, and we keep the retrieved inverted lists. For each node N in the next level, we first locate its parent node, and follow the threads of the inverted lists kept in the parent node. It can be easily seen that the inverted lists identified by parent's threads are a subset of the inverted lists required in N . Hence, we look up the inverted index in N to retrieve unidentified inverted lists only. In this way, we can retrieve all inverted lists for the nodes at the levels between minL and maxL. Then, we fill the first row of the DP array using the sizes of retrieved inverted lists. Finally, we apply our algorithm to compute remaining slots of the DP array. We remark that the condition Line 3 of the algorithm is always false, since we initialize the first row of the DP array before calling the algorithm.
Algorithm 2 encapsulates the initialization of the DP array using a threaded HSTree. We assume that q, τ , the HSTree H l , and the DP array A are visible in the algorithm. The algorithm first initialize an array of sets of inverted lists S L , which keeps inverted lists corresponding to W(q, L i,j l ) for each node L i,j l (Line 1). Then, it retrieve inverted lists for the query in each node at the level minL (Line 2). Recall that an HSTree node N also denotes the inverted index in N . For simplicity, we use S L [N ] to denote the element of S L corresponding to the node N . After looking up inverted indices in the nodes at the level minL and retrieving inverted lists for the query, it uses the retrieved inverted lists to construct inverted lists of the nodes at the levels above

V. EXPERIMENTS A. EXPERIMENTAL SETTINGS
In experiments, we used four widely used real-world datasets, IMDB Actor and Movie (http://www.imdb.com), and Web Corpus (http://www.ldc.upenn.edu/Catalog, number LDC2006T13). Some important statistics of the datasets are presented in Table 2. We chose the datasets to compare performance for short and long strings. Our algorithm was implemented in GNU C++ and compiled with -O3 option. All experiments were conducted on a machine with 32GB main memory and Intel i7 CPU running an Ubuntu operating system. Datasets and indices were held in main memory.
We randomly extracted 2000 query strings from each data set, ran queries. We evaluated the proposed technique in terms of query processing time. The reported results in this section are aggregated response times from 2000 queries. Since the HSTree technique consistently outperformed other existing techniques as reported in [39], [48], we compared our technique with the original HSTree technique [39], [48]. In the remaining sections, we denote our algorithm by OptSearch and the original HSTree search algorithm by HSSearch.

B. EXPERIMENTS ON DIFFERENT C VALUES
In this subsection, we evaluate our algorithm varying c values. Even when c > 1, as shown in [17] 5 , we can still estimate an upper bound of the number of candidates with the sum of the sizes of inverted lists for a query string. Therefore, we can still use our algorithm to obtain an optimal combination of τ + c disjoint nodes. In this case, as discussed in Section III-C, we only need to change maxL to log 2 (τ + c) and minL to log 2 (2 maxL /(2 maxL − (τ + c − 1))) in our algorithm. Figure 7 shows query response times for alternative c values on three different datasets. If we use a c value larger than 1, an underflow of the number of partitions may occurs. In this case, we re-evaluated the query using τ + 1 partitions. For all datasets and thresholds, we observed that c = 2 exhibited the best performance. It can be explained by the query time ratios shown in Figure 8. There is a trade-off between filtering (i.e., merging inverted lists) and verification (i.e., edit distance computation) times for different c values. To obtain the best performance, we need to balance the tradeoff. As shown in the figure, the time differences between filtering and verification were minimized when c was 2. We remark that the cost for selecting an optimal combination is negligible, and it is included in Indexlookup in Figure 8. These results justify the motivation of our work described in Section III-A. Based on the results in this subsection, we used c = 2 in our algorithm in the following subsections.

C. EXPERIMENTS ON INDEX LOOKUP TIME
In this subsection, we evaluate the effects on index lookup times when a threaded HSTree and the restriction of the maximum level maxL are applied. Figure 9 shows the results. When we used a threaded HSTree, index lookup time was reduced by 1.5 times on average. When we applied the restriction of maxL (i.e., maxL= log 2 (τ + c) ) along with a threaded HSTree, index lookup time was reduced by 2 times on average. As shown in Figure 8, index lookup time did not affect the query time when a threshold was large (e.g., τ ≥ 3).   For a low threshold (e.g. τ = 1), however, index lookup time was very important on the overall performance because merging inverted lists and verifying candidate strings were very fast. Since the proposed technique looks up inverted indices in all HSTree nodes to select an optimized combination of nodes, it is crucial to reduce index lookup time for low thresholds. As shown in the experiments in this subsection, the threaded HSTree structure and maxL restriction technique effectively reduced index lookup time.

D. COMPARISONS WITH HSSearch
In this subsection, we compare our algorithm with HSSearch. Figure 10 shows the results. On each dataset, OptSearch outperformed HSSearch by up to about 3 times. For low thresholds (e.g., τ ≤ 2), the performance of OptSearch was just as good as HSSearch. This is because inverted lists for partitions are very short and list merging and edit distance computation can be done very quickly on a low threshold. Since OptSearch requires more index lookups, the benefit of OptSearch is reduced by the overhead of index lookups. As a threshold increased, however, OptSearch significantly outperformed HSSearch because of the optimal partition selection and a good balance between filtering and verification. For example, OptSearch is about 3 times faster than HSSearch when τ = 4 on the Actor dataset (Figure 10(a)).
Since we used the dataset, Corpus, which contains 10 millions of strings, we can see that the proposed techniques scales well to a large dataset from Figure 10(c).

VI. RELATED WORK A. SIMILARITY MEASURES
The problem of quantifying similarity between objects has witnessed growing interest over the past decades. To measure the similarity between text data, character-based similarity functions and token set-based similarity functions are widely used. The most representative character-based similarity function is edit distance [11], [27], [28], [38], which is also known as Levenshtein distance. Since it reflects the ordering of characters in strings and it allows non-trivial alignment, it is widely adopted in many applications. For concrete examples, it is used in practical applications such as diff and patch commands in Linux OS systems, source code management for version control systems like GitHub, and DNA read mappers (e.g. [17], [18]) for analyzing genomic data. Set-based similarity functions such as Jaccard coefficient [12], Dice [9] and Cosine similarity are also used to measure the similarity between text data by tokenizing each string into a set of words or q-grams. Since set-based similarity functions considers only exact match between tokens, they might not correctly measure similarity when the granularity of a token is large. Fast-Join [40] and MF-Join [42] address this problem by considering similarity between tokens using edit distance before computing set-based similarity. LSA [19] and it variants (e.g., [24], [25]) also have been developed to measure similarity between documents through corpus analysis.
There are similarity functions that are used in structured data. SimRank [13] and many variants [26], [37], [47], [49], [50], [52] has been proposed to consider semantic similarity information between objects in information networks. The intuition behind SimRank is that similar objects are linked by similar objects. Based on the intuition, it quantifies node similarity based on the compound similarity of their neighbors. Graph edit distance [32], [34] and feature-based similarity functions [5], [36], [51], [53] has been proposed to quantify the similarity between complex objects represented by graph models. Similar to edit distance, graph edit distance measure the distance between two graphs using the minimum number of edit operations to make the graphs isomorphic. In contrast to edit distance, however, graph edit distance computation is NP-hard and many techniques have been proposed to efficiently compute graph edit distance (e.g. [15], [32]).

B. STRING SIMILARITY QUERY PROCESSING
Set similarity join is the problem of finding similar pairs of records from two collections of records, which is essential in many applications including data cleaning [8] and near duplicate detection [45]. Many algorithms are developed for the problem of set similarity join [1], [2], [6], [8], [10], [30], [35], [40], [41], [43]- [45]. Some of these algorithms (e.g. [2], [31]) solve only join problems, but most of these algorithms can be applied to the search problem in their original form or with slight modification. Many algorithms and inverted index structures have been proposed for the similarity search problem [3], [4], [6], [7], [16], [20], [21], [29], [41], [46]. The technique called VGRAM [21], [46] was proposed to use variable-length grams in an inverted index to improve similarity search performance and reduce the index size. A disk-based inverted index structure [4] was proposed for supporting similarity search on large datasets. In [16], a dataset partitioning algorithm has been proposed to reduce the number of candidates by exploiting document frequency orderings. Recent techniques exploits partitioning of data strings to establish a filtering condition based on the pigeonhole principle. PassJoin [22], [23] decomposes data strings into τ +1 partitions to perform efficient string similarity join. HSTree [39], [48] proposes a hierarchical index structure that considers multiple partitionings of a string to support string similarity search using the partition-based approach of PassJoin.

VII. CONCLUSION
In this paper, we propose an optimal partition selection algorithm to improve the performance of string similarity search using the HSTree indexing technique. We observed that there can be multiple combination of τ + c disjoint nodes in an HSTree, and proposed a novel dynamic programming algorithm that selects an optimal combination of HSTree nodes. To reduce the overhead of selecting optimal combination, we also proposed a threaded HSTree, which is an enhanced HSTree structure. We evaluated the proposed technique using real datasets and showed our technique outperformed the existing technique HSSearch.