Introduction
Keyword search over relational databases enables naive/informal users to retrieve information from relational databases (DBs) without any knowledge about schema details or query languages. The success of search engines shows that untrained users are at ease using keyword search to find information of interest.
However, this can be challenging because the information sought frequently spans multiple relations and attributes, depending on the schema design of the underlying DB. Therefore, Relational Keyword Search (R-KwS1) systems must automatically determine which pieces of information to retrieve from the database and how to connect them to provide a relevant answer to the user.
In general, keywords may refer to both database values in tuples and schema elements, such as relation and attribute names. For instance, consider the query
Handling keywords that refer to schema elements makes R-KwS significantly more challenging than the usual setting where only keywords referring to attribute values are considered. Firstly, it increases the complexity of the search process by requiring an understanding of the underlying database schema and its structure. Secondly, keywords referring to the schema introduce semantic ambiguity, making it difficult to disambiguate between schema references and attribute values. This ambiguity further complicates the search process and can lead to incorrect or incomplete results. Furthermore,integrating schema knowledge into the search process becomes crucial when handling schema references. Understanding PK/FK relationships and connecting relevant information adds an extra layer of complexity to the problem. Finally, ranking and relevance determination become more challenging when schema elements are involved. Existing systems may prioritize attribute values even if they do not provide useful answers. Accurately assessing relevance requires considering both attribute values and schema references. These challenges require dedicated techniques and algorithms specifically designed to handle schema references effectively.
In this work, we study new techniques for supporting schema references in keyword queries over relational databases. Specifically, we propose Lathe2, a new R-KwS system to generate a suitable SQL query from a keyword query, considering that keywords refer either to instance values or schema elements. Lathe follows the Schema Graph approach for R-KwS systems [2], [3]. Given a keyword query, this approach consists of generating relational algebra expressions called Candidate Joining Networks3 (CJNs), which are likely to express user intent when formulating the original query. The generated CJNs are evaluated, that is, they are translated into SQL queries and executed by a DBMS, resulting in several Joining Networks of Tuples (JNTs) which are collected and supplied to the user.
In the literature, the most well-known algorithm for CJN Generation is CNGen, which was first presented in the system DISCOVER [4], but was adopted by most R-KwS systems [5], [6], [7], [8]. Despite the possibly large number of CJNs, most works in the literature focused on improving CJN Evaluation and ranking of JNTs instead. Specifically, DISCOVER-II [6], SPARK [7], and CD [8] used information retrieval (IR) style score functions to rank the top-K JNTs. KwS-F [9] imposed a time limit for CJN evaluation, returning potentially partial results as well as a summary of the CJNs that have yet to be evaluated. Later, CNRank [10] introduces a CJN ranking, requiring only the top-ranked CJNs to be evaluated. MatCNGen [2], [11] proposed a novel method for generating CJNs that efficiently enumerated the possible matches for the query in the DB. These Query Matches (QMs) are then used to guide the CJN generation process, greatly decreasing the number of generated CJNs and improving the performance of CJN evaluation.
Among the methods based on the Schema Graph approach, Lathe is, to the best of our knowledge, the first method to address the problem of generating and ranking CJNs considering queries with keywords that can refer to either schema elements or attribute values. We revisited and generalized concepts introduced in previous approaches [2], [4], [10], [11], such as tuples-sets, QMs, and the CJNs themselves, to enable schema references. In addition, we proposed a more effective approach to CJN Generation that included two major innovations: QM ranking and Eager CJN Evaluation. Lathe roughly matches keywords to the values of the attributes or to schema elements. Next, the system combines the keyword matches into QMs that cover all the keywords from the query. The QMs are ranked and only the most relevant ones are used to generate CJNs. The CJN generation explores the primary key/foreign key relationships to connect all the elements of the QMs. In addition, Lathe employs an eager CJN evaluation strategy, which ensures that all CJNs generated will yield non-empty results when evaluated. The CJNs are then ranked and evaluated. Finally, the CJN evaluation results are delivered to the user. Unlike the previous methods, Lathe provides the user with the most relevant answer without relying on JNTs rankings. This is due to the effective rankings of QMs and CJNs that we propose, which are absent in the majority of previous work.
We performed several experiments to assess the effectiveness and efficiency of Lathe. First, we compared the results with those obtained with several previous R-KwS systems, including the state-of-the-art QUEST [12] system using a benchmark proposed by Coffman & Weaver [3]. Second, we assessed the quality of our ranking of QMs. The ranking of CJNs was then evaluated by comparing different configurations in terms of the number of QMs, the number of CJNs generated per QM, and the use of the eager evaluation strategy. Finally, we assessed the performance of each phase of Lathe, as well as the trade-off between quality and performance of various system configurations. Lathe achieved better results than all of the R-KwS systems tested in our experiments. Also, our results indicate that the ranking of QMs and the eager CJN evaluation greatly improved the quality of the CJN generation.
Our key contributions are: (i) a novel method for generating and ranking CJNs with support for keywords referring to schema elements; (ii) a novel algorithm for ranking QMs, which avoids the processing of less likely answers to a keyword query; (iii) an eager CJN evaluation for discarding spurious CJNs; (iv) a simple and yet effective ranking of CJNs which exploits the ranking of QMs.
The remainder of this paper is organized as follows: Section II reviews the related literature on relational keywords search systems based on schema graphs and support to schema references. Section IV summarizes all of the phases of our method, which are discussed in detail in Sections V–VII. Section VIII summarizes the findings of the experiments we conducted. Finally, Section IX summarizes the findings and outlines our plans for the future.
Background and Related Work
In this section, we discuss the background and related work on keyword search systems over relational databases and on supporting schema references in such systems. For a more comprehensive view of the state-of-the-art in keyword-based and natural language queries over databases, we refer the interested reader to a recent survey [13].
A. Relational Keyword Search Systems
Current R-KwS systems fall in one of two distinct categories: systems based on Schema Graphs and systems based on Instance Graphs. Systems in the first category are based on the concept of Candidate Joining Networks (CJNs), which are networks of joined relations that are used to generate SQL queries and whose evaluation return several Joining Networks of Tuples (JNTs) which are collected and supplied to the user. This method was proposed in DISCOVER [4] and DBXplorer [5], and it was later adopted by several other systems, including DISCOVER-II [6], SPARK [7], CD [8], KwS-F [9], CNRank [10], and MatCNGen [2], [11]. Systems in this category make use of the underlying basic functionality of the RDBMS by generating appropriate SQL queries to retrieve answers to keyword queries posed by users.
Systems in the second category are based on a structure called Instance Graph, whose nodes represent tuples associated with the keywords they contain, and the edges connect these tuples based on referential integrity constraints. BANKS [14], BANKS-II [15], BLINKS [16] and, Effective [17] use this approach to compute keyword queries results by finding subtrees in a data graph that minimizes the distance between nodes matching the given keywords. These systems typically generate the query answer in a single phase that combines the tuple retrieval task and the answer schema extraction. However, the Instance Graph approach requires a materialization of the DB and requests a higher computational cost to deliver answers to the user. Furthermore, the important structural information provided by the database schema is ignored, once the data graph has been built.
B. R-KwS Systems Based on Schema Graphs
In our research, we focus on systems based on Schema Graphs, since we assume that the data we want to query are stored in a relational database and we want to use an RDBMS capable of processing SQL queries. Also, our work expands on the concepts and terminology introduced in DISCOVER [4], [6] and expanded in CNRank [10] and MatCNGen [2], [11]. This formal framework is used and expanded to handle keyword queries that may refer to attribute values or to database schema elements. As a result, we can inherit and maintain all guarantees regarding the generation of minimal, complete, sound, and meaningful CJNs.
The best-known algorithm for CJN Generation is CNGen, which was introduced in DISCOVER [4] but was later adopted as a default in most of the R-KwS systems proposed in the literature [5], [6], [7], [8]. To generate a complete, non-redundant set of CJNs, this algorithm employs a Breadth-First Search approach [18]. As a result, CNGen frequently generates a large number of CJNs, resulting in a costly CJN generation and evaluation process.
Initially, most of the subsequent work focused on the CJN evaluation only. Specifically, as many CJNs were generated by CNGen that should be evaluated, producing a larger number of JNTs, such systems as DISCOVER-II [6], SPARK [7], and CD [8] introduced algorithms for ranking JNTs using IR style score functions.
KwS-F [9] addressed the efficiency and scalability problems in CJN evaluation in a different way. Their approach consists of two steps. First, a limit is imposed on the time the system spends evaluating CJNs. After this limit is reached, the system must return the (possibly partial) top-K JNTs. Second, if there are any CJNs that have yet to be evaluated, they are presented to the user in the form of query forms, from which the user can choose one and the system will evaluate the corresponding CJN.
CNRank [10] proposed a method for lowering the cost of CJN evaluation by ranking them based on the likelihood that they will provide relevant answers to the user. Specifically, CNRank presented a probabilistic ranking model that uses a Bayesian Belief Network [19] to estimate the relevance of a CJN given the current state of the underlying database. A score is assigned to each generated CJN, so that only a few CJNs with the highest scores need to be evaluated.
MatCNGen [2], [11] introduced a match-based approach for generating CJNs. The system enumerates the possible ways which the query keywords can be matched in the DB beforehand, to generate query answers. MatCNGen then generates a single CJN, for each of these QMs, drastically reducing the time required to generate CJNs. Furthermore, because the system assumes that answers must contain all of the query keywords, each keyword must appear in at least one element of a CJN. As a result of the generation process avoiding generating too many keyword occurrence combinations, a smaller but better set of CJNs is generated.
Lastly, Coffman & Weaver [3] proposed a framework for evaluating R-KwS systems and reported experimental results over three representative standardized datasets they built, namely MONDIAL, IMDb, and Wikipedia, along with their respective query workloads. The authors compare nine R-KwS systems, assessing their effectiveness and performance in a variety of ways. The resources of this framework were also used in the experiments of several other studies on R-KwS systems [2], [7], [10], [11], [20].
C. Support to Schema References in R-KwS
Overall there are few systems in the literature that support schema references in keywords queries. One of the first such systems was BANKS [21], a R-KwS system based on Instance Graphs. However, hence the query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.
Support for schema references in keyword queries was extensively addressed by Bergamaschi et al. in Keymantic [1], KEYRY [22], and QUEST [12]. All these systems can be classified as schema-based since they aim at generating a suitable SQL query given an input keyword query. They do not, however, rely on the concept of CJNs, as Lathe and all DISCOVER-based systems do. Keymantic [1] and KEYRY [22] consider a scenario in which data instances are not acessible, such as in databases on the hidden web and sources hidden behind wrappers in data integration settings, where typically only metadata is made available. Both systems rely on similarity techniques based on structural and lexical knowledge that can be extracted from the available metadata, e.g., names of attributes and tables, attribute domains, regular expressions, or from other external sources, such as ontologies, vocabularies, domain terminologies, etc. The two systems mainly differ in the way they rank the possible interpretations they generate for an input query. While Keymantic relies on an extension the authors proposed for the Hungarian algorithm, KEYRY is based on the Hidden Markov Model, a probabilistic sequence model, adapted for keyword query modeling. QUEST [12] can be thought of as an extension of KEYRY because it uses a similar strategy to rank the mappings from keywords to database elements. QUEST, on the other hand, considers the database instance to be accessible and includes features derived from it for ranking interpretations, in contrast to KEYRY.
From these systems, QUEST is the one most similar to Lathe. However, it is difficult to draw a direct comparison between the two systems as QUEST does not rely on the formal framework from CJN-related previous work [2], [4], [6], [10], [11] and it also resolves a smaller set of keyword queries then Lathe. QUEST, in particular, does not support keyword queries whose resolution necessitates SQL queries with self-joins. As a result, when comparing QUEST to other approaches, the authors limited the experimentation to 35 queries rather then the 50 included in the original benchmark [3], [12]. Lathe, on the other hand, supports all 50 queries.
Finally, there are systems that propose going beyond the retrieval of tuples that fulfill a query expressed using keywords and try to provide a functionality close to structured query languages. This is the case of SQAK [23] that allows users to specify aggregation functions over schema elements. Such an approach was later expanded in systems such as SODA [24] and SQUIRREL [25], which aim to handle not only aggregation functions, but also keywords that represent predicates, groupings, orderings and so on. To support such features, these systems rely on a variety of resources that are not part of the database schema or instances. Among these are conceptual schemas, generic and domain-specific ontologies, lists of reserved keywords, and user-defined metadata patterns. We see such useful systems as being closer to natural language query systems [13]. In contrast, Lathe, like any typical R-KwS system, aims at retrieving sets of JNTs that fulfill the query, and not computing results with the tuples. In addition, it does not rely on any external resources.
Problem Statement
Given a database that has
We represent these expressions with Candidate Joining Networks, where the nodes comprise selections or projections over relations, and the edges represent join operations. That is,
$u = \sigma _{a \ni k}(R_{u})$ , where$u = \pi _{a}(R_{u})$ $k=a$ , where$u = \sigma (R_{u})$ $k=R_{u}$
The first condition indicates whether a keyword matches the value of an attribute, while the second and third verifies if the keyword matches to an attribute name or a relation name, respectively.
For notational simplicity, we assume that the attributes of a primary to foreign key relationship have the same name, so we can freely join relations using natural joins. The generalization of the problem and the solution when these assumptions do not hold is trivial.
Also, for each edge
To ensure the connectivity,
Lathe Overview
In this section we present an overview of Lathe. We begin by presenting a simple example of the task carried out by our system. For this, we illustrate in Figure 1 a simplified excerpt from the well-known IMDb.4
Let
As other methods previously proposed in the literature, such as CNGen [4] and MatCNGen [2], [11], the main goal of Lathe is, given a query such as
For query
SQL queries generated for the keyword query “will smith movies” and their returned results.
As this example indicates, there may be several plausible SQL queries related to a given keyword query. Therefore, it is necessary to decide which alternative is more likely to fulfill the user intent. This task is also carried out by Lathe.
Next, we present an overview of the components and the functioning of Lathe.
A. System Architecture
In this section, we present the overall architecture of Lathe. We base our discussion on Figure 3, which illustrates the main phases that comprise the operation of the method.
The process begins with an input keyword query posed by the user. The system then attempts to associate each of the keywords from the query with a database schema element, such as a relation or an attribute. The system relies on the DB schema, i.e., the names of relations and attributes, or on the DB instance, i.e., on the values of the attributes, for this. This phase, called Keyword Matching ①, generates sets of Value-Keyword Matches (VKMs), which associate keywords with sets of tuples whose attribute values contain these keywords, and Schema-Keyword Matches (SKMs), which associate keywords with names of relations or attributes deemed as similar to these keywords.
In Table 1 we show possible matches between keywords in the input query and the database elements. For example, the keywords “will smith” are found together in the values of the attribute
In the next phase, Query Matching ②, Lathe generates combinations of VKMs and SKMs. In these combinations, we consider that all keywords in the query must be matched; in other words, the combination must be total. Furthermore, we also consider that all pairs of keywords and attributes are “useful”; that is, if we remove any of the pairs, this would result in a non-total combination. All combinations that satisfy both criteria are called Query Matches (QMs). In Figure 4 we present all possible QMs of the KMs illustrated in Table 1.
Although the Query Matching phase may generate a large number of QMs due to its combinatorial nature, only a few of them are useful in producing plausible answers to the user. As a result, we propose the first algorithm for Ranking Query Matches in the literature. This ranking assigns a score to QMs based on their likelihood of satisfying the needs of the user when formulating the keyword query. Thus, the system only outputs a few top-ranked QMs to the next phases. By doing so, it avoids having to process less likely QMs. We present the details on QMs, their generation, and ranking in Section VI.
Lastly, in the Candidate Joining Network Generation ③ phase, the system searches for interpretations for the keyword query. That is, the system tries to connect all the keyword matches from the QMs through CJNs, which are based on the schema graph. CJNs can be thought as relational algebra joining expressions that can be directly translated into SQL queries.
For instance, both the QMs shown in Figure 4 (a) and (b) can be connected using the
Also, the system performs a Candidate Joining Network Ranking, which takes advantage of the previous QM rank, but also favors CJNs that are more concise in terms of the number of relations they employ. Once we have identified the most likely CJNs, they can be evaluated as SQL queries that are executed by a DBMS to the users. We notice that some of the generated CJNs may return empty results when they are evaluated. Thus, Lathe can alternatively evaluate CJNs before ranking them and prune such void CJNs. We call this process instance-based pruning.
During the whole process of generating CJNs, Lathe uses two data structures which are created in a Preprocessing stage
The Value Index is an inverted index that stores keyword occurrences in the database, indicating the relations, attributes, and tuples where a keyword appears. These occurrences are retrieved to generate VKMs. Furthermore, the Value Index is used to calculate term frequencies for the QMs and CJNs Rankings. The Schema Index is an inverted index that stores database schema information, as well as statistics about relations and attributes. While database schema information, such as PK/FK relationships, are used for the generation of CJNs, the statistics about attributes, such as norm and inverted frequency, are used for rankings of QMs and CJNs.
In the following sections we present each of the phases of Figure 3, describing the steps, definitions, data structures, and algorithms we used.
Keyword Matching
In this section, we present the details on keyword matches and their generation. Their role in our work is to associate each keyword from the query to some attribute or relation in the database schema. Initially, we classify them as either VKMs and SKMs, according to the type of associations they represent. Later, we provide a generalization of the keyword matches and we introduce the concept of Keyword-Free Matches, which will be used in the next phases of our method.
A. Value-Keyword Matching
We may associate the keywords from the query to some attribute in the database schema-based on the values of this attribute in the tuples that contain these keywords using value-keyword matches, according to Definition 1.
Definition 1:
Let \begin{equation*}R^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] = \{ t|t \in R \wedge \forall A_{i}: W(t[A_{i}]) \cap Q = K_{i}\}\end{equation*}
Notice that each tuple from the database can be a member of only one value-keyword match. Therefore, the VKMs of a given query are disjoint sets of tuples.
Throughout our discussion, for the sake of compactness in the notation, we often omit mappings of attributes to empty keyword sets in the representation of a VKM. For instance, we use the notation
Example 1:
Consider the database instance of Figure 1. The following VKMs can be generated for the query “will smith films”.\begin{align*} PERSON^{V}[name^{ \{will,smith\} }] &= \{t_{1}\}\\ PERSON^{V}[name^{ \{will\} }] &= \{t_{2}\}\\ PERSON^{V}[name^{ \{smith\} }] &= \{t_{3}\}\end{align*}
Example 2:
Consider the query “lord rings 2001” whose intent is to return which Lord of the Rings movie was launched in 2001. We can represent it with the following value-keyword match:\begin{equation*} MOVIE^{V}[title^{ \{lord,rings\} }, year^{ \{2001\} }] = \{t_{17}\}\end{equation*}
The generation of VKMs uses a structure we call the Value Index. This index stores the occurrences of keywords in the database, indicating the relations and tuples a keyword appears and which attributes are mapped to the keyword. Lathe creates the Value Index during a preprocessing phase that scans all target relations only once. This phase comes before the query processing and it is not expected to be repeated frequently. As a result, without further interaction with the DBMS, answers are generated for each query. The Value Index has following the structure, which is shown in Example 3.\begin{equation*}I_{V}=\{term:\{relation:\{attribute:\{tuples\}\}\}\}\end{equation*}
Example 3:
The VKMs presented in Example 1 are based on the following keyword occurrences:.\begin{align*} I_{V}[will]&=\{PERSON:\{name:\{t_{1},t_{2}\}\}\} \\ I_{V}[smith]&=\{PERSON:\{name:\{t_{1},t_{3}\}\}\} \\ I_{V}[smith][PERSON]&=\{name:\{t_{1},t_{3}\}\} \\ I_{V}[smith][PERSON][name]&=\{t_{1},t_{3}\}\end{align*}
B. Schema-Keyword Matching
We may associate the keywords from the query to some attribute or relation in the database schema based on the name of the attribute or relation using Schema-Keyword Matches, according to Definition 2. Specifically, our method matches keywords to the names of relations and attributes using similarity metrics.
Definition 2:
Let \begin{equation*}R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]= \{ t|t \in R \wedge \forall k \in K_{i}: sim(A_{i}, k) \geq \varepsilon \}\end{equation*}
In this representation, we use the artificial attribute
Example 4:
The following schema-based relation matches are created for the query “will smith films”, considering a threshold \begin{align*} MOVIE^{S}[self^{ \{films\} }] &= \{t_{14},t_{15},t_{16},t_{17},t_{18},t_{19}\}\\ MOVIE^{S}[title^{ \{will\} }] &= \{t_{14},t_{15},t_{16},t_{17},t_{18},t_{19}\}\\ PERSON^{S}[name^{ \{smith\} }] &= \{t_{1},t_{2},t_{3},t_{4},t_{5}\}\end{align*}
Despite their similarity to VKMs, the schema-keyword matches serve a different purpose in our method, ensuring that the attributes of a relation appear in the query results. As a result, they do not “filter” any of the tuples from the database, implying that they do not represent any selection operation over database relations.
1) Similarity Metrics
For the matching of keywords to schema elements, we used two similarity metrics based on the lexical database WordNet: the Path similarity [26], [27] and the Wu-Palmer similarity [27], [28]. We introduce the WordNet database and the two similarity metrics below.
2) WordNet Database
WordNet [26] is a large lexical database that resembles a thesaurus, as it groups words based on their meanings. One use of WordNet is to measure similarity between words based on the relatedness of their senses, the many different meanings that words can have [29]. As a result, the word “film” can refer to a movie, as well as the act of recording or the plastic film. Each of these senses have a different relation to the sense of a “show”. Wordnet represents sense relationships, such as synonymy, hyponymy, and hypernymy, to measure similarity between words. Synonyms are two word senses that share the same meaning. In addition, we say that the sense
3) Path Similarity
The Path similarity [26], [27] exploits the structure and content of the WordNet database. The relatedness score is inversely proportional to the number of nodes along the shortest path between the senses of two words. If the two senses are synonyms, the path between them has length 1. The relatedness score is calculated as follows:\begin{align*}sim_{path}(w_{1},w_{2})=\max _{\substack {c_{1} \in senses(w_{1})\\ c _{2} \in senses(w_{2})}}\left [{\frac {1}{|shortest\_{}path(c_{1},c_{2})|}}\right]\end{align*}
4) Wu-Palmer Similarity
The Wu-Palmer measure (WUP) [27], [28] calculates relatedness by considering the depths of the two synsets \begin{align*}sim_{wup}(w_{1},w_{2})=\max _{\substack {c_{1} \in senses(w_{1})\\ c _{2} \in senses(w_{2})}}\left [{2\times \frac {depth(lcs(c_{1},c_{2}))}{depth(c_{1},c_{2})}}\right]\end{align*}
As in the case of VKMs, we detail the SKMGen algorithm used in Lathe in Appendix C.
C. Generalization of Keyword Matches
Initially, we presented Definitions 1 and 2 which, respectively, introduce VKMs and SKMs. We chose to explain the specificity of these concepts separately for didactic purposes. They are, however, both components of a broader concept, Keyword Match (KM), which we define in Definition 3. In the following phases, this generalization will be useful when merging VKMs and SKMs.
Definition 3:
Let \begin{equation*} R^{S}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]^{V}[A_{1}^{K_{1}^{V}}, \ldots, A_{m}^{K_{m}^{V}}] = V\!K\!M \cap S\!K\!M\end{equation*}
\begin{align*}R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] &= R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]^{V}[A_{1}^{ \{\} }, \ldots, A_{m}^{\{\}}] \\ R ^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] &= R^{S}[A_{1}^{ \{\} }, \ldots, A_{m}^{\{\}}]^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]\end{align*}
Another concept required for the generation of QMs and CNs is keyword-free matches, which we describe in Definition 4. They are KMs that are not associated with any keyword but are used as auxiliary structures, such as intermediate nodes in CJNs.
Definition 4:
We say that a keyword match \begin{equation*}K\!M =R^{S}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]^{V}[A_{1}^{K_{1}^{V}}, \ldots, A_{m}^{K_{m}^{V}}]\end{equation*}
For the sake of simplifying the notation, we will represent a keyword-free match as
Query Matching
In this section, we describe the processes of generating and ranking QMs, which are combinations of the keyword matches generated in the previous phases that comprise every keyword from the keyword query.
A. Query Matches Generation
We combine the associations present in the KMs to form total and non-redundant answers for the user. In other words, Lathe looks for KM combinations that satisfy two conditions: (i) every keyword from the query must appear in at least one of the KMs and (ii) if any KM is removed from the combination, the combination no longer meets the first condition. These combinations, called Query Matches (QMs), are described in Definition 5
Definition 5:
Let \begin{equation*} K\!M_{i}=R_{i}^{S}[A_{i,1}^{K^{S}_{i,1}},\ldots,A_{i,m_{i}}^{K^{S}_{i,m_{i}}}]^{V}[A_{i,1}^{K^{V}_{i,1}},\ldots,A_{i,m_{i}}^{K^{V}_{i,m_{i}}}]\end{equation*}
Notice that a QM cannot contain any keyword-free match, as it would not be minimal anymore. Example 5 presents combinations of KMs which are or are not QMs.
Example 5:
Considering the KMs from the Examples 1 and 4, only some of the following sets are considered QMs for the query “will smith films”:\begin{align*} M_{1} &= \{PERSON^{V}[name^{ \{will,smith\} }],MOVIE^{S}[self^{ \{films\} }]\}\\ M_{2} &= \{PERSON^{V}[name^{ \{will\} }],PERSON^{V}[name^{ \{smith\} }],\\ &\qquad MOVIE^{S}[self^{ \{films\} }]\}\\ M_{3} &= \{PERSON^{V}[name^{ \{will\} }], PERSON^{V}[name^{ \{smith\} }]\}\\ M_{4} &= \{PERSON^{V}[name^{ \{will,smith\} }],MOVIE^{S}[self^{ \{films\} }],\\ &\qquad CHARACTER\}\\ M_{5} &= \{PERSON^{V}[name^{ \{will,smith\} }], MOVIE^{S}[self^{ \{films\} }],\\ &\qquad PERSON^{V}[name^{ \{smith\} }]\}\end{align*}
We present the QMGen algorithm for generating QMs in Appendix D.
B. Query Matches Ranking
As described in Section IV, Lathe performs a ranking of the QMs generated in the previous step. This ranking is necessary because frequently many QMs are generated, yet, only a few of them are useful to produce plausible answers to the user.
Lathe estimates the relevance of QMs based on a Bayesian Belief Network model for the current state of the underlying database. In practice, this model assess two types of relevance when ranking query matches. The TF-IDF model is used to calculate the value-based score, which adapts the traditional Vector space model to the context of relational databases, as done in LABRADOR [30] and CNRank [10]. The schema-based score, on the other hand, is calculated by estimating the similarity between keywords and schema elements names.
In Lathe, only the top-k QMs in the ranking are considered in the succeeding phases. By doing so, we avoid generating CJNs that are less likely to properly interpret the keyword query.
Belief Bayesian Network:
We adopt the Bayesian framework proposed by [31] and [19] for modeling distinct IR problems. This framework is simple and allows for the incorporation of features from distinct models into the same representational scheme. Other keyword search systems, such as LABRADOR [30] and CNRank [10], have also used it.
In our model, we interpret the QMs as documents, which are ranked for the keyword query. Figure 5 illustrates an example of the adopted Bayesian Network. The nodes that represent the keyword query are located at the top of the network, on the Query Side. The Database Side, located at the bottom of the network, contains the nodes that represent the QM that will be scored. The center of the network is present on both sides and is made up of sets of keywords: the set
In our Bayesian Network, we rank QMs based on their similarities with the keyword query. This similarity is interpreted as the probability of observing a query match
Initially, we define a random binary variable associated with each keyword from the sets
As all the possibilities of
The instantiation of the root nodes of the network separates the query match nodes from the query nodes, making them mutually independent. Therefore:\begin{equation*}P(QM \wedge Q) = P(Q|v,s)P(QM|v,s)P(v)P(s)\end{equation*}
The probability of the keyword query \begin{equation*}P(Q|v,s) = \prod _{1 \leq i \leq |Q|} P(q_{i}|v,s)\end{equation*}
\begin{equation*}P(q_{i}|v,s) = (q_{i} \in v) \veebar (\exists k \in s: sim(q_{i},k) \geq \varepsilon)\end{equation*}
Similarly, in our network, the probability of a query match \begin{equation*}P(QM|v,s)= \prod _{1 \leq i \leq |QM|} P(K\!M_{i}|v,s)\end{equation*}
We compute the probability of KMs using two different metrics: a schema score based on the same similarities used in the generation of SKMs; and a value score based on a Vector model [33], [34] using the cosine similarity.\begin{align*} P(K\!M_{i}|v,s) &= \prod _{ \substack { 1 \leq j \leq m_{i} \\ K^{V}_{i,j} \neq \emptyset } } cos\left({\overrightarrow {A_{i,j}}, \overrightarrow {v \cap K_{i,j}^{V}}}\right) \\ &\times \prod _{ \substack { 1 \leq j \leq m_{i} \\ K^{S}_{i,j} \neq \emptyset } } \frac { \sum _{t \in s \cap K_{i,j}^{S}} sim(A_{i,j},t) }{|s \cap K_{i,j}^{S}|}\end{align*}
It is important to distinguish the documents from the Bayesian Network model and the Vector Model. The documents of the Bayesian Network are QMs, and the query is the keyword query itself, whereas the documents of the Vector model are database attributes, and the query is the set of keywords associated with the KM.
Once we know the document and the query of the Vector model, we can calculate the cosine similarity by taking inner product of the document and the query. The cosine similarity formula is given as follows:\begin{align*} cos\left({\overrightarrow {A_{i,j}}, \overrightarrow {v \cap K_{i,j}^{V}}}\right) &= \left({\overrightarrow {A_{i,j}^{V}} \boldsymbol {\cdot } \overrightarrow {v \cap K_{i,j}^{V}}}\right)/\left({| \overrightarrow {A_{i,j}}|\times | \overrightarrow {v \cap K_{i,j}^{V}}|}\right) \\ &= \alpha \times \frac {\displaystyle \sum _{t \in V} w\left({\overrightarrow {A_{i,j}},t}\right) \times w\left({\overrightarrow {v \cap K_{i,j}^{V}},t}\right)} {\displaystyle \sqrt {\sum _{t \in V} w\left({\overrightarrow {A_{i,j}},t}\right)^{2}}}\end{align*}
The weights for each term are calculated using the TF-IDF measure. This measure is based on the term frequency and specificity in the collection. We use the raw frequency and inverse frequency, which are the most recommended form of TF-IDF weights [33].\begin{equation*} w\left({\overrightarrow {X},t}\right) = freq_{X,t} \times \log {\frac {N_{A}}{n_{t}}}\end{equation*}
We present the algorithm for ranking QMs in Appendix E.
Candidate Joining Networks
In this section we present the details on our method for generating and ranking Candidate Joining Networks (CJNs), which represent different interpretations of the keyword query. We recall that our definition of CJNs expands on the definition presented in [4] to support keywords referring to schema elements.
The generation of CJNs uses a structure we call a Schema Graph. In this graph, there is a node representing each relation in the database and the edges correspond to the referential integrity constraints (RIC) in the database schema. In practice, this graph is built in a preprocessing phase based on information gathered from the database schema.
Definition 6:
Let \begin{equation*}E=\{ \langle R_{a},R_{b} \rangle | \langle R_{a},R_{b} \rangle \in \mathcal {R}^{2} \wedge R_{a} \neq R_{b} \wedge R\!I\!C(R_{a},R_{b}) \geq 1 \}\end{equation*}
Example 6:
Considering the sample movie database introduced in Figure 1, our method generates the schema graph below.\begin{align*} &G_{S}= < \{PERSON,MOVIE,CASTING,\\ &\qquad \qquad CHARACTER,ROLE\},\\ &\qquad \qquad \{\langle CASTING,PERSON\rangle,\langle CASTING,MOVIE\rangle,\\ &\qquad \qquad \langle CASTING,CHARACTER\rangle,\langle CASTING,ROLE\rangle \}>\end{align*}
In Figure 6, we represent a graphical illustration of
Once we defined the schema graph, we can introduce an important concept, the Joining Network of Keyword Matches (JNKM). Intuitively, a joining network of keyword matches
Definition 7:
Let \begin{align*} &i) \mathcal {V} = M \cup F\\ &ii) \forall \langle K\!M_{a},K\!M_{b} \rangle \in E \implies \exists \langle R_{a}, R_{b} \rangle \in G_{S}\\{}\end{align*}
Example 7:
Considering the query match \begin{align*} &\quad J_1={ PERSON }^V\left[{ name }^{\{{will }, { smith }\}}\right] \leftarrow { CASTING } \\ &\qquad \qquad\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \downarrow \\ &\qquad \qquad \qquad \qquad \qquad \qquad \qquad { MOVIE }^S[{ self }^{\{{ films }\}}] \\& J_2={ CHARACTER } \leftarrow { CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \\ &\qquad \qquad \qquad \qquad \qquad \qquad \downarrow \\ &\qquad \qquad \qquad { PERSON }{ }^V[{ name }^{\{{ will, smith }\}}] \end{align*}
The JNKMs
Notice that a JNKM might have unnecessary information for the keyword query, which was the case of
Definition 8:
Let \begin{align*} \forall K\!M_{i} \in \mathcal {V} ( \exists ! \langle K\!M_{a},K\!M_{b} \rangle \in E | i \in \{a,b\} \implies \\ &\hspace {-2pc}K\!M_{i} \neq R_{i}^{S}[ ]^{V}[ ]\end{align*}
Example 8:
Considering the query match \begin{align*} & \quad\quad{ PERSON }{ }^V\left[{ name }^{\{{will }\}}\right] \\ &\qquad\qquad\uparrow\\& J_3={ CASTING } \rightarrow { MOVIE }^S\left[{ self }^{\{f i l m s\}}\right] \\ & \quad\quad\quad\quad\downarrow \\ & \quad\quad{ PERSON }{ }^V\left[{ name }^{\{{smith }\}}\right] \end{align*}
Another issue that a JNKM might have is representing an inconsistent interpretation. For instance, it is impossible for
Theorem:
Let \begin{equation*} R\!I\!C(R_{a},R_{b}) \geq |\{KM_{c}| \langle K\!M_{a}, K\!M_{c} \rangle \in E_{J} \wedge R_{c}=R_{b} \}|\end{equation*}
Example 9 presents a JNKM that is sound, although it would be deemed not sound by previous approaches [2], [4].
Example 9:
Consider a simplified excerpt from the MONDIAL database [35], presented in Figure 7. As there exists 2 RICs from the relation \begin{align*}&J_4={COUNTRY}^V\left[{ name }^{\{{colombia }\}}\right] \leftarrow { BORDER } \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\begin{array}{c}\downarrow \\{ COUNTRY }^{V}{\left[{name }^{\{b r a z i l\}}\right]}\end{array} \end{align*}
Definition 9:
Let
Example 10:
Considering the query match \begin{align*}& { CJN }_1={ CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \leftarrow { CASTING }\\ &\qquad\qquad\qquad\downarrow \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \searrow\\ & \quad PERSON ^V\left[{ name }^{\{{will }\}}\right] \quad\quad\qquad\qquad PERSON^{V}\left[{ name }^{\{{smith }\}}\right]\end{align*}
The details on how we generate CJNs in Lathe are described by the CNKMGen Algorithm in Appendix G.
A. Candidate Joining Network Ranking
In this section, we present a novel ranking of CJNs based on the ranking of QMs. This ranking is necessary because often many CJNs are generated, yet, only a few of them are indeed useful to produce relevant answers.
We present in Section VI-B a QM ranking that advances the majority of the features present in the ranking of CJNs of other proposed systems, such as CNRank [10]. Thus, we can exploit the scores of the QMs to rank the CJNs. For this reason, our CJN ranking strategy is straightforward yet effective. Roughly, it uses the ranking of QMs adding a penalization for large CJNs. Therefore, the score of a candidate joining network \begin{equation*} score(C\!J\!N_{M}) = score(M) \times \frac {1}{|C\!J\!N_{M}|}\end{equation*}
To ensure that CJNs with the same score are placed in the same order that they were generated we used a stable sorting algorithm [18].
B. Candidate Joining Network Pruning
In this section we present an eager evaluation strategy for pruning CJNs. Even if CJNs contain valid interpretations of the keyword query, some of them may fail to produce any JNTs as a result. Thus, we can improve the results of our CJN generation and ranking if by pruning what we call void CJNs, which are CJNs with no JNTs in their results.
Example 11:
Considering the database instance of Figure 1 and the keyword query “will smith films”, the following CJNs can be generated:\begin{align*} & { CJN }_2={PERSON}^V\left[{ name }^{\{{will }\}}\right] \leftarrow { CASTING } \\ & \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\qquad\downarrow \\ & \qquad\qquad\qquad\qquad\qquad\quad{ MOVIE }{ }^S\left[{ self }^{\{{films\} }}\right]^V\left[{ name }^{\{{smith }\}}\right] \\ & { CJN }_3={ CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \leftarrow { CASTING } \\ & \qquad\qquad\qquad\downarrow \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \searrow\\ & \qquad{ PERSON }{ }^V\left[{ name }^{\{{will }\}}\right] \qquad { CHARACTER }{ }^V\left[{ name }^{\{{smith }\}}\right] \\ & \end{align*}
The interpretation of
As most of the previous work does not rank CJNs but only evaluates them and ranks their resulting JNTs instead, the pruning of void CJNs has previously never been addressed. Lathe employs a pruning strategy that evaluates CJNs as soon as they are generated, pruning the void ones. This strategy, as demonstrated in our experiments, can significantly improve the quality of the CJN generation process, particularly in scenarios where the schema graph contains a large number of nodes and edges.
For instance, one of the datasets we use in our experiments, the MONDIAL database, contains a large number of relations and relational integrity constraint (RICs). This results in a schema graph with several nodes and edges, which, intuitively, incur a large number of possible CJNs for a single QM. In contrast, we discovered that such schema graphs are prone to produce a large number of void CJNs. In particular, while approximately 20% of the keyword queries used in our experiments required us to consider 9 CJNs per QM, the eager evaluation strategy reduced this value to 2 CJNs per QM.
Notice, however, that to find if some CJN is void, we must execute it as an SQL in the DBMS, which incurs an additional cost and an increase in the CJN generation time. Despite that, we notice in our experiments that the eager evaluation strategy does not necessarily hinder the performance of a R-KwS system. In fact, the reducing the number of CJNs per QM alone improves the system efficiency because this parameter influences the CJN generation process. Furthermore, the eager evaluation advances the CJN evaluation, which is already a required step in the majority of R-KwS systems in the related work. Lastly, we can set a maximum number of CJNs to probe during the eager evaluation, which limits the increase in CJN generation time.
Experiments
In this section, we report a set of experiments performed using datasets and query sets previously used in similar experiments reported in the literature. Our goal is to evaluate the quality of the CJN Ranking, the quality QM ranking, and how our Eager Evaluation strategy can improve the CJN Generation.
A. Experimental Setup
1) System Details
We ran the experiments on a Linux machine running Artix Linux (64-bit, 32GB RAM, AMD Ryzen
2) Datasets
For all the experiments, we used three datasets, IMDb, MONDIAL, and Yelp, which were used for the experiments performed with previous R-KwS systems and methods [2], [3], [7], [10], [11], [20], [36]. The IMDb dataset is a subset of the well-known Internet Movie Database (IMDb)5, which comprises information related to films, television shows, and home videos – including actors, characters, etc. The MONDIAL dataset [35] comprises geographical and demographic information from the well-known CIA World Factbook6, the International Atlas, the TERRA database, and other web sources.
The Yelp dataset is a subset of Yelp7, which comprises information about businesses, reviews, and user data. The three datasets have distinct characteristics. The IMDb dataset has a simple schema, but query keywords often occur in several relations. Although the MONDIAL dataset is smaller, its schema is more complex or dense, with more relations and relational integrity constraints (RICs). The Yelp dataset has the highest number of tuples but its schema is simple. Table 2 summarizes the details of each dataset.
3) Query Sets
We used the query sets provided by Coffman & Weaver [3] benchmark for the IMDb and MONDIAL datasets. The query set for Yelp was obtained from SQLizer [37] and consists of 28 queries formulated in Natural Language. We adapted all of its queries to our experiments by extracting only their keyword terms.
However, we notice that several queries from IMDb and MONDIAL query sets do not have a clear intent, compromising the proper evaluation of the results, for instance, the ranking of CJNs. Therefore, for the sake of providing a more fair evaluation, we generated an additional for each original query set replacing queries that we consider unclear with equivalent queries with added schema references. As an example, consider the query “Saint Kitts Cambodia” for the MONDIAL dataset, where Saint Kitts and Cambodia are the names of the two countries. There exist several interpretations of this keyword query, each of them with a distinct way to connect the tuples corresponding to these countries. For example, one might look for shared religions, languages, or ethnic groups between the two countries. While all these interpretations are valid in theory, the relevant interpretation defined by Coffman & Weaver [3] in their golden standard indicates that the query searches for organizations in which both countries are members. In this case, we replaced in the new query set with the query “Saint Kitts Cambodia Organizations”.
Table 3 presents the query sets we used in our experiments, along with some of their features. Query sets whose names include the suffix “-DI” correspond to those in which we have replaced ambiguous queries as explained above. Thus, these queries sets have no ambiguous queries and they have a higher number of Schema References.
4) Golden Standards
The benchmark from Coffman & Weaver [3] provided the relevant interpretation and its relevant SQL results for each query of the IMDb and MONDIAL datasets. In the case of the Yelp dataset, SQLizer [37] provided the relevant SQL queries for natural language queries. Since we derived keyword queries from the latter, we also adapted the SQL queries to reflect this change. We then manually generated the golden standards for CJNs and QMs using relevant SQL provided by Coffman & Weaver and in SQLizer.
5) Metrics
We evaluate the ranking of CJNs and QMs using three metrics: Precision at ranking position 1 (
Precision at 1 (
The Mean Reciprocal Ranking (MRR) value indicates how close the correct CJN is from the first position of the ranking. Given a keyword query
6) Lathe Setup
For the experiments we report here, we set a maximum size for QMs and CJNs of 3 and 5, respectively. Also, we consider three important parameters for running Lathe:
All the resources, including source code, query sets, datasets and golden standards used in our experiments are available at https://github.com/pr3martins/Lathe.
B. Preliminary Results
We present in this section some statistics about the CJN generation process. Table 4 shows the maximum and average numbers of KMs, QMs, and CJNs generated for each query set. The last two columns refer to the ratio of the number of CJNs to the number of QMs. Notice that we removed the maximum caps for the number of CJNs and CJNs per QM in the experiment reported here. However, we maintained the limit sizes of 3 and 5 for the QMs and CJNs, respectively.
Overall, the query sets for both IMDb and Yelp datasets achieved higher maximum and average numbers of KMs and QMs. This result is due to a higher number of tuples and the keywords being present in multiple relations or combinations. For example, in the IMDb dataset, several persons, characters, and even movies share the same name or part of it. In the case of Yelp, for instance, the keyword “texas” can match a state, a restaurant name, or a username. On the other hand, in MONDIAL, the keywords often match a few attributes only. For example, a city name probably does not overlap with the names of countries, continents, etc. Consequently, the system produces a low number of KMs and QMs for the query sets of this dataset.
Regarding the CJN generation, the query sets for IMDb and Yelp achieved high numbers of CJNs because of their already high numbers of QMs, but a low ratio of CJNs to QMs due to their simple schema graphs. As for the query sets for the MONDIAL dataset, they achieved opposite results due to their complex schema graph.
C. Comparison With Other R-KwS Systems
In this experiment, we first compare Lathe with QUEST [12], the current state of art R-KwS system with support to schema references and then we also compare Lathe with several other R-KwS systems. Here, we used the default Lathe setup, that is, 8/1/9. We compare our results to those published by the authors, which refer to the MONDIAL dataset, because we were unable to run QUEST due to the lack of code and enough details for implementing it. Figure 8 depicts the results for the 35 queries supported by QUEST8 out of the 50 queries provided in the original query set. The graphs show the recall and P@1 values for the raking produced by each system considering the golden standard supplied by Coffman & Weaver [3].
Both systems achieved perfect recall; that is, all the correct solutions for the given keyword queries were retrieved. Concerning P@1, Lathe obtained better results than QUEST, with an average of 0.97 with a standard error of 0.03, which indicates that, in most cases, the correct solution was the one corresponding to the CJN ranked as the first by Lathe.
Next, we compare the results obtained for Lathe with those published in the comprehensive evaluation published by Coffman & Weaver [3] for the systems BANKS [14], DISCOVER [4], DISCOVER-II [6], BANKS-II [15], DPBF [38], BLINKS [16] and STAR [39]. Because this comparison uses all 50 keyword queries from the MONDIAL dataset, we did not include QUEST in the comparison. Figure 9 shows the recall and P@1 values for the raking produced by each system when the golden standard provided by Coffman & Weaver [3] is taken into account.
Overall, Lathe achieved the best results in Recall and P@1 value. That the only systems that achieved similar recall, DPBF and BLINKS, are based on data graph, thus, require a materialization of the database. The difference between recall values of Lathe, DISCOVER, and DISCOVER-II is mainly due to not supporting schema references. Regarding the P@1, Lathe obtained a value of 0.96 with a standard error of 0.03, which is significantly higher than the results for other systems. This difference in P@1 value, especially compared with DISCOVER and DISCOVER-II, is due to the novel ranking of QMs as well as an improved ranking of CJNs.
D. Evaluation of Query Matches Ranking
In this experiment, we evaluate the quality of QMs ranking according to the metrics MRR and
For all query sets, in most cases, the correct QM is at least in the eighth ranking position. In MONDIAL and MONDIAL-DI, the relevant QM is at least in the third position for all queries. Yelp obtained an
Regarding MRR, Lathe obtained 0.75 for both IMDb and IMDb-DI, 0.83 for Yelp, and 0.96 and 0.95 for MONDIAL and MONDIAL-DI, respectively. This result indicates that the relevant QM is often in the top positions of the ranking. Notice that the QM ranking indirectly impacts the generation and ranking of CJNs. In practice, a high
E. Evaluation of the Candidate Joining Network Ranking
In this experiment, we evaluate the quality of our approach for CJN generation and ranking. We used the metrics MRR and
Figure 11 shows the results for the IMDb and IMDb-DI query sets. As it can be seen, regardless of the configuration, our method was able to place the relevant CJNs in the top positions in the ranking, and the result is very similar for both IMDb and IMDb-DI query sets. This shows that in these datasets, our method was able to disambiguate the queries properly, even without the addition of schema references. It is worth noting that the values of
Figure 12 shows the results for MONDIAL and MONDIAL-DI. In these query sets, the configurations with the eager evaluation achieved significantly better results. The configurations 8/1/0 and 8/2/0 could not generate the relevant CJN for around 20% of the queries due to a low number of CJNs per QM, therefore, their results were capped at an MMR and
Figure 13 shows the results for the Yelp query set. Overall, the eager CJN evaluation did not affect the results for this query set, probably because the database schema graph was simple and the ways of connecting the query matches were straightforward. Configurations 8/1/0 and 8/1/9 achieved the best results, obtaining a
Regardless of the datasets and configurations, our method achieved an MRR value above 0.7, which indicates that on average, the relevant CJN is found between the first and the second rank positions. In the IMDb dataset, the decrease of
The eager CJN evaluation inherently affects the performance of the CJN generation process. Therefore it is important to look at the trade-off between the effectiveness and the efficiency in each configuration. We examine this trade-off in the next section.
F. Performance Evaluation
In this experiment, we aim at evaluating the time spent for obtaining the CJN given a keyword query, and analyze the trade-offs between efficiency and efficacy of the different configurations use in Lathe.
Lathe obtained better execution times for the IMDb dataset in all configurations. Also, the disambiguate variants of query sets yield slower execution times in comparison with the original counterparts.
Figure 14 summarizes the average execution time for each phase of the process: Keyword Matching, Query Matching and the Candidate Joining Network Generation. In this first experiment, we used the configuration 8/1/0. Lathe obtained better total execution times for the IMDb dataset, followed by the Yelp dataset. In addition, the disambiguate variants of query sets yield slower execution times in comparison with the original counterparts. Also, it is worth noting that the execution times for each query set are related to the number of KMs, QMs, and CJNs in the query sets shown in Table 4.
Regarding keyword matching, the Yelp dataset yielded the worst execution times, with 167ms, probably because of its higher number of attributes and tuples. Although the MONDIAL dataset has fewer tuples than IMDb, its higher number of schema elements (28 relations and 48 attributes) results in a higher execution time than IMDb.
Due to the combinatorial nature of QM generation, the execution times for the Query Matching phase are directly related to the number of QMs. While the execution times for the IMDb and IMDb-DI query sets that produced a high number of QMs are 247 and 256 milliseconds, respectively, the results for the MONDIAL and MONDIAL-DI are around 190 and 202 microseconds. The Yelp dataset achieved 121 miliseconds.
Concerning the CJN phase, the execution times for MONDIAL are significantly higher in comparison with the execution times for IMDb and Yelp, despite the lower number of CJNs for the MONDIAL. Because the CJN generation algorithm is based on a Breadth-First Search, the greater the number of vertices and edges in the schema graph of the MONDIAL dataset, the greater the number of iterations and, consequently, the slower the execution times. This behavior persists throughout different configurations, an issue we further analyze below.
G. Quality Versus Performance
Figure 15 presents an evaluation of the CJN generation performance, comparing the same configurations used in the experiment of Section VIII-E. We present the results for the IMDB, MONDIAL and Yelp datasets in different scales because they differ by order of magnitude. Overall, execution times increase as the number of CJNs taken per QM increases. This pattern is more pronounced in the MONDIAL dataset. Also, the eager CJN evaluation incurs an unavoidable increase in the CJN generation time as the system has to probe the CJNs running queries into the database.
As the configurations have an impact on both the quality of the CJN ranking and the performance, it is important to examine the trade-off between effectiveness and efficiency. Configuration 8/1/0 and 8/2/0 achieved the best execution times due to the low number of CJNs per QM and not relying on database accesses. However, these configurations did not achieve the highest values of MRR and R@K for the IMDb and MONDIAL datasets. Therefore, they are recommended if one must prioritize efficiency.
Configuration 8/1/9 obtained better results than configurations 8/8/0, 8/9/0 for the IMDb and MONDIAL datasets and better than 8/2/9 for all datasets. Although this configuration is slower than 8/1/0 and 8/2/0, the significantly better results of MRR and
We do not recommend the configurations 8/2/0, 8/8/0, 8/9/0 and 8/2/9 because their MRR and
It is interesting noting that although the configurations with eager CJN evaluation spend time to probe CJNs, sending queries to the DBMS. However, as they generate a smaller set of CJNs, the overall performance is not hindered in comparison with the configurations without it.
Conclusion
In this paper, we have proposed Lathe, a new relational keyword search (R-KwS) system for generating a suitable SQL query from a given keyword query. Lathe is the first to address the problem of generating and ranking Candidate Joining Networks (CJNs) based on queries with keywords that can refer to either instance values or database schema elements, such as relations and attributes.
In addition, Lathe improves the quality of the CJN generated by introducing two major innovations: a ranking for selecting better Query Matches (QMs) in advance, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs.
We present a comprehensive set of experiments performed with query sets and datasets previously used in experiments with previous state-of-the-art R-KwS systems and methods. Our experiments indicate that Lathe can handle a wider variety of keyword queries while remaining highly effective, even for large databases with intricate schemas.
Also, a full implementation of Lathe is publicly available at https://github.com/pr3martins/Lathe as a Python library for Keyword Search over Relational Databases called PyLatheDB [40]. This library is ready for developers to easily run Lathe or incorporate its features, such as keyword matching, into their own applications.
Our experience in the development of Lathe raised several ideas for future work. First, one important issue in our method is being able to correctly match keywords from the input query to the corresponding database elements. To improve this issue, we plan to investigate new alternative similarity functions. We are particularly interested in using word-embedding-based functions, such as the well-known Word Mover’s Distance (WMD) [41]. We also consider investigating methods based on Neural Language Models (NLMs), particularly on transformers and attention-based models [42], [43], [44], which proved to be promising for several text-based Information Retrieval problems. For example, we believe that an interesting approach to the QM ranking problem is to interpret it as a variant of the Table Retrieval task [45], [46], where given a keyword query and a table corpus, this task consists of returning a ranked list of the tables that are relevant to the query.
Second, data exploration techniques have recently gained popularity because they allow for the extraction of knowledge from data even when the user is unsure of what to look for [47]. Keyword-based queries, we believe, can be used as an interesting tool for data exploration because they allow one to retrieve interesting portions of a database without knowing the details of the schema and its semantics.
Third, we believe that pruning strategies can improve the QM generation by reducing the search space of keyword matches. For this, we plan to exploit the relationship of the QM generation and the discovery of matching dependencies [48].
Fourth, although we have focused on relational databases in this paper, the ideas we discussed here can be extended to other types of databases as well. Currently, we are extending these ideas to address the so-called document stores, such as the very popular MongoDB9 engine. Our preliminary findings [36] suggest that because queries of this type are frequently more complex than queries of relational databases, allowing the simplicity of keyword queries may have even more advantages in this context.
Finally, we anticipate that keyword queries will be useful as a tool for allowing the seamless integration of data from heterogeneous sources, as is the case in the so-called polystore systems and data lakes, which are becoming increasingly popular in recent years. There exist already research proposals in this direction [49], we believe that the schema graph approach we adopt in our work can be helpful to achieve this goal.
Appendix AAcronyms
Acronyms
AbbreviationExpansion
R-KwS | Relational Keyword Search. |
CJN | Candidate Joining Network. |
QM | Query Match. |
DB | Database. |
SQL | Structured Query Language. |
IR | Information Retrieval. |
DBMS | Database Management System. |
JNT | Joining Network of Tuples. |
IMDb | Internet Movie Database. |
KM | Keyword Match. |
VKM | Value-Keyword Match. |
SKM | Schema-Keyword Match. |
PKFK | Primary Key/Foreign Key. |
WUP | Wu-Palmer Measure. |
LCS | Least Common Subsumer. |
TF-IDF | Term Frequency – Inverse Document Frequency. |
RIC | Relational Integrity Constraint. |
JNKM | Joining Network of Keyword Matches. |
MJNKM | Minimal Joining Networks of Keyword Matches. |
R@K | Recall at K. |
P@1 | Precision at 1. |
MRR | Mean Reciprocal Rank. |
WMD | Word Mover’s Distance. |
NLM | Neural Language Model. |
ECLAT | Equivalence Class Clustering and bottom-up Lattice Traversal. |
Appendix BVKMGen Algorithm
VKMGen Algorithm
As shown in Algorithm 1, Lathe retrieves tuples from the database in which the keywords occur and uses them to generate value-keyword matches. Initially, the VKMGen Algorithm takes the occurrences of each keyword from the Value Index and form partial value-keyword matches, which are not guaranteed to be disjoint sets yet (Lines 3-8). The pool of VKMs is represented by the Hash Table
Next, Lathe ensures that VKMs are disjoint sets through the Algorithm 2, VKMInter, which is based on the ECLAT10 algorithm [50] for finding frequent itemsets. VKMInter looks for non-empty intersections of the partial value-keyword matches recursively until all of them are disjoint sets, and thus, proper VKMs. These intersections are calculated as follows:\begin{align*} K\!M_{1} \cap K\!M_{2} = \begin{cases} \displaystyle \emptyset &,\text {if }R_{a}\neq R_{b}\\ \displaystyle R^{V}_{ab}[A_{ab,1}^{K_{ab,1}},\ldots,A_{ab,m}^{K_{ab,m}}]&, \text {if }R_{a}=R_{b}\end{cases}\end{align*}
VKMInter uses three hash tables:
VKMInter first defines the hash tables
After the execution of VKMInter, in Line 9 of VKMGen, we obtained the value-keyword matches and their tuples. As the sets of tuples are only required for the generation of VKMs, VKMGen generates and outputs the set of value-keyword matches, ignoring the tuples from
Appendix CSKMGen Algorithm
SKMGen Algorithm
The generation of schema-keyword matches uses a structure we call the Schema Index, which is created in a preprocessing phase, alongside with the Value Index. This index stores information about the database schema and statistics about attributes, which are used for the ranking of QMs, which will be explained in Chapter VI. The stored information follows the structure below:\begin{equation*}I_{S}=\{relation:\{attribute:\{(norm,maxfrequency)\}\}\}\end{equation*}
The generation of SKMs is carried out by Algorithm 3, SKMGen. First, the algorithm iterates over the relations and attributes from the Schema Index. Then, SKMGen calculates the similarity between each keyword and schema element. It only considers the pairs whose similarity is above a threshold
Appendix DQMGen Algorithm
QMGen Algorithm
The generation of query matches is carried out by Algorithm 4, QMGen, which preserves the ideas proposed in MatCNGen [2], adapt them to keyword matches instead of tuple-sets. Let
It is easy to see that Algorithm 4 has a time complexity of
Also, as the QM ranking presented in Section VI-B penalizes QMs with a large number of KMs, we can define a maximum QM size
The algorithm MinimalCover iterates through the KMs from the combination
If
After merging all the possible elements from the query match
Appendix EQMRank Algorithm
QMRank Algorithm
The ranking of Query Matches is carried out by Algorithm 7, QMRank. Notice, that, intuitively, the process of ranking QMs advances part of the relevance assessment of the CJNs, which was first proposed in CNRank [10]. This yields to an effective ranking of QMs and a simpler ranking of CJNs. QMRank uses a value score and a schema score, which are respectively related to the VKMs and SKMs that compose the QM.
The algorithm first iterates over each query match, assigning 1 to both
Appendix FSound Theorem
Sound Theorem
Theorem 1:
Let \begin{equation*} R\!I\!C(R_{a},R_{b}) \geq |\{KM_{c}| \langle K\!M_{a}, K\!M_{c} \rangle \in E_{J} \wedge R_{c}=R_{b} \}|\end{equation*}
Proof:
Let \begin{align*}{4} T_{1} &=\,\,R_{1} \\ T_{2} &=\,\,T_{1} \bowtie _{f_{1,2}=k_{2}} R_{2} \\ T_{3} &=\,\,T_{2} \bowtie _{f_{1,3}=k_{3}} R_{3} \\ T_{n+1} &=\,\,T_{n} \bowtie _{f_{1,{n+1}}=k_{n+1}} R_{n+1} \\ T_{n+2} &=\,\,T_{n+1} \bowtie _{f_{1,x}=k_{n+2}} R_{n+2}, \text {where } x \in \{2,\ldots,n+1\} &&\\{}\end{align*}
\begin{align*}{4} T_{2} &=\,\,T_{1} \bowtie _{f_{1,2}=k_{2}} R_{2} \\ T_{n+2} &=\,\,T_{n+1} \bowtie _{f_{1,{2}}=k_{n+2}} R_{n+2} \\{}\end{align*}
\begin{equation*} f_{1,2}=k_{2} \wedge f_{1,{2}}=k_{n+2}\end{equation*}
\begin{equation*} T_{m+1} =\,\,T_{m} \bowtie _{f_{1,x}=k_{m+1}} R_{m+1}\end{equation*}
Appendix GCJNGen Algorithm
CJNGen Algorithm
The generation and ranking of CJNs is carried out by Algorithm 8, CJNGen, which uses a Breadth-First Search approach [18] to expand JNKMs until they comprehend all elements from a query match.
Despite being based on the MatCNGen Algorithm [2], CJNGen provides support for generating CJNs wherein there exists more than one RIC between one database relation to another, due to the definition of soundness presented in Theorem 1. Also, CJNGen does not require an intermediate structure such as the Match Graph in the MatCNGen system.
We describe CJNGen in Algorithm 8. For each query match, CJNGen generates the candidate joining networks for this query match using an internal algorithm called CJNInter, which we will focus on describing in the remainder of this section.
In Algorithm 9, we present CJNInter. This algorithm takes as input a query match
Next, the CJNInter initializes a queue
The expansion of
If
Note that the complexity of the CJN generation is mainly because of two factors: (1) There can be multiple KMs for each subset of keywords. As a result, there may be several ways of combining these KMs into QMs, so that all keywords are covered. (2) Given QM, there can be many distinct ways of connecting its elements through PK/FK constraints and keyword-free matches.
Lathe implements some basic CJN pruning strategies to help decrease the candidate space, which are based on the following parameters: the top-k CJNs, the top-k CJNs per QM and the maximum CJN size. Also, the algorithm implements a few strategies to prune the JNKMs which are not minimal or not sound, the maximum node Degree, the maximum number of keyword-free matches, and the distinct foreign keys.
Maximum Node Degree
As the leaves of a CJN must be keyword matches from the query match, then a CJN must have at most
Maximum Number of Keyword-Free Matches
The size of a CJN is based on the size of the query match and the number of keyword-free matches, that is, the size of a candidate joining network
The number of CJNs generated can be further reduced by the pruning and ranking them. In Section VII-A, we present a ranking of the candidate joining networks returned by CJNGen. In Section VII-B, we present pruning techniques for the generation of the candidate joining networks from CJNGen and CJNInter.