Loading [MathJax]/extensions/MathMenu.js
Supporting Schema References in Keyword Queries Over Relational Databases | IEEE Journals & Magazine | IEEE Xplore

Supporting Schema References in Keyword Queries Over Relational Databases


Lathe generates suitable SQL queries for database exploration from input keyword queries. First, the system matches keywords to schema elements. Then, it combines the key...

Abstract:

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or quer...Show More

Abstract:

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. They take a keyword query, locate their corresponding elements in the target database, and connect them using information on PK/FK constraints. Although there are many such systems in the literature, most of them only support queries with keywords referring to the contents of the database and just very few support queries with keywords refering the database schema. We propose Lathe, a novel R-KwS that supports such queries. To this end, we first generalize the well-known concepts of Candidate Joining Networks (CJNs) and Query Matches (QMs) to handle keywords referring to schema elements and propose new algorithms to generate them. Then, we introduce two major innovations: a ranking algorithm for selecting better QMs, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs. We present experiments performed with query sets and datasets previously experimented with state-of-theart R-KwS systems. Our results indicate that Lathe can handle a wider variety of queries while remaining highly effective, even for databases with intricate schemas.
Lathe generates suitable SQL queries for database exploration from input keyword queries. First, the system matches keywords to schema elements. Then, it combines the key...
Published in: IEEE Access ( Volume: 11)
Page(s): 92365 - 92390
Date of Publication: 25 August 2023
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Keyword search over relational databases enables naive/informal users to retrieve information from relational databases (DBs) without any knowledge about schema details or query languages. The success of search engines shows that untrained users are at ease using keyword search to find information of interest.

However, this can be challenging because the information sought frequently spans multiple relations and attributes, depending on the schema design of the underlying DB. Therefore, Relational Keyword Search (R-KwS1) systems must automatically determine which pieces of information to retrieve from the database and how to connect them to provide a relevant answer to the user.

In general, keywords may refer to both database values in tuples and schema elements, such as relation and attribute names. For instance, consider the query $``will~smith~films''$ over a database on movies. The keywords $``will''$ and $``smith''$ may refer to values of person names. The keyword $``films''$ on the other hand is more likely to refer to the name of a relation about movies. Although a significant number of query keywords correspond to schema references [1], the majority of previous work on R-KwS does not support references to the schema.

Handling keywords that refer to schema elements makes R-KwS significantly more challenging than the usual setting where only keywords referring to attribute values are considered. Firstly, it increases the complexity of the search process by requiring an understanding of the underlying database schema and its structure. Secondly, keywords referring to the schema introduce semantic ambiguity, making it difficult to disambiguate between schema references and attribute values. This ambiguity further complicates the search process and can lead to incorrect or incomplete results. Furthermore,integrating schema knowledge into the search process becomes crucial when handling schema references. Understanding PK/FK relationships and connecting relevant information adds an extra layer of complexity to the problem. Finally, ranking and relevance determination become more challenging when schema elements are involved. Existing systems may prioritize attribute values even if they do not provide useful answers. Accurately assessing relevance requires considering both attribute values and schema references. These challenges require dedicated techniques and algorithms specifically designed to handle schema references effectively.

In this work, we study new techniques for supporting schema references in keyword queries over relational databases. Specifically, we propose Lathe2, a new R-KwS system to generate a suitable SQL query from a keyword query, considering that keywords refer either to instance values or schema elements. Lathe follows the Schema Graph approach for R-KwS systems [2], [3]. Given a keyword query, this approach consists of generating relational algebra expressions called Candidate Joining Networks3 (CJNs), which are likely to express user intent when formulating the original query. The generated CJNs are evaluated, that is, they are translated into SQL queries and executed by a DBMS, resulting in several Joining Networks of Tuples (JNTs) which are collected and supplied to the user.

In the literature, the most well-known algorithm for CJN Generation is CNGen, which was first presented in the system DISCOVER [4], but was adopted by most R-KwS systems [5], [6], [7], [8]. Despite the possibly large number of CJNs, most works in the literature focused on improving CJN Evaluation and ranking of JNTs instead. Specifically, DISCOVER-II [6], SPARK [7], and CD [8] used information retrieval (IR) style score functions to rank the top-K JNTs. KwS-F [9] imposed a time limit for CJN evaluation, returning potentially partial results as well as a summary of the CJNs that have yet to be evaluated. Later, CNRank [10] introduces a CJN ranking, requiring only the top-ranked CJNs to be evaluated. MatCNGen [2], [11] proposed a novel method for generating CJNs that efficiently enumerated the possible matches for the query in the DB. These Query Matches (QMs) are then used to guide the CJN generation process, greatly decreasing the number of generated CJNs and improving the performance of CJN evaluation.

Among the methods based on the Schema Graph approach, Lathe is, to the best of our knowledge, the first method to address the problem of generating and ranking CJNs considering queries with keywords that can refer to either schema elements or attribute values. We revisited and generalized concepts introduced in previous approaches [2], [4], [10], [11], such as tuples-sets, QMs, and the CJNs themselves, to enable schema references. In addition, we proposed a more effective approach to CJN Generation that included two major innovations: QM ranking and Eager CJN Evaluation. Lathe roughly matches keywords to the values of the attributes or to schema elements. Next, the system combines the keyword matches into QMs that cover all the keywords from the query. The QMs are ranked and only the most relevant ones are used to generate CJNs. The CJN generation explores the primary key/foreign key relationships to connect all the elements of the QMs. In addition, Lathe employs an eager CJN evaluation strategy, which ensures that all CJNs generated will yield non-empty results when evaluated. The CJNs are then ranked and evaluated. Finally, the CJN evaluation results are delivered to the user. Unlike the previous methods, Lathe provides the user with the most relevant answer without relying on JNTs rankings. This is due to the effective rankings of QMs and CJNs that we propose, which are absent in the majority of previous work.

We performed several experiments to assess the effectiveness and efficiency of Lathe. First, we compared the results with those obtained with several previous R-KwS systems, including the state-of-the-art QUEST [12] system using a benchmark proposed by Coffman & Weaver [3]. Second, we assessed the quality of our ranking of QMs. The ranking of CJNs was then evaluated by comparing different configurations in terms of the number of QMs, the number of CJNs generated per QM, and the use of the eager evaluation strategy. Finally, we assessed the performance of each phase of Lathe, as well as the trade-off between quality and performance of various system configurations. Lathe achieved better results than all of the R-KwS systems tested in our experiments. Also, our results indicate that the ranking of QMs and the eager CJN evaluation greatly improved the quality of the CJN generation.

Our key contributions are: (i) a novel method for generating and ranking CJNs with support for keywords referring to schema elements; (ii) a novel algorithm for ranking QMs, which avoids the processing of less likely answers to a keyword query; (iii) an eager CJN evaluation for discarding spurious CJNs; (iv) a simple and yet effective ranking of CJNs which exploits the ranking of QMs.

The remainder of this paper is organized as follows: Section II reviews the related literature on relational keywords search systems based on schema graphs and support to schema references. Section IV summarizes all of the phases of our method, which are discussed in detail in Sections V–​VII. Section VIII summarizes the findings of the experiments we conducted. Finally, Section IX summarizes the findings and outlines our plans for the future.

SECTION II.

Background and Related Work

In this section, we discuss the background and related work on keyword search systems over relational databases and on supporting schema references in such systems. For a more comprehensive view of the state-of-the-art in keyword-based and natural language queries over databases, we refer the interested reader to a recent survey [13].

A. Relational Keyword Search Systems

Current R-KwS systems fall in one of two distinct categories: systems based on Schema Graphs and systems based on Instance Graphs. Systems in the first category are based on the concept of Candidate Joining Networks (CJNs), which are networks of joined relations that are used to generate SQL queries and whose evaluation return several Joining Networks of Tuples (JNTs) which are collected and supplied to the user. This method was proposed in DISCOVER [4] and DBXplorer [5], and it was later adopted by several other systems, including DISCOVER-II [6], SPARK [7], CD [8], KwS-F [9], CNRank [10], and MatCNGen [2], [11]. Systems in this category make use of the underlying basic functionality of the RDBMS by generating appropriate SQL queries to retrieve answers to keyword queries posed by users.

Systems in the second category are based on a structure called Instance Graph, whose nodes represent tuples associated with the keywords they contain, and the edges connect these tuples based on referential integrity constraints. BANKS [14], BANKS-II [15], BLINKS [16] and, Effective [17] use this approach to compute keyword queries results by finding subtrees in a data graph that minimizes the distance between nodes matching the given keywords. These systems typically generate the query answer in a single phase that combines the tuple retrieval task and the answer schema extraction. However, the Instance Graph approach requires a materialization of the DB and requests a higher computational cost to deliver answers to the user. Furthermore, the important structural information provided by the database schema is ignored, once the data graph has been built.

B. R-KwS Systems Based on Schema Graphs

In our research, we focus on systems based on Schema Graphs, since we assume that the data we want to query are stored in a relational database and we want to use an RDBMS capable of processing SQL queries. Also, our work expands on the concepts and terminology introduced in DISCOVER [4], [6] and expanded in CNRank [10] and MatCNGen [2], [11]. This formal framework is used and expanded to handle keyword queries that may refer to attribute values or to database schema elements. As a result, we can inherit and maintain all guarantees regarding the generation of minimal, complete, sound, and meaningful CJNs.

The best-known algorithm for CJN Generation is CNGen, which was introduced in DISCOVER [4] but was later adopted as a default in most of the R-KwS systems proposed in the literature [5], [6], [7], [8]. To generate a complete, non-redundant set of CJNs, this algorithm employs a Breadth-First Search approach [18]. As a result, CNGen frequently generates a large number of CJNs, resulting in a costly CJN generation and evaluation process.

Initially, most of the subsequent work focused on the CJN evaluation only. Specifically, as many CJNs were generated by CNGen that should be evaluated, producing a larger number of JNTs, such systems as DISCOVER-II [6], SPARK [7], and CD [8] introduced algorithms for ranking JNTs using IR style score functions.

KwS-F [9] addressed the efficiency and scalability problems in CJN evaluation in a different way. Their approach consists of two steps. First, a limit is imposed on the time the system spends evaluating CJNs. After this limit is reached, the system must return the (possibly partial) top-K JNTs. Second, if there are any CJNs that have yet to be evaluated, they are presented to the user in the form of query forms, from which the user can choose one and the system will evaluate the corresponding CJN.

CNRank [10] proposed a method for lowering the cost of CJN evaluation by ranking them based on the likelihood that they will provide relevant answers to the user. Specifically, CNRank presented a probabilistic ranking model that uses a Bayesian Belief Network [19] to estimate the relevance of a CJN given the current state of the underlying database. A score is assigned to each generated CJN, so that only a few CJNs with the highest scores need to be evaluated.

MatCNGen [2], [11] introduced a match-based approach for generating CJNs. The system enumerates the possible ways which the query keywords can be matched in the DB beforehand, to generate query answers. MatCNGen then generates a single CJN, for each of these QMs, drastically reducing the time required to generate CJNs. Furthermore, because the system assumes that answers must contain all of the query keywords, each keyword must appear in at least one element of a CJN. As a result of the generation process avoiding generating too many keyword occurrence combinations, a smaller but better set of CJNs is generated.

Lastly, Coffman & Weaver [3] proposed a framework for evaluating R-KwS systems and reported experimental results over three representative standardized datasets they built, namely MONDIAL, IMDb, and Wikipedia, along with their respective query workloads. The authors compare nine R-KwS systems, assessing their effectiveness and performance in a variety of ways. The resources of this framework were also used in the experiments of several other studies on R-KwS systems [2], [7], [10], [11], [20].

C. Support to Schema References in R-KwS

Overall there are few systems in the literature that support schema references in keywords queries. One of the first such systems was BANKS [21], a R-KwS system based on Instance Graphs. However, hence the query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.

Support for schema references in keyword queries was extensively addressed by Bergamaschi et al. in Keymantic [1], KEYRY [22], and QUEST [12]. All these systems can be classified as schema-based since they aim at generating a suitable SQL query given an input keyword query. They do not, however, rely on the concept of CJNs, as Lathe and all DISCOVER-based systems do. Keymantic [1] and KEYRY [22] consider a scenario in which data instances are not acessible, such as in databases on the hidden web and sources hidden behind wrappers in data integration settings, where typically only metadata is made available. Both systems rely on similarity techniques based on structural and lexical knowledge that can be extracted from the available metadata, e.g., names of attributes and tables, attribute domains, regular expressions, or from other external sources, such as ontologies, vocabularies, domain terminologies, etc. The two systems mainly differ in the way they rank the possible interpretations they generate for an input query. While Keymantic relies on an extension the authors proposed for the Hungarian algorithm, KEYRY is based on the Hidden Markov Model, a probabilistic sequence model, adapted for keyword query modeling. QUEST [12] can be thought of as an extension of KEYRY because it uses a similar strategy to rank the mappings from keywords to database elements. QUEST, on the other hand, considers the database instance to be accessible and includes features derived from it for ranking interpretations, in contrast to KEYRY.

From these systems, QUEST is the one most similar to Lathe. However, it is difficult to draw a direct comparison between the two systems as QUEST does not rely on the formal framework from CJN-related previous work [2], [4], [6], [10], [11] and it also resolves a smaller set of keyword queries then Lathe. QUEST, in particular, does not support keyword queries whose resolution necessitates SQL queries with self-joins. As a result, when comparing QUEST to other approaches, the authors limited the experimentation to 35 queries rather then the 50 included in the original benchmark [3], [12]. Lathe, on the other hand, supports all 50 queries.

Finally, there are systems that propose going beyond the retrieval of tuples that fulfill a query expressed using keywords and try to provide a functionality close to structured query languages. This is the case of SQAK [23] that allows users to specify aggregation functions over schema elements. Such an approach was later expanded in systems such as SODA [24] and SQUIRREL [25], which aim to handle not only aggregation functions, but also keywords that represent predicates, groupings, orderings and so on. To support such features, these systems rely on a variety of resources that are not part of the database schema or instances. Among these are conceptual schemas, generic and domain-specific ontologies, lists of reserved keywords, and user-defined metadata patterns. We see such useful systems as being closer to natural language query systems [13]. In contrast, Lathe, like any typical R-KwS system, aims at retrieving sets of JNTs that fulfill the query, and not computing results with the tuples. In addition, it does not rely on any external resources.

SECTION III.

Problem Statement

Given a database that has $n$ relations $R_{1}, \ldots, R_{n}$ , where each relation has $m_{i}$ attributes $a^{i}_{1}, \ldots, a^{i}_{m_{i}}$ . Let a keyword query be a set of keywords $k_{1}, k_{2}, \ldots, k_{n}$ . Answering a keyword query over the database means finding a set of relational algebra expressions that match the query, that is, they match each keyword to at least one database element, which can be the name of a relation, an attribute name, or a value of an attribute.

We represent these expressions with Candidate Joining Networks, where the nodes comprise selections or projections over relations, and the edges represent join operations. That is, $C$ is a candidate joining network for $Q$ if, for each $k \in Q$ , exists at least one node $u$ in $C$ so that one of the following is true:

  • $u = \sigma _{a \ni k}(R_{u})$

  • $u = \pi _{a}(R_{u})$ , where $k=a$

  • $u = \sigma (R_{u})$ , where $k=R_{u}$

The first condition indicates whether a keyword matches the value of an attribute, while the second and third verifies if the keyword matches to an attribute name or a relation name, respectively.

For notational simplicity, we assume that the attributes of a primary to foreign key relationship have the same name, so we can freely join relations using natural joins. The generalization of the problem and the solution when these assumptions do not hold is trivial.

Also, for each edge $u{\rightarrow }v$ in $C$ , there exists a primary to foreign key relationship from $R_{u}$ to $R_{v}$ , so that we can join $u{\Join }v$ .

To ensure the connectivity, $C$ may also have some nodes which are not associated with any keyword, but they act as intermediate tables for the join operations. An intermediate node $u = R_{u}$ cannot be a leaf in $C$ , that is, its degree must be greater than 1.

SECTION IV.

Lathe Overview

In this section we present an overview of Lathe. We begin by presenting a simple example of the task carried out by our system. For this, we illustrate in Figure 1 a simplified excerpt from the well-known IMDb.4

FIGURE 1. - A simplified excerpt from IMDb.
FIGURE 1.

A simplified excerpt from IMDb.

Let $Q$ be the keyword query $Q{=}{``will\,smith\,films''}$ , where the user wants the system to list the movies in which Will Smith appears. Notice that, informally, the terms “will” and “smith” are likely to match the contents of a relation from the DB, while the term “films” is likely to match the name of a relation or attribute.

As other methods previously proposed in the literature, such as CNGen [4] and MatCNGen [2], [11], the main goal of Lathe is, given a query such as $Q$ , generating a SQL query that, when executed, fulfills the information needed for the user. The difference between Lathe and these previous methods is that they are not able to handle references to schema elements, such as “films” in $Q$ .

For query $Q$ , two of the possible SQL queries that would be generated are presented in Figures 2 (a) ($S_{1}$ ) and (b) ($S_{2}$ ), whose respective results for the database of Figure 1 are presented in Figures 2(c) and (d). In the query $S_{1}$ , the keywords “will” and “smith” match the value of a single tuple of relation PERSON, while the keyword “films” matches the name of the relation MOVIE. As a result, $S_{1}$ retrieves the movies which the person Will Smith was in, and thus, satisfies the original user intent. As for query $S_{2}$ , the keywords “will” and “smith” match values of two different tuples in relation PERSON, that is, they refer to two different persons. The keyword “films” matches the name of the relation MOVIE again. Therefore, $S_{2}$ retrieves movies in which two different persons, whose names respectively include the terms “will” and “smith”, participated in. In these case, the persons are Will Theakston and Maggie Smith.

FIGURE 2. - SQL queries generated for the keyword query “will smith movies” and their returned results.
FIGURE 2.

SQL queries generated for the keyword query “will smith movies” and their returned results.

As this example indicates, there may be several plausible SQL queries related to a given keyword query. Therefore, it is necessary to decide which alternative is more likely to fulfill the user intent. This task is also carried out by Lathe.

Next, we present an overview of the components and the functioning of Lathe.

A. System Architecture

In this section, we present the overall architecture of Lathe. We base our discussion on Figure 3, which illustrates the main phases that comprise the operation of the method.

FIGURE 3. - Main phases and architecture of Lathe.
FIGURE 3.

Main phases and architecture of Lathe.

The process begins with an input keyword query posed by the user. The system then attempts to associate each of the keywords from the query with a database schema element, such as a relation or an attribute. The system relies on the DB schema, i.e., the names of relations and attributes, or on the DB instance, i.e., on the values of the attributes, for this. This phase, called Keyword Matching ①, generates sets of Value-Keyword Matches (VKMs), which associate keywords with sets of tuples whose attribute values contain these keywords, and Schema-Keyword Matches (SKMs), which associate keywords with names of relations or attributes deemed as similar to these keywords.

In Table 1 we show possible matches between keywords in the input query and the database elements. For example, the keywords “will smith” are found together in the values of the attribute name of the PERSON relation. The keyword “will” is also found alone in the values of PERSON.name, which is the case of the person Will Theakston present in instance shown in Figure 1. The term “smith” is can refer to either the name of a person, the name of a character or even the title of a movie, in this case “Mr. & Mrs. Smith”. Since these keywords are part of attribute values, these matches are considered VKMs. In the case of the keyword “films”, it actually matches the name of the Movie relation, which is why in Table 1 the keyword “films” matches MOVIE.self. Thus, this match is considered an SKM. The Keyword Matching phase is detailed in Section V.

TABLE 1 Keyword Matched for the Query “Will Smith Films”
Table 1- 
Keyword Matched for the Query “Will Smith Films”

In the next phase, Query Matching ②, Lathe generates combinations of VKMs and SKMs. In these combinations, we consider that all keywords in the query must be matched; in other words, the combination must be total. Furthermore, we also consider that all pairs of keywords and attributes are “useful”; that is, if we remove any of the pairs, this would result in a non-total combination. All combinations that satisfy both criteria are called Query Matches (QMs). In Figure 4 we present all possible QMs of the KMs illustrated in Table 1.

FIGURE 4. - Examples of combinations of keywords matched.
FIGURE 4.

Examples of combinations of keywords matched.

Although the Query Matching phase may generate a large number of QMs due to its combinatorial nature, only a few of them are useful in producing plausible answers to the user. As a result, we propose the first algorithm for Ranking Query Matches in the literature. This ranking assigns a score to QMs based on their likelihood of satisfying the needs of the user when formulating the keyword query. Thus, the system only outputs a few top-ranked QMs to the next phases. By doing so, it avoids having to process less likely QMs. We present the details on QMs, their generation, and ranking in Section VI.

Lastly, in the Candidate Joining Network Generation ③ phase, the system searches for interpretations for the keyword query. That is, the system tries to connect all the keyword matches from the QMs through CJNs, which are based on the schema graph. CJNs can be thought as relational algebra joining expressions that can be directly translated into SQL queries.

For instance, both the QMs shown in Figure 4 (a) and (b) can be connected using the CASTING relation, resulting in CJNs whose SQL translation is presented in Figure 2 (a) and (b), respectively.

Also, the system performs a Candidate Joining Network Ranking, which takes advantage of the previous QM rank, but also favors CJNs that are more concise in terms of the number of relations they employ. Once we have identified the most likely CJNs, they can be evaluated as SQL queries that are executed by a DBMS to the users. We notice that some of the generated CJNs may return empty results when they are evaluated. Thus, Lathe can alternatively evaluate CJNs before ranking them and prune such void CJNs. We call this process instance-based pruning.

During the whole process of generating CJNs, Lathe uses two data structures which are created in a Preprocessing stage $\bigcirc \!\!\!\!{0}$ : the Value Index and the Schema Index.

The Value Index is an inverted index that stores keyword occurrences in the database, indicating the relations, attributes, and tuples where a keyword appears. These occurrences are retrieved to generate VKMs. Furthermore, the Value Index is used to calculate term frequencies for the QMs and CJNs Rankings. The Schema Index is an inverted index that stores database schema information, as well as statistics about relations and attributes. While database schema information, such as PK/FK relationships, are used for the generation of CJNs, the statistics about attributes, such as norm and inverted frequency, are used for rankings of QMs and CJNs.

In the following sections we present each of the phases of Figure 3, describing the steps, definitions, data structures, and algorithms we used.

SECTION V.

Keyword Matching

In this section, we present the details on keyword matches and their generation. Their role in our work is to associate each keyword from the query to some attribute or relation in the database schema. Initially, we classify them as either VKMs and SKMs, according to the type of associations they represent. Later, we provide a generalization of the keyword matches and we introduce the concept of Keyword-Free Matches, which will be used in the next phases of our method.

A. Value-Keyword Matching

We may associate the keywords from the query to some attribute in the database schema-based on the values of this attribute in the tuples that contain these keywords using value-keyword matches, according to Definition 1.

Definition 1:

Let $Q$ be a keyword query and $R$ be a relation state over the relation schema $R(A_{1},\ldots,A_{m})$ . A value-keyword match from $R$ over $Q$ is given by:\begin{equation*}R^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] = \{ t|t \in R \wedge \forall A_{i}: W(t[A_{i}]) \cap Q = K_{i}\}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $K_{i}$ is the set of keywords from $Q$ that are associated to the attribute $A_{i}$ , $W(t[A_{i}])$ returns the set of words in $t$ for attribute $A_{i}$ and $V$ denotes a match of keywords to the database values.

Notice that each tuple from the database can be a member of only one value-keyword match. Therefore, the VKMs of a given query are disjoint sets of tuples.

Throughout our discussion, for the sake of compactness in the notation, we often omit mappings of attributes to empty keyword sets in the representation of a VKM. For instance, we use the notation $R^{V}[A_{1}^{K_{1}}]$ to represent $R^{V}[A_{1}^{K_{1}}, A_{2}^{ \{\} }, \ldots, A_{n}^{ \{\} }]$ .

Example 1:

Consider the database instance of Figure 1. The following VKMs can be generated for the query “will smith films”.\begin{align*} PERSON^{V}[name^{ \{will,smith\} }] &= \{t_{1}\}\\ PERSON^{V}[name^{ \{will\} }] &= \{t_{2}\}\\ PERSON^{V}[name^{ \{smith\} }] &= \{t_{3}\}\end{align*} View SourceRight-click on figure for MathML and additional features. VKMs play a similar role to the tuple-sets from related literature [2], [4]. They are, however, more expressive because they specify which attribute is associated with each keyword. Previous R-KwS systems based on the DISCOVER system, on the other hand, are unable to create tuple-sets that span multiple attributes [4], [6], [10]. Example 2 shows a keyword query that includes more than one attribute.

Example 2:

Consider the query “lord rings 2001” whose intent is to return which Lord of the Rings movie was launched in 2001. We can represent it with the following value-keyword match:\begin{equation*} MOVIE^{V}[title^{ \{lord,rings\} }, year^{ \{2001\} }] = \{t_{17}\}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The generation of VKMs uses a structure we call the Value Index. This index stores the occurrences of keywords in the database, indicating the relations and tuples a keyword appears and which attributes are mapped to the keyword. Lathe creates the Value Index during a preprocessing phase that scans all target relations only once. This phase comes before the query processing and it is not expected to be repeated frequently. As a result, without further interaction with the DBMS, answers are generated for each query. The Value Index has following the structure, which is shown in Example 3.\begin{equation*}I_{V}=\{term:\{relation:\{attribute:\{tuples\}\}\}\}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Example 3:

The VKMs presented in Example 1 are based on the following keyword occurrences:.\begin{align*} I_{V}[will]&=\{PERSON:\{name:\{t_{1},t_{2}\}\}\} \\ I_{V}[smith]&=\{PERSON:\{name:\{t_{1},t_{3}\}\}\} \\ I_{V}[smith][PERSON]&=\{name:\{t_{1},t_{3}\}\} \\ I_{V}[smith][PERSON][name]&=\{t_{1},t_{3}\}\end{align*} View SourceRight-click on figure for MathML and additional features. In Lathe, the generation of VKMs is carried out by the VKMGen algorithm, presented in details in Appendix B.

B. Schema-Keyword Matching

We may associate the keywords from the query to some attribute or relation in the database schema based on the name of the attribute or relation using Schema-Keyword Matches, according to Definition 2. Specifically, our method matches keywords to the names of relations and attributes using similarity metrics.

Definition 2:

Let $k \in Q$ be a keyword from the query, $R(A_{1},\ldots,A_{m})$ be a relation schema. A schema-keyword match from $R$ over $Q$ is given by:\begin{equation*}R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]= \{ t|t \in R \wedge \forall k \in K_{i}: sim(A_{i}, k) \geq \varepsilon \}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $1 \leq i \leq m$ , $K_{i}$ is the set of keywords from $Q$ that are associated with the schema element $A_{i}$ , $sim(A_{i}, k)$ gives the similarity between the name of a schema element $A_{i}$ and the keyword $k$ , which must be above a threshold $\varepsilon $ , and $S$ denotes a match of keywords to the database schema.

In this representation, we use the artificial attribute $self$ when we match a keyword to the name of a relation. Example 4 shows an instance of a schema-keyword match wherein the keyword “$films$ ” is matched to the relation $MOVIE$ .

Example 4:

The following schema-based relation matches are created for the query “will smith films”, considering a threshold $\varepsilon =0.6$ .\begin{align*} MOVIE^{S}[self^{ \{films\} }] &= \{t_{14},t_{15},t_{16},t_{17},t_{18},t_{19}\}\\ MOVIE^{S}[title^{ \{will\} }] &= \{t_{14},t_{15},t_{16},t_{17},t_{18},t_{19}\}\\ PERSON^{S}[name^{ \{smith\} }] &= \{t_{1},t_{2},t_{3},t_{4},t_{5}\}\end{align*} View SourceRight-click on figure for MathML and additional features. where $sim(a,b)$ gives the similarity between the schema element $a$ and the keyword $b$ , $sim(movie, films)=1.00$ , $sim(title, will)=0.87$ and $sim(name, smith)=0.63$ .

Despite their similarity to VKMs, the schema-keyword matches serve a different purpose in our method, ensuring that the attributes of a relation appear in the query results. As a result, they do not “filter” any of the tuples from the database, implying that they do not represent any selection operation over database relations.

1) Similarity Metrics

For the matching of keywords to schema elements, we used two similarity metrics based on the lexical database WordNet: the Path similarity [26], [27] and the Wu-Palmer similarity [27], [28]. We introduce the WordNet database and the two similarity metrics below.

2) WordNet Database

WordNet [26] is a large lexical database that resembles a thesaurus, as it groups words based on their meanings. One use of WordNet is to measure similarity between words based on the relatedness of their senses, the many different meanings that words can have [29]. As a result, the word “film” can refer to a movie, as well as the act of recording or the plastic film. Each of these senses have a different relation to the sense of a “show”. Wordnet represents sense relationships, such as synonymy, hyponymy, and hypernymy, to measure similarity between words. Synonyms are two word senses that share the same meaning. In addition, we say that the sense $c_{1}$ is a hyponym of the sense $c_{2}$ if $c_{1}$ is more specific, denoting a subclass of $c_{2}$ . For instance, “protagonist” is a hyponym of “character”; “actor” is a hyponym of “person”, and “movie” is a hyponym of “show”. The hypernymy is the opposite of hyponymy relation. Thus, $c_{2}$ us a hypernymy of $c_{1}$ .

3) Path Similarity

The Path similarity [26], [27] exploits the structure and content of the WordNet database. The relatedness score is inversely proportional to the number of nodes along the shortest path between the senses of two words. If the two senses are synonyms, the path between them has length 1. The relatedness score is calculated as follows:\begin{align*}sim_{path}(w_{1},w_{2})=\max _{\substack {c_{1} \in senses(w_{1})\\ c _{2} \in senses(w_{2})}}\left [{\frac {1}{|shortest\_{}path(c_{1},c_{2})|}}\right]\end{align*} View SourceRight-click on figure for MathML and additional features.

4) Wu-Palmer Similarity

The Wu-Palmer measure (WUP) [27], [28] calculates relatedness by considering the depths of the two synsets $c_{1}$ and $c_{2}$ in the WordNet taxonomies, along with the depth of the Least Common Subsumer (LCS). The most specific synset $c_{3}$ is the LCS, which is the ancestor of both synsets $c_{1}$ and $c_{2}$ . Because the depth of the LCS is never zero, the score can never be zero (the depth of the root of a taxonomy is one). Also, the score is 1 if the two input synsets are the same. The WUP similarity for two words $w_{1}$ and $w_{2}$ is given by:\begin{align*}sim_{wup}(w_{1},w_{2})=\max _{\substack {c_{1} \in senses(w_{1})\\ c _{2} \in senses(w_{2})}}\left [{2\times \frac {depth(lcs(c_{1},c_{2}))}{depth(c_{1},c_{2})}}\right]\end{align*} View SourceRight-click on figure for MathML and additional features.

As in the case of VKMs, we detail the SKMGen algorithm used in Lathe in Appendix C.

C. Generalization of Keyword Matches

Initially, we presented Definitions 1 and 2 which, respectively, introduce VKMs and SKMs. We chose to explain the specificity of these concepts separately for didactic purposes. They are, however, both components of a broader concept, Keyword Match (KM), which we define in Definition 3. In the following phases, this generalization will be useful when merging VKMs and SKMs.

Definition 3:

Let $Q$ be a keyword query and $R$ be a relation state over the relation schema $R(A_{1},\ldots,A_{m})$ . Let $V\!K\!M=R^{V}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]$ be a value-keyword match from R over Q. Let $S\!K\!M=R^{S}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]$ be a schema-keyword match from R over Q. A general keyword match from $R$ over $Q$ is given by:\begin{equation*} R^{S}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]^{V}[A_{1}^{K_{1}^{V}}, \ldots, A_{m}^{K_{m}^{V}}] = V\!K\!M \cap S\!K\!M\end{equation*} View SourceRight-click on figure for MathML and additional features. The representations of VKMs and SKMs in the general notation are given as follows:\begin{align*}R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] &= R^{S}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]^{V}[A_{1}^{ \{\} }, \ldots, A_{m}^{\{\}}] \\ R ^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}] &= R^{S}[A_{1}^{ \{\} }, \ldots, A_{m}^{\{\}}]^{V}[A_{1}^{K_{1}}, \ldots, A_{m}^{K_{m}}]\end{align*} View SourceRight-click on figure for MathML and additional features.

Another concept required for the generation of QMs and CNs is keyword-free matches, which we describe in Definition 4. They are KMs that are not associated with any keyword but are used as auxiliary structures, such as intermediate nodes in CJNs.

Definition 4:

We say that a keyword match $K\!M$ given by:\begin{equation*}K\!M =R^{S}[A_{1}^{K_{1}^{S}}, \ldots, A_{m}^{K_{m}^{S}}]^{V}[A_{1}^{K_{1}^{V}}, \ldots, A_{m}^{K_{m}^{V}}]\end{equation*} View SourceRight-click on figure for MathML and additional features. is a keyword-free match if, and only if, $\nexists K_{i}^{S} {\neq } \{\} \land \nexists K_{i}^{V} {\neq } \{\}$ , where $1 \leq i \leq m$ .

For the sake of simplifying the notation, we will represent a keyword-free match as $R^{S}[ ]^{V}[ ]$ or simply by $R$ .

SECTION VI.

Query Matching

In this section, we describe the processes of generating and ranking QMs, which are combinations of the keyword matches generated in the previous phases that comprise every keyword from the keyword query.

A. Query Matches Generation

We combine the associations present in the KMs to form total and non-redundant answers for the user. In other words, Lathe looks for KM combinations that satisfy two conditions: (i) every keyword from the query must appear in at least one of the KMs and (ii) if any KM is removed from the combination, the combination no longer meets the first condition. These combinations, called Query Matches (QMs), are described in Definition 5

Definition 5:

Let $Q$ be a keyword query. Let $M=\{K\!M_{1}, \ldots, K\!M_{n}\}$ be a set of keyword matches for Q in a certain database instance $I$ , where:\begin{equation*} K\!M_{i}=R_{i}^{S}[A_{i,1}^{K^{S}_{i,1}},\ldots,A_{i,m_{i}}^{K^{S}_{i,m_{i}}}]^{V}[A_{i,1}^{K^{V}_{i,1}},\ldots,A_{i,m_{i}}^{K^{V}_{i,m_{i}}}]\end{equation*} View SourceRight-click on figure for MathML and additional features. Also, let $C_{K\!M_{i}}{=} \bigcup _{{1 \leq j \leq m_{i} X \in \{S,V\} }} K_{i,j}^{X}$ and $C_{M}{=} \bigcup _{ {1 \leq i \leq n\,\,}} C_{K\!M_{i}}$ be the sets of all keywords associated with $K\!M_{i}$ and with $M$ , respectively. We say that $M$ is a query match for $Q$ if, and only if, $C_{M}$ forms a minimal set cover of the keywords in $Q$ . That is, $C_{M}= Q$ and $C_{M}{\setminus }C_{K\!M_{i}} \neq Q$ , $\forall K\!M_{i} \in M$ .

Notice that a QM cannot contain any keyword-free match, as it would not be minimal anymore. Example 5 presents combinations of KMs which are or are not QMs.

Example 5:

Considering the KMs from the Examples 1 and 4, only some of the following sets are considered QMs for the query “will smith films”:\begin{align*} M_{1} &= \{PERSON^{V}[name^{ \{will,smith\} }],MOVIE^{S}[self^{ \{films\} }]\}\\ M_{2} &= \{PERSON^{V}[name^{ \{will\} }],PERSON^{V}[name^{ \{smith\} }],\\ &\qquad MOVIE^{S}[self^{ \{films\} }]\}\\ M_{3} &= \{PERSON^{V}[name^{ \{will\} }], PERSON^{V}[name^{ \{smith\} }]\}\\ M_{4} &= \{PERSON^{V}[name^{ \{will,smith\} }],MOVIE^{S}[self^{ \{films\} }],\\ &\qquad CHARACTER\}\\ M_{5} &= \{PERSON^{V}[name^{ \{will,smith\} }], MOVIE^{S}[self^{ \{films\} }],\\ &\qquad PERSON^{V}[name^{ \{smith\} }]\}\end{align*} View SourceRight-click on figure for MathML and additional features. The sets $M_{1}$ and $M_{2}$ are considered QMs. In contrast, the sets of keyword matches $M_{3}$ , $M_{4}$ and $M_{5}$ are not QMs. While $M_{3}$ does not include all query keywords, $M_{4}$ and $M_{5}$ are not minimal, that is, they have unnecessary KMs.

We present the QMGen algorithm for generating QMs in Appendix D.

B. Query Matches Ranking

As described in Section IV, Lathe performs a ranking of the QMs generated in the previous step. This ranking is necessary because frequently many QMs are generated, yet, only a few of them are useful to produce plausible answers to the user.

Lathe estimates the relevance of QMs based on a Bayesian Belief Network model for the current state of the underlying database. In practice, this model assess two types of relevance when ranking query matches. The TF-IDF model is used to calculate the value-based score, which adapts the traditional Vector space model to the context of relational databases, as done in LABRADOR [30] and CNRank [10]. The schema-based score, on the other hand, is calculated by estimating the similarity between keywords and schema elements names.

In Lathe, only the top-k QMs in the ranking are considered in the succeeding phases. By doing so, we avoid generating CJNs that are less likely to properly interpret the keyword query.

Belief Bayesian Network:

We adopt the Bayesian framework proposed by [31] and [19] for modeling distinct IR problems. This framework is simple and allows for the incorporation of features from distinct models into the same representational scheme. Other keyword search systems, such as LABRADOR [30] and CNRank [10], have also used it.

In our model, we interpret the QMs as documents, which are ranked for the keyword query. Figure 5 illustrates an example of the adopted Bayesian Network. The nodes that represent the keyword query are located at the top of the network, on the Query Side. The Database Side, located at the bottom of the network, contains the nodes that represent the QM that will be scored. The center of the network is present on both sides and is made up of sets of keywords: the set $V$ of all terms present in the values of the database and the set $S$ of all schema element names.

FIGURE 5. - Bayesian network corresponding to the query “will smith films”.
FIGURE 5.

Bayesian network corresponding to the query “will smith films”.

In our Bayesian Network, we rank QMs based on their similarities with the keyword query. This similarity is interpreted as the probability of observing a query match $QM$ given the keyword query $Q$ , that is, $P(QM|Q)= \mu P(QM \wedge Q)$ , where $\mu = 1/P(Q)$ is a normalizing constant, as used in [32].

Initially, we define a random binary variable associated with each keyword from the sets $V$ and $S$ , which indicates whether the keyword was observed in the keyword query. As these random variables are the root nodes of our Bayesian Network, all of the probabilities of the other nodes are dependent on them. Therefore, if we consider $v \subseteq V$ and $s \subseteq S$ as the sets of keywords observed, we can derive the probability of any non-root node $x$ as follows: $P(x)=P(x|v,s)\times P(v)\times P(s)$ .

As all the possibilities of $v$ and $s$ are equally likely a priori, we can calculate them as $P(v)=(1/2)^{|V|}$ and $P(s)=(1/2)^{|S|}$ , respectively.

The instantiation of the root nodes of the network separates the query match nodes from the query nodes, making them mutually independent. Therefore:\begin{equation*}P(QM \wedge Q) = P(Q|v,s)P(QM|v,s)P(v)P(s)\end{equation*} View SourceRight-click on figure for MathML and additional features.

The probability of the keyword query $Q=\{q_{1},\ldots,q_{|Q|}\}$ is split between the probability of each of its keywords:\begin{equation*}P(Q|v,s) = \prod _{1 \leq i \leq |Q|} P(q_{i}|v,s)\end{equation*} View SourceRight-click on figure for MathML and additional features. A keyword $q_{i}$ from the query is observed, given the sets $s$ and $v$ , either if $q_{i}$ occurs in the values of the database or if $q_{i}$ has a similarity above a threshold $\varepsilon $ with a schema element.\begin{equation*}P(q_{i}|v,s) = (q_{i} \in v) \veebar (\exists k \in s: sim(q_{i},k) \geq \varepsilon)\end{equation*} View SourceRight-click on figure for MathML and additional features.

Similarly, in our network, the probability of a query match $QM$ is splited between the probability of each of its KMs.\begin{equation*}P(QM|v,s)= \prod _{1 \leq i \leq |QM|} P(K\!M_{i}|v,s)\end{equation*} View SourceRight-click on figure for MathML and additional features.

We compute the probability of KMs using two different metrics: a schema score based on the same similarities used in the generation of SKMs; and a value score based on a Vector model [33], [34] using the cosine similarity.\begin{align*} P(K\!M_{i}|v,s) &= \prod _{ \substack { 1 \leq j \leq m_{i} \\ K^{V}_{i,j} \neq \emptyset } } cos\left({\overrightarrow {A_{i,j}}, \overrightarrow {v \cap K_{i,j}^{V}}}\right) \\ &\times \prod _{ \substack { 1 \leq j \leq m_{i} \\ K^{S}_{i,j} \neq \emptyset } } \frac { \sum _{t \in s \cap K_{i,j}^{S}} sim(A_{i,j},t) }{|s \cap K_{i,j}^{S}|}\end{align*} View SourceRight-click on figure for MathML and additional features. where $K\!M_{i}=R_{i}^{S}[A_{i,1}^{K^{S}_{i,1}},\ldots,A_{i,m_{i}}^{K^{S}_{i,m_{i}}}]^{V}[A_{i,1}^{K^{V}_{i,1}},\ldots,A_{i,m_{i}}^{K^{V}_{i,m_{i}}}]$ .

It is important to distinguish the documents from the Bayesian Network model and the Vector Model. The documents of the Bayesian Network are QMs, and the query is the keyword query itself, whereas the documents of the Vector model are database attributes, and the query is the set of keywords associated with the KM.

Once we know the document and the query of the Vector model, we can calculate the cosine similarity by taking inner product of the document and the query. The cosine similarity formula is given as follows:\begin{align*} cos\left({\overrightarrow {A_{i,j}}, \overrightarrow {v \cap K_{i,j}^{V}}}\right) &= \left({\overrightarrow {A_{i,j}^{V}} \boldsymbol {\cdot } \overrightarrow {v \cap K_{i,j}^{V}}}\right)/\left({| \overrightarrow {A_{i,j}}|\times | \overrightarrow {v \cap K_{i,j}^{V}}|}\right) \\ &= \alpha \times \frac {\displaystyle \sum _{t \in V} w\left({\overrightarrow {A_{i,j}},t}\right) \times w\left({\overrightarrow {v \cap K_{i,j}^{V}},t}\right)} {\displaystyle \sqrt {\sum _{t \in V} w\left({\overrightarrow {A_{i,j}},t}\right)^{2}}}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\alpha =1 / \left({\sum _{t \in V}w\left({\overrightarrow {v \cap K_{i,j}^{V}},t}\right)^{2}}\right)^{ {1}/{2}}$ is the constant that represents the norm of the query, which is not necessary for the ranking.

The weights for each term are calculated using the TF-IDF measure. This measure is based on the term frequency and specificity in the collection. We use the raw frequency and inverse frequency, which are the most recommended form of TF-IDF weights [33].\begin{equation*} w\left({\overrightarrow {X},t}\right) = freq_{X,t} \times \log {\frac {N_{A}}{n_{t}}}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\overrightarrow {X}\in \left\{{ \overrightarrow {A_{i,j}}, \overrightarrow {v \cap K_{i,j}^{V}}}\right\}$ can be either the document or the query, $N_{A}$ is the number of attributes in the database, and $n_{t}$ is the number of attributes that are mapped to the occurrences of the term $t$ . In the case of $\overrightarrow {X}$ be the query, $freq_{X,t}$ gives the number of occurrences of a term $t$ in the keyword query, which is generally 1. In the case of $\overrightarrow {X}$ be an attribute(document), $freq_{X,t}$ gives the occurrences of a term $t$ in an attribute, which is obtained from the Value Index.

We present the algorithm for ranking QMs in Appendix E.

SECTION VII.

Candidate Joining Networks

In this section we present the details on our method for generating and ranking Candidate Joining Networks (CJNs), which represent different interpretations of the keyword query. We recall that our definition of CJNs expands on the definition presented in [4] to support keywords referring to schema elements.

The generation of CJNs uses a structure we call a Schema Graph. In this graph, there is a node representing each relation in the database and the edges correspond to the referential integrity constraints (RIC) in the database schema. In practice, this graph is built in a preprocessing phase based on information gathered from the database schema.

Definition 6:

Let $\mathcal {R} = \{R_{1},\ldots,R_{n}\}$ be a set of relation schemas from the database. Let $E$ be a subset of the ordered pairs from $\mathcal {R}^{2}$ given by:\begin{equation*}E=\{ \langle R_{a},R_{b} \rangle | \langle R_{a},R_{b} \rangle \in \mathcal {R}^{2} \wedge R_{a} \neq R_{b} \wedge R\!I\!C(R_{a},R_{b}) \geq 1 \}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $R\!I\!C(R_{a},R_{b})$ gives the number of Referential Integrity Constraints from a relation $R_{a}$ to a relation $R_{b}$ . We say that a schema graph is an ordered pair $G_{S}=\langle \mathcal {R},E \rangle $ , where $\mathcal {R}$ is the set of vertices (nodes) of $G_{S}$ , and $E$ is the set of edges of $G_{S}$ .

Example 6:

Considering the sample movie database introduced in Figure 1, our method generates the schema graph below.\begin{align*} &G_{S}= < \{PERSON,MOVIE,CASTING,\\ &\qquad \qquad CHARACTER,ROLE\},\\ &\qquad \qquad \{\langle CASTING,PERSON\rangle,\langle CASTING,MOVIE\rangle,\\ &\qquad \qquad \langle CASTING,CHARACTER\rangle,\langle CASTING,ROLE\rangle \}>\end{align*} View SourceRight-click on figure for MathML and additional features.

In Figure 6, we represent a graphical illustration of $G_{S}$ .

FIGURE 6. - A schema graph for the sample movie database of Figure 1.
FIGURE 6.

A schema graph for the sample movie database of Figure 1.

Once we defined the schema graph, we can introduce an important concept, the Joining Network of Keyword Matches (JNKM). Intuitively, a joining network of keyword matches $J$ contains every KM from a query match $M$ . $J$ may also contain some free-keyword matches for the sake of connectivity. Finally, $J$ is a connected graph that is structured according to the schema graph $G_{S}$ . The definition of joining network of keyword matches is given as follows:

Definition 7:

Let $M$ be a query match for a keyword query $Q$ . Let $G_{S}$ be a schema graph. Let $F$ be a set of keyword-free matches from the relations of $G_{S}$ . Consider a connected graph of keyword matches $J = \langle \mathcal {V}, E\rangle $ , where $\mathcal {V}$ and $E$ are the vertices and edges of $J$ . We say that $J$ is a joining network of keyword matches from $M$ over $G_{S}$ if the following conditions hold:\begin{align*} &i) \mathcal {V} = M \cup F\\ &ii) \forall \langle K\!M_{a},K\!M_{b} \rangle \in E \implies \exists \langle R_{a}, R_{b} \rangle \in G_{S}\\{}\end{align*} View SourceRight-click on figure for MathML and additional features. For the sake of simplifying the notation, we will use a graphical illustration to represent JNKMs, which is shown in Example 7.

Example 7:

Considering the query match $M_{1}$ previously generated in Example 5, the following JNKMs can be generated:\begin{align*} &\quad J_1={ PERSON }^V\left[{ name }^{\{{will }, { smith }\}}\right] \leftarrow { CASTING } \\ &\qquad \qquad\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \downarrow \\ &\qquad \qquad \qquad \qquad \qquad \qquad \qquad { MOVIE }^S[{ self }^{\{{ films }\}}] \\& J_2={ CHARACTER } \leftarrow { CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \\ &\qquad \qquad \qquad \qquad \qquad \qquad \downarrow \\ &\qquad \qquad \qquad { PERSON }{ }^V[{ name }^{\{{ will, smith }\}}] \end{align*} View SourceRight-click on figure for MathML and additional features.

The JNKMs $J_{1}$ and $J_{2}$ cover the query match $M_{1}$ . The interpretation of $J_{1}$ looks for the movies of the person will smith. $J_{2}$ looks for the movies of the person will smith and which character will smith played in these movies.

Notice that a JNKM might have unnecessary information for the keyword query, which was the case of $J_{2}$ presented in Example 7. One approach to avoid generating unnecessary information is to generate Minimal Joining Networks of Keyword Matches (MJNKM), which are addressed in Definition 8. Roughly, a MJNKM cannot have any keyword-free match as a leaf, that is, a keyword-free match incident to a single edge.

Definition 8:

Let $G_{S}$ be a schema graph. Let $M$ be a query match for a query Q. We say that $J=\langle \mathcal {V}, E \rangle $ from $M$ over $G_{S}$ is minimal joining network of keyword matches (MJNKM) if, and only if, the following condition holds:\begin{align*} \forall K\!M_{i} \in \mathcal {V} ( \exists ! \langle K\!M_{a},K\!M_{b} \rangle \in E | i \in \{a,b\} \implies \\ &\hspace {-2pc}K\!M_{i} \neq R_{i}^{S}[ ]^{V}[ ]\end{align*} View SourceRight-click on figure for MathML and additional features.

Example 8:

Considering the query match $M_{2}$ previously generated in Example 5, the following MJNKMs can be generated:\begin{align*} & \quad\quad{ PERSON }{ }^V\left[{ name }^{\{{will }\}}\right] \\ &\qquad\qquad\uparrow\\& J_3={ CASTING } \rightarrow { MOVIE }^S\left[{ self }^{\{f i l m s\}}\right] \\ & \quad\quad\quad\quad\downarrow \\ & \quad\quad{ PERSON }{ }^V\left[{ name }^{\{{smith }\}}\right] \end{align*} View SourceRight-click on figure for MathML and additional features.

Another issue that a JNKM might have is representing an inconsistent interpretation. For instance, it is impossible for $J_{3}$ presented in Example 8 to return any results from the database. By Definition 1, the VKMs $PERSON^{V}[name^{ \{will\} }]$ and $PERSON^{V}[name^{ \{smith\} }]$ are disjoint. However, a tuple from $CASTING$ cannot refer to two different tuples of $PERSON$ . Thus $J_{3}$ is inconsistent. We notice that previous work in literature for CJN generation had addressed this kind of inconsistency [2], [4]. They did not, however, consider the situation in which there exist more than one RIC from one relation to another. In contrast, based on the theorems and definitions presented in [4], Lathe proposes a novel approach for checking consistency in CJNs that support such scenarios. Theorem 1 presents a criterion that determines when a JNKM is sound, that is, it can only produce JNTs that do not have more than one occurrences of a tuple. The proof of Theorem 1 is presented in Appendix F.

Theorem:

Let $G_{S} = \langle \mathcal {R}, E_{G} \rangle $ be a schema graph. Let $J =\langle \mathcal {V}, E_{J} \rangle $ be a joining network of keyword matches. We say that $J$ is sound, that is, it does not have more than one occurrences of the same tuple for every instance of the database if, and only if, the following condition holds $\forall K\!M_{a} \in \mathcal {V}, \forall \langle R_{a}, R_{b} \rangle \in E_{G}$ :\begin{equation*} R\!I\!C(R_{a},R_{b}) \geq |\{KM_{c}| \langle K\!M_{a}, K\!M_{c} \rangle \in E_{J} \wedge R_{c}=R_{b} \}|\end{equation*} View SourceRight-click on figure for MathML and additional features. where $R\!I\!C(R_{a},R_{b})$ indicates the number of Referential Integrity Constraints from a relation $R_{a}$ to a relation $R_{b}$ .

Example 9 presents a JNKM that is sound, although it would be deemed not sound by previous approaches [2], [4].

Example 9:

Consider a simplified excerpt from the MONDIAL database [35], presented in Figure 7. As there exists 2 RICs from the relation $BORDER$ to $COUNTRY$ , represented by the attributes Ctry1_Code e Ctry2_Code, a tuple from $BORDER$ can be joined to at most two distinct tuples from $Country$ , which is the case of $t_{35}\bowtie t_{38}\bowtie t_{36}$ . Thus, the following MJNKM is sound:\begin{align*}&J_4={COUNTRY}^V\left[{ name }^{\{{colombia }\}}\right] \leftarrow { BORDER } \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\begin{array}{c}\downarrow \\{ COUNTRY }^{V}{\left[{name }^{\{b r a z i l\}}\right]}\end{array} \end{align*} View SourceRight-click on figure for MathML and additional features. Finally, Definition 9 describes a Candidate Joining Network (CJN), which is roughly a sound minimal joining network of keyword matches.

FIGURE 7. - A simplified excerpt from MONDIAL.
FIGURE 7.

A simplified excerpt from MONDIAL.

Definition 9:

Let $M$ be a query match for the keyword query $Q$ . Let $G_{S}$ be a schema graph. Let $C\!J\!N$ be a joining network of keyword matches from $M$ over $G_{S}$ given by $C\!J\!N=\langle \mathcal {V}, E \rangle $ . We say that $C\!J\!N$ is a candidate joining network if, and only if, $C\!J\!N$ is minimal and sound.

Example 10:

Considering the query match $M_{2}$ previously generated in Example 5, the following CJN can be generated:\begin{align*}& { CJN }_1={ CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \leftarrow { CASTING }\\ &\qquad\qquad\qquad\downarrow \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \searrow\\ & \quad PERSON ^V\left[{ name }^{\{{will }\}}\right] \quad\quad\qquad\qquad PERSON^{V}\left[{ name }^{\{{smith }\}}\right]\end{align*} View SourceRight-click on figure for MathML and additional features. The candidate joining networks $C\!J\!N_{1}$ covers the query match $M_{2}$ . $C\!J\!N_{1}$ is a minimal and sound JNKM. The interpretation of $C\!J\!N_{1}$ searches for the movies where both persons “will” (e.g. Will Theakston) and “smith” (e.g. Maggie Smith) participate in. The two keyword-free matches from the $CASTING$ are treated as different nodes in the candidate joining network $C\!J\!N_{1}$ .

The details on how we generate CJNs in Lathe are described by the CNKMGen Algorithm in Appendix G.

A. Candidate Joining Network Ranking

In this section, we present a novel ranking of CJNs based on the ranking of QMs. This ranking is necessary because often many CJNs are generated, yet, only a few of them are indeed useful to produce relevant answers.

We present in Section VI-B a QM ranking that advances the majority of the features present in the ranking of CJNs of other proposed systems, such as CNRank [10]. Thus, we can exploit the scores of the QMs to rank the CJNs. For this reason, our CJN ranking strategy is straightforward yet effective. Roughly, it uses the ranking of QMs adding a penalization for large CJNs. Therefore, the score of a candidate joining network $C\!J\!N_{M}$ from a query match $M$ is given by:\begin{equation*} score(C\!J\!N_{M}) = score(M) \times \frac {1}{|C\!J\!N_{M}|}\end{equation*} View SourceRight-click on figure for MathML and additional features.

To ensure that CJNs with the same score are placed in the same order that they were generated we used a stable sorting algorithm [18].

B. Candidate Joining Network Pruning

In this section we present an eager evaluation strategy for pruning CJNs. Even if CJNs contain valid interpretations of the keyword query, some of them may fail to produce any JNTs as a result. Thus, we can improve the results of our CJN generation and ranking if by pruning what we call void CJNs, which are CJNs with no JNTs in their results.

Example 11:

Considering the database instance of Figure 1 and the keyword query “will smith films”, the following CJNs can be generated:\begin{align*} & { CJN }_2={PERSON}^V\left[{ name }^{\{{will }\}}\right] \leftarrow { CASTING } \\ & \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\qquad\downarrow \\ & \qquad\qquad\qquad\qquad\qquad\quad{ MOVIE }{ }^S\left[{ self }^{\{{films\} }}\right]^V\left[{ name }^{\{{smith }\}}\right] \\ & { CJN }_3={ CASTING } \rightarrow { MOVIE }^S[{ self }^{\{{ films }\}}] \leftarrow { CASTING } \\ & \qquad\qquad\qquad\downarrow \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \searrow\\ & \qquad{ PERSON }{ }^V\left[{ name }^{\{{will }\}}\right] \qquad { CHARACTER }{ }^V\left[{ name }^{\{{smith }\}}\right] \\ & \end{align*} View SourceRight-click on figure for MathML and additional features.

The interpretation of $C\!J\!N_{2}$ looks for the movies whose name contains the keyword “smith” (e.g. “Mr. & Mrs. Smith”) and in which a person whose contains “will” (e.g. “Will Theakston”) participate in. The interpretation of $C\!J\!N_{3}$ looks for the movies where a person whose name contains “will” (e.g. “Will Theakston”) played the character “smith” (e.g. “Jane Smith”). Notice that although the candidate joining networks $C\!J\!N_{2}$ and $C\!J\!N_{3}$ both provide valid interpretations for the keyword query, they do not produce any tuples as a result in the given database instance.

As most of the previous work does not rank CJNs but only evaluates them and ranks their resulting JNTs instead, the pruning of void CJNs has previously never been addressed. Lathe employs a pruning strategy that evaluates CJNs as soon as they are generated, pruning the void ones. This strategy, as demonstrated in our experiments, can significantly improve the quality of the CJN generation process, particularly in scenarios where the schema graph contains a large number of nodes and edges.

For instance, one of the datasets we use in our experiments, the MONDIAL database, contains a large number of relations and relational integrity constraint (RICs). This results in a schema graph with several nodes and edges, which, intuitively, incur a large number of possible CJNs for a single QM. In contrast, we discovered that such schema graphs are prone to produce a large number of void CJNs. In particular, while approximately 20% of the keyword queries used in our experiments required us to consider 9 CJNs per QM, the eager evaluation strategy reduced this value to 2 CJNs per QM.

Notice, however, that to find if some CJN is void, we must execute it as an SQL in the DBMS, which incurs an additional cost and an increase in the CJN generation time. Despite that, we notice in our experiments that the eager evaluation strategy does not necessarily hinder the performance of a R-KwS system. In fact, the reducing the number of CJNs per QM alone improves the system efficiency because this parameter influences the CJN generation process. Furthermore, the eager evaluation advances the CJN evaluation, which is already a required step in the majority of R-KwS systems in the related work. Lastly, we can set a maximum number of CJNs to probe during the eager evaluation, which limits the increase in CJN generation time.

SECTION VIII.

Experiments

In this section, we report a set of experiments performed using datasets and query sets previously used in similar experiments reported in the literature. Our goal is to evaluate the quality of the CJN Ranking, the quality QM ranking, and how our Eager Evaluation strategy can improve the CJN Generation.

A. Experimental Setup

1) System Details

We ran the experiments on a Linux machine running Artix Linux (64-bit, 32GB RAM, AMD Ryzen $^{\mathrm{ TM}}$ 5 5600X CPU @ 3.7GHz) We used PostgreSQL as the underlying RDBMS with a default configuration. All implementations were made in Python 3.

2) Datasets

For all the experiments, we used three datasets, IMDb, MONDIAL, and Yelp, which were used for the experiments performed with previous R-KwS systems and methods [2], [3], [7], [10], [11], [20], [36]. The IMDb dataset is a subset of the well-known Internet Movie Database (IMDb)5, which comprises information related to films, television shows, and home videos – including actors, characters, etc. The MONDIAL dataset [35] comprises geographical and demographic information from the well-known CIA World Factbook6, the International Atlas, the TERRA database, and other web sources.

The Yelp dataset is a subset of Yelp7, which comprises information about businesses, reviews, and user data. The three datasets have distinct characteristics. The IMDb dataset has a simple schema, but query keywords often occur in several relations. Although the MONDIAL dataset is smaller, its schema is more complex or dense, with more relations and relational integrity constraints (RICs). The Yelp dataset has the highest number of tuples but its schema is simple. Table 2 summarizes the details of each dataset.

TABLE 2 Datasets We Used in Our Experiments
Table 2- 
Datasets We Used in Our Experiments

3) Query Sets

We used the query sets provided by Coffman & Weaver [3] benchmark for the IMDb and MONDIAL datasets. The query set for Yelp was obtained from SQLizer [37] and consists of 28 queries formulated in Natural Language. We adapted all of its queries to our experiments by extracting only their keyword terms.

However, we notice that several queries from IMDb and MONDIAL query sets do not have a clear intent, compromising the proper evaluation of the results, for instance, the ranking of CJNs. Therefore, for the sake of providing a more fair evaluation, we generated an additional for each original query set replacing queries that we consider unclear with equivalent queries with added schema references. As an example, consider the query “Saint Kitts Cambodia” for the MONDIAL dataset, where Saint Kitts and Cambodia are the names of the two countries. There exist several interpretations of this keyword query, each of them with a distinct way to connect the tuples corresponding to these countries. For example, one might look for shared religions, languages, or ethnic groups between the two countries. While all these interpretations are valid in theory, the relevant interpretation defined by Coffman & Weaver [3] in their golden standard indicates that the query searches for organizations in which both countries are members. In this case, we replaced in the new query set with the query “Saint Kitts Cambodia Organizations”.

Table 3 presents the query sets we used in our experiments, along with some of their features. Query sets whose names include the suffix “-DI” correspond to those in which we have replaced ambiguous queries as explained above. Thus, these queries sets have no ambiguous queries and they have a higher number of Schema References.

TABLE 3 Query Sets We Used in Our Experiments
Table 3- 
Query Sets We Used in Our Experiments

4) Golden Standards

The benchmark from Coffman & Weaver [3] provided the relevant interpretation and its relevant SQL results for each query of the IMDb and MONDIAL datasets. In the case of the Yelp dataset, SQLizer [37] provided the relevant SQL queries for natural language queries. Since we derived keyword queries from the latter, we also adapted the SQL queries to reflect this change. We then manually generated the golden standards for CJNs and QMs using relevant SQL provided by Coffman & Weaver and in SQLizer.

5) Metrics

We evaluate the ranking of CJNs and QMs using three metrics: Precision at ranking position 1 ($P\text{@}1$ ), Recall, Recall at ranking position $K$ ($R\text{@}K$ ), and Mean Reciprocal Rank (MRR).

Precision at 1 ($P\text{@}1$ ) is the ratio of relevant results found in the first position for each query to the number of queries. Recall is the ratio of relevant results retrieved to the total number of relevant results. Recall at $K$ ($R\text{@}K$ ) is the mean recall across multiple queries considering only first $K$ results. If fewer than $K$ results are retrieved by a system, we calculate the recall value at the last result. For instance, if the system returns the relevant CJN in at most position 3 of the ranking for 35 out of 50 queries, then the system would obtain an $R\text{@}3$ of 0.7.

The Mean Reciprocal Ranking (MRR) value indicates how close the correct CJN is from the first position of the ranking. Given a keyword query $Q$ , the value of the reciprocal ranking for $Q$ is given by $RR_{Q} = \frac {1}{K}$ , where $K$ is the rank position of the relevant result. Then, the MRR obtained for the queries in a query set is the average of $RR_{Q}$ , for all $Q$ in the query set.

6) Lathe Setup

For the experiments we report here, we set a maximum size for QMs and CJNs of 3 and 5, respectively. Also, we consider three important parameters for running Lathe: $N_{QM}$ , the maximum number of QMs considered from the QM ranking; $N_{CJN}$ , the maximum number of CJNs considered from each QM; and $P_{CJN}$ , the number of CJNs probed per QM by the eager evaluation. In this context, a setup for Lathe is a triple $N_{QM}/N_{CJN}/P_{CJN}$ . The most common setup we used in our experiments is 8/1/9, in which we take the top-5 QMs in the ranking, generate and probe up to 9 CJNs for each QM, and take only the first non-empty CJN, if any, from each QM. We call this the default setup. Later in this section, we will discuss how these parameters affect the effectiveness and the performance of Lathe, as well as why we use the default configuration.

All the resources, including source code, query sets, datasets and golden standards used in our experiments are available at https://github.com/pr3martins/Lathe.

B. Preliminary Results

We present in this section some statistics about the CJN generation process. Table 4 shows the maximum and average numbers of KMs, QMs, and CJNs generated for each query set. The last two columns refer to the ratio of the number of CJNs to the number of QMs. Notice that we removed the maximum caps for the number of CJNs and CJNs per QM in the experiment reported here. However, we maintained the limit sizes of 3 and 5 for the QMs and CJNs, respectively.

TABLE 4 Statistics for the CJN Process of Each Query Set
Table 4- 
Statistics for the CJN Process of Each Query Set

Overall, the query sets for both IMDb and Yelp datasets achieved higher maximum and average numbers of KMs and QMs. This result is due to a higher number of tuples and the keywords being present in multiple relations or combinations. For example, in the IMDb dataset, several persons, characters, and even movies share the same name or part of it. In the case of Yelp, for instance, the keyword “texas” can match a state, a restaurant name, or a username. On the other hand, in MONDIAL, the keywords often match a few attributes only. For example, a city name probably does not overlap with the names of countries, continents, etc. Consequently, the system produces a low number of KMs and QMs for the query sets of this dataset.

Regarding the CJN generation, the query sets for IMDb and Yelp achieved high numbers of CJNs because of their already high numbers of QMs, but a low ratio of CJNs to QMs due to their simple schema graphs. As for the query sets for the MONDIAL dataset, they achieved opposite results due to their complex schema graph.

C. Comparison With Other R-KwS Systems

In this experiment, we first compare Lathe with QUEST [12], the current state of art R-KwS system with support to schema references and then we also compare Lathe with several other R-KwS systems. Here, we used the default Lathe setup, that is, 8/1/9. We compare our results to those published by the authors, which refer to the MONDIAL dataset, because we were unable to run QUEST due to the lack of code and enough details for implementing it. Figure 8 depicts the results for the 35 queries supported by QUEST8 out of the 50 queries provided in the original query set. The graphs show the recall and P@1 values for the raking produced by each system considering the golden standard supplied by Coffman & Weaver [3].

FIGURE 8. - Comparison of Lathe with the QUEST system.
FIGURE 8.

Comparison of Lathe with the QUEST system.

Both systems achieved perfect recall; that is, all the correct solutions for the given keyword queries were retrieved. Concerning P@1, Lathe obtained better results than QUEST, with an average of 0.97 with a standard error of 0.03, which indicates that, in most cases, the correct solution was the one corresponding to the CJN ranked as the first by Lathe.

Next, we compare the results obtained for Lathe with those published in the comprehensive evaluation published by Coffman & Weaver [3] for the systems BANKS [14], DISCOVER [4], DISCOVER-II [6], BANKS-II [15], DPBF [38], BLINKS [16] and STAR [39]. Because this comparison uses all 50 keyword queries from the MONDIAL dataset, we did not include QUEST in the comparison. Figure 9 shows the recall and P@1 values for the raking produced by each system when the golden standard provided by Coffman & Weaver [3] is taken into account.

FIGURE 9. - Comparison with other approaches using Recall and P@1 metrics.
FIGURE 9.

Comparison with other approaches using Recall and P@1 metrics.

Overall, Lathe achieved the best results in Recall and P@1 value. That the only systems that achieved similar recall, DPBF and BLINKS, are based on data graph, thus, require a materialization of the database. The difference between recall values of Lathe, DISCOVER, and DISCOVER-II is mainly due to not supporting schema references. Regarding the P@1, Lathe obtained a value of 0.96 with a standard error of 0.03, which is significantly higher than the results for other systems. This difference in P@1 value, especially compared with DISCOVER and DISCOVER-II, is due to the novel ranking of QMs as well as an improved ranking of CJNs.

D. Evaluation of Query Matches Ranking

In this experiment, we evaluate the quality of QMs ranking according to the metrics MRR and $R\text{@}K$ . As shown by the results in Section VIII-B, there can be many QMs depending on the query. As a result, we want to verify how effective the QMRank algorithm is at selecting the most likely correct QM from among those generated in this experiment. Figure 10 shows the results obtained with $R\text{@}K$ up to the tenth ranking position and the MRR metric.

FIGURE 10. - Evaluation of Query Matches.
FIGURE 10.

Evaluation of Query Matches.

For all query sets, in most cases, the correct QM is at least in the eighth ranking position. In MONDIAL and MONDIAL-DI, the relevant QM is at least in the third position for all queries. Yelp obtained an $R\text{@}8$ of 1 and $R\text{@}3$ of 0.93, which indicates that the system returns the relevant QM by the eighth position, and in most cases, up to the third position. There is one query for the IMDb dataset whose relevant QM is not minimal. As QMs must be minimal by Definition 5, Lathe does not support this query. Consequently, the query sets for the IMDb dataset can obtain an $R\text{@}K$ value of 0.98 at most. IMDb and IMDb-DI achieved this value at position 5.

Regarding MRR, Lathe obtained 0.75 for both IMDb and IMDb-DI, 0.83 for Yelp, and 0.96 and 0.95 for MONDIAL and MONDIAL-DI, respectively. This result indicates that the relevant QM is often in the top positions of the ranking. Notice that the QM ranking indirectly impacts the generation and ranking of CJNs. In practice, a high $R\text{@}K$ value with a low $K$ allows us to generate fewer CJNs without compromising the quality of the CJN ranking. Based on the obtained results, we set the parameter $N_{QM}$ to 8, which indicates that Lathe will only generate CJNs for the top-8 query matches.

E. Evaluation of the Candidate Joining Network Ranking

In this experiment, we evaluate the quality of our approach for CJN generation and ranking. We used the metrics MRR and $R\text{@}K$ for $K$ up to the tenth rank position. We tested several different setups but to save space we report here only those with representative distinct results. Specifically, we report the results of four setups without the eager evaluation, that is, 8/1/0, 8/2/0, 8/8/0 and 8/9/0 and two setups with the eager evaluation, that is 8/1/9 and 8/2/9.

Figure 11 shows the results for the IMDb and IMDb-DI query sets. As it can be seen, regardless of the configuration, our method was able to place the relevant CJNs in the top positions in the ranking, and the result is very similar for both IMDb and IMDb-DI query sets. This shows that in these datasets, our method was able to disambiguate the queries properly, even without the addition of schema references. It is worth noting that the values of $R\text{@}1$ in both query sets show that the configurations with the eager evaluation achieved better results because they place the relevant CJNs in the first ranking position more frequently. The $R\text{@}K$ metric also shows that the quality of the ranking decreases as the number of CJNs per QM increases, especially for $K$ in the range $2\leq K \leq 6$ .

FIGURE 11. - Ranking of Candidate Joining Networks - IMDb (top) and IMDb-DI (bottom).
FIGURE 11.

Ranking of Candidate Joining Networks - IMDb (top) and IMDb-DI (bottom).

Figure 12 shows the results for MONDIAL and MONDIAL-DI. In these query sets, the configurations with the eager evaluation achieved significantly better results. The configurations 8/1/0 and 8/2/0 could not generate the relevant CJN for around 20% of the queries due to a low number of CJNs per QM, therefore, their results were capped at an MMR and $R\text{@}K$ value of 0.8, approximately. The configurations 8/8/0 and 8/9/0 were able to generate the relevant CJN for most of the cases, although the large number of CJNs per QM negatively affected the ranking of CJNs. Finally, the configurations 8/1/9 and 8/2/9 produced the best results because the pruning enables us to generate the relevant CJN with a low number of CJNs per QM while also placing the relevant CJN in higher rank positions. Notice that the disambiguation of queries in the MONDIAL-DI query set allowed configurations 8/8/0 and 8/9/0 to have better results, especially for the $R\text{@}K$ metric for $K$ above 7. The eager evaluation configurations were able to disambiguate the queries without relying on the addition of schema references, therefore, their results were consistent across the MONDIAL and MONDIAL-DI query sets.

FIGURE 12. - Ranking of Candidate Joining Networks - MONDIAL (top) and MONDIAL-DI (bottom).
FIGURE 12.

Ranking of Candidate Joining Networks - MONDIAL (top) and MONDIAL-DI (bottom).

Figure 13 shows the results for the Yelp query set. Overall, the eager CJN evaluation did not affect the results for this query set, probably because the database schema graph was simple and the ways of connecting the query matches were straightforward. Configurations 8/1/0 and 8/1/9 achieved the best results, obtaining a $MRR$ of 0.85 and $R\text{@}2$ of 0.92 for the CJN generation. This indicates that the relevant CJNs are often found up to the second ranking position, with exception of two queries, whose relevant CJN were found in positions 5 and 7, respectively. The other configurations obtained slightly worse results, with an MRR of 0.84 and an $R\text{@}2$ of 0.89, approximately.

FIGURE 13. - Ranking of Candidate Joining Networks - Yelp.
FIGURE 13.

Ranking of Candidate Joining Networks - Yelp.

Regardless of the datasets and configurations, our method achieved an MRR value above 0.7, which indicates that on average, the relevant CJN is found between the first and the second rank positions. In the IMDb dataset, the decrease of $R\text{@}K$ values according to the number of CJNs taken per QM is also reflected on the MRR metric. However, in the MONDIAL dataset, the improvement of the $R\text{@}K$ values due to the disambiguation of queries is not reflected on the MRR value, as this improvement only happens in low ranking positions ($K \leq 8$ ).

The eager CJN evaluation inherently affects the performance of the CJN generation process. Therefore it is important to look at the trade-off between the effectiveness and the efficiency in each configuration. We examine this trade-off in the next section.

F. Performance Evaluation

In this experiment, we aim at evaluating the time spent for obtaining the CJN given a keyword query, and analyze the trade-offs between efficiency and efficacy of the different configurations use in Lathe.

Lathe obtained better execution times for the IMDb dataset in all configurations. Also, the disambiguate variants of query sets yield slower execution times in comparison with the original counterparts.

Figure 14 summarizes the average execution time for each phase of the process: Keyword Matching, Query Matching and the Candidate Joining Network Generation. In this first experiment, we used the configuration 8/1/0. Lathe obtained better total execution times for the IMDb dataset, followed by the Yelp dataset. In addition, the disambiguate variants of query sets yield slower execution times in comparison with the original counterparts. Also, it is worth noting that the execution times for each query set are related to the number of KMs, QMs, and CJNs in the query sets shown in Table 4.

FIGURE 14. - Average Execution Times for each phase of Lathe.
FIGURE 14.

Average Execution Times for each phase of Lathe.

Regarding keyword matching, the Yelp dataset yielded the worst execution times, with 167ms, probably because of its higher number of attributes and tuples. Although the MONDIAL dataset has fewer tuples than IMDb, its higher number of schema elements (28 relations and 48 attributes) results in a higher execution time than IMDb.

Due to the combinatorial nature of QM generation, the execution times for the Query Matching phase are directly related to the number of QMs. While the execution times for the IMDb and IMDb-DI query sets that produced a high number of QMs are 247 and 256 milliseconds, respectively, the results for the MONDIAL and MONDIAL-DI are around 190 and 202 microseconds. The Yelp dataset achieved 121 miliseconds.

Concerning the CJN phase, the execution times for MONDIAL are significantly higher in comparison with the execution times for IMDb and Yelp, despite the lower number of CJNs for the MONDIAL. Because the CJN generation algorithm is based on a Breadth-First Search, the greater the number of vertices and edges in the schema graph of the MONDIAL dataset, the greater the number of iterations and, consequently, the slower the execution times. This behavior persists throughout different configurations, an issue we further analyze below.

G. Quality Versus Performance

Figure 15 presents an evaluation of the CJN generation performance, comparing the same configurations used in the experiment of Section VIII-E. We present the results for the IMDB, MONDIAL and Yelp datasets in different scales because they differ by order of magnitude. Overall, execution times increase as the number of CJNs taken per QM increases. This pattern is more pronounced in the MONDIAL dataset. Also, the eager CJN evaluation incurs an unavoidable increase in the CJN generation time as the system has to probe the CJNs running queries into the database.

FIGURE 15. - Performance Evaluation of the CJN Generating phase.
FIGURE 15.

Performance Evaluation of the CJN Generating phase.

As the configurations have an impact on both the quality of the CJN ranking and the performance, it is important to examine the trade-off between effectiveness and efficiency. Configuration 8/1/0 and 8/2/0 achieved the best execution times due to the low number of CJNs per QM and not relying on database accesses. However, these configurations did not achieve the highest values of MRR and R@K for the IMDb and MONDIAL datasets. Therefore, they are recommended if one must prioritize efficiency.

Configuration 8/1/9 obtained better results than configurations 8/8/0, 8/9/0 for the IMDb and MONDIAL datasets and better than 8/2/9 for all datasets. Although this configuration is slower than 8/1/0 and 8/2/0, the significantly better results of MRR and $R\text{@}K$ values for MONDIAL and IMDb datasets make the 8/1/9 configuration an overall recommended option, especially if one must prioritize effectiveness.

We do not recommend the configurations 8/2/0, 8/8/0, 8/9/0 and 8/2/9 because their MRR and $R\text{@}K$ values do not justify the increase in execution times. Although 8/2/9 obtained the best MRR and $R\text{@}K$ values for the MONDIAL dataset, it is 37%-80% slower than 8/1/9. Configurations 8/8/0 and 8/9/0 achieved a slight increase in the $R\text{@}K$ metric for the MONDIAL dataset, for $K \leq 8$ , however, they obtained lower values of MRR and $R\text{@}K$ values for the IMDb and Yelp datasets.

It is interesting noting that although the configurations with eager CJN evaluation spend time to probe CJNs, sending queries to the DBMS. However, as they generate a smaller set of CJNs, the overall performance is not hindered in comparison with the configurations without it.

SECTION IX.

Conclusion

In this paper, we have proposed Lathe, a new relational keyword search (R-KwS) system for generating a suitable SQL query from a given keyword query. Lathe is the first to address the problem of generating and ranking Candidate Joining Networks (CJNs) based on queries with keywords that can refer to either instance values or database schema elements, such as relations and attributes.

In addition, Lathe improves the quality of the CJN generated by introducing two major innovations: a ranking for selecting better Query Matches (QMs) in advance, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs.

We present a comprehensive set of experiments performed with query sets and datasets previously used in experiments with previous state-of-the-art R-KwS systems and methods. Our experiments indicate that Lathe can handle a wider variety of keyword queries while remaining highly effective, even for large databases with intricate schemas.

Also, a full implementation of Lathe is publicly available at https://github.com/pr3martins/Lathe as a Python library for Keyword Search over Relational Databases called PyLatheDB [40]. This library is ready for developers to easily run Lathe or incorporate its features, such as keyword matching, into their own applications.

Our experience in the development of Lathe raised several ideas for future work. First, one important issue in our method is being able to correctly match keywords from the input query to the corresponding database elements. To improve this issue, we plan to investigate new alternative similarity functions. We are particularly interested in using word-embedding-based functions, such as the well-known Word Mover’s Distance (WMD) [41]. We also consider investigating methods based on Neural Language Models (NLMs), particularly on transformers and attention-based models [42], [43], [44], which proved to be promising for several text-based Information Retrieval problems. For example, we believe that an interesting approach to the QM ranking problem is to interpret it as a variant of the Table Retrieval task [45], [46], where given a keyword query and a table corpus, this task consists of returning a ranked list of the tables that are relevant to the query.

Second, data exploration techniques have recently gained popularity because they allow for the extraction of knowledge from data even when the user is unsure of what to look for [47]. Keyword-based queries, we believe, can be used as an interesting tool for data exploration because they allow one to retrieve interesting portions of a database without knowing the details of the schema and its semantics.

Third, we believe that pruning strategies can improve the QM generation by reducing the search space of keyword matches. For this, we plan to exploit the relationship of the QM generation and the discovery of matching dependencies [48].

Fourth, although we have focused on relational databases in this paper, the ideas we discussed here can be extended to other types of databases as well. Currently, we are extending these ideas to address the so-called document stores, such as the very popular MongoDB9 engine. Our preliminary findings [36] suggest that because queries of this type are frequently more complex than queries of relational databases, allowing the simplicity of keyword queries may have even more advantages in this context.

Finally, we anticipate that keyword queries will be useful as a tool for allowing the seamless integration of data from heterogeneous sources, as is the case in the so-called polystore systems and data lakes, which are becoming increasingly popular in recent years. There exist already research proposals in this direction [49], we believe that the schema graph approach we adopt in our work can be helpful to achieve this goal.

Appendix A

Acronyms

AbbreviationExpansion
R-KwS

Relational Keyword Search.

CJN

Candidate Joining Network.

QM

Query Match.

DB

Database.

SQL

Structured Query Language.

IR

Information Retrieval.

DBMS

Database Management System.

JNT

Joining Network of Tuples.

IMDb

Internet Movie Database.

KM

Keyword Match.

VKM

Value-Keyword Match.

SKM

Schema-Keyword Match.

PKFK

Primary Key/Foreign Key.

WUP

Wu-Palmer Measure.

LCS

Least Common Subsumer.

TF-IDF

Term Frequency – Inverse Document Frequency.

RIC

Relational Integrity Constraint.

JNKM

Joining Network of Keyword Matches.

MJNKM

Minimal Joining Networks of Keyword Matches.

R@K

Recall at K.

P@1

Precision at 1.

MRR

Mean Reciprocal Rank.

WMD

Word Mover’s Distance.

NLM

Neural Language Model.

ECLAT

Equivalence Class Clustering and bottom-up Lattice Traversal.

Appendix B

VKMGen Algorithm

As shown in Algorithm 1, Lathe retrieves tuples from the database in which the keywords occur and uses them to generate value-keyword matches. Initially, the VKMGen Algorithm takes the occurrences of each keyword from the Value Index and form partial value-keyword matches, which are not guaranteed to be disjoint sets yet (Lines 3-8). The pool of VKMs is represented by the Hash Table $P$ , whose keys are KMs and values are sets of tuple IDs.

Algorithm 1 - VKMGen(
$Q$
)
Algorithm 1

VKMGen($Q$ )

Next, Lathe ensures that VKMs are disjoint sets through the Algorithm 2, VKMInter, which is based on the ECLAT10 algorithm [50] for finding frequent itemsets. VKMInter looks for non-empty intersections of the partial value-keyword matches recursively until all of them are disjoint sets, and thus, proper VKMs. These intersections are calculated as follows:\begin{align*} K\!M_{1} \cap K\!M_{2} = \begin{cases} \displaystyle \emptyset &,\text {if }R_{a}\neq R_{b}\\ \displaystyle R^{V}_{ab}[A_{ab,1}^{K_{ab,1}},\ldots,A_{ab,m}^{K_{ab,m}}]&, \text {if }R_{a}=R_{b}\end{cases}\end{align*} View SourceRight-click on figure for MathML and additional features. where $K\!M_{x} = R^{V}_{x}[A_{x,1}^{K_{x,1}},\ldots,A_{x,m}^{K_{x,m}}]$ for $x \in \{a,b\}$ , and $K_{ab,i} = K_{a,i}\cup K_{b,i}$ .

Algorithm 2 - VKMInter(
$P$
)
Algorithm 2

VKMInter($P$ )

VKMInter uses three hash tables: $P$ , $P_{next}$ and $R$ . The pool $P$ contains the partial VKMs of the current iteration. The pool $P_{next}$ contains the partial VKMs for the next iteration. The pool $R$ stores the tuple IDs to be removed from the VKMs of $P$ at the end of the current iteration, turning the partial VKMs into proper value-keyword matches.

VKMInter first defines the hash tables $P_{next}$ and $R$ , then initializes $R$ with empty sets (Lines 1-4). Next, the algorithm iterates over all pairs $\{K\!M_{a},K\!M_{b}\}$ of VKMs in $P$ and tries to create a new keyword match $K\!M_{ab}$ , which is the intersection of $K\!M_{a}$ e $K\!M_{b}$ (Lines 5-11). If $K\!M_{ab}$ is valid, that is, if $K\!M_{a}$ e $K\!M_{b}$ are VKMs over the same database relation, and the tuples $T_{ab}$ within $K\!M_{ab}$ are not empty, then we add $K\!M_{ab}$ to the next iteration pool $P_{next}$ and add the tuples $T_{ab}$ to $R$ for removal after the iteration (Lines 8-11). After all the possible intersections are processed, VKMInter iterates over $R$ and removes the tuples for each VKM of the pool $P$ , making them proper disjoint keyword matches (Lines 12-16). Lastly, VKMInter recursively process the pool $P_{next}$ for the next iteration, then it updates and returns the current pool $P$ (Lines 18-19).

After the execution of VKMInter, in Line 9 of VKMGen, we obtained the value-keyword matches and their tuples. As the sets of tuples are only required for the generation of VKMs, VKMGen generates and outputs the set of value-keyword matches, ignoring the tuples from $P$ (Lines 10-11). From now on, Lathe does not need to manipulate the database tuples or their IDs.

Appendix C

SKMGen Algorithm

The generation of schema-keyword matches uses a structure we call the Schema Index, which is created in a preprocessing phase, alongside with the Value Index. This index stores information about the database schema and statistics about attributes, which are used for the ranking of QMs, which will be explained in Chapter VI. The stored information follows the structure below:\begin{equation*}I_{S}=\{relation:\{attribute:\{(norm,maxfrequency)\}\}\}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The generation of SKMs is carried out by Algorithm 3, SKMGen. First, the algorithm iterates over the relations and attributes from the Schema Index. Then, SKMGen calculates the similarity between each keyword and schema element. It only considers the pairs whose similarity is above a threshold $\varepsilon $ (Line 8), which is used to generate SKMs (Line 3).

Algorithm 3 - SKMGen(
$Q$
)
Algorithm 3

SKMGen($Q$ )

Appendix D

QMGen Algorithm

The generation of query matches is carried out by Algorithm 4, QMGen, which preserves the ideas proposed in MatCNGen [2], adapt them to keyword matches instead of tuple-sets. Let $V\!K$ and $S\!K$ be respectively sets of value-keyword matches, and schema-keyword matches previously generated. The algorithm looks for combinations of keyword matches in $P{=}V\!K{\cup }S\!K$ that form minimal covers for the query $Q$ . At a first glance, this statement may suggest that we need to generate the whole power set of $P$ to obtain the complete set of QMs. However, it can be shown that any minimal cover of a set of $n$ elements has at most $n$ subsets [51]. Therefore, no match for a query $Q$ can be formed by more than $|Q|$ keyword matches.

Algorithm 4 - QMGen(
$Q, VK, SK$
)
Algorithm 4

QMGen($Q, VK, SK$ )

It is easy to see that Algorithm 4 has a time complexity of $\sum ^{|Q|}_{i=1}{\binom{|P| }{ i}}$ . This equation gives us an upper bound on the number of query matches that must be generated for a query. It shows that the running time depends on two important factors: the size of the query and on the size of the sets of keyword matches $P$ . Regarding these two factors, the first one, the size of a query is usually small, e.g., less than two on average, and queries with more than four keywords are rare [2]. In such cases, this summation turns to be a low-degree polynomial. The second factor, $|P|$ , is also dependent on the query size, but the main issue to observe is how termsets are distributed among relations. This factor is harder to predict, but usually very few subsets of query terms are frequent in many relations. In fact, larger subsets are increasingly less frequent. Thus, in practice, just a few query matches need to be generated.

Also, as the QM ranking presented in Section VI-B penalizes QMs with a large number of KMs, we can define a maximum QM size $t{\leq }|Q|$ to prune QMs which are less likely to be relevant. For this reason, QMGen iterates over all the subsets of $P$ whose size is less than or equal to a maximum QM size $t$ , which in our experiments we set a value of $t=3$ (Lines 3-4). Next, QMGen checks whether the combination $M$ of keyword matches form a minimal cover for the query. The evaluation of minimal cover is carried out by Algorithm 5.

Algorithm 5 - MinimalCover(
$Q,M$
)
Algorithm 5

MinimalCover($Q,M$ )

The algorithm MinimalCover iterates through the KMs from the combination $M$ , generating a set $C_{M}$ which comprise all keywords covered by $M$ (Lines 1-5). Next, the algorithm checks whether $M$ is total, that is, whether $C_{M}{=}Q$ . Notice that since KMs can only associate an attribute or relation in the database schema to keyword from the query $Q$ , that is $C_{M}{\subseteq }Q$ , then we can imply that $C_{M}{=}Q$ if, and only if, $|C_{M}|{=}|Q|$ (Line 6). Next, MinimalCover checks whether $M$ is minimal, that is, if we remove any keyword match from $M$ it will no longer be total. For this reason, MinimalCover iterates again through the KMs and, for each one, it generates a set $C_{KM}$ which comprise all keywords covered by $K\!M$ . Then, the algorithm check whether the set difference of $C_{M}{\setminus }C_{KM}$ is still equal to $Q$ , which can be achieved by comparing $|C_{M}{\setminus }C_{KM}|{=}|Q|$ .

If $M$ forms a minimal cover for $Q$ , then $M$ is considered a query match. However, $M$ may have some keyword matches which can be merged, especially SKMs. The merging of KMs from $M$ is carried out by Algorithm 6. Notice that we cannot merge two VKMs since they are disjoint sets, however we can merge a schema-keyword match with both a SKM or a VKM. The algorithm MergeKeywordMatches uses the two hash tables $P_{V\!K}$ and $P_{S\!K}$ to store, respectively, the VKMs and SKMs based on the relation they are built upon (Lines 1-12). Next, the algorithm iterates through the relations present in $P_{SK}$ and tries to merge all possible KMs from that relation, resulting in a keyword match $K\!M_{merged}$ . $K\!M_{merged}$ starts as a keyword-free match but it is merged with all existent SKMs (Lines 14-17), then it is merged an arbitrary value-keyword match $V\!K\!M$ , if existent (Lines 18-21). Lastly, $K\!M_{merged}$ and all values-keyword matches except $V\!K\!M$ are added to the query match $M'$ , which is returned at the end of MergeKeywordMatches (Lines 22-23).

Algorithm 6 - MergeKeywordMatches(
$Q,M$
)
Algorithm 6

MergeKeywordMatches($Q,M$ )

After merging all the possible elements from the query match $M$ , QMGen adds $M$ to the set of query matches $Q\!M$ , which is returned at the end of the algorithm.

Appendix E

QMRank Algorithm

The ranking of Query Matches is carried out by Algorithm 7, QMRank. Notice, that, intuitively, the process of ranking QMs advances part of the relevance assessment of the CJNs, which was first proposed in CNRank [10]. This yields to an effective ranking of QMs and a simpler ranking of CJNs. QMRank uses a value score and a schema score, which are respectively related to the VKMs and SKMs that compose the QM.

Algorithm 7 - MRank(
$QM$
)
Algorithm 7

MRank($QM$ )

The algorithm first iterates over each query match, assigning 1 to both $value\_{}score$ and $schema\_{}score$ . Next, QMRank goes through each keyword match from the QM. In the case of a KM matching the values of an attribute, the algorithm updates the $value\_{}score$ based on the cosine similarity using TF-IDF weights. QMRank retrieves the term frequency and inverted attribute frequency from the Value Index, and the norm of an attribute from the Schema Index, which are all calculated in the preprocessing phase (see Section IV-A). In the case of a KM matching the name of a schema element, the algorithm updates the $schema\_{}score$ the average similarity of the keywords with the schema elements based on the similarity functions presented in Section V. Once the algorithm aggregates the scores of KMs to generate the score of QMs, the final step is to sort them in descending order.

Appendix F

Sound Theorem

Theorem 1:

Let $G_{S} = \langle \mathcal {R}, E_{G} \rangle $ be a schema graph. Let $J =\langle \mathcal {V}, E_{J} \rangle $ be a joining network of keyword matches. We say that $J$ is sound, that is, it does not have more than one occurrences of the same tuple for every instance of the database if, and only if, the following condition holds $\forall K\!M_{a} \in \mathcal {V}, \forall \langle R_{a}, R_{b} \rangle \in E_{G}$ :\begin{equation*} R\!I\!C(R_{a},R_{b}) \geq |\{KM_{c}| \langle K\!M_{a}, K\!M_{c} \rangle \in E_{J} \wedge R_{c}=R_{b} \}|\end{equation*} View SourceRight-click on figure for MathML and additional features. where $R\!I\!C(R_{a},R_{b})$ indicates the number of Referential Integrity Constraints from a relation $R_{a}$ to a relation $R_{b}$ .

Proof:

Let $R_{a}$ and $R_{b}$ be database relations so that there exists $n$ Referential Integrity Constraint (RICs) from $R_{a}$ to $R_{b}$ . Intuitively, a tuple from $R_{a}$ may refer to at most $n$ tuples from $R_{b}$ . Consider a joining network of keyword matches $J$ wherein a keyword match over $R_{a}$ is adjacent to $m$ keyword matches over $R_{b}$ , that is $J= \langle \mathcal {V}, E \rangle $ , where $\mathcal {V}=\{K\!M_{1},\ldots,K\!M_{m+1}\}$ , $E = \{ \langle KM_{1},KM_{i} \rangle | 2 \leq i \leq m\}$ , and $R_{1} = Ra \wedge R_{i}=R_{b}, 2 \leq i \leq m$ . We can translate $J$ into a relational algebra expression wherein the edges are join operations using RICs and keyword matches are selection operations over relations. For didactic purposes, we assume, without loss of generality, that all the KMs of $J$ are keyword-free matches. Let $k_{j}$ be a key attribute from $R_{i}$ and $f_{i,j}$ be the attribute from $R_{i}$ that references $k_{j}$ . The SQL translation of $J$ can be represented by $T_{m+1}$ , which expands a join operation in each iteration.\begin{align*}{4} T_{1} &=\,\,R_{1} \\ T_{2} &=\,\,T_{1} \bowtie _{f_{1,2}=k_{2}} R_{2} \\ T_{3} &=\,\,T_{2} \bowtie _{f_{1,3}=k_{3}} R_{3} \\ T_{n+1} &=\,\,T_{n} \bowtie _{f_{1,{n+1}}=k_{n+1}} R_{n+1} \\ T_{n+2} &=\,\,T_{n+1} \bowtie _{f_{1,x}=k_{n+2}} R_{n+2}, \text {where } x \in \{2,\ldots,n+1\} &&\\{}\end{align*} View SourceRight-click on figure for MathML and additional features. Notice that by the iteration $n+2$ , all RICs from $R_{a}$ to $R_{b}$ were already used once. Therefore, this expansion require that we use one of the RICs twice, which would lead to redundancy. For instance, if assume $x=2$ , without loss of generality, then:\begin{align*}{4} T_{2} &=\,\,T_{1} \bowtie _{f_{1,2}=k_{2}} R_{2} \\ T_{n+2} &=\,\,T_{n+1} \bowtie _{f_{1,{2}}=k_{n+2}} R_{n+2} \\{}\end{align*} View SourceRight-click on figure for MathML and additional features. As the join conditions are stacked in each iteration, we can say that:\begin{equation*} f_{1,2}=k_{2} \wedge f_{1,{2}}=k_{n+2}\end{equation*} View SourceRight-click on figure for MathML and additional features. which implies that $k_{2} = k_{n+2}$ and, thus, all the returning JNTs would have more than one occurrence of the same tuple for every instance of the database.\begin{equation*} T_{m+1} =\,\,T_{m} \bowtie _{f_{1,x}=k_{m+1}} R_{m+1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Appendix G

CJNGen Algorithm

The generation and ranking of CJNs is carried out by Algorithm 8, CJNGen, which uses a Breadth-First Search approach [18] to expand JNKMs until they comprehend all elements from a query match.

Algorithm 8 - CJNGen(
$RQM,G_{S}$
)
Algorithm 8

CJNGen($RQM,G_{S}$ )

Despite being based on the MatCNGen Algorithm [2], CJNGen provides support for generating CJNs wherein there exists more than one RIC between one database relation to another, due to the definition of soundness presented in Theorem 1. Also, CJNGen does not require an intermediate structure such as the Match Graph in the MatCNGen system.

We describe CJNGen in Algorithm 8. For each query match, CJNGen generates the candidate joining networks for this query match using an internal algorithm called CJNInter, which we will focus on describing in the remainder of this section.

In Algorithm 9, we present CJNInter. This algorithm takes as input a query match $M$ and the schema graph $G_{S}$ . Next, it chooses a KM from the QM as a starting point, resulting in an unitary graph (Lines 3-4). If the query match $M$ has only one element, we already generated the one possible candidate joining network (Line 6).

Algorithm 9 - CJNInter(
$G_{S}$
, 
$M$
, 
$score_{M}$
)
Algorithm 9

CJNInter($G_{S}$ , $M$ , $score_{M}$ )

Next, the CJNInter initializes a queue $D$ , which is used to store the JNKMs which are not CJNs (Lines 7-8). In Loop 9-27, CJNInter takes one JNKM $J$ from the queue and tries to expand it with KMs. N otice that $J$ can be expanded with incoming and outgoing neighbors, therefore it uses an undirected schema graph $G_{S}^{U}$ (Line 14). Also, the elements of $M$ can only be added once in a JNKM but keyword-free matches can be added several times.

The expansion of $J$ results in a JNKM $J'$ (Lines 18-19). Then, CJNInter verifies whether $J'$ was already generated and whether it is sound, according to Definition 9. If $J'$ fails to meet these two conditions it is pruned (Line 20).

If $J'$ was not pruned, CJNInter checks whether $J'$ covers the query match $M$ . If it does, $J'$ is a candidate joining network and it will be added to the list $C\!J\!N$ . If $J'$ does not cover $M$ , then it will be added to the deque $D$ (Lines 21-24). At the end of the procedure, CJNInter returns the set $C\!J\!N$ of candidate joining networks for the query match $M$ (Line 28).

Note that the complexity of the CJN generation is mainly because of two factors: (1) There can be multiple KMs for each subset of keywords. As a result, there may be several ways of combining these KMs into QMs, so that all keywords are covered. (2) Given QM, there can be many distinct ways of connecting its elements through PK/FK constraints and keyword-free matches.

Lathe implements some basic CJN pruning strategies to help decrease the candidate space, which are based on the following parameters: the top-k CJNs, the top-k CJNs per QM and the maximum CJN size. Also, the algorithm implements a few strategies to prune the JNKMs which are not minimal or not sound, the maximum node Degree, the maximum number of keyword-free matches, and the distinct foreign keys.

SECTION A.

Maximum Node Degree

As the leaves of a CJN must be keyword matches from the query match, then a CJN must have at most $|Q\!M|$ leaves. Also, considering that the maximum node degree in a tree is less or equal to the number of its leaves, we can safely prune the JNKMs that contains a node with a degree greater than $|Q\!M|$ .

SECTION B.

Maximum Number of Keyword-Free Matches

The size of a CJN is based on the size of the query match and the number of keyword-free matches, that is, the size of a candidate joining network $C\!J\!N_{M}$ for a query match $M$ is given by $|C\!J\!N_{M}| {=} |M|{+}|F|$ , where $F$ is a set of keyword-free matches. Thus, if we consider a maximum CJN size $T_{max}$ , we can also set a maximum number of keyword-free matches for a CJN, given by $|F| {\leq } T_{max}{-}|M|$ . Therefore, we can prune all JNKMs that contain more keyword-free matches than this maximum number set.

The number of CJNs generated can be further reduced by the pruning and ranking them. In Section VII-A, we present a ranking of the candidate joining networks returned by CJNGen. In Section VII-B, we present pruning techniques for the generation of the candidate joining networks from CJNGen and CJNInter.

References

References is not available for this document.