Journals & Magazines >IEEE Access >Volume: 11

Supporting Schema References in Keyword Queries Over Relational Databases

Lathe generates suitable SQL queries for database exploration from input keyword queries. First, the system matches keywords to schema elements. Then, it combines the key...

Abstract:

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or quer...Show More

Metadata

Abstract:

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. They take a keyword query, locate their corresponding elements in the target database, and connect them using information on PK/FK constraints. Although there are many such systems in the literature, most of them only support queries with keywords referring to the contents of the database and just very few support queries with keywords refering the database schema. We propose Lathe, a novel R-KwS that supports such queries. To this end, we first generalize the well-known concepts of Candidate Joining Networks (CJNs) and Query Matches (QMs) to handle keywords referring to schema elements and propose new algorithms to generate them. Then, we introduce two major innovations: a ranking algorithm for selecting better QMs, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs. We present experiments performed with query sets and datasets previously experimented with state-of-theart R-KwS systems. Our results indicate that Lathe can handle a wider variety of queries while remaining highly effective, even for databases with intricate schemas.

Lathe generates suitable SQL queries for database exploration from input keyword queries. First, the system matches keywords to schema elements. Then, it combines the key...

Published in: IEEE Access ( Volume: 11)

Page(s): 92365 - 92390

Date of Publication: 25 August 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3308908

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Keyword search over relational databases enables naive/informal users to retrieve information from relational databases (DBs) without any knowledge about schema details or query languages. The success of search engines shows that untrained users are at ease using keyword search to find information of interest.

However, this can be challenging because the information sought frequently spans multiple relations and attributes, depending on the schema design of the underlying DB. Therefore, Relational Keyword Search (R-KwS1) systems must automatically determine which pieces of information to retrieve from the database and how to connect them to provide a relevant answer to the user.

In general, keywords may refer to both database values in tuples and schema elements, such as relation and attribute names. For instance, consider the query $``will~smith~films''$ over a database on movies. The keywords $``will''$ and $``smith''$ may refer to values of person names. The keyword $``films''$ on the other hand is more likely to refer to the name of a relation about movies. Although a significant number of query keywords correspond to schema references [1], the majority of previous work on R-KwS does not support references to the schema.

Handling keywords that refer to schema elements makes R-KwS significantly more challenging than the usual setting where only keywords referring to attribute values are considered. Firstly, it increases the complexity of the search process by requiring an understanding of the underlying database schema and its structure. Secondly, keywords referring to the schema introduce semantic ambiguity, making it difficult to disambiguate between schema references and attribute values. This ambiguity further complicates the search process and can lead to incorrect or incomplete results. Furthermore,integrating schema knowledge into the search process becomes crucial when handling schema references. Understanding PK/FK relationships and connecting relevant information adds an extra layer of complexity to the problem. Finally, ranking and relevance determination become more challenging when schema elements are involved. Existing systems may prioritize attribute values even if they do not provide useful answers. Accurately assessing relevance requires considering both attribute values and schema references. These challenges require dedicated techniques and algorithms specifically designed to handle schema references effectively.

In this work, we study new techniques for supporting schema references in keyword queries over relational databases. Specifically, we propose Lathe2, a new R-KwS system to generate a suitable SQL query from a keyword query, considering that keywords refer either to instance values or schema elements. Lathe follows the Schema Graph approach for R-KwS systems [2], [3]. Given a keyword query, this approach consists of generating relational algebra expressions called Candidate Joining Networks3 (CJNs), which are likely to express user intent when formulating the original query. The generated CJNs are evaluated, that is, they are translated into SQL queries and executed by a DBMS, resulting in several Joining Networks of Tuples (JNTs) which are collected and supplied to the user.

In the literature, the most well-known algorithm for CJN Generation is CNGen, which was first presented in the system DISCOVER [4], but was adopted by most R-KwS systems [5], [6], [7], [8]. Despite the possibly large number of CJNs, most works in the literature focused on improving CJN Evaluation and ranking of JNTs instead. Specifically, DISCOVER-II [6], SPARK [7], and CD [8] used information retrieval (IR) style score functions to rank the top-K JNTs. KwS-F [9] imposed a time limit for CJN evaluation, returning potentially partial results as well as a summary of the CJNs that have yet to be evaluated. Later, CNRank [10] introduces a CJN ranking, requiring only the top-ranked CJNs to be evaluated. MatCNGen [2], [11] proposed a novel method for generating CJNs that efficiently enumerated the possible matches for the query in the DB. These Query Matches (QMs) are then used to guide the CJN generation process, greatly decreasing the number of generated CJNs and improving the performance of CJN evaluation.

Among the methods based on the Schema Graph approach, Lathe is, to the best of our knowledge, the first method to address the problem of generating and ranking CJNs considering queries with keywords that can refer to either schema elements or attribute values. We revisited and generalized concepts introduced in previous approaches [2], [4], [10], [11], such as tuples-sets, QMs, and the CJNs themselves, to enable schema references. In addition, we proposed a more effective approach to CJN Generation that included two major innovations: QM ranking and Eager CJN Evaluation. Lathe roughly matches keywords to the values of the attributes or to schema elements. Next, the system combines the keyword matches into QMs that cover all the keywords from the query. The QMs are ranked and only the most relevant ones are used to generate CJNs. The CJN generation explores the primary key/foreign key relationships to connect all the elements of the QMs. In addition, Lathe employs an eager CJN evaluation strategy, which ensures that all CJNs generated will yield non-empty results when evaluated. The CJNs are then ranked and evaluated. Finally, the CJN evaluation results are delivered to the user. Unlike the previous methods, Lathe provides the user with the most relevant answer without relying on JNTs rankings. This is due to the effective rankings of QMs and CJNs that we propose, which are absent in the majority of previous work.

We performed several experiments to assess the effectiveness and efficiency of Lathe. First, we compared the results with those obtained with several previous R-KwS systems, including the state-of-the-art QUEST [12] system using a benchmark proposed by Coffman & Weaver [3]. Second, we assessed the quality of our ranking of QMs. The ranking of CJNs was then evaluated by comparing different configurations in terms of the number of QMs, the number of CJNs generated per QM, and the use of the eager evaluation strategy. Finally, we assessed the performance of each phase of Lathe, as well as the trade-off between quality and performance of various system configurations. Lathe achieved better results than all of the R-KwS systems tested in our experiments. Also, our results indicate that the ranking of QMs and the eager CJN evaluation greatly improved the quality of the CJN generation.

Our key contributions are: (i) a novel method for generating and ranking CJNs with support for keywords referring to schema elements; (ii) a novel algorithm for ranking QMs, which avoids the processing of less likely answers to a keyword query; (iii) an eager CJN evaluation for discarding spurious CJNs; (iv) a simple and yet effective ranking of CJNs which exploits the ranking of QMs.

The remainder of this paper is organized as follows: Section II reviews the related literature on relational keywords search systems based on schema graphs and support to schema references. Section IV summarizes all of the phases of our method, which are discussed in detail in Sections V –VII. Section VIII summarizes the findings of the experiments we conducted. Finally, Section IX summarizes the findings and outlines our plans for the future.

SECTION II.

Background and Related Work

In this section, we discuss the background and related work on keyword search systems over relational databases and on supporting schema references in such systems. For a more comprehensive view of the state-of-the-art in keyword-based and natural language queries over databases, we refer the interested reader to a recent survey [13].

A. Relational Keyword Search Systems

Current R-KwS systems fall in one of two distinct categories: systems based on Schema Graphs and systems based on Instance Graphs. Systems in the first category are based on the concept of Candidate Joining Networks (CJNs), which are networks of joined relations that are used to generate SQL queries and whose evaluation return several Joining Networks of Tuples (JNTs) which are collected and supplied to the user. This method was proposed in DISCOVER [4] and DBXplorer [5], and it was later adopted by several other systems, including DISCOVER-II [6], SPARK [7], CD [8], KwS-F [9], CNRank [10], and MatCNGen [2], [11]. Systems in this category make use of the underlying basic functionality of the RDBMS by generating appropriate SQL queries to retrieve answers to keyword queries posed by users.

Systems in the second category are based on a structure called Instance Graph, whose nodes represent tuples associated with the keywords they contain, and the edges connect these tuples based on referential integrity constraints. BANKS [14], BANKS-II [15], BLINKS [16] and, Effective [17] use this approach to compute keyword queries results by finding subtrees in a data graph that minimizes the distance between nodes matching the given keywords. These systems typically generate the query answer in a single phase that combines the tuple retrieval task and the answer schema extraction. However, the Instance Graph approach requires a materialization of the DB and requests a higher computational cost to deliver answers to the user. Furthermore, the important structural information provided by the database schema is ignored, once the data graph has been built.

B. R-KwS Systems Based on Schema Graphs

In our research, we focus on systems based on Schema Graphs, since we assume that the data we want to query are stored in a relational database and we want to use an RDBMS capable of processing SQL queries. Also, our work expands on the concepts and terminology introduced in DISCOVER [4], [6] and expanded in CNRank [10] and MatCNGen [2], [11]. This formal framework is used and expanded to handle keyword queries that may refer to attribute values or to database schema elements. As a result, we can inherit and maintain all guarantees regarding the generation of minimal, complete, sound, and meaningful CJNs.

The best-known algorithm for CJN Generation is CNGen, which was introduced in DISCOVER [4] but was later adopted as a default in most of the R-KwS systems proposed in the literature [5], [6], [7], [8]. To generate a complete, non-redundant set of CJNs, this algorithm employs a Breadth-First Search approach [18]. As a result, CNGen frequently generates a large number of CJNs, resulting in a costly CJN generation and evaluation process.

Initially, most of the subsequent work focused on the CJN evaluation only. Specifically, as many CJNs were generated by CNGen that should be evaluated, producing a larger number of JNTs, such systems as DISCOVER-II [6], SPARK [7], and CD [8] introduced algorithms for ranking JNTs using IR style score functions.

KwS-F [9] addressed the efficiency and scalability problems in CJN evaluation in a different way. Their approach consists of two steps. First, a limit is imposed on the time the system spends evaluating CJNs. After this limit is reached, the system must return the (possibly partial) top-K JNTs. Second, if there are any CJNs that have yet to be evaluated, they are presented to the user in the form of query forms, from which the user can choose one and the system will evaluate the corresponding CJN.

CNRank [10] proposed a method for lowering the cost of CJN evaluation by ranking them based on the likelihood that they will provide relevant answers to the user. Specifically, CNRank presented a probabilistic ranking model that uses a Bayesian Belief Network [19] to estimate the relevance of a CJN given the current state of the underlying database. A score is assigned to each generated CJN, so that only a few CJNs with the highest scores need to be evaluated.

MatCNGen [2], [11] introduced a match-based approach for generating CJNs. The system enumerates the possible ways which the query keywords can be matched in the DB beforehand, to generate query answers. MatCNGen then generates a single CJN, for each of these QMs, drastically reducing the time required to generate CJNs. Furthermore, because the system assumes that answers must contain all of the query keywords, each keyword must appear in at least one element of a CJN. As a result of the generation process avoiding generating too many keyword occurrence combinations, a smaller but better set of CJNs is generated.

Lastly, Coffman & Weaver [3] proposed a framework for evaluating R-KwS systems and reported experimental results over three representative standardized datasets they built, namely MONDIAL, IMDb, and Wikipedia, along with their respective query workloads. The authors compare nine R-KwS systems, assessing their effectiveness and performance in a variety of ways. The resources of this framework were also used in the experiments of several other studies on R-KwS systems [2], [7], [10], [11], [20].

C. Support to Schema References in R-KwS

Overall there are few systems in the literature that support schema references in keywords queries. One of the first such systems was BANKS [21], a R-KwS system based on Instance Graphs. However, hence the query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.

Support for schema references in keyword queries was extensively addressed by Bergamaschi et al. in Keymantic [1], KEYRY [22], and QUEST [12]. All these systems can be classified as schema-based since they aim at generating a suitable SQL query given an input keyword query. They do not, however, rely on the concept of CJNs, as Lathe and all DISCOVER-based systems do. Keymantic [1] and KEYRY [22] consider a scenario in which data instances are not acessible, such as in databases on the hidden web and sources hidden behind wrappers in data integration settings, where typically only metadata is made available. Both systems rely on similarity techniques based on structural and lexical knowledge that can be extracted from the available metadata, e.g., names of attributes and tables, attribute domains, regular expressions, or from other external sources, such as ontologies, vocabularies, domain terminologies, etc. The two systems mainly differ in the way they rank the possible interpretations they generate for an input query. While Keymantic relies on an extension the authors proposed for the Hungarian algorithm, KEYRY is based on the Hidden Markov Model, a probabilistic sequence model, adapted for keyword query modeling. QUEST [12] can be thought of as an extension of KEYRY because it uses a similar strategy to rank the mappings from keywords to database elements. QUEST, on the other hand, considers the database instance to be accessible and includes features derived from it for ranking interpretations, in contrast to KEYRY.

From these systems, QUEST is the one most similar to Lathe. However, it is difficult to draw a direct comparison between the two systems as QUEST does not rely on the formal framework from CJN-related previous work [2], [4], [6], [10], [11] and it also resolves a smaller set of keyword queries then Lathe. QUEST, in particular, does not support keyword queries whose resolution necessitates SQL queries with self-joins. As a result, when comparing QUEST to other approaches, the authors limited the experimentation to 35 queries rather then the 50 included in the original benchmark [3], [12]. Lathe, on the other hand, supports all 50 queries.

Finally, there are systems that propose going beyond the retrieval of tuples that fulfill a query expressed using keywords and try to provide a functionality close to structured query languages. This is the case of SQAK [23] that allows users to specify aggregation functions over schema elements. Such an approach was later expanded in systems such as SODA [24] and SQUIRREL [25], which aim to handle not only aggregation functions, but also keywords that represent predicates, groupings, orderings and so on. To support such features, these systems rely on a variety of resources that are not part of the database schema or instances. Among these are conceptual schemas, generic and domain-specific ontologies, lists of reserved keywords, and user-defined metadata patterns. We see such useful systems as being closer to natural language query systems [13]. In contrast, Lathe, like any typical R-KwS system, aims at retrieving sets of JNTs that fulfill the query, and not computing results with the tuples. In addition, it does not rely on any external resources.

SECTION III.

Problem Statement

Given a database that has $n$ relations $R_{1}, \ldots, R_{n}$ , where each relation has $m_{i}$ attributes $a^{i}_{1}, \ldots, a^{i}_{m_{i}}$ . Let a keyword query be a set of keywords $k_{1}, k_{2}, \ldots, k_{n}$ . Answering a keyword query over the database means finding a set of relational algebra expressions that match the query, that is, they match each keyword to at least one database element, which can be the name of a relation, an attribute name, or a value of an attribute.

We represent these expressions with Candidate Joining Networks, where the nodes comprise selections or projections over relations, and the edges represent join operations. That is, $C$ is a candidate joining network for $Q$ if, for each $k \in Q$ , exists at least one node $u$ in $C$ so that one of the following is true:

$u = \sigma _{a \ni k}(R_{u})$
$u = \pi _{a}(R_{u})$ , where $k=a$
$u = \sigma (R_{u})$ , where $k=R_{u}$

The first condition indicates whether a keyword matches the value of an attribute, while the second and third verifies if the keyword matches to an attribute name or a relation name, respectively.

For notational simplicity, we assume that the attributes of a primary to foreign key relationship have the same name, so we can freely join relations using natural joins. The generalization of the problem and the solution when these assumptions do not hold is trivial.

Also, for each edge $u{\rightarrow }v$ in $C$ , there exists a primary to foreign key relationship from $R_{u}$ to $R_{v}$ , so that we can join $u{\Join }v$ .

To ensure the connectivity, $C$ may also have some nodes which are not associated with any keyword, but they act as intermediate tables for the join operations. An intermediate node $u = R_{u}$ cannot be a leaf in $C$ , that is, its degree must be greater than 1.

SECTION IV.

Lathe Overview

In this section we present an overview of Lathe. We begin by presenting a simple example of the task carried out by our system. For this, we illustrate in Figure 1 a simplified excerpt from the well-known IMDb.4

FIGURE 1.

A simplified excerpt from IMDb.

R-KwS	Relational Keyword Search.
CJN	Candidate Joining Network.
QM	Query Match.
DB	Database.
SQL	Structured Query Language.
IR	Information Retrieval.
DBMS	Database Management System.
JNT	Joining Network of Tuples.
IMDb	Internet Movie Database.
KM	Keyword Match.
VKM	Value-Keyword Match.
SKM	Schema-Keyword Match.
PKFK	Primary Key/Foreign Key.
WUP	Wu-Palmer Measure.
LCS	Least Common Subsumer.
TF-IDF	Term Frequency – Inverse Document Frequency.
RIC	Relational Integrity Constraint.
JNKM	Joining Network of Keyword Matches.
MJNKM	Minimal Joining Networks of Keyword Matches.
R@K	Recall at K.
P@1	Precision at 1.
MRR	Mean Reciprocal Rank.
WMD	Word Mover’s Distance.
NLM	Neural Language Model.
ECLAT	Equivalence Class Clustering and bottom-up Lattice Traversal.

Supporting Schema References in Keyword Queries Over Relational Databases

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Background and Related Work

A. Relational Keyword Search Systems

B. R-KwS Systems Based on Schema Graphs

C. Support to Schema References in R-KwS

Problem Statement

Lathe Overview

A. System Architecture

Keyword Matching

A. Value-Keyword Matching

Definition 1:

Example 1:

Example 2:

Example 3:

B. Schema-Keyword Matching

Definition 2:

Example 4:

1) Similarity Metrics

2) WordNet Database

3) Path Similarity

4) Wu-Palmer Similarity

C. Generalization of Keyword Matches

Definition 3:

Definition 4:

Query Matching

A. Query Matches Generation

Definition 5:

Example 5:

B. Query Matches Ranking

Candidate Joining Networks

Definition 6:

Example 6:

Definition 7:

Example 7:

Definition 8:

Example 8:

Theorem:

Example 9:

Definition 9:

Example 10:

A. Candidate Joining Network Ranking

B. Candidate Joining Network Pruning

Example 11:

Experiments

A. Experimental Setup

1) System Details

2) Datasets

3) Query Sets

4) Golden Standards

5) Metrics

6) Lathe Setup

B. Preliminary Results

C. Comparison With Other R-KwS Systems

D. Evaluation of Query Matches Ranking

E. Evaluation of the Candidate Joining Network Ranking

F. Performance Evaluation

G. Quality Versus Performance

Conclusion

Appendix AAcronyms

Acronyms

Appendix BVKMGen Algorithm

VKMGen Algorithm

Appendix CSKMGen Algorithm

SKMGen Algorithm

Appendix DQMGen Algorithm

QMGen Algorithm

Appendix EQMRank Algorithm

QMRank Algorithm

Appendix FSound Theorem

Sound Theorem

Theorem 1:

Proof:

Appendix GCJNGen Algorithm

CJNGen Algorithm

Appendix A
Acronyms

Appendix B
VKMGen Algorithm

Appendix C
SKMGen Algorithm

Appendix D
QMGen Algorithm

Appendix E
QMRank Algorithm

Appendix F
Sound Theorem

Appendix G
CJNGen Algorithm