Enhanced Natural Language Interface for Web-Based Information Retrieval

Database application is at the core of most web application systems such as web-based email, source codes repository management, public scientific data repository management, news portals, and publication repository of various fields. However, the usage of these database systems for data and information retrieval is severely limited because of lacking support for processing search queries expressed in a natural language (NL). Most web interfaces for databases today only take search queries entered in some form of logical combination of keywords or text strings, which restrict the scope and depth of what a web user really wants to search for, even though natural language based data or information retrieval has made significant advances in recent years. To overcome or at least to alleviate such limitation in web information services, we propose in this article an improved neural model based on an existing framework IRNet for NL query of databases, in which a representation of Gated Graph Neural Network (GGNN) is introduced to encode the database entities and relations. We also represent and use the database values in the prediction model to identify and match table and column names for automatic synthesize a correct SQL statement from a query expressed in a NL sentence. Experiments with a public dataset demonstrates the promising potential of our approach.


I. INTRODUCTION
Nowadays database (DB) application is the backbone of most web-based information services such as web-based email, source codes repository management, public scientific data repository management, news portals, and publication repositories of various fields [1]- [3]. Fig. 1 is a snapshot of a website Web of Science [4] that provides an interface for searching scientific publications from its database of scientific citations. Like most similar web database interfaces [5]- [7], it supports search method of using keywords as well as complex formulas with identifier, Boolean operators, and brackets. Such search method is more advanced and powerful than simple keywords-based search [8]- [10]. But its actual search power may be limited by its complexity The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . which requires some level of knowledge about the database content and the expertise of using the search tool. To most users without such knowledge and expertise, most likely they will not be able to take full advantage of the search tool for their data or information needs. Such limitation can only be overcome or at least alleviated by a natural language interface with the support of NL query to SQL query (NL-SQL) or text to SQL (TTS) capabilities. Please note ''NL-SQL'' and ''TTS'' will be used interchangeably in this article.
In research of TTS, deep learning (DL) has been the primary choice for the last few years. Many variations of DL algorithms have been developed and tested using WikiSQL dataset [11] a first large-scale Text-to-SQL (TTS) dataset. Zhong et al. [11] proposed Seq2SQL, a model based on Seq2Seq structure and utilizing reinforcement learning to generate SQL queries; Xu et al. [12] proposed SQLNet based on Seq2SQL, which uses a sequence-to-set model and a col- VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ umn attention mechanism to substantially improve accuracy without the use of reinforcement learning; Yu et al. [13] proposed TypeSQL, which views the TTS problem as a slot filling task and uses type information to better understand rare entities and numbers in the natural language questions; Dong et al. [14] proposed a structure-aware neural network that decomposes the semantic parsing process into two stages: first, generating its sketch based on the input sentences, and then, filling in missing details by considering natural language and the sketch itself, later, with the advent of more and better pre-trained language models, increasing number of researchers applied them to improve their TTS algorithms on WikiSQL, such as SQLova [15] which incorporates the BERT model [16]. WikiSQL dataset is limited for it only contains databases of single table. Yu et al. [17] proposed a large-scale, complex, and cross-domain Text-to-SQL dataset Spider containing databases of multiple tables. Spider dataset supports investigations of sophisticated NLP and DL models for predicting a large number of complex SQL queries. Some recent studies using Spider have developed more advanced methods and algorithms achieving impressive results. SyntaxSQLNet [18] is the first model developed for the Spider task using a syntax tree representing the features of the SQL queries. It also proposed a method for generating cross-domain training data to enhance model performance with data augmentation. Lee [19] proposed a SQL clause-wise decoding neural architecture with a self-attention based database schema encoder for the Spider task. Wang et al. [20] presented a unified framework, called RAT-SQL, based on the relationaware self-attention mechanism, to address schema encoding, schema linking, and feature representation within a text-to-SQL encoder. Bogin et al. [21] presented an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network (GNN) [22].
Guo et al. [23] propose a very interesting deep neural network based approach IRNet to tackle complex and cross-domain Text-to-SQL problems using Spider dataset. By decomposing an TTS tasks into three phases and using an intermediate representation [24]- [28], IRNet not only provides an effective alternative approach to addressing the mismatch problems and difficulties of predicting columns caused by the large number of out-of-domain words, but also presents ''break-point'' for intermediate performance analysis which is what really interests us most. For instance, in IRNet an intermediate representation called SemQL is designed to bridge NL and SQL. An SQL query is inferred from the synthesized SemQL of the query with domain knowledge. Such capability allows us to perform certain analyses of intermediate results to gain better understanding of parts of its DL algorithm, so that targeted revision or algorithm improvement can be made. Because the rich expressiveness of this representation, we will adopt it in our extended implementation of IRNet [23].
In analyzing IRNet performance on Spider dataset, we found database values, as an important part of the database, can provide valuable information for database field prediction. In this article we will introduce, in the framework of IRNet, database value into the prediction model to identify and match table and column names in natural language queries.
To investigate and understand what we can do to further improve the performance of IRNet, we have conducted a series of experiments, and performed some in-depth analyses of causes of mismatches. We identify two causes: (1) mismatches due to lack of the representations of relations between tables and (2) mismatches due to lack of the representations of database values. To address these two issues, we propose two extensions to the IRNet: (1) add Gated Graph Neural Network (GGNN) [21], [22] to IRNet to encode the database structure; (2) represent database values in the prediction model.

II. METHODS
Database values can often provide valuable information or clues about the correspondences between words in a NL query and database fields, which can be used to improve the accuracy of identification of targeted table and column. Schema of database structures i.e. tables, columns and their relations are informatively invaluable for predicting SQL queries. How to represent and use such information in a deep neural network-based model has profound impact to its performance.

A. BASE MODEL
We adopt basic IRNet [23] framework ( Fig. 2(a)) in our implementation, an overview of which is shown in (Fig. 2(b)). An TTS task is carried out within the framework in three phases or subtasks. In the first phase, words or tokens in an NL query are paired with database fields by Schema Linking. Subsequently, target columns and tables are determined by leveraging database values together with database schema in the prediction model. We introduced a representation of Gated Graph Neural Network (GGNN) [21], [22] to encode the DB schema replacing the original IRNet representation of DB schema. Relations between DB tables and relations between tables and columns missing in IRNet are now fully represented and used in the model. An attention mechanism [29] is implemented to compute the Value Attention Embedding to be added in the column Embedding, which helps the model to identify potential table and column names in the NL query. An intermediate representation (IR) for the query is obtained and represented in a domain-specific language, called SemQL [23]. A syntax-based neural network model is used to synthesize SemQL query. Finally, A SQL query is generated based on SemQL and domain knowledge [23]. Transforming SemQL to SQL is done by traversing the SemQL tree from its root to leaf nodes.

B. EXTRACTING TABLE DATA FROM WEB PAGES
To obtain data on a web page and organize it in a formalized way using database, we analyze the tags corresponding to the   table in the HTML code of the webpage. In general, web pages contain a lot of information. Each web page may be composed of text, tables, pictures, links, etc. Besides these, there are also a large number of HTML tags and CSS [30] styles that are used to control the layout and display of the web. The target data that we really need to extract is scattered among the above-mentioned different forms of HTML components. Therefore, it is necessary to locate the tables from the web page [31], [32].
After locating the table on the webpage, we must further identify the validity of the table [33], [34]. In addition to displaying data and information in tabular form, tables in web pages can also be used to generate layouts and show effects. Because of the multiple uses of HTML table tags, the table area usually contains some invalid information such as web page layout and advertisement. Thus, before extracting table data, invalid tables must be removed to keep the data that is really needed.
We use a web crawler [35], [36] to get the table data. The general crawler routine is nothing more than the steps of sending a request, obtaining a response, parsing a web page, extracting data, and saving data. For a complete crawler, VOLUME 9, 2021 FIGURE 3. The process of obtaining tables on the web. We use python tools to extract tables from HTML source code.
the amount of code can range from dozens of lines to hundreds of lines, and the cost is relatively high, so we take the read_html function of the Pandas tool [37] in Python as our approach. As shown in Fig. 3, first, we send a request to the specified website URL and get the source code of the page, then, for the purpose of locating the table in the HTML page and filter out the redundant information, we apply XPath combined with manual features to preprocess. After that, we use read_html function in Pandas to parse the potential data, this function can directly capture the table in the web page. Finally, we store the parsed tabular data in a csv file and import the csv file into the SQL database.  of length 1-6 in a NL query sentence, and then compute the similarity of these n-grams to the database values: where words in (1) denotes the n-grams, and content denotes the database value. If the similarity exceeds a certain threshold, the corresponding database value is taken as selected. An attention mechanism [38] (Fig. 4) is then applied to the value and the corresponding natural language question, so that the NL query sentence can carry potentially valuable information of the database value for column prediction.
where Embedding in (2) is a function that converts a string to a corresponding embedded matrix, λ is the similarity threshold. (3) is the Embedding of the selected database value, E q is the Embedding of the natural language question, W v is the parameter matrix, w v represents the attentional weight of the database value on the natural language question, and E q|v in (4) is the Embedding of the natural language question that combines the database value. Finally, the E q|v is added to the IRNet column Representation (column Embedding) as a Value Attention Embedding.

D. ENCODING DB SCHEMA WITH GRAPH NEURAL NETWORK
As shown in Fig. 5, Gated Graph Neural Network (GGNN) [21], [22] is used in IRNet to encode database structure, where tables and columns are represented as nodes. Affiliation relationships between columns and tables as well as the primary foreign key relationship are represented as edges. In order to include more information about the relationships between tables and columns, column Embedding and table Embedding are computed from the GGNN instead of the initial embedding in IRNet's Schema Encoder.
The x in (5) denotes node features, which is filled with 0 if there are not enough dimensions, and h (0) v is used as the initial state of the node. The →, ↔} in (6) denotes the two types of edges, and ε type is the set of edges, in which each node recalculates its representation in each step according to the representation of its neighbors in the previous step. (7) denotes the final representation of each node computed by GRU [39]. In the GGNN approach, each node represents a table or a column, and the final representation contains a global schema structure.
The calculation process is as follows: Embedding. At each step, each node re-computes its representation based on the representation of its neighbors in the previous step.

III. EXPERIMENTS
Since the SQL used when querying the web-based information is relatively simple, it may not contain complex SQL components, such as Join, Group by, Union etc., so we extracted some simple samples from the Spider data set that match the difficulty of the web query to test our model performance. The purpose of our experiments is to demonstrate VOLUME 9, 2021 how much performance improvement can be gained by our proposed extensions to the IRNet and its ability to query the web-based information. As such the experiments were designed to only test the original IRNet implementation and our proposed extended implementation, based on which a preliminary comparative evaluation will be made.

A. DATASET
In our experiments, Spider [17] dataset is used, which contains over 10,000 natural language questions and their corresponding SQL statements from 200 databases of 138 domains with multiple tables, each database contains 5.1 tables on average. In the Spider data set, each query sample has a label, and the difficulty level is determined according to the number of SQL components, selections, and conditions. There are 4 levels of difficulty, for example, a SQL query containing only SELECT, FROM, and WHERE components will be marked as simple. And if a query contains more than two SELECT columns, two or more WHERE conditions and GROUP BY two columns, or contains EXCEPT or nested queries, it is considered difficult.

B. IMPLEMENTATION
We implement our TTS system with PyTorch [40]. Following IRNet, parameters for the neural network are chosen empirically as shown in Table 1. The dimension of hidden vector is set to 300. The dimensions of action embedding and node type are 128 and 64. Word embedding is initialized using Glove [41] and shared between the NL encoder and the Schema encoder. Adam [42] is selected as the optimizer, and the hyper parameters are all default settings. Dropout rate [43] and Batch size are set to 0.3 and 64 respectively. The iterations of GGNN is 3.

C. RESULTS
When the user expresses a natural language query for webbased information, it is often not too complicated, they tend to ask a simple question like ''How many acting statuses are there?'' rather than a more sophisticated one ''Show the name and number of employees for the departments managed by heads whose temporary acting value is 'Yes'?''. In general, SQL statements corresponding to complex natural language queries are also more difficult to parse. To adapt to the features of retrieving web-based information, we simplified the Spider data set to make the samples more similar to those when retrieving network information to verify the performance of the model during this work. Table 2 shows the model's performance on samples similar to those often appear in web-based information retrieval.    Table 3 shows the exact matching accuracy of IRNet and IRNet with our improvements. By adding database value to the model and using GGNN to encode the database structure, our methods improve the IRNet model by 2.2% overall. The table also shows the breakdown of prediction accuracy in groups of different difficulty level. Our approach clearly improves the accuracy of column and table prediction for easy, medium, hard and extra hard groups. To examine how each technique contributes to the performance, we conduct an ablation analysis (Table 4) of two aspects: 1) without GGNN, 2) without DB Value. Without GGNN, the model's exact matching accuracy decreases by 1.3%, and Without DB Value, the model's exact matching accuracy decreases by 1.1%. This means that using GGNN to encode DB Schema can effectively incorporate relationships between database structures into the model, and adding database value on top of the model improves the exact matching accuracy.
To study the performance of our methods in detail, we measure the average F1 score on different SQL components. We compare between IRNet and Ours. As shown in Fig. 6, Our methods outperform IRNet on SELECT, WHERE, GROUP BY and ORDER BY, but has a slightly decreased for AND/OR. Fig. 7 presents an example to illustrate intuitively how our method rectify the error of column prediction with database value. As is shown, the column 'Language' is never referred in the natural language question, thus the original model generated the SQL with no column named 'Language' but a wrong column 'Hight_definition_TV'. With our approach of combining database value, the word ''English'' in the NL query is identified to be a potential database value. Then model then uses it to seek for the column it belongs to, and finally produces a correct SQL query statement. In contrast, an incorrect SQL query is generated without using our method.

D. DISCUSSION
To show the value of maximizing the use of information embedded in relational databases in order to improve the prediction performance of a TTS system, we have described following two new algorithmic components as extensions to the IRNet neural model: 1) Introducing database values into the model, computing the similarity between natural language or textual questions or queries and the database values, and establishing correlations between database values and column names through an Attention mechanism. 2) Using Gated Graph Neural Network (GGNN) to encode complete database schema, not only including tables and columns but also their relations; The preliminary experimental results have demonstrated the improvement over the original IRNet implementation, and the potential value of our approach to enhance web-based information retrieval capabilities.
Some of the limitations of our approach are also clearly observed in our analysis and evaluation of the experimental results.

1) TABLE AND COLUMN PREDICTION
Our method cannot predict the correct column in approximately 50% of all analyzed errors. In about 20% of errors, it selects a column from another table, so the table's prediction is also wrong. The main reason for these errors is that the columns in different tables have similar names, so it is difficult to distinguish. Examples of such column names often appear in multiple tables. Incorporating more appropriate schema linking methods (for example, embedding-based methods rather than string-based methods used in IRNet) may help reduce such errors.

2) SQL SKETCH PREDICTION
In about 33% of the cases, we found errors in the SQL sketch. However, it is worth noting that the majority (69%) of these errors occurred in queries classified as Hard or Extra Hard in the Spider dataset. Some difficulties and special situations require advanced common sense, which is difficult to incorporate into the model. However, some examples of errors with lower difficulty may be easily solved with domain knowledge. VOLUME 9, 2021

IV. CONCLUSION
In this article, we have discussed two improvements to the NL-SQL model in IRNet for web-based data and information retrieval. First, we introduced a representation of Gated Graph Neural Network to encode the database structure. Second, we include database values in our prediction model to compute the correlation between database values and column names, in order to alleviate the difficulties in matching column names due to lack of sufficient details in natural language queries. We train the revised model with the Spider dataset. The experimental results with the testing dataset empirically validate the merits of our model, and demonstrate its potential to gain performance improvement for web-based data and information retrieval. The method and algorithms discussed in this paper can be also applied to other types of DB based application systems. For future work, we will conduct further investigation of using natural language model(s) combined with application domain knowledge or semantic in developing TTS models and algorithms of higher performance.