Efficient SPARQL Queries Generator for Question Answering Systems

Much like traditional database querying, the question answering process in a Question Answering (QA) system involves converting a user’s question input into query grammar, querying the knowledge base through the query grammar, and finally returning the query result (i.e., the answer) to the user. The accuracy of query grammar generation is therefore important in determining whether a Question Answering system can produce a correct answer. Generally speaking, incorrect query grammar will never find the right answer. SPARQL is the most frequently used query language in question answering systems. In the past, SPARQL was generated based on graph structures, such as dependency trees, syntax trees and so on. However, the query cost of generating SPARQL is high, which creates long processing times to answer questions. To reduce the query cost, this work proposes a low-cost SPARQL generator named Light-QAWizard, which integrates multi-label classification into a recurrent neural network (RNN), builds a template classifier, and generates corresponding query grammars based on the results of template classifier. Light-QAWizard reduces query frequency to DBpedia by aggregating multiple outputs into a single output using multi-label classification. In the experimental results, Light-QAWizard’s performance on Precision, Recall and F-measure metrics were evaluated on the QALD-7, QALD8 and QALD-9 datasets. Not only did Light-QAWizard outperform all other models, but it also had a lower query cost that was nearly half that of QAWizard.

triples. An RDF triple is of the form (Subject, Predicate, 71 Object) as shown in Figure 2, which can be expressed as an 72 entity (Subject), an attribute (Predicate) and a value (Object). 73 For example, the statement ''the capital city of England is 74 London'', consists of a subject (''England''), a predicate 75 (''the capital city''), and an object (''London''). Note that the 76 subject and the predicate can only be described by a URI, and 77 the object can be addressed by a URI or literal. 78 The three main types of SPARQL generation are Seman-79 tic Query Graph (SQG) searches [2], [3], [8], template 80 designs [1], and machine learning solutions [7], [9]. Based 81 on SQG generated by a dependency tree structure of a ques-82 tion, Ochieng [8] proposed a framework to translate nat-83 ural language to SPARQL. gAnswer2 [3] uses knowledge 84 graph structure to recursively search the subgraphs to find 85 all the possible RDF triples. gAnswer2 looks for all possible 86 SPARQL queries to increase the probability of finding an 87 accurate answer, but it spends a lot of time sending unneces-88 sary queries to DBpedia, increasing query costs. The query 89 cost here refers to the frequency of queries to DBpedia. 90 TWDAqua [2] uses Ngrams in entity mapping, which also 91 increases the frequency of queries to DBpedia. In addition, 92 WDAqua only considers the semantic graph in the KB, but 93 does not consider syntax issues, resulting in low precision 94 of query results. The experiments compare the performance 95 of our proposed method to those of gAnswer2 [3]  [12], the precision, recall and F-measure of the proposed QA 137 system were higher than those achieved with [2], [3], [13].

138
The rest of the paper is organized as follows. In Section 2,

144
This section introduces the concept of the label powerset 145 method [14] in subsection II-A. We also briefly review the 146 differences between our work and that in [2], [3], and [13].   Figure 3(b). The advantage is that the classifier predicts  has a problem-based dependency tree structure, searching the 180 tree from the root node to its children until the leaf nodes, 181 forming relational phrases according to depth-first search 182 (DFS). Then, the entity of DBpedia determines whether it 183 matches the relational phrase. After that, the dependency 184 tree is applied to create an SQG. Each subgraph of the 185 SQG is assigned a score, and the scores are then sorted 186 to select the top-k subgraphs as candidates for generating 187 SPARQL queries. WDAqua [2] uses N-grams to find all 188 related entities in DBpedia. That is, by calculating the dis-189 tances between entities, the mapped entity is like a root 190 searching the DBpedia based on breath-first search (BFS) 191 at depth 2. The SPARQL queries are generated according to 192 the calculated distance. The results of the queried results are 193 ranked, and the first ranked answer is returned to the user as 194 the final answer.

195
In contrast to the aforementioned approaches, Xser [1] is 196 a new branch of designing structured perceptrons to detect 197 entity types. Based on entity types, a dependency graph, 198 a semantic DAG, is constructed to present the relationships 199 between entities. According to the relationships between enti-200 ties, the pre-defined rules are used to generate the SPARQL 201 queries that satisfy the query intents. After executing the 202 SPARQL query, the answers are evaluated against the rank-203 ing results. Using MEMM, the past work QAWizard [13] 204 proposed two stages, namely the training stage and query 205 stage. The training phase aims to learn entity types and RDF 206 labels from experience with questions defined in the QALD 207 dataset. In the query phase, the answering processing for 208 input questions contains several steps: preprocessing, entity 209 type tagging, entity mapping, RDF tagging, SPARQL gener-210 ation, evaluation, and answer filtering. The processing steps 211 of QAWizard are similar to those in Light-QAWizard. The 212 main differences between QAWizard and light-QAWizard are 213 that in the latter, (1) RDF tagging is directly integrated into 214 the next SPARQL generation; (2) RNN is used to answer 215 questions considering the context of the question; (3) the 216 multi-label problem is treated as a multi-class problem using 217 LP to reduce the query cost of QAWizard; and (4) a bidi-218 rectional LSTM-CRF (conditional random field), referred to 219 as BiLSTM-CRF [20], is used to label the entities of the 220 input question in order to improve the accuracy of entity 221 tagging.

223
The system architecture shown in Figure 4 consists of two  question includes the language encoding, the question, and 250 the keywords in the question, where ''query'' is the SPARQL 251 query used to request answers from DBpedia, and ''answer'' 252     Table 3. 312 The pruned results are shown in Table 3, where midpoint 313 ?x is represented by the lowercase letter (such as IDs a, b, 314 and c), and ?ans is represented by the uppercase letter (such 315 as IDs A, B, and C). After pruning unnecessary templates, the 316 filtered templates for two RDF triples and three RDF triples 317 are as shown in Table 4. For example, the generated SPARQL 318 query for the sentence ''How many pages are there in War and 319 Peace?'' is shown below: The template ID shown in Table 3 is the same as the label 329 defined in the LP. The labels are merged to transform the 330 problem into multi-class classification via LP methods. For 331 example, the merged label appears in the Table 4 as an ID 332 field, as the result of the template classifier, which is used 333 later to generate the corresponding SPARQL queries. The 334 number of labels is reduced from 259 labels to 32 labels. The 335 statistical results are shown in Table 5.

337
After the user enters a question, the question is parsed and 338 processed in six steps, as shown in Figure 1, including: 339 preprocessing, entity type tagging, entity mapping, answer 340 retrieval, and answer type filtering. The preprocessing step 341 parses the input problem, and includes tokenization, lemma-342 tization, and part-of-speech (POS) tagging. The trained entity 343 tagger is used to assign an entity type to each token in the 344 entity type tagging step. Next, the tokenized token finds the 345 name entity in the entity mapping step. The answer retrieval 346 step generates the corresponding SPARQL query and eval-347 uates the query results. Finally, the query result is filtered 348 99854 VOLUME 10, 2022    according to the answer type through the answer type filtering 349 step, and the final answer is returned to the user.  Table 6 shows the question sentence ''List all musicals with 355 music by Elton John'' after the preprocessing step. 356

357
Based on the sentence processed by the pre-processing step, 358 we use the trained entity type tagger model to tag the sentence 359 as follows:     Only E, C and R are considered in the entity mapping step.

374
As for E, the token is matched to DBRDict-A and DBRDict-B

375
[21], which are the designate and abbreviation indicating 376 the DBpedia repository as shown in Table 7 and Table 8.     build DBRDict-P for relation entities, an example of which 381 is shown in Table 10.

383
According to the template classifier model, the input question 384 can be classified into one of the templates listed in Table 4. 385 Algorithm 1 is the main function to call the corresponding 386 sub-function to generate the SPARQL queries based on the 387 classified results. For example, if the delivered result is BB, 388 the function B is called twice. The entity type E is placed in 389 the position S or O, the R is placed in P position, and C is 390 placed in O position, and P is designated rdf:type.

407
Billy_Elliot_the_Musical Lestat_(musical) 408 The answer type filters the answers according to the

435
Python was used to implement the system with a Tensorflow 436 [23] network architecture using an RNN deep learning model. 437 The parameters used are listed in Table 12. The study com-438 pares the LSTM, GRU, and Bi-LSTM models as well as the 439 parameters for LSTM Layer and LSTM Unit set as 1, 2, 3 and 440 64, 128, and 256, respectively. BERT, which was proposed 441 by Devlin et al. [24], could be used to resolve the synonym 442 problem according to the context between sentences. The 443 POS tag embedding was trained by the data set Treebank 444 [25] to classify the POS Tag, which could help to realize 445 the semantics of natural language in order to increase the 446 precision of the trained model. The embedding size was set 447 to 30 because the number of labels of Part-of-Speech Tags 448 was 36. Window size was the parameter for training POS 449 tag embedding. On average, 7 words needed to be analyzed 450 for the questions QALD-7, QALD-8, and QALD-9; thus, the 451 Window Size was set at 5. Training time was set at 100 epochs 452 to reduce the loss. Batch Size referred to the batch of data. 453 Optimizer Adam was applied to adjust the weights and bias 454 to minimize the loss. We compared the performance when the 455 learning rates were 0.01, 0.05 and 0.001. The evaluation of 456 loss for each learning rate is shown in Figure 7. The result 457 shows that the loss is stable if the learning rate is 0.001. 458 Dropout was used to handle the over fitting problem.
To give further detail on the tagging results of different 490 types of QALD, Table 14 shows the precision on QALD-7, 491 QALD-8, and QALD-9 at 73.91%, 84.84% and 70.73%, 492 respectively, when Bi-LSTM is adopted. Table 15 shows 493 different precision based on different models and different 494 data sets, with average precision (Precision), average recall 495 (Recall), and average F-measure (F-measure). The precision 496 scores of Light-QAWizard were 0.565, 0.462 and 0.398, 497 respectively, representing the best precision compared to 498 QAWizard, gAnswer2, and WDAqua. Also, the average recall 499 scores were better than those of gAnswer2 and WDAqua 500 based on three different datasets. The F-measures of Light-501 QAWizard were 0.594, 0.457 and 0.406 as tested on QALD-7, 502 QALD-8, and QALD-9 datasets, respectively, outperforming 503 those of QAWizard, gAnswer2, and WDAqua.  attribute entity, and class entity are denoted by e, r and c. 507 Note that w is the number of SPARQL queries generated with 508 a distance condition using WDAqua. gAnswer2 [3] generates 509 a dependency tree for natural language questions and converts 510 them into a query graph that contains semantic information, 511 finds subgraphs in the graph through the graph knowledge 512 base, and use the subgraphs to generate relative query syntax. cost is e × r × c. WDAqua [2] uses N-grams to perform entity 516 comparisons with DBpedia for each word in the question 517 sentence. Each entity is treated as a starting point, a breadth-518 first search (BFS) at depth 2 is started in DBpedia, and its 519 distance is calculated to generate SPARQL queries. The query 520 cost of WDAqua is (e+r +c) 2 ×2+w. QAWizard [7] contains 521 two stages: entity type tagging and RDF type tagging. The 522 pre-designed templates are used to generate the SPARQL 523 queries. The query cost of QAwizard is e × r × c × 2 on 524 average. Based on the query cost calculation in Table 17, the 525 query cost of the QAWizard system method is n, the query 526 cost of the gAnswer2 system is n/2, and the query cost of 527 the WDAqua system is 5n. This research method uses the 528 multi-label classification method to reduce the query cost of 529 QAWizard to n/2. The advantages of Light-QAWizard are 530 summarized below: 531 1) Light-QAWizard outperforms QAWizard, gAnswer2, 532 and WDQqua in terms of average precision, recall and 533 F-measure. 534 2) SPARQL query templates are trained on the QALD-7, 535 QALD-8 and QALD-9 datasets. Thus, only the neces-536 sary SPARQL queries are kept, reducing query costs. 537 Light-QAWizard achieves the lowest query cost when 538 compared to QAWizard and WDAqua.

540
A QA system can accurately answer users' questions. 541 SPARQL query generation often drives the query costs, 542 which reflects the frequency of queries to DBpedia. The 543 necessary queries consider the efficiency of answering the 544 question. This paper proposes a classification model and 545 integrates RNN to train a model that can learn from the expe-546 riences of picking out suitable SPARQL queries. To reduce 547 query costs, LP is adopted to combine the labels to generate 548 the SPARQL queries. The accuracies on QALD-7, QALD-8 549 and QALD-9 are 73.91%, 84.84% and 70.73% respectively. 550 The outstanding performance on metrics including precision, 551 recall, F-measure, and query costs, surpass those of all other 552 systems evaluated on the same test sets.

553
Although the proposed system achieves superior perfor-554 mance, further work should be considered to improve the 555 quality of answers. For example, multilabel classification 556 algorithms, such as binary relevance, classifier chains, and 557 pairwise, could be used to answer complex questions that 558 include comparatives and superlatives. Moreover, the experi-559 mental dataset, QALD, is still small, and is therefore limited 560 in its ability to train a model that can sufficiently satisfy 561 almost all types of questions. LC-QuAD [26], a larger dataset, 562 could be used for training to increase the accuracy of the 563 model.