Smart Information Retrieval: Domain Knowledge Centric Optimization Approach

In the age of the Internet of Things, online data have witnessed a significant growth in terms of volume and diversity, and research into information retrieval has become one of the important research themes in the Internet-oriented data science research. This paper introduces a novel domain knowledge centric methodology aimed at improving the accuracy of using machine learning methods for relation extraction from text data, which is critical to the accuracy and efficiency of information retrieval-based applications, including recommender systems and sentiment analysis. The proposed methodology makes a significant contribution to the processes of domain knowledge-based relation extraction including interrogating Linked Open Datasets to generate the relation classification training data, addressing the imbalanced classification in the training datasets, determining the probability threshold of the best learning algorithm, and establishing the optimum parameters for genetic algorithms, which were utilized to optimize the feature selection for the learning algorithms. The experimental evaluation of the proposed methodology reveals that the adopted machine-learning algorithms exhibit higher precision and recall in relation extraction in the reduced feature space optimized by our implementation. The considered machine learning includes support vector machine, perceptron algorithm uneven margin, and K-nearest neighbors. The outcome is verified by comparing against the random mutation hill-climbing optimization algorithm using Wilcoxon signed-rank statistical analysis.


I. INTRODUCTION
Internet of Things (IoT) paradigm is increasing the amount of data being made available online [1], [2].It is due to the integration of the Internet with many heterogeneous areas such as, Internet of Healthcare Things (IoHT) in medical, Internet of Vehicles (IoV) in transport, and Internet of Industrial Things (IIoT) in industry [3], [4].The growing online data can be analyzed to satisfy the information need of a variety of intelligent or smart applications and services including advising financial investors about a potential business risk, informing the music industry about an emerging consumer trend, alerting drivers using traffic predictions, etc. [5].However, the online-published data is diverse in terms of volume and complexity, largely unstructured and constructed in natural human languages, which makes its manual exploitation infeasible.Therefore, Information Extraction (IE) techniques are needed to automate the interpretation of data written in natural language text.
Named entity recognition and relation extraction are the two fundamental processes of IE.Extracting the relations between the named entities, such as that between an organization and an employee, is critical to the identification of the problem domain's key events, and is therefore key to the majority of IE applications such as semantic search, question answering, knowledge harvesting, sentiment analysis and recommender systems [6].
There are two main approaches to relation extraction, Rule-based and Machine Learning (ML) approaches.While Rule-based approaches rely on transforming the linguistic features space into lexical and syntactic patterns to be applied on natural language texts in order to extract relations, ML approaches do not require deep linguistic skills and use trained classifiers to extract relations from unstructured text [6].Similar to the work of Minard et al. in [7], our relation extraction method adopts a hybrid approach that integrates both Rule-based and ML techniques.Our approach relies on Rule-Based techniques for recognizing named entities, extracting relation instances and generating feature vectors, then Supervised ML techniques are utilized for Relation Extraction based on named entities' relation instances and their feature vectors.For Named Entity Recognition we used the Rule-based ANNIE (A Nearly-New Information Extraction) pipeline system in GATE's NLP engine [8].With respect to relation extraction, we implemented and evaluated three ML classifiers that are commonly adopted for relation extraction from unstructured text: Support Vector Machine (SVM), Perceptron Algorithm Uneven Margin (PAUM) and K-Nearest Neighbor (KNN).
The success of supervised ML is affected by two factors.The first factor is the quality of the training datasets, i.e. the quality and representation of the class instances in the training datasets.If the training datasets contain significant irrelevant, unreliable, noisy or redundant information, then creating accurate classification models during the training phase will be more difficult [9].The second factor is the relevance of the feature vectors that represent distinctive characteristics of the classes in training datasets.The process of identifying and removing the undesirable features is called feature selection, which reduces the dimensionality of the data and increases the speed and efficiency of classifiers' operations [10].Several feature selection approaches were proposed with different selection techniques such as heuristic methods and Evolutionary Algorithms (EAs).A popular feature selection technique uses Genetic Algorithms (GA) as a wrapper approach, where the best feature subsets are evaluated by using the classifier to detect the possible interaction between features.GAs are widely and successfully used to solve the feature selection problem [11], [12].However, to the best of our knowledge, no reported work has been published so far on the use of GAs for feature selection in the relation classification process.In this effort, we aim to employ GAs as a wrapper approach for feature selection to improve the accuracy of relation classifiers.With respect to the quality of the training datasets, we intend to exploit knowledge about the target domain, in particular as the taxonomy of its key concepts and the likely relations between them, to aid process of detecting the candidate relations in the training dataset as well as extracting an extended set (lexical, syntactic, Named Entity) of training features.Semantic Web Technologies (SWTs) will be utilized as the modeling tool for domain knowledge as they facilitate the organization of information into a highly structured knowledgebase that can be comprehended and processed by software agents.
This paper presents a novel methodology for integrating domain knowledge with supervised ML to improve the processes of Relation Extraction from unstructured text.
We utilize semantic modeling for constructing the domain knowledge and GAs for optimizing the learning algorithms' feature subset.Our proposed approach makes several contributions to the methods of knowledge-based relation extraction including: 1) Interrogating Linked Open Data (LOD) 1 datasets to efficiently generate the relation classification training data; 2) Reducing the training data True Negative/Positive imbalance; 3) Setting the best-fit learning algorithms' probability threshold; 4) Establishing the optimum GAs parameters.
The findings of our research also make valuable contribution to the understanding of the impact of specific feature types (lexical, syntactic, Named Entity) and features grouping on the accuracy of the relation classification process for the target application domain.
Our experimental evaluation revealed that all the adopted relation classifiers perform significantly better, in terms of the relation extraction precision and recall, in the reduced feature space optimized by GAs.Moreover, using the Wilcoxon statistical analysis test, we verified that our implementation of GAs represents an appropriate choice for optimizing the process of feature selection for the relation classification problem by comparing it against a space search algorithm that has similar operational dynamics, Random Mutation Hill-Climbing (RMHC).
This paper is organized as follows.Section 2 summarizes the related works on relation extraction and feature selection.The main processes of our proposed domain-specific approach to relation extraction are described in section 3. The ML-based Relation classification tasks are introduced in section 4. The feature selection task and its optimization are explained in section 5. Section 6 evaluates the performance of the GA-optimized ML classification, which is further analyzed in section 7 by contrasting it to optimization based on the Random Mutation Hill-Climbing Algorithm.Section 8 summarizes the findings of the paper and section 9 presents the conclusions and our plans for further works.

II. BACKGROUND AND RELATED WORKS
The focus of this paper is on optimizing the ML relation classification process of our hybrid rule based -supervised ML relation extraction approach.There are two key processes in the supervised ML pipeline that can significantly impact the classification accuracy: the class instances labeling and feature vectors generation; both processes can benefit from formalized knowledge of the problem domain, which can play an important role in understanding the syntactic and semantic characteristics of the problem domain's text and subsequently in improving Natural Language Processing tasks associated with automating or semi-automating the instances labeling process.For instance, in our implementation of Machine Learning based relation classification, domain-specific knowledge is used to compile some of our training datasets by drawing on relation mentions that feature as ground facts in public datasets such as DBpedia and Freebase.This alleviates the manual annotation effort for relation extraction, which can be a time-consuming and cumbersome task to undertake manually [13].
The second key process in the supervised ML pipeline is features vector generation.ML classification tasks require assigning features vector to a finite set of classes in their training datasets.Searching for an optimal features subset can be computationally expensive, especially when the features vector is high-dimensional.Several methods have been developed for generating the features subsets such as sequential search that includes forward and backward search, and complete search that includes exhaustive search and the more common random search, where all operators are randomly generating and selecting features subsets.Examples of random search implementations include evolutionary algorithms, simulated annealing and random mutation hill-climbing.
After feature subsets are generated, they are evaluated by a certain criterion to measure the improvement to the accuracy of the targeted classification model.Based on the evaluation criteria, feature selection approaches can be classified into two categories, the Filter approach and the Wrapper approach [12].The Filter approach assesses the relevance of features by describing a dataset from the perspective of consistency, dependency and distance metrics.All the features are scored and ranked based on certain statistical criteria, and the features with the highest-ranking values are selected and the low scoring features are removed.The best feature subset for the classifier model is selected independently because it ignores the targeted classification model performance on the reduced feature set.On the other hand, the wrapper approach embeds the targeted classification model performance to assess the relevance of the features.After a search procedure in the space of possible feature subsets is defined and various subsets of features are generated, the evaluation of a specific subset of features is obtained by training and testing the targeted classification model.To search the space of all feature subsets, a search algorithm is wrapped around the classification model [14], [15].
Several studies compared the filter and wrapper evaluation criteria.All these studies agree that the Filter approach requires less computational resources than the Wrapper approach because it does not involve the targeted classification model performance in assessing the selected features subsets.They also agree that the Wrapper approach is more accurate than the Filter approach as it selects the best feature subset by directly involving the targeted classification model performance in accuracy measures to ensure that it is improved [12], [14].
Considering that the ML model performance can be affected by an individual feature as well as combinations of two or more features in a feature set, this research investigates the application of automatic search techniques, in particular Genetic Algorithms as a wrapper approach to improve the process of feature subset selection.Although this technique is computationally more demanding compared to Filter approaches feature selection, we argue that the computational overhead is not critical to the performance of our Information Extraction system as the feature selection optimization process is applied as a one-off process to optimize the performance of the machine learning classifies for each target problem domain.
Genetic Algorithms as a Wrapper approach have been used to solve the feature selection optimization problem in diverse areas of Machine Learning based classification problems ranging from Named Entities Recognition [16] to diagnosis and treatment of heart conditions [17].

III. DOMAIN-SPECIFIC RELATION EXTRACTION FROM UNSTRUCTURED DOCUMENTS
Our approach integrates domain knowledge with ML classification to improve the fundamental information retrieval tasks of Named Entity Recognition and Relation Classification.The approach is based on comprehensive analysis of the key concepts and relations of the targeted domain, which are modeled, using Semantic Web technologies, into a formal ontology that is used to semantically tag the entities and interrelations extracted from relevant Web documents.This effectively transforms the initial 'conceptual' domain knowledge into an enriched knowledgebase that can be intelligently explored by means of sophisticated interrogation of the integral and inferred facts within a single document or a set of interrelated documents [18].The tasks of our approach are implemented in three main phases as depicted in Fig. 1, they are: 1) Phase one: Domain analysis and constructing the knowledge map and then translating it into a formal semantic model, ontology.
2) Phase two: Natural Language pre-processing tasks for Named Entity Recognition including, relation detection, features extraction and training datasets generation.
3) Phase three: Relation classification including feature selection by utilising supervised ML and then inserting the semantically annotated information into the semantic ontology.
The unstructured data source of this research is online financial news articles.They are retrieved by using the Rich Site Summary (RSS) feeds including BBC, Reuters and Yahoo Finance.For the purpose of training datasets generation, we retrieve 6135 documents from the online news RSS feeds.Table 1 presents some examples of those news RSS Feeds links.
Building the domain's knowledge map aims to create a prearranged vocabulary and semantic structure for exchanging information about that domain.We modeled the domain knowledge in terms of the problem (use case) domain's key concepts, their interrelations and the characteristics of the data as well as the interaction with the target beneficiary groups.Then, the knowledge map is translated into a formal   semantic model, ontology.The ontology can be utilized to source knowledge from publicly available datasets that are published using the same standardized formalism.Moreover, ontology reasoning can infer more information about knowledge facts in different contexts [18].As shown in Fig. 2 the target domain knowledge is structured as a map of interrelated concepts that can be easily revised and improved by both the domain experts and knowledge engineers.
In the modeled problem domain model in figure 2 above, the central concepts are Organization, Location and Person, based on which other economic domain super-concepts and sub-concepts, such as organization type and stock index, are derived.After that we defined the properties of concepts and relations between these concepts.
The following subsections describe in detail the preprocessing tasks for our proposed hybrid relation classification approach.

A. RELATION DETECTION
Our relation extraction approach is implemented at the sentence-level.Every entity pair for a targeted relation that appears in a sentence in unstructured data is identified and annotated as a relation instance and is assumed to represent one relation type.Relation detection grammar rules are encoded using GATE's pattern matching language JAPE (Java Annotation Patterns Engine) [19].The number of detected sentences and relation instances of the targeted relations in this work is shown in Table 2.These relation instances will be used to compile the relation classification's training datasets.

B. FEATURE EXTRACTION
We argue that domain knowledge can assist in selecting the relation classifiers' features vector.Therefore, we exploit the semantic knowledge of the problem domain to extract new features that expand on the features set used in traditional ML relation classification efforts such as that by Mintz et al. in [20]; for instance, we added dependency paths and entity description features.As the dependency path (grammatical relation) between the related entities is not always apparent, we took into consideration the dependency paths of all words in the sentence including the candidate relation entities.The entity description features include its Parts of Speech annotation, the entity string and the number of words in the entity.
The features are categorized into three categories, lexical features, syntactic features and Named Entity features as illustrated in Table 3 below.These features are extracted by using JAPE rules in the GATE Embedded framework and added to every relation instance in the unstructured data.

IV. ML-BASED RELATION CLASSIFICATION
Selecting an appropriate ML algorithm depends on the problem specification and the nature of the data [21].We implemented and evaluated three different supervised ML relation classifiers, Support Vector Machine (SVM), Perceptron Algorithm Uneven Margin (PAUM) and K-Nearest Neighbor (KNN).The works of Li et al. in [22], Piskorski and Yangarber in [23] and Witten et al. in [24] reveal that these algorithms are used in IE tasks with adequate results.
SVM is a supervised ML algorithm that has proved effective for a diversity of classification tasks including many IE tasks.The most important parameters of this implementation are SVM cost (C, the Cost associated with allowing training errors, soft margin) and the uneven margins (τ or tau, setting the value of uneven margins parameter of the SVM) [22], [25].
PAUM is a simple and effective learning algorithm especially for large training datasets.It has been successfully used for document classification and IE.It has three parameters, positive (p) and negative (n) margins, which allow the PAUM to handle imbalanced datasets better, and the modification of the bias term parameter (optB) [26].Apply Roulette Wheel tech. to select two parents' chromosomes, C j and C k , where 0 ≤ j,k < N and j =kn 9: Generating new chromosomes 10: Apply two points crossover operation on C j and C k chromosomes with probability Pc 11: Apply all points mutation operation on C j and C k chromosomes with probability Pm 12: Let new chromosomes be C This work uses the GATE implementation for the three ML algorithms above as explained in the work of Cunningham et al. in [8].
The algorithms above can implement both binary and multi-class classifiers.Multi-classification is usually solved in terms of multiple binary classifications by using a simple ''one-vs-others'' or ''one-vs-another'' models [22].Rifkin and Klautau [28] argue that the ''one-vs-others'' approach is simple, robust and the accuracy of its results is better or similar to other approaches such as the single machine and error-correcting coding approaches besides that it requires less number of models.For these reasons, a number of studies have employed this multi-class approach; for example, the work of Archibald and Fann [29] and the work of Chandrashekar and Sahin [10].Hence, we adopted the ''one-vs-others'' method to transform multi-classifier into multiple binary.
The key elements affecting the accuracy of supervised ML algorithms are the training datasets, the feature vector and the learning model parameters.The configuration of these elements affects the accuracy of algorithms' results.The next subsections present how we generated the training datasets, tuned the algorithms' parameters and selected the best feature subsets for relation classification.

A. GENERATING THE TRAINING DATASETS
We adopted two methods to generate the labelled instances for the training datasets, using manual annotation and automatically by means of extracting ground facts from existing public datasets.

1) GENERATING TRAINING DATASETS FROM ONLINE STRUCTURED DATASETS
We have employed Semantic Web technologies to model our problem domain knowledge and subscribe the retrieved data to it using the Resource Description Framework (RDF) standard.The same standardized metadata is used in public datasets in the Linked Open Data (LOD) Cloud to publish ground facts that are relevant to various problem domains.These ground facts can be used to compile training datasets for relation classification and enriching the resulting knowledgebase.Hence, we adopted a knowledge-driven distant supervision ML approach to extract common entity pairs' relations by utilizing two existing knowledge datasets as a distant supervision source for ML relation classification.These datasets are DBpedia. 2 and Freebase3 At the time of writing this document, DBpedia contained more than 4.5 million entities and more than 3 billion RDF triples for a diversity of languages.Freebase dataset contained approximately 47.5 million topics and 2.9 billion facts in English language.
The training datasets were built by retrieving the relations between any two entities in a single sentence in the unstructured document that are mentioned in Freebase or DBpedia as ground facts.These relations are assumed to be a class instance or true positive in the training datasets.The mentioned relations in the semantic datasets were extracted by using JENA's SPARQL engine.JENA4 is a free and open source Java framework for building Semantic Web and Linked Data applications, and SPARQL5 is an RDF Query Language recommended by W3C for interrogating semantic stores.The complete implementation details of this task were published in our previous paper [18].

2) GENERATING TRAINING DATASETS MANUALLY
Although manual annotation of ML relation instances is a labour-intensive task, it is generally considered to be more precise than automatic annotation.In this research, we applied manual annotation to generate training datasets to extract uncommon relations between pairs that could not be found in exiting semantic datasets, DBpedia and Freebase.We employed GATE annotation tools to extract the training instances for ML.Table 4 shows the three training datasets that were collected manually.

B. PARAMETERS OPTIMIZATION
The optimization of the ML algorithms' parameters is the problem of choosing/tuning a set of parameters' values that result in improving the ML classifiers' performance by tuning the ML algorithms' parameters.
Lorena and De Carvalho [30] report that there are generally three methods to find the ML algorithms' parameters optima: use the default values, define the values by grid search and automatic search through optimization techniques such as GAs.Grid based search is commonly used to perform parameter optimization, where the default values for the ML algorithms' parameters are evaluated against the other values in the grid.In this work, we adopted grid-based search to perform parameter tuning as it is sufficient to satisfy the requirements of the deployed ML techniques and is simple to implement in comparison with the computationally expensive automatic optimization techniques [31].
Practically, grid search starts with a finite set of reasonable values for each parameter.These values are selected manually in accordance with the specifications of each algorithms.Then, the selected grid sets are used to train the ML algorithms and evaluate their performance against ground-truth in a k-fold validation process.Finally, the parameters that achieve the highest model performance are chosen [31], [32].In this work, the finite sets of parameter values for SVM and KNN (parameters C and tau for SVM, K for KNN) were heuristically selected by studying the specifications and recommendations of those algorithms.However, for the PAUM algorithm parameters (p, n and optB), we relied on the recommended parameters' values by the work of Li et al. [33].The parameters' values selected by grid search proved favorable to the traditionally accepted default values for the SVM, PAUM and KNN algorithms.Table 5 shows the parameters of SVM, PAUM and KNN that were selected using the grid search experiments.

V. OPTIMIZING FEATURE SELECTION USING GENETIC ALGORITHMS
The features in the solution space for Relation Classification are loosely related, which makes the utilization of manual search techniques difficult.Hence, we automate the feature selection process by applying Genetic Algorithms search in a wrapper approach.In the wrapper approach, the classifier model itself is employed to measure the fitness of features set; in other words, the features selected depend on the classifier model used.
We have adopted the conventional implementation of GAs that generally comprises the initialization of the solution space population, population reproduction, crossover and mutation operations and defining the fitness function for evaluation.However, several techniques can be deployed to implement the aforementioned operations; for instance, there are two techniques for population reproduction, steady-state and generational populations and there are several methods for the population initialization such as randomness, compositionality and non-compositionality. Similarly, parent selection can be performed using Stochastic Universal Sampling (SUS) or the Roulette Wheel Selection (RWS), and parent replacement can be based on the replacement of the worst parent or the replacement of random parents.The crossover operation could be applied to one or two crossover points in the chromosome and mutation operation could be applied on one or more genes in the chromosome [34]- [36].We conducted a series of experiments to heuristically determine the techniques that represent a better fit for our feature selection problem.
In our implementation, the genetic-information or chromosome is represented by a binary string of 1's and 0's (genes) that operate as a feature filter, where every bit or gene in the chromosome represents a certain feature (see Fig. 3).If the bit value equals one, this means that its feature is selected to participate in constructing the classifier model, otherwise the feature must be removed.The size of the features vector in this work is 20, which means that the size of the chromosome is 20 bits.
For the purpose of using GA as a wrapper approach, the ML classifiers are utilized to assess features' subsets according to their classification performance.In detail, we define the fitness function using the classification F1 score, which is computed by evaluating the relation classification model using k-fold Cross Validation.The fitness values are computed as follows: 1) By filtering a specified chromosome, a feature subset is generated to train the relation classification model.
2) The generated feature subset is evaluated by applying k-fold Cross Validation on the classification models with the targeted training dataset and feature subset as an input.
3) The resulting F1-score is assumed to be the fitness function value for the specified chromosome or feature subset.
Fig. 4 below illustrates the workflow of the feature selection process as wrapper approach.By means of experimentation, we heuristically selected the Roulette Wheel technique for parent strings selection and adopted two-points and all points for the crossover and mutation operations respectively.For population initialization, we adopted randomness initialization.There are two techniques for population reproduction, steady state and generational techniques.We adopted the steady state technique with the unconditional replacement of the worst chromosome for the parent replacement strategy because it is commonly used to assist in improving the performance of GAs.Steady state technique is less computationally intensive than generational technique; for instance, for 20 population size and two parent selection and 50 iteration, it requires 120 fitness calls instead of 1100 fitness calls for generational technique.
GAs have their own parameters that require more experimentation to find the best fit for a specific optimization problem.These parameters are, initial population size, the number of generations, crossover rate and mutation rate.These parameter values should be adjusted for each problem because they would be related to characteristics of the problem.Small population size might not provide a sufficient sample size for the search space in order to reach an optimum solution.On the other hand, a large population requires more evaluations per generation, which can result in a slow rate of convergence.The crossover rate controls the frequency of applying the crossover operator on the selected parents to generate offspring.The higher the crossover rate, the more quickly new solutions are introduced into the population.If the crossover rate is too low, the search might be inactive due to the lower exploration rate.Similarly, the mutation rate controls the frequency of applying the mutation operator on the selected parents after applying crossover operator to increase the variability of the population.A low level of mutation rate serves to prevent any given gene position in the chromosome from converging to a single value in the entire population.A high level of mutation yields an essentially random search.Lastly, we needed to determine the optimal number of generations as it is directly related to the number of evaluations or fitness functions calls and hence impacts the efficiency of the GAs implementation.By means of experimentation, we heuristically established the parameters that represent the best fit for our feature selection problem.The values of the parameters are shown in Table 6.
Our implementation of Genetic Algorithm operation steps to select the best features subset are as in the following Pseudo-code: Our implementation of GAs' operations output is the chromosome that has best fitness value in the population.The selected features of this chromosome are considered to be the best for the targeted classifier model.More details about our evaluation results are presented in the ensuing section.

VI. EVALUATION RESULTS AND DISCUSSION
There are two commonly used evaluation methods for ML algorithms, K-fold cross-validation and holdout test.In K-fold cross-validation, the corpus is split into K equal size partitions of documents.The evaluation run is repeated K times (folds).Each partition is used as test dataset and all the remaining partitions as a training dataset for all K folds.The overall Recall, Precision and F1-measure result of this method is the average of the all folds' results.In contrast, in holdout test, a number of documents in the training datasets are randomly selected according to a specified ratio, the default is 66%.All other documents are assumed to be testing dataset [8], [37].In this work, we used cross validation K-Fold with K=10, which is empirically found to be the best method in practical ML evaluations as reported by Witten et al. [24].
There are two different options for computing precision, recall and F1-measure over a corpus: micro averaging and macro averaging.In micro averaging, the corpus is treated as one large document, where True Positive, False Positive and False Negative are counted through the entire corpus, and precision, recall and F1-measure are calculated accordingly.On the other hand, macro averaging computes precision, recall and F1-measure by counting True Positive, False Positive and False Negative on every single document and then averages the results for the entire corpus [8].Macro Averaging is more appropriate for our problem domain since the sourced financial news articles represent independent documents.
According to Witten et al. [24], there is more than one method to plot the evaluation results of ML algorithms performance.These methods depend on the target domain.The probability threshold value is an important factor for the best classification results in the majority of Machine Learning classifiers.In these classifiers, a set of instances are assigned to a class if their probability of class membership is greater than a probability threshold ρ, where 0 ≤ ρ ≤ 1.For example, with the default probability threshold value of 0.5, the predicted probability value of any instance to be a member of a certain class as a true positive must be greater than 0.5 [38].However, Freeman and Moisen [39] have asserted that the accuracy of the classification models is affected by the value of the threshold.They added that the default threshold value of 0.5 does not necessarily produce a highest prediction accuracy; particularly, when the datasets are highly imbalanced.It should be noted, however, that in all the previous studies in Relation Extraction that are reported in the open literature and to the best of authors' knowledge, the impact of probability threshold values on the relation classification accuracy has not been given great attention by the researchers in the past.This motivated us to investigate the impact of the probability threshold in relation classification in our research by means of experimentation.We heuristically selected the best threshold value for all classification models on all training datasets by drawing on the correlation between the threshold probability classification and F1-measure.
As presented in section 4 and Table 4, we generated seven different training datasets that cover different relations between different entity concepts in the financial and economic news domain.The sources of the unstructured documents are RSS Feeds (see Table 1).
In the seven training datasets, all the named entities are automatically annotated; however, the classes' relation instances are automatically annotated in four training datasets and manually annotated in the other three training datasets.
The ML relation classification models have been created by using the training datasets with the features vectors.These models should be evaluated before applying them to extract relations from unstructured data.Initially, the training datasets were configured by reducing their classes imbalance to reach the optimum results.Then, a series of experiments were conducted in this research in order to select the best feature subsets to improve the accuracy of relation classifiers models and choosing between ML algorithms, SVM, PAUM and KNN.

A. CONFIGURING THE TRAINING DATASETS
Generally, the classification models tend to favor the majority classes while incorrectly classifying the instances from the minority classes.According to Asif-Ur-Rahman et al. in [4], if the size of one class's instances is much more than other classes' instances in a training dataset, it is considered imbalanced.In our training datasets, specifically the datasets that are generated using public distant supervision sources (DBpedia and Freebase), the number of negative relation instances is large.This is attributed to the fact that some relations in our unstructured data will be incorrectly assumed to be negative instances as they are not included as ground facts in the sourced public datasets.We believe that these negative relation instances can disrupt the balance between True Positives and Negatives instances of the classes in the training datasets.
The first set of experiments attempts to alleviate the classes' imbalance in terms of True Positive and True Negative numbers in order to improve the accuracy of the classification model and to speed up ML processing.In these experiments, we heuristically measure the impact of reducing the number of negative relation instances on the models' accuracy by reducing or removing the relation instances in the documents that are not mentioned in the distant supervision sources.We also explicitly add some negative relation instances in the training datasets of one relation class in order to decrease in the true positive rate while maintaining a low false positive rate as recommended by Mohamed et al. [40].Table 7 above shows the impact of reducing the number of negative Relation Instances on ML models' accuracy in terms of Precision, Recall and F1-measure.
Mintz et al. [20] utilize multi-class logistic classification for relation extraction and reported that the negative relations instances had a minor effect on the performance of their classifier.However, for the implemented SVM classification, it is evident from Fig. 5 that the SVM model accuracy clearly improves as we reduce the number of the True Negative relation instances because the class distribution in the training datasets does play a major role in the performance of most classification algorithms as highlighted by Asif-Ur-Rahman et al. [4].

B. FEATURE SELECTION
The second set of experiments concerns feature selection by using GAs in a wrapper approach.First, we find the best TABLE 7. Shows the impact of reducing the number of negative relation instances on ML models accuracy in terms of precision, recall and F1-measure.subset of features by using our implementation of GAs, and then evaluate the relation classification models using the selected feature subset.

1) FEATURE SELECTION RESULTS
Using the same parameters listed in Table 6, we execute our implementation of the GA.The results in Fig. 6 illustrate the required number of GAs' iterations required by SVM, PAUM and KNN to select an optimal fitness function value (F1 measure); SVM, PAUM and KNN require 57, 54 and 69 iterations respectively.We conclude that the three ML algorithms require approximately the same numbers of iterations to reach the optimal fitness value and that 100 iterations are quite sufficient for the GAs to achieve that goal.
Table 8 below shows the number of selected features in every subset for every classifier, SVM, PAUM and KNN, in all training datasets.This table also shows the features in every subset, which are classified into the three categories, Lexical, syntactic and Named Entity category.
From the data in Table 8, it is apparent that the features of the Named Entities category are more important than the features of the lexical and syntactic categories in the majority of the training datasets.These results are consistent with the findings of Wang et al. [41] who noted that the entity features lead to improvement in performance because the mentioned relation between two entities is closely related to the entity types.

2) EVALUATING THE RELATION CLASSIFICATION MODELS BY USING THE SELECTED FEATURE SUBSETS
The selected feature subsets in the training datasets are employed to create the relation classifiers' models.These models are evaluated by using 10-fold cross validation.Table 9 shows the comparison between the F1-measures results of the three relation classifiers models, SVM, PAUM and KNN when they use all features vectors and when they use the feature subsets.Also, the table indicates the best F1-measure in terms of the best probability threshold.Fig. 7 illustrates the impact of the probability threshold on the F1-measure upon SVM relation classification when using all the classification features and the features subsets selected by our implementation of GA.It is clear that the F1-measure peaks upon probability threshold of 0.4.
All of the classifiers that we studied, SVM, PAUM and KNN, performed significantly better in the reduced feature space optimised by the GA.As evident in Table 9, TABLE 9. Comparing the classifiers results in terms of F1 score before and after GAs results (Thr=probability threshold, ALL=F1 when all features, GA=F1 when features selected by GA).our implementation of GAs has improved the accuracy of ML algorithms in all training datasets.It can also be noticed that the improvements registered for SVM and PAUM are more evident compared to KNN.KNN is more sensitive to the irrelevant features, which is corroborated by Imandoust and Bolandraftar [42] while Wang et al. [41] assert that the mechanism of SVM learning makes the irrelevant features have little impact on the performance of the SVM algorithm.
Our experiments have also indicated that the accuracy of the classification models is affected by the value of the probability threshold.The best threshold values for all classification models on all training datasets were empirically selected to deliver better classification accuracy compared to the default threshold value 0.5 as evidenced in below.
It can be observed from Table 9 that our implementation of GA selects features from the Named Entity category more frequently than from the lexical and syntactic categories for the majority of the training datasets.Consequently, we decided to conduct further research to investigate the impact of the features categories on the classifiers' performance.
With respect to the performance of the SVM, PAUM and KNN relation classifiers, the data in Table 9 indicates that the accuracy of SVM classifier outperforms PAUM and KNN for most of the training datasets, which are Person-Organization, Person-Location, StockIndex-Organzation and Organization-Percent training datasets.The recorded results consistent with the findings of other studies that utilize ML in relation classification; for example, the study by Li et al. [43] found that SVM may perform better than PAUM in small training datasets and they have a close performance in large training datasets.Also, the work of Hmeidi et al. [27] reveal that SVM has better F1-measure results than KNN.We believe that PAUM and KNN exhibit better performance than SVM in some training datasets because PAUM is appropriate for imbalanced training datasets and KNN performs better with small number of features.

C. FEATURES CATEGORY SELECTION
This section evaluates the effect of the features of a single category (Lexical, Syntactic or Named Entity) on the accuracy of the relation classification models.We created the models by using training datasets with features of each category individually and with feature combinations of all categories.
The models' evaluation results are compared in Table 10.The data in the table indicates that the best Precision, Recall and F1-measure values are produced when features of named entities category are included in the training.
The results of these experiments illustrate that the models that are created using the Named Entity category combined with lexical and/or syntactic features, exhibit better accuracy than the models created without including the Named Entity category.This is true for the all training datasets and all ML classifiers except the training dataset of the relation between Stock Symbol and Organization entities when using SVM and PAUM classifiers.This is attributed to the fact that the relation instance correlating Stock Symbol and Organization is short in terms of the number of words (sometimes there are no words between the entity pairs) compared to other relations (with more than two words between the entity pairs).This reduces the effectiveness of certain features; for instance, the features that represent the number of tokens between the entities in the relation instances and the features that represent the POS of the words between the entities.Table 11 below  In general, the classification accuracy of the ML models has improved as a result of deploying our GA for optimizing the feature selection process.In section 7, we further assert this claim by comparing it against another solution search method for feature selection.

VII. CONTRASTING OUR IMPLEMENTATION OF GA OPTIMIZATION TO RANDOM MUTATION HILL-CLIMBING
In this section, we attempt to verify that GAs are an appropriate choice for optimizing the process of feature selection for the relation classification problem.Hence, we decided to compare our implementation of GAs with Random Mutation Hill-Climbing (RMHC) as their operational dynamics are very similar.Our choice of HC to compare against GAs for the feature selection optimization problem is consistent with numerous studies that elected to compare between the two algorithms, for a variety of problems, since their early conception.One of the earliest investigations was carried out by Mitchell et al. [44] who attempted to answer the question: when will a GA outperform Hill-Climbing?They claim that understanding the mechanism of GAs and the characteristic of the fitness landscapes of the problem is crucial for deciding when the GAs will be most useful.Another study by MacFarlane et al. [45] compared GAs to several types of HC algorithms including RMHC.The algorithms were applied to solve term selection problem for an information filtering task.Although they observed that both Genetic and Hill-Climbing algorithms appear to be able to improve accuracy of term selection, they did not find evidence that their implementation of GA performs better than that for their Hill-Climbing algorithm.A recent study by Sakamoto et al. [46] elected to compare GAs and HC in a completely different problem domain, which is simulating the node placements problem for achieving the network connectivity and user coverage.
RMHC can be considered as a GA without crossover operation and initial population.The solution neighbor or the new solution in RMHC can be generated by applying a similar mutation as in GAs, which could make jumps of varying sizes through the search space [36].The other reason of choosing RMHC to compare with our implementation of GAs is to compare between the complexity of GA with the simplicity of RMHC and answering the question: do we need the computational complexity of GA operations?
In our RMHC implementation, we adopted a similar configuration to that used by Sakamoto et al. [46].The RMHC implementation works as in the following pseudo-code: In order to fairly compare the performance of our implementation of GAs and RMHC for the feature selection problem, the experiments should be under the same computational conditions, in particular with respect to the fitness evaluation calls as it represents the most critical operational step of search algorithms.It is clear that one run of GAs is more expensive than one run of RMHC in terms of fitness functions calls [47].As a result, we should run both algorithms with equal number of fitness function calls.
Because we adopted the steady state technique for population reproduction in our implementation of GAs, the number of fitness function calls will be equal to I × 2 + P, where, I is the iterations number of GAs' operations and P is the population size.However, the number of fitness function calls in RMHC is equal to the number iterations of its operations because our implementation of RMHC does not have initial population.Consequently, the number of iterations of RMHC experiments should be equal to the number of our GA fitness function calls.
For the purpose of this experimental comparison, we evaluate optimizing the accuracy of the SVM relation classifier for only one training dataset (Location-Organization).The number of iterations in our implementation of the GAs is 50, thus the algorithm makes 120 fitness function calls for a population size of 20; consequently, the Random Mutation Hill-Climbing algorithm should have 120 iterations in order to subject it to the same computational efforts in terms of fitness evaluations.The number of executed runs for each algorithm is 30, which represent the number of sample runs.

The comparison between our implementation of Genetic
Random Mutation Hill Climbing algorithms are highlighted in Table 12 in terms of fitness sample runs, i.e.F1-measure.The results in the table indicates that Random Mutation Hill-Climbing algorithm outperforms our implementation of Genetic Algorithms in only 4 of the 30 sample runs.
From the data in Table 8, it is apparent that our implementation of Genetic Algorithms outperforms Random Mutation Hill-Climbing algorithm in most the results' sample runs as our implementation of Genetic Algorithms have higher ranking sample runs than the sample runs of Random Mutation Hill-Climbing algorithm.Nevertheless, in order to further examine any significant difference in the performance of our implementation Genetic Algorithms and Random Mutation Hill-Climbing algorithm, we applied a statistical test to compare their performance in the feature subset selection problem.We considered a Wilcoxon singed rank test procedure to perform a pairwise comparison between the two algorithms' sample runs.Wilcoxon test is a non-parametric statistical procedure for examining the median differences in observations for two samples.It aims to detect if there is a significant difference among the behaviour of the samples of two algorithms' results.Before applying the Wilcoxon procedure test, we should rank the absolute differences of the two sample pairs.First, finding out the difference between each sample pair.Then, the absolute differences of the samples are ranked by ordering them from the smallest to the largest.The rank will be according to the position of the absolute difference of the pair in the ordered list [48].Table 12 shows the fitness values for the sample runs of Genetic and Random Mutation Hill-Climbing algorithms; also, their paired sample runs differences and the ranks and total ranks of their absolute differences.
The Wilcoxon singed rank statistical analysis was applied by using the R package6 on our implementation of Genetic Algorithms and Random Mutation Hill-Climbing algorithm sample runs under the null hypothesis and at 0.05 significant level (α).The Wilcoxon test results in R package are shown in below: data: GA and RMHC V = 419, p-value = 0.00003453 alternative hypothesis: true location shift is not equal to 0 Where V is the sum of the positive ranks (GA results ranks) and p-value is a probability that measures the evidence against the null hypothesis.Lower probabilities provide stronger evidence against the null hypothesis.
It is clear that p-value (0.00003453) is considerably less than the significant level (0.05).This result shows that there is a significant difference between our implementation of GAs and Random Mutation Hill-Climbing algorithm and the null hypothesis is rejected.The statistical test result further evidences that our implementation of GAs for feature selection outperforms the Random Mutation Hill-Climbing algorithm in terms of improving relation classifiers accuracy.

VIII. FINDINGS SUMMARY: A METHODOLOGY FOR KNOWLEDGE-ASSISTED ML-BASED RELATION EXTRACTION
Our research into extracting relations from domain-specific documents resulted in a comprehensive methodology for integrating domain knowledge with supervised ML techniques to improve the Information Extraction process form unstructured data.
The preliminary stage of our proposed methodology, which comprised knowledge map construction and the NLP tasks (NER, Relation Detection, feature extraction), was documented in detail in an earlier publication [18].This paper documents how our methodology integrates domain knowledge with ML techniques in order to improve the process of Information Extraction process from unstructured data.In this stage, we developed innovative techniques to optimise the process of ML classifiers for Relation Extraction; this includes employing distant supervision for compiling the ML training datasets and using GA for feature selection.Supported by a series of experiments, our research reports on the favourable knowledge-assisted implementation and configuration of the ML classifiers and GAs including: We have employed public LOD datasets (DBpedia and Freebase) as distant supervision sources to our ML algorithms as, similar to our knowledge modeling approach, these datasets use the same standardised semantic formalism to publish ground facts that are relevant to our problem domain.The ground facts were used to compile training datasets for relation classification.in terms of the number of words.After building the relation classification models by using the configured training datasets and the best selected features vectors, apply these models onto the pre-processed unlabelled online financial news documents to extract new relations between the targeted annotated entities.The output data of this step is an annotated document with entities and their interrelations that are incrementally populated into the resultant semantic knowledgebase.The extracted relations have a confidence score based on the probability of the correctness of entity pairs' relation.These scores could be used to rank the extracted relations to generate a list of the most confident relations [20].
The above described methodology is applicable to other domains and only requires the one-off effort in constructing the semantic model of the domain knowledge, i.e. engineering the semantic ontology that conceptualises the domain's key terms and relations and identifying public data sets providing ground facts about the domain's key events.

IX. CONCLUSIONS AND FURTHER WORK
Harnessing insights from the prolific online information resources requires the computerised processing of unstructured text in order to satisfy the information need of particular applications such as recommender systems and sentiment analysis.The research reported in this paper contributes to the efforts of information extraction by proposing a novel methodology that integrates domain knowledge with supervised Machine Learning (ML) to improve the processes of Relation Extraction from unstructured text.
Considering that the success of supervised Machine Learning is affected by the quality of the training datasets and the relevance of the features vectors, we utilized distant supervision techniques, informed by Linked Open Data datasets, to aid in the compilation of the input training data, and then deployed evolutionary algorithms (Genetic Algorithms) to optimise the process of feature selection in order to reduce the dimensionality of the data and subsequently increase the efficiency and accuracy of the classifiers' operations.Our research also makes several contributions to the methods of configuring the GA-optimised machine learning for relation classification including the reduction of the training data True Negative/Positive imbalance, setting the best-fit learning algorithms' probability threshold and establishing the optimum GAs parameters.In addition, the findings of our research also contributed to the understanding of the impact of specific feature types (lexical, syntactic, Named Entity) and features grouping on the accuracy of the relation classification process for the target application domain.
The conducted experimental evaluation evidenced that the developed knowledge-assisted relation classification model, which was further boosted by our implementation of GAs to reduce the feature space, has resulted in significant improvement in the process of relation extraction.The experimental results also indicate that amongst the implemented ML algorithms, SVM exhibited the best relation classification accuracy in the majority of the training datasets while retaining acceptable levels of accuracy in the rest in the remaining training datasets.
Finally, we verified that GAs represent an appropriate choice for optimizing the process of feature selection for the relation classification problem by comparing them against a space search algorithm that has similar operational dynamics, Random Mutation Hill-Climbing (RMHC).In order to further examine any significant difference in the performance of our implementation of GAs and Random Mutation Hill-Climbing algorithm.We used a non-parametric statistical procedure, Wilcoxon test, to detect if there is a significant difference among the behaviour of the sample runs of our algorithms' implementations.The findings demonstrated that our implementation of GAs for feature selection outperforms the Random Mutation Hill-Climbing algorithm in terms of improving relation classifiers accuracy.
Our plans for further work include investigating whether the relation classification results can be further enhanced by utilising GAs to solve the multi-objective optimization problems combining parameters optimization of the ML algorithms and feature selection in relations classification.More broadly, our future work aims to develop the reasoning capabilities of the underlying semantic knowledgebase for the benefit of target user groups such as journalists or financial investors.Hence, we will investigate the application of reasoning techniques such as the first-order classification rules that can be hard-wired into the knowledgebase' semantic model and the explicit Semantic Web Rules Language (SWRL) to classify events and facts that might be of interest to the end users.The planned research will also investigate the techniques for Natural Language query interpretation into SPARQL queries that can efficiently interrogate the domain Knowledgebase.

FIGURE 1 .
FIGURE 1.The Three phases of the general framework.

FIGURE 2 .
FIGURE 2. The concept map of this work.

Algorithm 1
Genetic Algorithm Implementation 1: Start: 2: N is the size of the population 3: Pc is the crossover rate and Pm is the mutation rate 4: Let the best solution S * and its fitness F * (S * ) equal to 0 5: Generate initial N chromosomes C i for the initial Population, where i ∈ [0,1,. . .,N) 6: Evaluate initial chromosomes C i , to be of finesses F(C i ); 7: repeat 8: j and C k , children's chromosomes 13: Evaluate C j and C k , the fitness of the children's chromosomes are F(C j ) and F(C k ) 14: Unconditionally replace children's chromosomes C j and C k with the worst chromosomes in population 15: Find best chromosome C b with best fitness F(C b ) in the current population, where 0 ≤ b< N 16: Let the current solution S equals the best chromosome C b and the current fitness F equals F(C b ) 17: if F > F * then 18:Update the best solution and the best fitness; end if 22: until (stopping condition is met) 23: Return S * , F * 24: End KNN uses simple techniques and its accuracy is often enhanced when the number of features is small; the KNN implementation used in this work has only one parameter, K[27].

2 :
Generate an initial solution S 0 ; 3: Evaluate the initial solution S 0 , F(S 0 ); 4: Let the current solution S equals the initial solution S 0 ; 5: Let the best solution S * equals the initial solution S 0 ; 6: Let the best fitness value F * equalsnn the fitness of the initialsolution F(S 0 ); 7: repeat 8: Mutate current solution S to generate a new solution S ; 9: Evaluate the new solution F(S ); 10: if F(S ) > F(S * ) then 11: Update the best solution and the best fitness; end if 15: Update the current solution S = S ; 16: until (stopping condition is met) 17: Return S * , F * 18: End

FIGURE 4 .
FIGURE 4. GA feature subsets selection as wrapper approach.
For instance, the marketing domain uses lift chart by plotting True Positive rate versus training subset size, the communication domain uses Receiver Operator Characteristic (ROC) curve by plotting True Positive rate versus False Positive rate and the Information Retrieval domain uses Recall versus Precision curve.This research computes the evaluation results of ML models in relation classification by drawing the relation between recall and precision in terms of the confidence threshold for classification or the threshold probability classification as it is commonly accepted as the standard in the Information Extraction field.

FIGURE 5 .
FIGURE 5. SVM model accuracy in terms of the number of non relevant relation instances in location organization pair training dataset.

FIGURE 6 .
FIGURE 6.The genetic algorithm iterations to select the best feature subset for stock index and the percentage increase or decrease training dataset.

FIGURE 7 .
FIGURE 7. Impact of threshold on SVM relation classifiers' accuracy.
illustrates the difference in POS features for StockSymbol-Organization and Organization-Percent relation instances.The number of POS tokens between the entity pairs in the relation instance of StockSymbol-Organization training dataset is only one and the number of POS tokens between the entity pairs in the relation instance of and Organization-Percent training dataset is 12.It is clear that the features which are related to the tokens between the entity pairs in the StockSymbol-Organization training dataset are not sufficient to indicate the syntactic relation between organization and its stock symbol within the context.
A. BOOTSTRAPPING THE TRAINING DATASETS WITH DISTANT SUPERVISION SOURCES ABDULADEM ALJAMEL received the master's degree in Internet and enterprise computing and the Ph.D. degree in knowledge-based information extraction and exploration form Nottingham Trent University.He is currently a Lecturer with the Libyan Higher Institution of Science and Technology.His research interests include information extraction, and knowledge representation and exploration.TAHA OSMAN is currently a Principal Lecturer with Nottingham Trent University (NTU) and the Manager of postgraduate studies with the Department of Computing and Technology, NTU.Based on extensive research in intelligent multi-agent systems, his research expanded to knowledgebased systems, where he developed specialist expertise in utilizing semantic Web technologies for intelligent retrieval and the exploration of information.His research interests include the development of innovative solutions for numerous problem domains, including financial recommender systems and Arabic sentiment analysis, and resulted in commercial R&D collaborations in the fields of digital media search engines and mineral deposits prospectivity analysis.GIOVANNI ACAMPORA was a Hoofddocent Tenure Track in process intelligence with the School of Industrial Engineering, Information Systems, Eindhoven University of Technology, Eindhoven, The Netherlands, from 2011 to 2012.He was a Reader in computational intelligence with the School of Science and Technology, Nottingham Trent University, Nottingham, U.K., from 2012 to 2016.He is currently an Associate Professor in artificial intelligence with the University of Naples Federico II, where he is also a Scientific Advisor with the Quantum Computing and Smart Systems Laboratory, Department of Physics.His main research interests include computational intelligence, fuzzy modeling, evolutionary computation, and ambient intelligence.He is a Senior Member of the IEEE.He is member of the Scientific Board of the Interdepartmental Center for Advanced Robotics in Surgery.He is the Chair of IEEE-SA 1855WG, the working group that has published the first IEEE standard in the area of fuzzy logic.As a result of this activity, he received the prestigious IEEE-SA Emerging Technology Award.He serves as the Editorin-Chief for Quantum Machine Intelligence (Springer), an Associate Editor for Soft Computing (Springer), and an Editorial Board Member for Memetic Computing (Springer), Heliyon (Elsevier), the International Journal of Autonomous and Adaptive Communication Systems (Inderscience), and the IEEE TRANSACTIONS ON FUZZY SYSTEMS.In 2017, he acted as a General Chair of the IEEE International Conference on Fuzzy Systems, the top leading conference in the area of fuzzy logic.

TABLE 1 .
Example of RSS feeds.

TABLE 2 .
Sentences and number of pairs of relation instances.

TABLE 4 .
The summary of the collected training datasets (RI=all relation instances, RC= relations classes, Doc=documents, P=person, O=organization, L=location, S=stock symbol, I=stock index, C=percentage).

TABLE 5 .
The grid search results of optimum ML algorithms parameters.

TABLE 6 .
Our implementation of GAs parameters.

TABLE 8 .
The feature subsets that are selected by using gas (Lex=Lexical, Syn=Syntactic & Ent=entity).

TABLE 11 .
Examples of the POS feature of tokens between entity pairs.

TABLE 12 .
GA and RMHC F1-measure sample runs and their absolute differences ranks.
In terms of selecting the best features for relation classification, the research findings indicate that the models that are created using the Named Entity category combined with lexical and/or syntactic features exhibit better accuracy.The exception for our target domain is the Stock Symbol and Organization relation as it is characterised with short relation mentions (instances) tive and True Negative instances of the classes in the training datasets.Hence, we conduct a number of experiments to heuristically reduce the number of resulting negative instances and we also explicitly introduce some negative relation instances in the training datasets of one relation class in order to decrease the true positive rate while maintaining a low false positive rate.The experimental results evidenced that our approach has a positive impact on the models' accuracy.