A Semantic Approach to Ranking Techniques: Improving Web Page Searches for Educational Purposes

The Web offers an unprecedented number of resources and has become the most popular source of information for students shaping their understanding of a new topic, and for instructors selecting relevant material for learning and teaching activities. Even though search engines are the most widely used tools for searching for educational content, the realities of the learning and teaching processes make the retrieval and evaluation of educational resources more complex than they are for other goods or services. The lack of recourse to educational metadata in web pages, as well as the size of the Web itself, call for specific techniques to be adopted for a more effective ranking of educational content. In this study, we propose an innovative approach based on semantic technologies. The SemanticSearch approach described in this paper leverages knowledge graph representation of teaching contexts and proposes a new ranking method for rating educational web content. In the literature we find an Educational Ranking Principle that ranks web pages for a specific teaching context. In this study, we integrate the Educational Ranking Principle with semantic data to extend the experimentation and analyse performance further. We undertake an evaluation involving university teachers, considering more than 70 queries to measure the SemanticSearch performance against the Educational Ranking Principle in addition to two state-of-the-art methodologies: Tf-Idf and BM25F. Paired t-tests of four accuracy measures provide statistical evidence for improvements made using SemanticSearch method accuracy in evaluating web pages when compared to the three baselines.


I. INTRODUCTION
The Web has become a primary source of information for a growing number of people, and it affects many aspects of our everyday life, education included. More and more frequently, the Web is the primary resource that students exploit to shape their understanding of a new topic. Teachers also take advantage of the myriad of resources available on the Web to create their lessons and to design learning activities. Other than web resources (courses, lessons) specifically created for educational purposes by numerous on-line educational providers, there is a vast number of web pages suitable for educational contexts, even though they have not been specifically designed for didactic purposes.
In this scenario, the retrieval of suitable resources that match a specific educational (or didactic) needs is crucial.
As demonstrated by [1], retrieving and recommending educational resources cannot be directly compared to performing the same operations for other goods or services: information about the user's preferences or interests alone is not sufficient, because the data needed has to comply with educational requirements and learning objectives which go beyond the individual's profile. Such educational aspects must also be fully considered during the various phases of learning and teaching in order to arrive at a suitable recommendation of educational resources [2]. Moreover, it is important to note that most education-oriented research into information retrieval and recommender systems considers students as the main users, while only a few studies focus on instructors [3] as the prime requesters of educational resources. Even in recent publications on emerging technologies for intelligent e-learning systems -as reported in the chapter "Technology in education" [4] -attention is mainly given to teaching methodologies (e.g., the "flipped classroom" approach) rather than to instructors' problems in retrieving and reutilising teaching materials. Research into supporting learning and teaching design has been mostly oriented towards hybrid recommendation systems based on pedagogical patterns, metadata-based models for exploring materials, and intelligent systems to help the teacher save time when performing highly repetitive tasks [5]. However, teachers also need support (a) during the process of activity design, and (b) in the choice of relevant resources for learning and teaching [6]. In this scenario we envisage a web app for teachers that is able to support the implementation of our method for searching educational materials on the Web. To our knowledge, there are, as yet, no such apps, but only digital archives dedicated to educational content. An analysis of the problem of an effective retrieval of learning resources from an instructor's perspective reveals specific requirements. The need to explore educational resources is probably at its peak when an instructor starts to teach a new course which is not related to previous classes; this may involve new topics or a different teaching context. Previously used resources might therefore not meet the instructor's needs. Moreover, even when the same instructor teaches in different contexts, the resources delivered in one context may not match -or have little in common with -those needed in another [2], [6]. In general, we can say that when instructors are looking for new resources, it is very likely that their queries and requests will not conform to their previous teaching, either because the topic is new or because the educational aims of the course have changed.
The instructor's educational context is critical in how information about their present teaching needs should be presented [2], [7]. We can deduce this context partly by looking at the structure of the course [8]. The course structure is, essentially, the representation of the knowledge and skills that the course aims to deliver. Such a structure, also known as the educational plan, can be visualised as the course concept map [9]. Moreover, an educational resource may be relevant when teaching a concept in just one specific context. For this reason, the retrieval of learning resources that match the educational context (contextual retrieval) is more important than suggesting resources based solely on an individual teacher or on past selections (user or content-based retrieval). For example, let us assume that a recommender system suggests a resource suited to the instructor's teaching style, but offers a concept that is no longer part of the educational plan. It is clear that the recommendation is not wrong in itself since the resource is actually of interest to the instructor. However, it is rendered useless given the context of the current teaching situation and so the recommendation is unsatisfactory from an educational point of view.
Our research focuses on the ranking of web pages according to their compliance with the educational context of the instructor. This focus on web pages introduces specific challenges regarding the following aspects: • Metadata related to educational context are not widely adopted [10]. The use of Learning Resource Metadata Initiative (LRMI) increased over time but the distribution and quality of learning resources markup is limited. Therefore, while generic search engines leverage metadata to provide additional information matching a web search for products or services (e.g. prices for products), there are no educational annotation of resources. • There are no standards applied for the structure of learning material published on the Web. Quality metadata to describe learning resources in dedicated learning object repositories are poor, and many incompatible standards are used for these metadata. Therefore, the majority of educators use common search engines when looking for educational resources [11]. • There is no specific educational rating for the content of Web pages. A mechanism to evaluate the capacity of using a content in an educational context does not exist.
To support the search of learning materials, common search engines are still the predominant tools, even though they are not specifically oriented to educational content [11].
We thus propose a semantic search methodology that aims to give higher ranking to those web pages that are i) useful for the instructor and ii) concur with their current teaching requirements. This search methodology exploits information from the Teaching Contexts (TC) of the instructor as defined in a teacher model presented in [12]. That definition of TC includes the title, concept map, education level, teaching objectives and difficulty of the course. It also proposes a method called Educational Ranking Principle (ERP) that leverages the TC information to improve the retrieval of educational resources for a specific TC. Our new SemanticSearch method goes further, integrating semantic data extrapolated from the DBpedia knowledge graph with the TC information. In this way, it is no longer necessary to have suitable metadata content, dedicated standards or ratings for educational content to retrieve didactic material. Knowledge graphs are widely adopted for organizing information about entities in a structured and semantically meaningful way. Essentially, when a teacher submits a query to retrieve didactic material from the Web, our Semantic-Search produces i) a Resource Category Graph and ii) a Query Category Graph. Our method combines the information from these two graphs and scores web pages according to how they fit in the elaboration of the two graphs. The higher the similarity score, the more likely the resource is suited to the teaching context of the educator.
Based on the information available in the TC, we tested our SemanticSearch methodology on 16 different queries, created automatically as a combination of the components of the teaching context (as reported in Section III, Table 1). Queries are enriched through a Named Entity Recognition process with semantic entities from the DBpedia Knowledge Graph. Our methodology leverages specific properties and attributes of the semantic entities to improve ranking performance. The results of this semantic-based search are compared against two baselines: Tf-Idf, and BM25F. In our findings, the Se-manticSearch is more reliable than Tf-Idf and BM25F in ranking resources according to their suitability in a teaching context, and it also improves on the already good results of ERP. In particular, some aspects of the TC in combination with semantic data help significant semantic entities to be detected more effectively, resulting in a better educational scoring of the web pages. In the following Section II we discuss related work in the field, while in Section III and Section IV we present the methodology and the evaluation of our SemanticSearch proposal respectively. Results are laid out in Section V, while Section VI presents final remarks.

II. LITERATURE AND BACKGROUND
Our work investigates a ranking technique that can search for educational content throughout the entire Web. Discovering educational resources on the Web is a widely debated issue. In this respect, categorizing online content for didactic purposes is one of the main difficulties, mainly due to the lack of semantic relationships between Web resources that do not allow for effective automatic retrieval of teaching and learning material on the Internet. For instance, the absence of this semantic element makes it challenging to define relations between areas and subjects. Consequently, tasks such as finding associations between topics and listing recommendations about educational resources are still extremely difficult to perform automatically. Much research, particularly from the Technology Enhanced Learning (TEL) community, focuses on addressing these issues. In the following sections we will present some approaches that are closely related to our proposal.
An interesting work which exploits Linked Data available on the Web presents a process of supporting the semiautomatic classification of Open Educational Resources [13]. This method categories OERs based on the data enrichment capabilities that can be provided by social open vocabularies and archives. The authors demonstrate that it is possible to create a link between a formal knowledge organization system and a knowledge source. While ontology, a popular means of representing knowledge, can be used to categorise web pages with reference to conceptual schemes (e.g., thesauruses), a social categorisation using existing knowledge graphs, like DBpedia, can be more beneficial. Moreover, DBpedia resources are linked to ontologies such as YAGO and WordNet, thus providing more semantic information in the form of typeOf and suchAs relations.
The authors in [14] compare eight techniques for ranking semantic associations, highlighting two of them: size and entity homogeneity. The results prove that small semantic associations between entities with similar types and semantic associations with uniform relations are, in practice, effective for human experts.
The importance of semantics is also highlighted in [15], where the authors demonstrate that using a semantic model to represent content is particularly effective in the retrieval of scientific articles. In [16], the authors present a novel solution for searching for and ranking OERs, which integrates the ranking of search results from an existing search engine with the rankings of OERs found via term clustering. The mixed ranking is then used to re-calculate the order of retrieved terms. Their goal is to improve the results of an existing search engine by increasing the number of relevant OERs relating to the search terms. A retrieval framework for vocabulary learning that is optimized for learning outcomes rather than general relevance is illustrated in [17]. The framework takes into account students' prior knowledge and optimizes the retrieval process by defining the learning and effort functions. Then, by combining them it is possible to obtain the final function for that optimization problem. In our case, we see the learning problem from a different point of view, namely, from the teacher's perspective. Our goal is to give stronger support to the teacher in arranging and retrieving the most appropriate learning material from the Web. In contrast to the approaches cited above, the focus of our research is the elicitation of educational content carried out by the actor of the search: the teacher. Our starting point is the Educational Ranking Principle (ERP) [12], an educationbased ranking principle that elaborates the usefulness of a web page for a teaching context. The teaching context, included in the Instructor Profile, plays a fundamental role in the ranking. The formalisation of an instructor profile that is based on contextual information about the teaching is useful in addressing many problems when analysing web pages for educational purposes [18]- [21]. Leveraging the same information of the TC used for the ERP, studies in TEL have been able to address some important problems such as i) the automatic discovery of the prerequisites of educational resources [22], ii) potential new recommendation methods for instructors [23], [24], and iii) the comparison of the performance of recommender systems in TEL [25]. Recent research in this field also demonstrate that semantic entities of knowledge graphs are useful in enhancing the description of teaching resources [19], [26]. In this respect, our study also suggests a semantic-driven approach to further improve the scoring of web pages for a teaching context.

III. METHODOLOGY
ERP aims to rate web pages according to specific aspects of an instructor's teaching practice. The rating reflects the matching of the web page with the concept to teach as well as the appropriate context. Google and other Information Retrieval (IR) methods already rank web pages according to topic and they perform remarkably. The main problem is ranking a set of web pages according to their suitability for teaching in a certain context; this is where ERP comes to the fore. It is expected that if a web page is to be considered an educational resource, it not only needs to explain a VOLUME 4, 2016 concept, but also to refer to certain fundamental knowledge associated with it (e.g., prerequisite knowledge). It must also be appropriate for the target students. The Teaching Context is a key element in accurately ranking web pages according to the actual needs of the instructor [2], [12]. ERP is based on a specific structure of Teaching Context consisting of the following attributes: • Concept Name (CN); • Course Title (CT); • Prerequisite Knowledge (PK); • Difficulty (DIFF); • Education Level (EL). When teaching a concept, its name (CN), course title (CT) and prerequisite knowledge (PK) contextualise the scope of the teaching. A concept name may be too ambiguous, so a course title can be of help in describing the domain.
The prerequisite knowledge defines what the learners are expected to know already before acquiring the concept CN. Any teaching material that refers to its relative PK, as specified in the Teaching Context, will better suit the learning process. Let us suppose that an instructor is looking for educational resources for the concept c in a concept map. Let us also assume a simplified concept map, where the edges between concepts represent only one type of semantic relationship: the prerequisite relationship. The PK attribute is the set of concepts directly connected to c (i.e., with ongoing edges to c). Prerequisite Knowledge has been valuable in addressing various problems in the retrieval and recommendation of resources in TEL [3], [22], [27], including the ranking of web pages [12]. Finally, the difficulty (DIFF) and the educational level (EL) of a concept tailor the search to those materials that the target audience can acquire appropriately.
These five attributes are sufficient in order to provide a higher score for those web pages that are most appropriate for the Teaching Context [12].
ERP analyses the content and structure of the web pages with respect to the Teaching Context. In brief, ERP is a structured scoring method that analyses the Tf-Idf score of terms of the teaching context in specific sections of the web pages. These scores are weighted according to the degree of expectancy in finding an attribute of the teaching context in a section of the web page. This mechanism is designed to assign a high score to web pages that present high Tf-Idf scores for an element of the teaching context in a significant section of the web page. When scoring a set of 10 web pages for a web search, the ERP scoring is more accurate and reliable than traditional IR methods such as Tf-Idf and BM25F. Performing the same test over a larger set of web pages (a more detailed account is given in Section IV-E), we discovered that ERP continues to perform better than the Tf-Idf baseline. The same improvement is also recorded when comparing ERP to BM25F, even though we were not able to obtain statistically significant results for this test due to the different textual structure of newly injected web pages. Over a large set of resources, the term frequencies suggest a good match between a web page and the teaching context because of the high frequency of the terms in common. However, the resources may be very different on a conceptual level; this is where semantic techniques can assist IR methods [28], [29].
At this point we propose SemanticSearch, a revised ERP based on a semantic methodology rather than on pure term frequency. Using the same formulation of the teaching context, we introduce a potential semantic web search methodology to overcome the weaknesses of ERP. Figure 1 illustrates how SemanticSearch incorporates semantic techniques in the scoring phase of ERP. The main steps of the entire process are detailed in the following workflow: Step 0: As a preliminary, the instructor wishes to create a course which comes with the following basic information: title (CT), educational level (EL), difficulty (DIFF) and a concept map which represents the organization of the course concepts based on their prerequisite relationships (PK).
Step 1: From the course information, the system generates the search query by combining the five educational attributes of the TC. In this study we aim to investigate which attributes of the TC lead to a better educational ranking of the web pages. In this regard, there are many possible query structures that can combine the attributes of the TC in different ways, as shown in Section IV-A0a. In this study, we evaluate all of these and identify the one that is most relevant to our proposal. Once the system creates the query from the teaching context information, two different workflows emerge as Fig. 1 illustrates: (a) the creation of the semantic graph associated with the query, and (b) the creation of the semantic graph associated with the retrieved resources.
Step a.1 Extraction of semantic entities from the query using the Dandelion Name Entity Recognition framework 1 [30].
Step a.2 Creation of the semantic graph of the DBpedia categories associated with the query entities [31] by using the dct:subject property of each DBpedia entity extracted at Step a.1 (see Fig. 2).
Step b.1 Query submission. The query is sent to a search engine (e.g., Google) or ranking principle to retrieve the list of resources.
Step b.2 Extraction of semantic entities from the content of each resource by using the Dandelion Name Entity Recognition framework. Semantically enriched entities extend the representation of entities from a mere sequence of terms to additional information about specific attributes. In our case the categories associated with each entity are leveraged for the ranking process.
Step b.3 Filtering of the semantic entities extracted so that only the most promising entities are retained, ranked according to the specific scoring method (presented in Section IV-A0b).
Step b.4 Creation, for each resource, of the semantic graph from the DBpedia categories associated with the resource entities extracted during Step b.2.
Step b.5 In order to reduce noise and refine the overall category graph, we apply Dijkstra's algorithm and the Spreading Activation technique to the semantic graph created in the previous step. This step aims to remove categories that are likely to be irrelevant [31]. Specifically, after the extraction of the category tree from DBpedia, Dijkstra's algorithm is applied to find the shortest paths between the categories and the root of the category tree; for each edge belonging to these paths a weight equal to 1 is assigned; the spread activation algorithm is then applied to deduce the most important top-level categories for an entity. The activation is spread throughout the category graph contrary to the direction of the edges (from "child" to "parent").
Step 6 Computation of the similarity measure between the semantic graphs of the web pages and the semantic graph associated with the query. In this study we investigate which similarity measure, among those presented in Section IV-A0c, is the most effective in ranking learning materials.
Step 7 Ordering the resources according to the similarity score obtained from the Graph Similarity Comparator (during Step 6) by descending ratings.

IV. EVALUATION
The workflow in Figure 1 shows how the SemanticSearch methodology leverages semantic technologies for ranking instructional materials according to a teaching context. According to this methodology, the retrieval of web pages and the name entity recognition process rely on external services (Google Custom Search API and Dandelion NER respectively). However, i) the composition of the query structure, ii) the entity filtering and iii) the similarity of the query and resource graph are three key steps in the methodology that must be correctly tuned in order to improve ranking performance. In this work we evaluate several SemanticSearch measures determined by the combination of the parameters involved in the three key elements of the workflow. The performance of these SemanticSearch measures is compared with the baselines Tf-Idf and BM25F using the Mean Average Precision (MAP) and top-N precision score as accuracy measures. The statistical t-test is used to confirm that the accuracy of our method is higher than the baselines. To produce a ground truth dataset of web pages rated within a teaching context, we asked 66 assessors to rate the relevance of web pages with respect to their teaching context. This dataset was used to compare the performances of our method against the baselines according to the MAP and top-N accuracy measures. In the following sections the elements of our evaluation methodology are presented in detail.

A. MEASURES
The SemanticSearch measures are obtained by combining the following three elements: Query structure, Scoring method and Similarity measure. The query structure is used in the Query Composition step to extract the attributes of the TC that will be used to generate the query to be submitted to the search engine. In the Entity Filtering step the scoring method is adopted to set the threshold to filter the entities that are associated with the resources (web pages) obtained from the search engine as results. Finally, the similarity measure intervenes in the Graph Similarity Comparator stage to determine the degree of similarity between the query and resource graphs. The following sections describe the three VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186356 components in detail. Then, in Section V we evaluate which combination of these elements is the most beneficial for ranking the learning materials.

a: Query Structures
When the system is asked to search for teaching material for a particular concept CN, the system can generate several queries by combining the attributes of the TC. Since the order of the attributes does not influence the scoring phase, Table 1 reports the 16 query structures which represent all the combinations of the TC attributes. One aim of the present study is to identify which query structure, out of the 16 we can generate, is the best for the Semantic Search methodology.
Following the approach already presented in the literature [31], we produced a graph representation of the query using semantic entities extracted by means of the Dandelion API. In particular, the queries, written according to the query structures in Table 1, are elaborated with the Dandelion NER framework to obtain the corresponding DBpedia entities. Then, the dct:subject property indicates the related DBpedia categories. The hierarchical structure of the DBpedia categories is, in turn, used to construct the category graph of each entity.
For example, let us imagine that a teacher is searching for learning resources relating to "Strategic information system" (CN) for a "Business strategy" (CT) undergraduate course (EL) for beginners (DIFF). We also assume that the teacher specifies the concept "Organisational strategy" (PK) as a prerequisite for the Strategic information system. Figure 2 shows the category graph for the query "Business strategy, Strategic information system, Organizational strategy" automatically generated using the query structure Q8 in Table 1 with the attributes CN, CT, PK: • Strategic information system (CN) • Business strategy (CT) • Organisational strategy (PK) The Dandelion NER framework returns the DBpedia entities for this query: dbr:Strategic_information_system, dbr:Strategic_management, dbr:Organization. The categories of these three entities are extracted using the dct:subject property of each entity. Successively, the category graph is created by considering the skos:broader relationships of the DBpedia categories. In the resulting graph, as shown in Figure 2, each node of the graph is a DBpedia category, and the top-level categories (Business and Science in the figure) have dbc:Main_topic_classifications category as skos:broader. The graph clearly shows that the user is asking for resources about strategic systems in the field of business. Generally speaking, the higher the number of terms in the query, the more complex the structure of the resulting query graph will be (i.e., with a higher number of nodes and edges), given that the entity extractor tool is likely to find a larger quantity of entities. This semantic representation of the query allows us to leverage semantic approaches in the overall ranking process.

b: Scoring Methods
For each resource (web page) analysed in our method, we extract the semantic entities and then create a category graph following the same approach used for the query. The textual content of a resource may be much more detailed than a query, even just considering the number of words used. As previously stated, with increasing text length the graph structure displays a larger quantity of nodes and edges. However, the extractor may find entities that are poorly related to the resource as a whole. To reduce this noise and improve the performance of our method in the successive phases, we introduce the filtering of the entities mined by Dandelion, thus retaining only the most relevant ones. We consider the following three possible approaches to filter the semantic entities: • Ef-Irf: this refers to the calculation of an adapted version of the Tf-Idf (Term Frequency-Inverse Document Frequency) usually adopted in linguistic analysis. The Ef-Irf (Entity Frequency-Inverse Resource Frequency) measure considers entities rather than terms, and the resources to which the entities refer to rather than documents. We find this measure successfully enriches educational materials with semantic entities [32]. • Confidence: this is the value of confidence assigned by Dandelion to each extracted entity. The higher the confidence, the better the association mined by the tool. Thus, the entities extracted with greater confidence are likely to be the most significant ones. • Ef-Irf * Confidence: by multiplying the two scores obtained from the previous approaches we generate a third score that takes into account both frequency and semantic aspects.

c: Similarity Measures
The final step is to produce a resource category graph out of the semantic entities extracted from a resource (web page) and filtered as mentioned above. The categories are linked to the entities through the dct:subject property, and, in addition, the categories are linked through the skos:broader property. Finally, each resource category graph is compared with the query category graph by the Graph Similarity Comparator. The similarity between two category graphs indicates if two documents, in our case the query and a resource (web page), address similar topics. There are different ways to compute the similarity between two category graphs. However, it is not the purpose of this study to investigate and compare such methods: here we are discussing just two of the approaches that we find most relevant to our study: • Overlapping Degree (OD): this similarity measure addresses the different arrangement of common concepts between two graphs. It takes into account the perspective of the commonality of nodes, and the structuralsemantic correspondence of the placement of the common nodes in the two graphs [33]. • Jaccard: this is a well-known measure that computes 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  The set of query structures used for the evaluation of the Semantic Search methodology. The Query Example column shows an example of the query derived from the concept "Operators" for a "Java programming" undergraduate course.

FIGURE 2.
An example of a semantic graph produced from the categories of the entities extracted from the query "Business strategy Strategic information system Organisational strategy".
the intersection over union of the node sets of the two graphs. This measure does not consider the relations between nodes in the graph. After this final step, resources are scored based on their relevance to the query. The semantic ranking is achieved by ordering these resources from the highest to the lowest score.

B. BASELINE METHODS
The SemanticSearch methodology is an IR methodology. For this reason, its accuracy performance is compared to Tf-Idf and BM25F as they represent unstructured and structured scoring methods respectively. We do not need to dwell on the importance of Tf-Idf and BM25F methods, which are the bases of modern IR approaches [34], [35] and are still used as the baseline for new IR methods [34]. These methods can be configured in different ways to reflect the specific scoring application better [34].

a: Tf-Idf
This measure is an unstructured scoring method that analyses the body text of a web page as a whole. The idea behind Tf-Idf is based on computing two coefficients for each term of a query, namely the Term Frequency (Tf) and the Inverse Document Frequency (Idf) in the collection of documents. The higher the Tf-Idf value for a document, the more relevant it is. Tf is the number of occurrences of a term in a document, while Idf is the number of documents that contain such a term in the dataset. When scoring documents for a non-binary query, a popular way of using Tf-Idf relies on a Vector Space Model (VSM) representation of both the query text and the VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186356 document [35]. We thus build two vectors of Tf-Idf scores from the terms in the query: one for the body text of the web page and the other for the query text. The dimension of the vectors is equal to the number of unique terms in the query.
In this study, we use the Apache Lucene Practical Scoring Function 2 after removing the normalisation and boosting factors. Given a term t of a query q and the text text, we compute the Tf-Idf scores as follows: where idf(t) is defined as follows: idf(t) = 1 + log total number of documents docFreq(t) + 1 .
Hence, the relevance score of a web page w for a query q is: where V q and V w are the Tf-Idf vectors of the query q and the web page w respectively.

b: BM25F
Unlike Tf-Idf, BM25F analyses the matching of a query with different parts or sections of a document. In the case of web pages, it is a straightforward process of extracting different kinds of information from the HTML tree of the pages. BM25F basically applies a BM25 function to score the relevance of diverse sections of a web page to a query. It then combines these scores using certain parameters to produce the relevance score. Section IV-E will provide an in-depth discussion of which parts of the web pages we consider for our purposes. However, it is useful to mention here that we consider these sections: s ∈ {title, body, links, highlights}.
In our experiments, the setting of BM25F follows traditional methods [34], where we find that b title = 0.4, b body = 0.3, b links = 0.4. For b highlights , we have assigned a value of 0.5 given that it is a section which is expected to represent fundamental concepts regarding the content of the web page. We set the value of K 1 = 1.7 as per the original reference [34]. While the optimisation of boost factors may lead to better results for this method, we unfortunately do not have sufficient web searches for this purpose. Even the original reference with a much larger dataset could not optimise the parameters [34].

C. ACCURACY MEASURES
When comparing the performance of IR methods, the main reference point is accuracy [36]. In this study, we analyse the SemanticSearch methodology following Mean Average Precision (MAP) and top-N precision scores.

a: Mean Average Precision
This is one of the most popular evaluation metrics in IR [36]. MAP makes it possible to compare the performance of two or more IR techniques over a set of queries. MAP expresses overall accuracy by computing the mean of the Average Precision (AP) scores of all the queries. Each query induces a subject method to rank the items. Given an order of items, the AP score is the average of the top-N precision scores in ranking the relevant items. In this study, we compare the MAP of our methodology against the baselines. To strengthen our findings further, we also comment on the paired t-tests of the AP scores relating to each query.

b: Top-N Precision
MAP already provides an analysis of the accuracy of the methods in terms of precision scores. In Information Retrieval, the Precision measure is computed as the fraction of the relevant documents from all those retrieved. However, further insight into the quality and practical benefit of a scoring method is provided by evaluating the Precision also considering the ranking of the relevant documents in an ordered results list. Let us consider a search engine that, for a query, shows all the resulting documents into pages, each containing 10 documents. Normally, a Web user tends to consider the web pages that appear on the first page and in the highest ranking positions as those that are most relevant. Therefore, when evaluating a ranking methodology it is important to take into account the position in rank for a relevant document: even when a relevant resource is correctly retrieved, the overall precision of the method is low if the methodology does not rank it in a top position. In contrast, a methodology that is able both to retrieve relevant resources and to rank them as top results yields high precision. For this reason, and to carry out a more practical appraisal of the methods, we compute three different values for the Precision measure in our evaluation, considering the relevant items in positions of high interest for the user: Precision at top-1 (denoted as P@1), top-3 (P@3) and top-5 (P@5).

D. STATISTICAL TESTING
The t distribution can be used for statistical hypothesis testing when only a small sample of the entire population has participated in the study [37]. In such a scenario, the t distribution is found to be more reliable than the z distribution for hypothesis testing and constructing a confidence interval for the population mean [37]. The t distribution should be used when the underlying population is assumed to be normally distributed, and it is considered to be very accurate if the sample size is sufficiently large [37]. This applies to our case study.
The goal of our t-test is to show that our SemanticSearch methodology produces more accurate rankings than current practice. For each baseline approach, we perform a paired t-test with our proposal for the AP, P@1, P@3 and P@5 measures. In each test, the null hypothesis H 0 states that the mean value of the accuracy of our method is lower than the baseline. Following our data quality assurance procedure, we recorded a total of 61 valid searches during the data collection phase. Given this sample size, we set the threshold for the t-8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186356 value to 2.000 to reject the null hypothesis at .05 significance and 60 degree of freedom.

E. DATASET
To run a comparison of the methods, we need a dataset of web pages with a relevance label for a teaching context. To the best of our knowledge, no such dataset exists, so we proceed with the collection of such data. During the data collection phase, we have to be sure that the external assessors rate the usefulness of web pages for teaching above their relevance to the query: a small but significant difference for a proper experiment [38]. For a reliable evaluation of the usefulness of a web page, external assessors rate items according to the highest level of knowledge and an awareness of the purpose of the web-search [38]. To fulfil these requirements, we designed and implemented an online survey that allows instructors to define a teaching context which is of interest to them, and which includes a concept map 3 . Users then formulate a query for retrieving web pages relating to a concept in their concept map. Our online survey system submits the query to Google 4 , and presents the assessors with the first 10 items as ranked by Google. We only select the top 10 items to keep the survey short, and because users tend to stop at the first page of items retrieved by a search engine (10 items in the case of Google with a standard configuration). To avoid any bias due to the system presentation order, the survey shows the 10 items in a random order, and assessors are made aware of this. The items are presented directly in a neutral environment so that the users do not know that Google is elaborating the query. Finally, the assessors rate the retrieved web pages according to their usefulness for teaching a concept in the teaching context. Assessors evaluate the web pages by means of a 5-point Likert scale, with each web page rated with a score from 1 to 5 (1 -not useful, 5 -extremely useful). Following this protocol, instructors are the external assessors of the web pages, and as such they define the contextual information, formulate the query, and rate the web pages. We believe that this data-collection protocol responds to issues highlighted by [38] in obtaining a true assessment of the usefulness, and not just the relevance, of items in a web search. After the data collection phase, the dataset includes 77 web searches, each of them with 10 web pages rated by the assessors.
To enable structured scoring of the web pages, we identify four main parts of a web page which are of interest for our application. Table 2 details the four sections that the methods can elaborate, and the HTML tags that we extract from a web page and the text of which makes up the four sections. There is also the issue of eliminating text that is not relevant and that may generate some noise in the ranking task (e.g., menu-bars). This problem has no easy solution, and is beyond  the scope of the present research. Usually, when proposing a ranking principle, such an issue can be resolved successively once the new approach has proved to work efficiently [34]. For our research, it is important that both the benchmark and our methods use the same texts during the experiments. The presence of texts taken from noisy HTML tags also affects the scoring methods. We can thus prove the potential improvement of the new SemanticSearch measures in these conditions as well. We only perform some general cleaning of the texts (removal of stop words and stemming [39]) before we run any scoring methods. For all the queries and the corresponding pages retrieved by the search engine, we created a semantic representation with the corresponding query category graph and resource category graphs.

a: Additional web pages to rank
In real-world scenarios, traditional IR scoring functions elaborate thousands of web pages in a web search, with only few of them being useful for the purpose of the web search. To make our evaluation similar to that scenario, we want to apply our scoring methods to a very large set of resources. However, it is not feasible to ask assessors to evaluate over 1,000 items for each web-search, and neither do we have the resources to collect such data from existing systems. Hence, on top of the 10 pages we collected during the data collection phase, we added at least 1,000 non-relevant web pages extracted from the DMOZ (now Curlie 5 ) web directory. These pages are randomly selected from the different categories included in the directory. Considering that dead links can be found in DMOZ, each search in the dataset may have a slightly different number of additional noisy items. On average, we added 1,070 noisy items, ending up with more than a thousand web pages to be ranked for the web search. Even though assessors have not rated the web pages from DMOZ, we know that the latter are not relevant as they come from non-educational categories of DMOZ (e.g., Shopping). We can assert that they are not useful for teaching in general (we do not make assumptions on their relevance to the topic of the query), so we label all the web pages from DMOZ with a rating of 1 (the lowest rating of usefulness).
For each DMOZ item that is added to the web-searches in our dataset, we create the semantic representation by i) extracting semantic entities, ii) filtering out irrelevant semantic entities, and iii) building the category graph according to the DBpedia categories of the entities.

V. RESULTS
This section reports the outcome of the evaluation described in Section IV. The discussion compares the performances of our SemanticSearch and ERP measures against the baselines Tf-Idf and BM25F.
Since Tf-Idf and BM25F do not benefit from the educational information, while ERP and our SemanticSearch measures do, in order to make a fair evaluation we consider the accuracy of Tf-Idf and BM25F with both the queries formulated by the assessors and the queries built from the TC attributes (TC-based queries).
Because there are 16 possible TC-based queries (see Table 1), for the sake of brevity we compare our methods with the TC-based query that shows the best results for Tf-Idf and BM25F. The goal of this analysis is to identify the best combination of query structure, scoring method and similarity for our SemanticSearch methodology. We then compare the most accurate SemanticSearch measure with ERP and the baselines to appreciate the performance of our semantic methodology compared to existing practice. We will refer to each possible combination of the three components of the SemanticSearch using the following notation: <query_id> <scoring_method> <similarity_measure>. For instance, the Q1 query structure (see Table 1) with the Ef-Irf scoring method and the OD similarity measure is given the name Q1EfIrfOD.
However, we must consider that the query-graphs are automatically generated using the semantic entities taken from the query texts. For some of them, the Dandelion API could not mine any semantic entity even when the query contained academic or scientific terms. This issue is due to the nature of the queries that may not have enough terms that match with the DBpedia knowledge base. Since new semantic entities are continuously added in DBpedia, we expect this limitation of the proposed approach to have less impact in future versions of the knowledge base. To apply statistical analysis of our SemanticSearch measures against the baselines, we need to evaluate them over a set of at least 50 web searches. For this reason, we will only report the results of those methods that are able to produce the category graphs for at least 70% of the queries in our dataset. We decided on this threshold so each SemanticSearch measure could be applied to at least 54 web searches. This number is a compromise between having at least 50 web searches and enough SemanticSearch measures to compare.
The only SemanticSearch measures that were able to pass that threshold are: q2EfIrfConfidenceOD, q5EfIrfOD, q8EfIrfJaccard, q10EfIrfOD, and q8EfIrfConfidenceJaccard. Consequently, we evaluate here the performance of just these five measures.
Before going into the performance of the SemanticSearch measures, we first observe how the ERP performs against the baselines.

A. ERP VERSUS THE BASELINES
While we know that ERP works well with a small number of resources per search [12], performance drops significantly when we test the method over a large set of items per search. This is why we put forward the proposal for semantic approaches to be involved in the ranking process. Table 3 reports the paired t-tests of ERP against Tf-Idf and BM25F.  While ERP shows a strong improvement against Tf-Idf, BM25F is very challenging. This result confirms that a structured scoring of the web pages should be the focus of an educational ranking principle. Overall, ERP performs better than BM25F over the 77 searches of this study, but this is not enough to generalise the results on a large scale. We expect our SemanticSearch measures to achieve stronger results than ERP.

B. SEMANTICSEARCH VERSUS THE BASELINES
This section presents and discusses the accuracy performance of our SemanticSearch method against the accuracy of the baselines interrogated by the five queries that were able to pass the threshold illustrated above. Table 4 reports the results of the paired t-tests between the five SemanticSearch measures and the Tf-Idf. Since the p-values are lower than 0.05 for all of our methods, we conclude that SemanticSearch methods performed significantly better than Tf-Idf. However, ERP is already able to outperform Tf-Idf. The most challenging baseline is BM25F. When comparing the SemanticSearch methodology against BM25F, there is a milder outcome than with the previous results. Table 5 shows that, overall, only q5EfIrfOD performs better than the baseline, although there are some borderline situations. While AP and P@5 leave no doubt as to the better ranking produced by the q5EfIrfOD method (p < 0.05, t > 2.000, conf. diff. > 0.02), P@1 and P@3 are more critical. We record an increase of 11.861% of P@3 when we compare q5EfIrfOD to the average P@3 value of the BM25F. While the t-test is not strong enough to support this result (t=1.75), the p-value is lower than 0.05 suggesting a better P@3 performance for q5EfIrfOD.
Overall, we find that the q5EfIrfOD approach is more reliable than traditional IR methods. Even though BM25F benefits from a formulation of queries, the statistical tests confirm that our q5EfIrfOD is still able to generate better rankings. Hence, we have been able to improve on ERP using semantic data in order to make it more robust when ranking an extremely large set of resources for a web search.

VI. CONCLUSIONS AND FUTURE WORK
This study has addressed the critical issue of ranking educational material for educational applications. Web searches imply various challenges owing to the extremely large number of web pages involved, most of which are not relevant for the purposes of the search [40]. This situation is even more problematic when educational issues are involved. Not all web pages extracted by Google or other search engines are suitable for use in an educational context. Consequently, instructors are obliged to spend a lot of time and effort checking the many results from Google or other search engines before they can select the appropriate resources for their educational needs. An initial study of this topic suggests adopting ERP, which is a new structured method for scoring web pages according to teaching context [12]. However, this approach has some limitations when scoring a large number of web pages, i.e., not just the top 10 pages retrieved by Google.
To overcome this drawback, we have integrated ERP with semantic data. The proposed methodology combines Information Retrieval and Semantic Web methodologies, levering DBpedia knowledge graphs in describing the content of a query based on terms used in the instructor's teaching plan. While we find that ERP struggles when ranking hundreds of resources according to teaching context, the semantic data fix this issue. Proof of the validity of this statement is provided by the experimentation carried out. The dataset provides more than 1,000 noisy items in addition to the 10 Google answers rated by assessors as useful. Following the methodology illustrated in Section IV we established that the optimal composition of the query based on teaching context is the one that considers CN and PK. As a filtering method we used Ef-Irf, while as a measure of the similarity between graphs we used the overlapping degree. This combination (q5EfIrfOD) led us to establish that queries composed in this way not only perform very well in comparisons with the Tf-Idf baseline (Tab. 4), but also in comparisons with BM25F (Tab. 5). The paired t-tests regarding average precision against the baselines are positive. Moreover, the P@5 and P@3 metrics have a significant p-value (p < 0.05).
This study fulfils the promise of an educational ranking principle that is able to score a large number of web pages simultaneously in order to facilitate the retrieval of instructional materials. This is thanks to its semantic-based approach.
However, there are some limitations, with ERP and the SemanticSearch being useful in different applications. The term-based ERP is efficient and requires less data processing because it does not need to conduct a semantic analysis of the teaching context and the web page content. On the other hand, these semantic data are crucial for ranking hundreds of web pages at once. The issue with semantic data extraction is not only computational, but also regards feasibility in certain contexts. During the study, we were able to apply the semantic approach to a reduced amount of web searches than the original ERP, namely 66 instead of 77 (86% of the total amount). The reason for this is that it was not possible to extract a sufficient number of entities for some teaching contexts. Therefore, ERP is a better application for increased web searches where a set of resources has already been pruned by another system, such as Google. In contrast, Semantic ERP works well on its own and does not necessarily have to be used in combination with another system. Since our methodology explains how semantic data can be used for the scoring of educational web pages, we expect a combination of textual and semantic analysis to provide a more scalable and accurate ERP. In conclusion, this study aims to minimise the time instructors have to spend looking for teaching material by automatically performing the timeconsuming activity of evaluating the educational suitability of a web resource for a specific educational context. We also believe that current and future systems in TEL can benefit from ERP and SemanticSearch in order to exploit the huge amount of knowledge hosted on the Web efficiently. Such a development is likely to enhance the effectiveness of the overall teaching experience.
DAVIDE TAIBI is a senior researcher at Institute of Educational Technology of the National Research Council of Italy and part-time lecturer at the Department of Computer Science, University of Palermo. His main research areas are related to pedagogical applications to mobile learning, learning analytics and social media in education. He is coordinating two European funded projects on the development of data literacy competencies for university and business sectors.
14 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3186356