ELAD: An Entity Linking Based Affiliation Disambiguation Framework

The number of papers has exploded as more and more people and more types of institutions participating in scientific research. At the same time, institution name disambiguation (IND) is getting more sophisticated, which is critical for research assessment, scholar alignment, etc. Previous knowledge-based and rule-based methods require knowledge and rules prepared in advance, which cannot cope with growing and changing data and learning rules, especially for data with a long period and abundant sources. This paper proposes an automatic learning framework to solve the problem, which is based on entity linking, entity type recognition, candidate generation, and result selection. Experiments show that precision and recall is much higher than the traditional method, ELAD learns more knowledge from the knowledge graph, and it can deal with ever-changing and ever-increasing data. What’s more, it solves many problems that cannot be solved by traditional methods: the connection between institution entities, mistakes correction, and the reduction of manual and pre-prepared knowledge. At last, for the case study, we develop two applications based on ELAD which proves its reliability.


I. INTRODUCTION
The primary purpose of IND in academic big data is to identify and map the top-level institutions in affiliations to real-world institutional entities. IND is essential for research assessment [1], institution cooperation network [25], scholar name disambiguation [27], scholar trajectory [18], talent flow [13], management of scientific papers [10], assessment of institutional research performance [17], institution ranking [4], etc. With the progress of modern science and technology, the number of scientific research papers surges. A statistic shows that the average global growth rate has been sustained at about 15% for both papers and patents in recent years [28]. On the one hand, the number of newly pub-The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . lished papers is increasing exponentially. On the other hand, the backward data storage technology in the past years has caused a lot of non-traceable problems. What is more, to meet the needs of the rapid development of modern technology, all kinds of joint research laboratories, enterprise research institutes, new specialized research academics, and other new research institutions emerge in an endless stream as well. The historical technical problems, the explosion of paper data, and the emergence of new research institutions make the background of IND problem rather complicated.
At the same time, the causes of the problem are so complex, such as different translation methods, different spelling languages, institutional transition, spelling errors, institution versus divisions [8], different writing styles, typographical or OCR mistakes, abbreviations, translation mistakes [22], omissions [6], various institutions have the same name, etc. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Moreover, it is even hard to tell which causes the institution name disambiguation error for specific affiliations. Although traditional methods can solve problems within a finite set with strict rules and knowledge, the issue becomes complicated to deal with when faced with a large number of growing data sets. This paper focuses on addressing the IND problem based on entity linking, which will reduce the reliance on predefined rules and pre-prepared knowledge. What is more, the method can deal with the growing and changing types of institutions in affiliations with the help of a regularly updated knowledge graph.

II. RELATED WORK
Previous work in solving this problem can be divided into three main categories: rule-based methods, knowledgebased methods, and the combination of these two methods. The main characteristics of the rule-based methods includes: (1) a productive system of rules based on common instances, usually implemented by regular expressions; (2) a rich additional keywords library based on regulations, e.g., city/state/country names, keywords for organizations, etc.; (3) similarity matching algorithm or clustering. Knowledgebased methods usually include the following properties: (1) pre-prepared knowledge bases, e.g., institutions and their introductions; (2) manual handling.
De Bru and Moed [7] formulate the most simple IND rules. NEMO is a typical rule-based system, which multilayered rule matching with multiple dictionaries [11]. The normalization process involves clustering based on weighted local sequence alignment metrics to address synonymy at the word level, and local learning based on finding connected components to address synonymy [11]. Jiang et al. propose a clustering method based on normalized compression distance to group the affiliations that denote the same institute [10]. Cuxac et al. [6] apply both supervised and semi-supervised approaches for affiliations clustering. Huang et al. propose a rule-based algorithm, of which precision is high, and recall is low [8]. Sun et al. build an affiliation knowledge base named Authority File for Affiliations (AFA) based on ontology principles [20]. And they also established seven name matching rules based on their regions, types, and naming characteristics [21].
Aumueller et al. determine the similarity of affiliations based on how the URLs in the result sets of affiliation web searches overlap [3]. Based on that, they also proposed an approximate string metric that handles acronyms and abbreviations [2]. Morillo et al. propose a new semi-automatic method is presented to standardize or codify addresses that need a large number of hand-coded data [14]. Nooj is a new corpus processing system with large-coverage multilingual dictionaries and grammars [19]. Taşkın et al. use Nooj for institution standardization [23]. The system sCooL is based on mappings collected from different curated and non-curated sources [9], for which knowledge from CareerBuilder (CB) and Wikipedia and manual mappings are critical. Google Scholar Citations (GSC) provides an institutional affiliation link by institution name and institutional e-mail web domain [16].
Pre-prepare rules and knowledge are valid for a given data set, which can be fully explored by statistic methods. However, when faced with large-scale data, the rules can be very complicated, and the knowledge needs to be extremely rich. For not pre-defined data sets (the number is increasing, and the type and distribution for institutions are changing with time), traditional approaches are undoubtedly limited by preprepare rules and knowledge.
All in all, the main limitation of the traditional methods lies in they are (1) difficult to scale up to increasing and changing affiliations; (2) hard to find new institutions, and (3) can't explore the relationship between institutions. However, the development of knowledge graph technology provides us with a new idea to solve this problem. Our work focuses on solving the IND problem by entity linking with the help of existing knowledge graph.

III. PRELIMINARIES
Before introducing the framework, we present the preliminaries to systematically define the problems to be solved by our framework.

A. XLore & XLink
XLore 1 is a large-scale multi-lingual knowledge graph by structuring and integrating Chinese Wikipedia, English Wikipedia, French Wikipedia, and Baidu Baike, which contains 16,284,901 instances, 2,466,956 concepts and 446,236 properties [24]. XLink is a large-scale English-Chinese bilingual knowledge graph based on XLore, which provides external knowledge that can help readers to understand ambiguous and obscure entities [26]. The API of XLore 2 allow us to access the data expediently.
The most import concepts in knowledge graph are entity, attribute, and relation. For XLore, all of them are prepared and easy to use. In this work, we mainly use instances, classes, and entities in XLore. The outline of an Instances and a Classes shown in Figure 1 and Figure 2.
For each instance and class, there is a corresponding uri, which gives the details of the instance and the class (as it is shown in Figure 3).
An API query will return a set of instances and classes, which is based on fuzzy matching in XLore. The right part of Figure 5 sketches their relationships (See API for their definitions and structures).

B. PROBLEM DEFINITION
First of all, we define the mapping relationship between external query w and internal entities as h, where h(w) = I + P. I w is a collection of Instances and P w is a collection of Classes.  And then, the goal of the entity linking based IND problem can be described as ''find the exact top-level institution name from a set of entities''.
Let a ∈ A as an affiliation string extracted from a scholarly paper, o ∈ O means an unambiguous top institution entity in the real world, C denotes the candidate set of a, where C = ∅. According to our survey, the main points of the work includes: (1) make sure the result entity is in the candidate set C; (2) selection algorithm needs to be accurate.
The ultimate goal of the entity linking based method is to establish the mapping relationship between A and O, which can be described as follows: In this work, the framework addresses two significant parts of the IND problem: (1) Candidate Generation: generation of reliable candidate sets C, taking all the possible IND factors mentioned above into consideration to ensure coverage of the candidate set to the result; (2) Result Selection: a model to select the most likely result o from C, which will ensure accuracy of the linking between affiliation and the real-world top-level institution. In general, the goal of our framework is to find the corresponding entities of affiliations by applying knowledge graph. Meanwhile, the framework can cope with the ever-changing and ever-increasing paper data, which implies all kinds of institutions.

IV. FRAMEWORK
To deal with above problems, we illustrate an institution name disambiguation framework based on entity linking in this paper. In this section, we specify the technical details of our framework.

A. FRAMEWORK OVERVIEW
As it is illustrated in Figure 4, the framework is consist of four parts: Preprocessing, Candidate Generation, Result Selection, and Application.
Generally, preprocessing of affiliations in papers can be roughly divided into three steps: paper structuring, affiliation extraction, and affiliation normalization. The principal aim of these steps is to group the same expressions into one to reduce processing times. What is more, the framework provides various of applications. In the next section, we will display two case studies based on our framework.
In the process, we make the most of the similarity between external strings and entities in the knowledge graph as well as the relationship between entities. To sum up, the framework mainly solves two problems: Candidate Generation and Result Selection, which are the most critical parts of the framework.

B. CANDIDATE GENERATION
Previous studies shows that a considerable number of errors in IND come from lacking of essential information in affiliation string. To overcome this difficulty, we use knowledge graph to learn its related expressions and add the related entities into C for selection.
The following aspects are mainly considered in the entity linking based Candidate Generation: (1) institutions in C must be top-level and verifiable; (2) all possible results must be included, including those not in the knowledge graph; (3) any c in C must be standard and accurate. VOLUME 8, 2020  According to the above aspects, we propose a candidate generation algorithm based on entity linking supported by XLore. Before introducing the algorithm, we propose two crucial functions in this paper: 1. affiliation cleaning function f 1 : The primary purpose of f 1 is removing the parts that do not contain any expression of an institution, which is similar to the traditional rulebased methods. We use s = f 1 (a) to express the process. Meanwhile, we define the inverse process of f 1 as f 1 , which will find the original affiliation string a of s. 2. query string generation function f 2 : f 2 will generate a contiguous sequence of n items from a given w by n-gram language model. The process is illustrated in the left part of Figure 5.
In n-gram language model, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. In more detail, the number of elements in W (as the left part of Figure 5 shows, s contains m words) can be expressed as follows: Figure 5 depicts the process of linking an affiliation string a to a knowledge graph G in XLore. It's worth saying that an entity caching module is used to cache the query results from the knowledge graph, which can reduce the number of knowledge graph requests effectively.
where N denotes the lower limit of n in n-gram language model. Learning the latent expression in affiliation from knowledge graph not only gives us more background knowledge but also provide us with more relationships between entities and classes to better understand a semantically. In this way, we generate a reliable candidates C by the candidate generation algorithm.
The generation process of candidates set C of a is described as follows with pseudo-code: where ''top-level institution identification'' mainly depends on the judgment of its ''Hypernyms'' and label the official name for an entity. In this way, we extend C and link the entities by applying the knowledge graph.

C. RESULT SELECTION
The goal of Result Selection is to find the most possible institution entity from C by the institution context in a. In this paper, we define the most possible institution entity as r = arg max P(c|a). According the definition of f 1 , we can assume that P(c|a) = P(c|s), where s = f 1 (a). Moreover, We define a function clean() to reduce the interference of noise from the a. The function clean() is limited to perform simple processing on strings, such as removing special word symbols, unifying case, etc. In this way, we get the equation P(c|a) = P(c |s ), where c = clean(c) and s = clean(s). Obviously, this assumption is quite reasonable.

Algorithm 1 Candidate Generation Algorithm
Input: a: affiliation string of a paper; Output: C: candidate set of institutions; 1 Initialize C ← {} ; 2 s = f 1 (a) ; 3 W = f 2 (s) ; 4 for w in W do 5 add w into C ; Based on this assumption and a series of experiments, we proposed a probability formula of entity linking, which can take full advantage of entity in knowledge graph and affiliation string. The probability formula can be expressed as follows: where cls() means a function to find the Longest Common Subsequence (LCS) [5] of two strings, med() denotes a function to calculate the Minimum Edit Distance (MED) [15] of two strings, C suggests the mapping result of C by clean() function, and the absolute value represents the length of the string. It is worth mentioning that LCS is a NP-hard problem and we use dynamic programming approach to solve it [12].
In this paper, we use P c to describe the probability that s and c have the same meaning. Thus, we can get the following formula: Furthermore, the result of formula 3 can be defined as follows: where R denotes the result set of the entity linking. There is no doubt that R probably has more than one element. However, we simply choose the longest one r and map r  to r with clean() function. In this way, we will get most possible candidate from arg max P(c |a ). The model denotes that the most possible candidate c given a depends on the text similarity between the top institutions. The advantages of this method include: (1) combine the advantages of traditional methods; (2) make full use of the entities in the knowledge graph; (3) low computational complexity, which can deal with large-scale data.

V. EXPERIMENT
In this experiment, we mainly verify three main aspects: (1) the framework can achieve a considerable high accuracy; (2) the results have more knowledge than traditional methods; (3) it can deal with the ever-changing and ever-increasing data.
We designed a real-world experiment to verify the reliability and usability of our framework. To restore a more realistic application scenario, we randomly select papers from AMiner 3 whose period is relatively long. The details of the data set are as follows: For comparison, we also try to choose a state-of-the-art method. Actually, for rule-based methods and knowledgebased methods, the number of rules and the amount of knowledge determine the accuracy. Considering the completeness and generality of knowledge and rules, we deem that the work of Huang et al. [8] is most representative of all the papers mentioned above. Thus, we replicate their method on our data set and compare it with ELAD. In this paper, we call it ''Huang's Method'' for short.
To be more specific, the data involved in ELAD are illustrated as follows: In calculating Number per affiliation, we use the number of regularized affiliations. As Table 2 shows, our framework provides a large number of candidates for us to select, which provides the possibility of solving problems such as spelling mistakes, missing words, etc. In terms of the completeness of candidates, ELAD provides a higher possibility to improve accuracy.
Finally, we designed a rigorous test method to verify the accuracy of our approach. In more detail, we only label the official name of an institution in this experiment, and only the exactly matched string is considered right. In addition, the usability of the framework enables us run the work at a very low cost.

VI. RESULTS
We apply these data in both ELAD and Huang's method. Huang's method clustered data into 10,714 categories, each of which represents an institution. With an average of 3.14 affiliations cluster into one institution, 33,628 of these affiliations are in clusters, which counts 39.57% of the whole. That's to say, 40.23% affiliations did not involve in, which is much lower on our real-world data set than on their data set.

A. ACCURACY
From the perspective of accuracy, ELAD demonstrates clear superiority in this experiment. As Table 3 shows, the accuracy and recall of ELAD are much higher than the traditional method. where Return Result means the ratio of results returned by the method.

B. MORE KNOWLEDGE
What's more, with entity linking, ELAD can deal with university and university system level IND problem. For example, our framework deems the top institution of ''Medical Anthropology Program,University of California,San Francisco,U.S.A.'' is ''University of California,San Francisco'', not ''University of California''. Furthermore, some affiliations with a small amount of misspelling can also get the right result.
However, affiliations with severe lack of essential and correct information can't be predicted, e.g., ''Mochrum, Newton Stewart, Wigtownshire,Scotland''. It is also to find the official name of ''Inorganic Chemistry 2, Chemical Center, University of Lund,Lund,Sweden'' is ''Lund University''. Nonetheless, our framework has significantly improved recall and precision.

C. FOR EVER-CHANGING AND EVER-INCREASING DATA
During the preparation of experiment data, we find that the number of entities and uris did not increase linearly as the amount of data grow, which growth rate is decreasing. This phenomenon demonstrates that there are many common attributes among institutions, and ELAD is suitable for everchanging and ever-increasing data processing. Besides, the source code of ELAD, data set and experiment results are available at GitHub. 4

VII. CASE STUDY
To validate our proposed novel framework, two cases studies are defined. In the first case study, we try to find out the most productive institutions in AI 2000 Data Mining Most Influential Scholars. 5 We link the affiliation strings to their knowledge graph entities by applying ELAD. The top 10 institutions of the most productive are demonstrated in Table 4. To the best of our knowledge, the result is consistent with the common-sense understanding, which verifies the reliability of ELAD in turn.
In the second case study, we try to mine the collaboration relationship between the top 10 institutions in the first case study. The collaboration frequency between institutions implies their relationships in academic cooperation and competition. The institutions at the centre of the collaboration network are often the most influential in this field. In this way, we visualize the collaboration network of the top 10 institutions in Figure 6.
The nodes denote institutions, the edges imply the collaboration, and the values indicate the collaboration frequencies between institutions. The network proves the close cooperation between top research institutions.

VIII. CONCLUSION
This paper proposes an automatic framework to solve the IND problem in research papers in a scenario where the amount of data is increasing and the period is very long. Our framework makes full use of the relationships between entities in affiliations and their hypernym and hyponym, which can overcome the limitations of pre-preparation rules and knowledge methods. Experiments show that its recall is much higher than traditional methods with their precision at the same level. At the same time, our framework reduces human involvement with more external knowledge, which helps establish the linking between entities and give standardized names.
The framework makes it convenient for us to build a knowledge graph of research institutions, which includes the relationship between top institutions as well as the relationship between top-level institutions and sub-level institutions. Nevertheless, the most significant limitation is the speed of querying the knowledge graph data, which makes the framework more suitable for incremental processing. Meanwhile, the entity caching module in our framework can significantly reduce the frequency of queries.