Measure the Impact of Institution and Paper Via Institution-Citation Network

This paper investigates the impact of institutes and papers over time based on the heterogeneous institution-citation network. A new model, IPRank, is introduced to measure the impact of institution and paper simultaneously. This model utilises the heterogeneous structural measure method to unveil the impact of institution and paper, reflecting the effects of citation, institution, and structural measure. To evaluate the performance, the model first constructs a heterogeneous institution-citation network based on the American Physical Society (APS) dataset. Subsequently, PageRank is used to quantify the impact of institution and paper. Finally, impacts of same institution are merged, and the ranking of institutions and papers is calculated. Experimental results show that the IPRank model better identifies universities that host Nobel Prize laureates, demonstrating that the proposed technique well reflects impactful research.


I. INTRODUCTION
S CIENTIFIC impact is evaluated at different levels, rang- ing from high level at national and institutional scales to low level at researcher and paper scales [1]- [3].Many studies focus on scientific impact measure, scholarly network analysis, and success of science [4]- [8].While many of these studies explore scientific impact at a particular timeframe, there's a growing interest in understanding the evolution of scientific impact in "science of science" [9], [10].For scientific impact measurement, citation network is a often used technique [11], [12], whereas heterogeneous scholarly network has attracted growing attention recently [13], [14].Quantifying scientific impact in the heterogeneous scholarly network is closely related to structural measure, citation analysis and behavioral complexity.A subset of heterogeneous scholarly networks is the evolving network of institution and paper over time, which forms the structural foundation for advancing scientific discoveries, gauging scientists' performances, ranking universities, and allocating funding.A heterogeneous scholarly network relationship is shown in Figure 1.I 1 -I 10 represent research institutes and P 1 -P 9 represent papers.In Figure 1, paper P 1 cites the two papers paper P 2 and paper P 3 , and the link between two papers points to its reference.The signed institutions of paper P 1 include institution I 1 and institution I 2 , the bi-directional links represent the relationship between paper and institution, indicating that the institution publishes the paper and the paper belongs to the institution.Quantifying paper impact is longstanding point of research [3], [15]- [18].Previous studies have mainly focused on unstructured measures or structured measures [19].Unstructured measures rely on citations of scholarly papers or Altmetrics, including downloads, views, shares, and citations [20].Citations attracted by a scholarly paper can sometimes be correlated to its age, which favors older publications.Altmetrics are suitable for quantifying the impact of paper in the early stage of publication.However, both metrics are easily manipulated by scholars who can artificially increase the number of citations.Compared to unstructured metrics, structured metrics more adequately quantify the impact of paper.The most representative structured measures are PageRank and HITS algorithms [21]- [24].PageRank algorithm is often used in homogeneous network such as citation network and coauthor network [25].HITS algorithm is used in heterogeneous scholarly network such as paper-author network and paper-journal network [12].Quantifying institution impact has always been the focus of scientific researchers [9], [10], [26]- [30].Currently, quantifying institutional impact is limited to unstructured metrics and homogeneous structured metrics.Several unstructured methods are widely recognized such as Academic Ranking of World University (ARWU), QS World University Ranking (QS), Times Higher Education World University Ranking (THE) and Performance Ranking of Scientific Papers for World Universities (NTU) [10], [31].However, unstructured metrics rely heavily on the number of bibliometric indicators.To develop a structured quantitative method to measure the institutional impact, Massucci et al. [11] integrated PageRank into the citation network of institutions.However, despite these significant efforts, the correlation between institutional impact and paper impact in heterogeneous scholarly network remains unclear.Possible reasons include: institution impact evaluation is moving from unstructured to structured; compared to evaluating the institution impact in a homogeneous network, evaluating the institution impact in a heterogeneous network is a more complicated task.
Therefore, we develop a quantitative model, IPRank, to improve the understanding of institution and paper impact in the heterogeneous scholarly network.With the unprecedented expansion of publications and the availability of largescale datasets on publications, institutions and citations, the analysis of institution and paper network and their quantification in heterogeneous network are now possible.In this paper, we address two main questions.First, we construct a heterogeneous institution-citation network and derive the statistical model of institution-citation network, making it possible to simultaneously quantify the impact of institution and scholarly paper.Second, we develop a structured measurement based on the institution-citation network by utilizing PageRank to quantify the impact of institution and scholarly paper.
The rest of this paper is organized as follows.Section II summarizes recent work on the evaluation of institution and paper impact.Section III introduces the proposed IPRank model framework in detail.The experimental results are shown and discussed in Section IV.Section V draws concluding remarks of the study.

II. RELATED WORK
Quantifying the impact of scholarly papers has been extensively investigated.Early studies are mostly based on the number of citations.Garfield proposed using citation counts as the measure of scholarly paper impact [32], and he also developed Journal Impact Factor (JIF) as the measure of journal impact [33].Although, citation-based approach has certain limitations, such as the impact factor of different disciplines cannot be unified.Citations as a metric to measure the impact of paper have been controversial, especially due to the existence of questionable citations [34].
To resolve this problem, on-going research has been conducted to explore structured metrics to quantify the paper impact [14], [21], [35]- [37].These studies are mostly based on scholarly networks, including homogeneous networks (citation network of paper, citation network of institution, and coauthor network) and heterogeneous networks (paper-author network, paper-venue network, and author-venue network).Chen et al. [21] found scientific gems with Google's PageRank algorithm via citation network.The reason behind it is that important papers attract more citations, including citing paper with high importance, which increase the importance of the cited papers.On the basis of this work, Jiang et al. [38] integrated mutual reinforcement relationships based on the three homogeneous networks and the three heterogeneous networks by applying PageRank and HITS algorithm.Subsequently, Wang et al. [12] measured the impact of paper by exploiting citations, authors, journals and time information via homogeneous scholarly network and the heterogeneous scholarly network mentioned above.Compared to the work of Jiang et al., Wang et al. [12] introduced time feature to evaluate the impact of paper, and favored recent scholarly papers to higher scores.Inspired by the work of Wang et al. [12] and Ioannidis [39], Bai et al. [2] proposed COIRank to measure the impact of paper by identifying anomalous citation patterns to adjust citation weights.Liang et al. [14] proposed a novel mutual ranking algorithm based on the heterogeneous academic hypernetwork by employing the mutual reinforcement relationship.Bai et al. [40] developed a higher-order weighted quantum PageRank algorithm based on the behaviour of multiple step citation flow.The citation dynamics with higher-order dependencies reveal the actual impact, and better distinguish the impact from self-citation.
Compared to the evaluation of paper impact, quantification of institutional impact is more complicated [26], [41]- [43].Previous metrics are mainly based on statistics of features, including researcher-based features (staff winning Nobel Prizes, number of highly cited researchers, international collaboration), paper-based features (article published in Nature and Science, article index, number of publications, high quality publications, normalized impact, excellence rate, copublications), institution-based features (university-industry co-publications), and other features such as availability of research funding and graduation rates [28], [44], [45].These features are relatively easy to obtain, and they reflect the impact of institution.However, these quantitative indicators have certain drawbacks.Therefore, the structured metrics are investigated to quantify the impact of institution [11], [46].Bai et al. [46] first explored the conflict of interest (COI) relationships to discover negative citations and weaken the associated citation strength.Furthermore, PageRank and HITS algorithms were utilized to measure the impact of papers based on citation network, paper-author network and paper-journal network.Finally, the institutional impact was calculated by the impact of all publications in this institution.Massucci et al. [11] studied the citation patterns among university and used the PageRank algorithm based on the citation network between institutions.In their study, the citation relationships between papers are converted into the citation relationships between signed institutions of papers.However, the citation relationships between any two papers is one to one, and since a paper can signed by multiple institutions, the citation relationships between institutions are more complicated.

A. DATA SOURCES AND DATA PRE-PROCESSING
Our experiments are based on the American Physical Society (APS) dataset, which consists of all papers published in Physical Review from 1894 to 2013, spanning across the following journals: Physical Review A, B, C, D, E, I, L, ST and Review of Modern Physics.This dataset includes title of paper, author's name, author's affiliations, date of publication information, and a list of cited papers.
In this study, we consider papers and institutions that meet the following criteria: Through the above pre-processing, a summary of the basic statistics of the APS dataset from 1894 to 2013 is given in Table 1.The entire APS dataset from 1894 to 2013 is used to quantify the long-term impact of institution and paper.Correspondingly, for examining the short-term impact of institution and paper, we summarize the information of the APS dataset during different time periods, also as shown in Table 1.We choose a five-year period to quantify the impact of institution and paper, mainly referring to the Global Ranking of Academic Subjects (ARWU-GRAS) ranking institutions [11].Except for counting the number of papers, the number of institutions, the number of links between papers, and the number of links between papers and institutions, we count the number of references of papers published, including papers published from 1894 to 2013.These references are also used to quantify the short-term impact of institution and paper.The reason is that the literature cited at any time is attributed to the impact of institutions during this period.For instance, to quantify the impact of an institution from 2009 to 2013, we need to construct an institution-citation network, which contain papers published during this time period, references of these papers, and related institutions.A detailed introduction of institution-citation network is covered in the next section.

B. IPRANK MODEL FRAMEWORK
In this section, we introduce the IPRank model (see Figure 2), which is a PageRank based model for quantifying the impact of institution and scholarly paper.The framework firstly constructs the institution-citation network.PageRank algorithm is then used to quantify the impact of institution and paper.Finally, we merge the impact of institutions, and rank institutions and papers.

1) Constructed institution-citation network
There is a good deal of literature in information science dealing with the citation network between papers [2], [21] and the citation network between institutions [11] to quantify the impact of paper and the impact of institution.However, to our knowledge, no detailed construction of an actual institution-citation network has been attempted in the past.In this paper, the institution-citation network is a heterogeneous and directed scholarly network, consisting of two categories of nodes: institution and paper.In additions, there are two types of links: one is the citation link between scholarly papers, the other is the link between institution and paper.Given a set of institutions I = I 1 , I 2 , ..., I m and a set of scholarly papers P = P 1 , P 2 , ..., P n .Let E P P denote the citations between scholarly papers, E P I denote the relationship between papers and institutions.The heterogeneous institution-citation network can be represented as a graph G = (I P, E P P E P I ).For an institution-citation network with m institutions and n papers, graph G can be represented by adjacency matrix A: where A P P represent the citation matrix between papers, A P I and A IP represent the links between institutions and papers.A P I = A T IP , since the links between institutions and papers are symmetric.

2) IPRank Model
The motivation of our method is described as follows: (1) If a scholarly paper is cited by many other publications, it means that the paper has high importance.(2) If a scholarly paper with a high importance is linked to other papers, the importance of the linked papers will increase accordingly.
(3) If an institution publishes many papers and these papers are cited by many other papers, it means that the institution has high importance.(4) If a scholarly paper with a high importance is linked to an institution, the importance of the linked institution will increase accordingly.
Figure 2 illustrates IPRank model framework by examine the simple situation: given three papers P 1 , P 2 and P 3 , paper P 1 with two institutions, I 1 and I 2 , paper P 2 with two institutions, I 2 and I 3 , paper P 3 with an institution I 4 .Paper P 1 cites paper P 2 and paper P 3 , therefore, a simple citation network can be constructed, which is an unweighted directed graph.According to the relationship between P 1 , P 2 , P 3 and I 1 , I 2 , I 3 and I 4 , the links between them can be added to the citation network, thus, a simple institution-citation network (graph G) are constructed.
Let A denote the adjacency matrix of G, and let B denote the transition probability matrix of A. The institution-citation network can be represented by a stochastic matrix P R. For a source i, the PageRank vector P R is defined as the unique solution of the following formula: where P R(i) represents the importance of the node i in the institution-citation network, α (the teleport probability) is a constant between 0 and 1, and is set as 0.85 in our experiments.The value of α parameter refers to the original Google PageRank algorithm [21].N represents the number of nodes in institution-citation network.j is the adjacent node of i, and j ∈ IN (i) indicates that node j is the indegree of node i. P R(j) represents the importance of the node j.The linear algebraic definition of PageRank is equivalent to simulating a random walk.Start from the source i, with probability (1-α), skip to a same chosen neighbor of the current node, or with probability α stop at the current node.According to Equation (2), we finally obtain the prestige scores of institutions and papers in the heterogeneous network.The pseudocode of IPRank model is listed in ALGORITHM 1.The importance of institution and the importance of scholarly paper are their P R values in the institution-citation network.As expected, papers P 2 and P 3 are cited by paper P 1 , and paper P 3 only belongs to institution I 4 , Compared to paper P 3 , paper P 2 belongs two institutions: I 2 and I 3 , therefore, I 4 is the most influential institution among the four institutions.Only paper P 1 is not cited by other papers in the three papers, therefore, the prestige score of paper P 1 is the lowest in the three papers.Since institution I 1 only links paper P 1 , and paper P 1 with a low prestige score, therefore, the score of the institution I 1 is the lowest among four institutions.Paper P 1 and paper P 2 belong to two institutions, and they share a same institution I 2 .Since paper P 1 cites paper P 2 , the importance of paper P 2 is higher than the importance of paper P 1 .Similarly, paper P 2 and paper P 3 are also cited by paper P 1 , since paper P 2 signs two institutions I 2 and I 3 , and paper P 3 only signs one institution I 4 , therefore, the importance of institution I 4 is higher than the importance of institution I 3 .

IV. RESULTS
We compare the similarity of institution ranking between IPRank and IRank [11].Both algorithms can be classified as structured metrics; however, the IPRrank is based on the heterogeneous institution-citation network whereas the IRank is based on the homogeneous citation network between institutions.Table 2 shows the Spearman correlation coefficient between IPRank and IRank.
According to Table 2, we observe a high correlation between IPRank and IRank for top 10 -top 100 institutions.In terms of the long-term impact of the institutions, the Spearman correlation coefficient between IPRank and IRank ranges from 0.73 to 0.88 for top 10 -top 100 ranked institutions.Especially, for top 10 institutions, the Spearman correlation coefficient between IPRank and IRank is the highest reaching 0.88.In terms of the short-term impact of the institutions, the Spearman correlation coefficient of the two algorithms changes relatively little, and ranges from 0.87 to 0.93 between 1994 and 1998.Compare to the period from 1994 to 1998, in the two time periods: 1999 to 2003 and 2004 to 2008, the correlation coefficient changed relatively large, from 0.62 to 0.92 and 0.71 to 0.93, respectively.In the five years between 2009 and 2013, the Spearman correlation coefficient between IPRank and IRank is the highest for top 10 institutions reaching 0.99, and the lowest for top 90 institutions reaching 0.84.
We also compare the similarity of paper ranking between IPrank algorithm and IRank algorithm (see Table 3).In terms of long-term paper impact, the correlation coefficient between the two algorithms is generally on the rise for top 10 -top 100 papers, and ranges from -0.30 to 0.79.During the period from 1994 to 1998, the correlation coefficient between them is higher than 0.58, and they are all positive related.Between 1999 and 2003, for top 10 -top 50 papers, the correlation coefficient between the IPRank and IRank algorithms is positive related, and they are higher than 0.  4, and it indicates that IPRank model has a higher correlation with outstanding impact.
Similarly, we check the rankings of the Nobel Prize institutions between 1930 and 2013, which are derived from Nobel Prize papers.Table 5 shows the rankings of ten Noble Prize institutions based on IPRank and IRank algorithms.It should be noted that since 1952, University of California has gradually separated from the University of California, Berkeley as an administrative system, no longer as a university.Therefore, for the institution entry University of California, we also renamed it to the University of California, Berkeley.According to Table 5, we observe that several institutions have the same ranking order, and several other institutions have slightly different rankings.The reason behind it is that the importance of institution is related to the importance of its published scholarly papers.Simultaneously, the importance of an institution will increase if papers published by the institution are cited by other papers.In general, each institution has a large number of linked papers, and the number of linked papers is different for different institutions.Therefore, the ranking difference based on IPRank and IRank algorithms is small for institution ranking.Compared with institutional rankings, the ranking of a paper depends on its impact of citing papers and institution.Therefore, the rankings of papers ranked by the IPRank and PageRank algorithms are quite different.rates than that of IRank, indicating that the IPRank algorithm better reflects the impact of Nobel Prize institutions.Figure 5 compares IPRank and IRank in terms of the precision rates of retrieving Nobel Prize universities and among top N universities.From top 1 to top 8 universities, the probability of the number of Nobel Prize universities of IPRank is 1.For top 9 and top 10 universities, the probability of the number of Nobel Prize universities of IPRank is less than 1, and they are 0.88 and 0.90 respectively.In contrast, the probability of the number of Nobel Prize universities of IRank fluctuates greatly and ranges from 0.80 to 0.89.The probability of the number of Nobel Prize universities of the IPRank algorithm is found greater than or equal to the probability using the IRank algorithm.

V. CONCLUSION
This paper investigated a data-driven method to quantify the impact of institution and paper from heterogeneous institution-citation network.Unlike most prior studies that utilised citation network to measure the impact of institution or paper, this paper proposed IPRank to simultaneously quantify the impact of institution and paper in a heterogeneous scholarly network.Experimental results showed that the IPRank model was more representative of the outstanding impact of institution and paper.Compared to the ranking of IPRank and PageRank algorithms for Nobel Prize papers and institutions, IPRank model produced a higher ranking in most cases for identifying Nobel Prize-winning papers and institutions, making it an adequate tool for institutional impact assessment.

FIGURE 1 :
FIGURE 1: An example of heterogeneous institution-citation network.
(1) Paper and institution details are complete and in the right format.(2) At least one institution is found for a paper.(3) The first institution associated to each author is retained.(4) Each institution retains to the first-level unit.For example, we retain Sloane Physics Laboratory, Yale university as Yale university.(5) Institutions with same name merge.For example, University of California at Berkeley and California University at Berkeley are merged into University of California, Berkeley.It is worth mentioning that before 1952, the University of California at Berkeley was called the University of California.Therefore, in our research, these two names were unified as the University of California, Berkeley.

ALGORITHM 1 :
Rank institution and paper Input: Matrix A P P ∈ R n×n , Matrix A P I ∈ R n×m , Matrix A IP ∈ R m×n Output: Scores of PR(i) Initialize Matrix A; Compute transition probability matrix B; Initialize scores of PR(i); for node i in institution-citation network do step 1: Calculate scores of PR(i) according to Eq.(2); step 2: Update scores of PR(i); end Iterate step 1 and step 2 until convergence; Return scores of PR(i); 68.During the same period, for top 60 -top 100 papers, the correlation coefficient between is low, and ranges from 0.35 to 0.49.Between 2004 and 2008, the correlation coefficient between the IPRank and IRank algorithms shows an upward trend, and ranges from -0.18 to 0.73 for top 10 -top 100 papers.Between 2009 and 2013, the correlation coefficient is less than or equal to 0.5.It can be seen that the correlation coefficient at different periods is not regular.To test whether IPRank model correlates with outstanding impact, we rank 35 Nobel Prize papers from 1930 to 2013 on the basis of IPRank and PageRank.To validate of the IPRank model, we compare the rankings based on IPRank and PageRank.Experimental results indicate 80% Nobel Prize papers rank higher by IPRank than by PageRank.The top ranked Nobel Prize papers are shown in Table

Figure 3 compares
IPRank and PageRank in terms of the recall rates of retrieving 35 Nobel Prize papers among top N papers.It is observed that the IPRank algorithm consistently

FIGURE 3 :
FIGURE 3: Recall performance for retrieving Nobel Prize papers among top N papers.

FIGURE 4 :
FIGURE 4: Recall performance for retrieving Nobel Prize universities among top N universities.

FIGURE 5 :
FIGURE 5: Precision performance for retrieving Nobel Prize universities among top N universities.

TABLE 1 :
Statistical summary of the APS dataset for different time periods.

TABLE 2 :
Spearman correlation coefficient between IPRank and IRank for top N institutions.

TABLE 3 :
Spearman correlation coefficient between IPRank and IRank for top N papers.

TABLE 4 :
Comparing the ranking of IPRank and PageRank algorithms for ten Nobel Prize papers.

TABLE 5 :
Comparing the ranking of IPRank and PageRank algorithms for ten Nobel Prize institutions.IPRank and IRank in terms of the recall rates of identifying Nobel Prize universities and among top N universities.For top 1 to top 3, top 6 and top 9 universities, both IPRank and IRank contain the same number of Nobel Prize universities.For top 4, top 5, top 7, top 8 and top 10 universities, the IPRank consistently yields higher recall