Identification High Influential Articles by Considering the Topic Characteristics of Articles

The topic of one article reflects its main semantic content, which is also the main guidance for researchers to choose reference literature. In order to explore whether the topic of an article will affect its citation trend in future, this paper establishes a machine learning framework to study the role of topic characteristics in the prediction of future high influential articles. Articles from four different disciplines are collected as experimental samples to verify whether the framework proposed in this paper can be applied to the prediction task in different disciplines. The Latent Dirichlet Allocation (LDA) is used to determine the topic characteristics of sample articles. LDA can map sample articles to current hot topics and generate the mapping probability of sample articles under different hot topics. The maximum mapping probability of the sample article under the hot topics is extracted as the topic feature of the article. Then the feature space for the prediction task is constructed by combining the topic feature and some bibliometrics indices of articles. Three feature selection algorithms, Fisher Score, Relief-F and Spectral Feature Selection (SPEC), are taken to select the important features in the feature space. The prediction performance of these features is finally tested by three classifiers, SVM, KNN and Bagging. The experimental results show that the topic characteristics of article, the early citation characteristics of article, and the reputation of the author are the key factors that determine whether an article can grow into a highly influential one. The important value of topic characteristics in articles’ citation activities shows that the content of the article is an important factor in attracting more citations.


I. INTRODUCTION
Scientific research article is an important materialized form of scientific research achievements and serves as an important guidance in evaluating scientific research status of different academic entities, such as countries [1]- [4], institutions [5]- [7] and individuals [8]- [14], etc. Researchers tend to pay more attentions to articles with high academic influence, and try to explore whether there are characteristics that can help identify the influential articles in advance [15]- [27], which makes the research of predicting potential high influential The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . articles become an important issue in the field of bibliometrics. Because it is difficult to define the influence of articles quantitatively, researchers usually use articles' citation counts to represent the citation influence of them [21], [22], [28]- [30]. Therefore, the prediction of future high influential articles is naturally transformed into the prediction of highlycited articles in advance.
Tahamta et al. [20] made a comprehensive overview on the work of researchers in predicting of articles' future citation counts. They detected 198 relevant papers and summarized that the indices from authors [31]- [36], papers [29], [36]- [39], journals [40]- [45], and some alternative metrics [30], [46]- [51] are related to the number VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of citations. However, most of these indicators are the external bibliometrics features of the paper and do not involve the paper's content characteristics. In fact, the content of one paper is an important guidance for researchers to choose the related reference literature. When generating their reference list, researchers first investigate the correlation between the topics of literature and their own research interests.
If the topic of one literature is consistent with the interest of researchers, it is more likely to attract the attention of researchers, thus increasing the chances of the literature being cited. Some researchers extracted the topic of the articles and analyzed the relationship between the topic and the citation counts of articles [52]- [54]. They pointed out that some topics are more attractive to researchers, and articles related to these attractive research topics are more easily cited by researchers. But the topic of the article is mainly determined according to the subjective classification of researchers or the use of existing thesauri [52], which makes it not very accurate to characterize the theme of the article. With the development of natural language processing technology and semantic analysis technology, it is easier to obtain the deep semantic information of an article. The Latent Dirichlet allocation (LDA) model is a widely used mining method of text topic [55]. At present, LDA model has been widely used in various topic mining tasks, such as topic detection [56]- [58], hotspot discovery [61], [62] and other areas [58], [59]. In this paper, the LDA model is introduced to extract and generate topic characteristics of articles. On this basis, a machine learning framework is established to analyze the value of topic characteristics in predicting the high influential articles. Figure 1 shows the sketch of the proposed prediction framework. The work of this paper is divided into three parts: (1) Constructing the feature space for the predicting task: By combining the topic characteristics trained by the LDA model and the articles' traditional bibliometrics features, a feature space is constructed to perform the prediction task; (2) Extracting key features from the feature space: Three feature selection algorithms, Fisher Score, Relief-F and Spectral Feature Selection (SPEC), are used to extract key features from the feature space. These key features can better distinguish articles with different influences; (3) Predicting high influential articles: Three classification algorithms of SVM, KNN and Bagging are taken to verify the performance of the selected key features in identifying future high influential articles.

II. RELATED WORK
Since Garfield regarded the citation counts as the basic index of bibliometrics, the citation counts provides an important basis for researchers to quantitatively evaluate the impact of scientific research papers [36], [63]- [71], academic institutions [5] and academic individuals [8], [9], [12], [72]. Although many studies have shown that it is difficult to fully reflect the impact of research literature only with its total number of citations, considering the universality and availability of indicators, researchers still tend to use the number of citations to express the influence of an article in the academic community [21], [22], [47], [48], [73]. In recent years, more and more research began to focus on what kind of articles can grow into influential ones [74]- [77]. Researchers extracted features from the authors of the article [74], [76], [78]- [82], the bibliometrics indices of the article [37], [39], [83]- [85], the journal where the article was published [44], [86]- [88], the attention [89] and the download [51], [90], [91] of the article in website and other web usage indexes [24], [48], [92], [93], so as to identify papers which will obtain high academic influence in future.
Among the author related characteristics, researchers pointed out that the author's reputation [21], [22], [24], [94]- [96], the number of authors [17], [84], [97]- [99] and the academic ranking of authors [100] have important impacts on the citation behavior of articles. In addition, the author's gender [31] and whether there is international cooperation among authors [6], [17], [84], [88], [96], [98], [101], [102] have also been widely studied. Among the indexes related to journals, journal impact factor is the most widely used feature for prediction [17], [84], [88]. The open access [40], [103], the accessibility of periodicals [40], [96], the ranking of periodicals [98], [100], [104], the reputation of periodicals [87], [99] and other factors [76], [105] also affect the citation activity of articles to a certain extent. Researchers also discussed the relationship between the indicators of the article itself and the citation activity. These indicators include the length of the research article [87], [97], [106], the type of the article [107], the length of the title and abstract [108], the number and diversity of the references in article [17], [19], [84], the time characteristics of the article [75], [109], [110], and the web usage statistic indicators of the article [24], [30], [91], [111]. They pointed out that the scientific research work in developed countries [109], the relatively short title and abstract of article [112], the diversity of the references [108] play important roles in attracting more citations. In our previous studies, we discussed the influence of articles' early citation behavior on their subsequent citations [21]- [23]. It is found that if the paper can be cited by more countries, institutions, journals and disciplines in a short period of time after publication, the paper will grow into a highly influential paper to a large extent. Although the above studies show widely discussion on the factors that affect the citation from different perspectives, most of these factors are related to the external characteristics of the article, and do not involve the content of it. In recent years, some researchers discussed the role of the content of the article in its citation behavior, but these studies mainly analyzed the popularity [100], variety [101], novelty and diversity [100], [113] of the article, and the exploration of article's semantic content is still not enough. The semantic content of the article embodies the research theme of the article, which is of great significance for researchers to choose references. The content of the article determines whether the article can arouse the interest of researchers, and then stimulate the cited behavior.
In recent years, with the development of natural language processing technology and deep learning technology, researchers proposed many technologies and models that can effectively extract the characteristics of text content [100], [113], [114]. Among them, LDA model is one of the most widely used method that can effectively explore the theme of documents [55]. LDA can transform the text into a threetier structure of ''Document-Topic-Word''. By extracting the ''Document-Topic matrix'' in the three-tier structure, the topic distribution of the text could be obtained. In the field of citation analysis, LDA model has been widely used in topic detection, [56]- [58], hotspot discovery [59]- [60], automatic document construction [61] and article recommendation [62].
In this paper, the LDA model is introduced to extract and generate the topic features of the article. On the basis of the extracted topic features and some commonly used bibliometrics features, this paper establishes a machine learning framework to test the value of articles' topic characteristics in predicting their citation trend in future.

III. DATA SET
This paper selects articles published in 2010 as experimental samples, and uses the total citation counts of these articles obtained from the beginning of publication to 2018 as the representation of the future citation influence of these articles. In order to ensure the universality of the experimental results, this paper collects the article data from four different fields: Physics, Chemistry, Oncology and Neurology. One journal from each field is selected as the representative of the corresponding field. Table 1 shows the information of the selected articles. There are total 421 articles, including 1792 authors, are used in our experiments.
The reasons for choosing these four fields are the following. Physics and Chemistry have strong interdisciplinary characteristics with other disciplines, while Oncology and Neurology are relatively professional disciplines with relatively low interdisciplinary characteristics with other fields. Choosing articles from these four fields can better explore the common characteristics that affect the citation behavior of articles. In order to meet the needs of classification tasks, the selected articles from each field are divided into three kinds of highly-cited papers (HCPs), medium-cited papers (MCPs) and low-cited papers (LCPs) according to their total citations. 1) HCPs: Articles whose total citation counts is at least 7 times the average citations of articles in the same journal are selected as HCPs. The choice of multiple seven mainly ensures that an appropriate number of articles are selected as HCPs. According to this standard, in each journal, the selected HCPs happen to be the top 1% articles in the journal according to articles' total number of citations, which conforms to the definition of highly-cited papers in the Web of Science (WoS). In Web of Science, papers received enough citations to place it in the top 1% in the same subject area and in the same publication year are classified as highly-cited papers.
2) LCPs: Articles whose total citation counts is less than the average citations of articles in the same journal are taken as LCPs. VOLUME 8, 2020 3) MCPs: The articles whose citation counts located between the ones for HCPs and LCPs in the same journal are taken as MCPs.
Due to the great differences in the topics involved in each article, it is difficult to establish the relationship between a specific topic of the article and its future citation trend, and it is also difficult to draw general conclusions in different articles. Therefore, this paper does not examine the specific theme of the article itself, but on the basis of examining the relationship between the theme of the article and the current research hotspots, generates the topic characteristics of the sample article. And to do this, it is necessary to extract the research focus in the field of sample articles before determining the topic characteristics of sample articles. This paper selects highly-cited papers published in 2005-2009 from the representative journal in each field, and extracts the research topics reflected by these highly-cited papers through LDA model, which are regarded as the current research hotspots in the domain of this journal. These highlycited papers are still defined as papers whose citation counts is at least 7 times of the average citation frequency of journal papers. After determining the current research hotspots in each field, the LDA model maps the work of the sample articles to the research hotspots and obtains the mapping probability of the topic of sample articles under different research hotspots. Then the topic characteristics of the sample articles are generated based on the results of the mapping probability.

A. CONSTRUCTING FEATURE SPACE FOR PREDICTING FUTURE HIGH INFLUENTIAL ARTICLES
Previous studies have shown that some commonly used bibliometrics features have an important impact on the citation behavior of articles. This paper first extracted these commonly used bibliometric features from sample articles. At the same time, by using LDA model, this paper extracts the topic feature of sample articles. Combined with the topic feature and the extracted bibliometrics features, the feature space for predicting high influential articles is established. Each feature in the feature space is normalized to the range of [0.01, 0.99] to eliminate the influence of the value range of different features on classification.

1) BIBLIOMETRIC FEATURES
Twenty bibliometric indicators are collected for the sample articles. These bibliometric indicators do not involve the content of the article, but all belong to the external factor indicators collected from the statistical point of view. Table 2 shows the information on these 20 bibliometric indicators.
Indices {x 1 − x 6 } describe the citation characteristics two years after publication. These characteristics give the early feedback of the scientific community to the paper. Indices By using the functions of ''Create Citation Report'' and ''Analyze the Indexing Results'' in Web of Science, it is easy to collect these bibliometrics indices.

2) TOPIC FEATURE
In order to extract the topic feature of articles, this paper first uses LDA to train the highly-cited articles published In order to determine the dimension of topics, the topic coherence score is calculated to examine the coherence of the model under different number of topics. The topic coherence captures the semantic interpretability of a topic based on its corresponding descriptor terms (for example, the first n topic words with the largest probability in the topic estimated by LDA) [115]. It was shown that there is a good correlation between topic coherence and human expert labeling [116]. The higher the topic coherence score is, the higher the semantic interpretability of the documents assigned to the same topic, the more reasonable the result of topic division is. The coherence score is calculated as follows: where V is a topic described by a set of words. {v i } represents the topic words. D(v i ) counts the number of documents containing word v i . D(v i , v j ) counts the number of documents containing words v i and v j . We calculate the coherence score of each topic and average them to get the average coherence score of all topics. By changing the number of topic dimensions, we can get the average score of topic coherence under different topic dimensions. According to the comparison of the average coherence scores of different topic dimensions, the topic dimension with the highest score is selected for the final dimension. Figure 2 shows the results of the average coherence score of the model under different number of topics. It shows that when the number of topics is 10, the average coherence score of topics reaches a small peak, and then tends to be flat with  the increase of the number of topics. Therefore, this paper chooses 10 as the dimension of topics in LDA model. Figure 3 shows the probability distribution of some sample articles under the 10 dimensional research hotspots. This paper mainly discusses whether the relationship between the topic of the article and the current research hotspots will affect the future citation behavior of the article. Therefore, we take the maximum membership degree of each sample article under the 10 dimensional research hotspots as the topic feature of it, which is labeled as Topic, and added it to the feature space.

B. FEATURE SELECTION
In order to overcome the ''dimensionality curse'' [117] which may probably existed in the feature space, a feature selection process is applied to find the subset of features which maintain the essential characteristics of the dataset. Considering that each feature selection technique may be biased to some features due to their initial mechanisms, three different selection techniques of Fisher Score, Relief-F and SPEC method are used to detect the significance of each feature and verify whether the core feature subset can achieve good results under different classification algorithms.

1) FISHER SCORE
Fisher Score calculates the variance of samples of different categories to determine the contribution of one feature to the classification. If a feature has a strong ability to distinguish categories, the variance difference of samples in the same category should be as small as possible, and the variance difference between different categories of samples should be as large as possible.
where µ f i expresses the average of the i-th feature f i in the sample, µ k f i represents the mean of the i-th feature f i in the sample in the k-th category. n k describes the number of samples in the k-th category, and f j,i gives the value of the i-th feature in the j-th sample.
Based on Equation (3), we calculate the value of Fisher Score for each feature to represent the contribution of each feature to classification. All the features are ranked by their Fisher Score in descending order.

2) RELIEF-F
Relief-F [118] weights features by calculating their ability to distinguish different types of samples. Relief-F randomly selects one sample from the sample set, and then finds its k nearest neighbor samples in the same category and each different category as the random sample. It calculates the ability of each feature to distinguish random samples and their neighborhood samples, and gives more weight to the features with higher ability to distinguish.
To be specific, for the randomly selected sample x i , it first finds k nearest neighbor sample h j from the homogeneous sample set C, and then finds k neighboring samples m j in the sample set S which belongs to different class from x i , finally, Relief-F calculates the probability estimation of the above formula and assigns it as the weight of feature A. Where p(C) is the probability of class C, and p(S) is the probability of the other class S besides C.

|xi[A]−xj[A]| max(A)−min(A) A is continous 0 A is discrete and x i [A] = x j [A] 1 A is discrete and x i [A] = x j [A]
(4) Taking the order of weights from large to small, we can get all the features ranked.

3) SPEC
SPEC selects features based on ''spectrum theory'', and uses mutual information to calculate the correlation between features and categories [119]. The basic idea of SPEC is that a good feature should have a structure similar to that of a graph composed of initial data. SPEC calculates the correlation between the initial features by evaluating the consistency of the features of the spectral matrix derived from the similarity matrix between samples. For n samples X = (x 1 , x 2 , . . . , x n ) in the data sets X , {F 1 , F 2 , . . . F m } are m features of the data sets, {f 1 , f 2 , . . . , f m } are m feature vectors. Following the computational procedure, the weights of all features could be computed: Step 1: Calculate the similarity matrix S of X : Step 2: Further acquire the representation G of the graph of S, and calculate the adjacency matrix W of G: Step 3: Compute the degree matrix D of G: Step 4: Compute the feature weights for each feature vector f i : 107892 VOLUME 8, 2020 Step 5: Sort each feature using the ranking functionφ(x): Through the above steps, we can get the feature set arranged in the order of feature weight from large to small.
According to the ranking results in each of the feature selection algorithm, we take the intersection of the top 10 features in each algorithm as the key features. These features are of great value for identifying future high influential articles.

C. CLASSIFICATION MODELS
Three classification models of support vector machine (SVM), k-nearest neighbor classifier (KNN) and decision tree-based Bagging classifier are used to test the classification performance of the selected key features.
SVM tends to find the best separating hyper-plane in the feature space, so as to maximize the interval between positive and negative samples in the training set [120]. The classical SVM algorithm is a supervised binary learning algorithm. The multiclassification problem can be solved by the combination of SVM classifiers [121]. In this paper, Gaussian function is used as the kernel function of SVM. The hyper-parameters of Gaussian function is set to gamma = 9, c = 5.
KNN first determines the k nearest neighbors of the samples to be classified in the feature space, and assigns the category of most of the k nearest neighbors to the samples to be classified [122]. In this paper, the parameter k of KNN classifier is set to 12.
Bagging is a classifier based on decision tree. It is a parallel integrated learning method, which uses multiple trees for training and prediction, and outputs prediction value combined with training results [123]. In this paper, ten decision trees are integrated in Bagging.
In ''Supplementary Material'', we show the detailed discussion on determining the optimal parameters for the three classifiers.  Table 3 shows the top 10 features selected under three feature selection algorithms. It can be seen that topic feature Topic and seven bibliometrics features of {x 1 , x 2 , x 3 , x 4 , x 5 , x 7 , x 8 } appear in the feature subset obtained by three different feature selection algorithms, which shows that these 8 features are more important than other features in determining the future citation trend of the article. Among them, the topic feature Topic reflects the semantic content of the article. The features of {x 1 , x 2 , x 3 , x 4 , x 5 } give the early citation characteristics of the article, which are denoted as Early-Citation features in this paper. The features of x 7 , x 8 give the h-index of the author to represent the academic reputation of the author, which are denoted as Author-Reputation features in this paper.

V. EXPERIMENTAL RESULTS AND DISCUSSION
In addition, the sample articles are selected from different fields, which have different interdisciplinary characteristics. Some of the sample articles come from the fields of Physics and Chemistry. These two areas have a high degree of intersection with other areas. The rest of the articles are from Oncology and Neuroscience. These two fields belong to the fields with strong specialization, and they have a low interdisciplinary with other fields. In order to explore the applicability of the framework proposed in this paper to articles from different interdisciplinary areas, the above three feature selection algorithms are used to process articles from two kinds of fields respectively and identify the key features in different kinds of fields. Under each algorithm, according to the order of feature weight calculated by the algorithm, the top 10 features are selected.    Table 4 and Table 5 show the results of feature selection in the two kinds of fields respectively. The results show that for VOLUME 8, 2020 articles in fields with different interdisciplinary characteristics, the topic characteristics, the Early-Citation features and the Author-Reputation features are still the key features that affect the citation trend. The features from these three aspects will not be affected by the domain characteristics, which are of great value to the citation behavior of the article.
In order to verify the value of selected key features in identifying high influential articles, this paper uses three classifiers of SVM, KNN and Bagging to test the classification performance of these features. The classification performance is determined by 10 times cross validation. In this study, AUC under ROC curve is used to evaluate the classification model. ROC curve is drawn with FPR (False Positive Rate) as horizontal axis and TPR (True Positive Rate) as vertical axis [124]. AUC value gives the area under ROC curve [125]. The larger AUC value is, the better the classification performance of the model is. Table 6 shows the confusion matrix used to calculate FPR and TPR. Table 7 shows the classification performance of three classifiers in each journal. The last column in Table 7 shows the average classification performance of the three classifiers under each journal. The last row gives the average classification performance of each classifier for articles in different journals.   Table 7 more intuitively in the form of histogram. It can be seen that the key features extracted in this paper have achieved good classification performance for sample articles in different fields. Under each journal and each classifier, the average classification performance represented by AUC is above 0.9, which proves that these features can effectively identify and distinguish articles with different citation influences.
Among the three classifiers used in this paper, SVM achieves better classification results. We believe that there are two reasons. Firstly, SVM can map the data to highdimensional space, and construct the optimal separation hyperplane in high-dimensional feature space, so it can realize the separation of the indivisible data in the plane. Secondly, by adjusting the kernel functions and selecting the appropriate parameters, SVM can still have strong robustness even if there is a certain deviation in the training samples. These two points should be the reason why SVM is superior to other classifiers. In order to further verify the role of the three kinds of key features in the recognition of high influential articles, this paper successively removes these three types of features, and uses the best classifier, SVM, to test the classification performance of the feature subset after feature removal. Table 8 shows the classification performance after excluding features. In Table 8, the minus sign ''-'' in front of the features indicates that deletion of the corresponding feature, and the values in the table give the classification performance after deleting the corresponding feature.
It can be seen that after deleting one of the three key features, the classification performance of the four journals is greatly reduced. Among them, the classification performance of the four journals decreases the most after excluding the Early-citation features and the AUC value of the four journals decreases by 10%-20%. The AUC value of two journals of APL and LN even decreases from 0.94 and 0.97 to 0.74 and 0.75, respectively. This shows that the Earlycitation features of the article is a very important feature to determine whether it has a high influence in the future. The role of Topic feature is second only to that of Early-citation features. The AUC value decreases significantly after the Topic feature is deleted. Especially in journal LN, its classification performance decreased from 0.97 to 0.77, which shows that the topic feature is also an important feature that affects the article's citation trend in the future. After removing the Author-Reputation features from the key feature subset, the classification performance of the four journals also decreases significantly, which indicates that the Author-Reputation features also have a great impact on the citation behavior of articles.
Therefore, the topic feature, the early citation features and the author's reputation are the key factors that affect the future citations of the article. Basing on these three indicators, we can effectively identify the articles with high influence in the future.
The topic characteristics of the article reflect the semantic content of the article. The important value of topic feature on the citation behavior shows that, in addition to some external bibliometrics characteristics from the statistical perspective, the content of the article also has a great impact on the future citation trajectory of the article. To some extent, this conclusion reflects the researchers' reading and research habits. When researchers make research directions and choose reading literature, they usually first understand the current research hot spots, and then choose articles that meet their own research interests to read. The role of topic feature in citation behavior reflects researchers' preference for current research hotspots.
The early citation features of the article represent the academic feedback on the article in a short time after publication. If an article can be cited by more countries, institutions, disciplines, journals, etc. in a short time after publication, it means that the knowledge carried by the article is widely recognized and accepted by the academic community. The wide spread of knowledge brought by the citation activity will bring more opportunities for the article to be cited, and make the article more likely to grow into a highly influential document in the future [69]. The conclusion of this paper is consistent with that of some previous researches [21], [23], [94]- [96], [126].
The author's h-index was proposed in 2005 by Hirsch to assess the academic level of scientists [127]. The index is proved quite useful and widely used in quantifying an individual's scientific output [128]. The high h-index indicates that researchers have high academic level and high academic reputation. The important influence of the author's h-index on the articles' citation activity shows that the articles published by the authors with high h-index will receive more attentions, which will bring many potential opportunities for citation. This conclusion is consistent with the research results in some previous work [21]- [23], [94]- [96].

VI. CONCLUSIONS
The purpose of this paper is to explore which characteristics of the article will produce important value for its future citation behavior. Compared with the previous works of determining the influencing factors of citation, this paper focuses on the influence of the topic characteristics of the article on its citation trend.
There are three main contributions in this paper. 1) The LDA model is introduced to generate the topic feature of the sample articles. By training the highly-cited papers at present and the sample articles, LDA model can map the content of the sample articles to the current research hotspots determined by the highly-cited articles. Based on the probability distribution of the content of sample articles in the research hotspots, the maximum membership degree of each sample article under the research hotspots are extracted as its topic feature, which lays the foundation for the subsequent recognition of high influential articles. 2) On the basis of the extracted topic feature and some commonly used bibliometrics features, the feature space is constructed to identify the future high influential articles. Three feature selection algorithms are used to extract the key features which are of great value to distinguish articles with different citation influences. 3) Three classifiers are utilized to test the classification performance of the key features to verify whether these features, especially the topic feature, are valuable to identify the high influential articles in the future.
In order to get more reliable experimental results, this paper selects articles in different fields as experimental samples. The experimental results show that the characteristics of early citations, topics and authors' reputation are the key features that affect articles' citations. The topic characteristics of the article is very helpful to attract more citations, and is also very important to the identification of high influential papers in the future. As the topic of the article reflects its content, it shows that in addition to some external bibliometrics characteristics of the article, the content of the article will also have a great impact on its citation behavior, and to a certain extent determines whether the article can grow into an influential one in the future.
The work of this paper is to study the influence of the content of the article on its future citation behavior. Although the conclusion of this paper shows that the content characteristics of the article do have an important impact on its future citation behavior, the conclusion is a little rough because the content of the article is only based on its topic. In the following work, we will continue to explore how to better quantify the content characteristics of the article to further investigate the value of the content in its citation behavior.
JIAQI ZHANG received the B.S. degree from Northeast Forestry University, China, in 2017, where she is currently pursuing the M.S. degree with the College of Information and Computer Engineering. Her main areas of interest are text mining, scientometrics, and machine learning.
XIANGRONG ZHANG received the Ph.D. degree from the Harbin Institute of Technology, China, in 2016. She is currently an Associate Professor with the School of Management, Heilongjiang Institute of Technology. Her current research interests include text mining, natural language processing, and deep learning.
NA ZHU received the master's degree from the School of Information Management, Heilongjiang University, China, in 2010, where she is currently pursuing the Ph.D. degree in literature informatics. She is also an Associate Research Librarian with the Library of Harbin University. Her research interests include bibliometrics and data policy. VOLUME 8, 2020