The Quality Assist: A Technology-Assisted Peer Review Based on Citation Functions to Predict the Paper Quality

This study aims to develop a prediction model for paper quality assessment to support technology-assisted peer review. The prediction technique is intended to reduce the review burden, which is becoming a critical issue in today’s paper submission process. However, most existing works on this topic were built by involving the reviewers’ comments, which is considered unfair and inapplicable for reducing the review burden. Therefore, our prediction method relies only on features extracted from the paper to address this issue. The method covers three tasks as follows: two are classification tasks and one is a regression task. The classification tasks predict the final review decision (accepted-rejected) and estimate the paper quality (good-poor), while a regression task predicts the review scores. Additionally, the classification and regression tasks are implemented using three main features i.e., citing sentence features developed based on the labeling scheme of citation functions, regular sentence features created by applying the label of citation functions to non-citation text, and reference-based features constructed by identifying the source of citations. Furthermore, the classification experiments on the dataset obtained from the International Conference on Learning Representations 2017–2020 showed that our methods are more effective in the good-poor task than the accepted-rejected task by demonstrating the best accuracy of 0.75 and 0.73, respectively. Moreover, we also reached a satisfactory recall of 0.99 using only the citing sentence features to obtain as many good papers as possible in the good-poor task. Our regression experiments indicate that the best result in predicting the average review score is higher than the individual review score by showing Root Mean Square Error (RMSE) of 1.34 and 1.71, respectively.


I. INTRODUCTION
Peer review aims to ensure the quality of scientific works. It is used not only in journal publishing but also in conference submissions, grant proposal evaluations, and academic monograph submissions [1]. However, completing all stages of the peer review is time-consuming and requires extensive human effort, from accepting the manuscript to the final review decision. Peer review can be challenging in the journal The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . submission process due to the massively published research papers. The STM report 2018 [2] states 33,100 peer-reviewed English-language journals and 9,400 non-English-language journals collectively publish more than 3 million articles annually. Another study reported that the yearly review of the previously rejected manuscripts reaches 15 million hours [3]. Moreover, EasyChair, a web application for conference management systems, has managed around 100,000 conference events since 2002. 1 The situation worsens due to the uneven geographical distribution of the review experts, thereby putting the peer-review process into an over-burdened system [4].
Another limitation of the peer-review process is that because it is based only on human expertise, it will unavoidably tend to be biased and subjective due to several factors, such as expert academic background, experience, emotion, and health [5]. Other challenges well identified by [6] include inadequate training on how to perform the peer review or response to the review [7], the relationship between the journal and peer-review quality [8], and standard core competencies for editors [9]. Additionally, Jana [10] explained additional limitations of the traditional peer-review system, such as expensive and publication delay, harsh comments due to reviewers' anonymity, author-recommended reviewers, and irresponsible reviewers to complete the review process. Therefore, this situation poses the opportunity for proposing a technology-assisted peer review (TAPR), an automated screening method, to reduce the massive burden of the peer review process.
The development of TAPR to reduce the review burden has gained much attention. The TAPR has addressed three principal tasks in existing works: predicting the paper quality, final review decisions, and review scores. The exiting TAPR was developed using various purposes, ranging from predicting three-classes outcomes accepted, borderline, and rejected [11] to suggesting two-labels outcome accepted and rejected as majority-targeted classes as in [12]. However, existing works encounter two main drawbacks. The first drawback is that due to inconsistency among review results, review scores, and final decisions, as stated by [13], directly predicting the final review decision leads to a bias in determining the paper's quality. For example, if reviewers agreed not to reject the manuscripts, the editor rejected 20%, or if the reviewers agreed to reject the manuscripts, the editor rejected 80%. Therefore, to resolve this issue, this paper proposes two prediction tasks: the paper quality prediction to determine whether the manuscripts are good or poor, which is more reasonable and review score prediction to estimate the review scores. However, for comparison, we predict that the final reviewer's decision based on the submitted manuscripts will be accepted or rejected. The second drawback is that most existing studies employed the review comments as prediction features. However, this approach is considered unfair and inapplicable when the main aim is to reduce the review burden. Therefore, the technique to reduce the human cost of the peer-review process should not depend on the features, including review comments that require human work.
This study develops a prediction method to address two classification tasks and a regression task for assessing paper quality; these tasks do not depend on the review comments. While the classification tasks predict the final review decision (accepted-rejected) and paper quality (good-poor), the regression task forecasts the average and the individual review scores. Additionally, the prediction tasks are accomplished using predictors with several prediction features.
Here, we use citation functions, which represent why the author of the research paper cited previous works as the main predictor. This choice is motivated because the citation functions can represent the paper's quality [14] [15], show the proposed research position in numerous literature [16], indicate the novelty of the proposed research [17], understand the broad view of the given research topics of the paper [18], examine the map of science [19]. Additionally, the use of citation functions-based features brings another advantage to explore rarely-touched field of citation functionsbased recommendation system, especially when preparing the research manuscript. Hence, the significant role of citation functions in estimating the paper's quality is worth discussing.
The prediction method for the tasks proposed in this paper can be summarized as follows: the main predictor is citation functions obtained by categorizing the citing sentences (a sentence containing citation marks). Notably, the citation functions applied in this paper were developed in our previous research [20]. Since the author's intention during manuscript writing cannot be accommodated using only citing sentences, this paper proposes an additional predictor called regular sentence predictor involving the non-citation sentences. Following this, another predictor to be implemented here is the reference-based predictor, which represents the references cited in the manuscript. Finally, we intend to merge the mentioned predictors into a combination predictor to investigate the impact of prediction features when combined. The prediction model is created using several Machine Learning (ML) and Feature Selection (FS) methods. Therefore, to evaluate the prediction performances, this study uses a dataset from the International Conference on Learning Representations (ICLR) 2017-2020, well parsed by [21] At the end of this paper, several contributions will be explained: • This paper proposed a method to predict the final review decision, paper quality, and review scores comprising four predictors as follows: citing sentence, regular sentence, reference-based, and combination. Our prediction method is independent of review comments as features but depends only on the paper.
• This paper demonstrated the accuracy of 0.67 and 0.72 in accepted-rejected and good-poor tasks, respectively, to evaluate the impact of citation functions in the classification tasks. However, the best accuracies were achieved through a combination predictor by 0.73 and 0.75 in accepted-rejected and good-poor tasks, respectively.
• Compromising with lower accuracy of 0.72 in the goodpoor task, the satisfying recall of 0.99 was achieved using only citing sentence predictor.
• Analyzing the top 10 most important features in the combination predictor of classification tasks poses the fact that a feature called citing_paper_dominant, which represents a paper that outperforms previous works' performance, is considered significant to the prediction results; however, this feature has few instances of distributions in the dataset.
• Regarding average review score prediction, the best results were represented by RMSE and Mean Absolute Error (MAE), achieving 1.34 and 1.07. Conversely, in the individual review score prediction, the best RMSE and MAE were attained when predicting the individual score 1 by 1.71 and 1.38.
• When obtaining the best performance, the citation functions-based predictors (citing sentence and regular sentence predictor) are more impactful than the reference-based predictor.
• Finally, our prediction method is more effective in predicting the paper quality than the final review decision in the classification tasks. However, in the regression task, our method better estimates the average review score than the individual one. This paper is organized as follows: Section II presents a brief review of related works predicting the paper's final review decision, paper quality, and review score. Section III introduces our proposed method to handle three prediction tasks, i.e., accepted-rejected, good-poor, and review scores. Next, Section IV describes how the prediction features are constructed. In Section V, we report the experimental results of the prediction tasks. Finally, Section VI encompasses the conclusion and future research plan.

II. RELATED WORK
This section presents existing works on three research focuses as follows: (a) existing TAPR platforms were developed by publishers and technology vendors, (b) paper acceptance prediction covering two subtasks: classification tasks, comprising the final review decision and the paper quality prediction, and regression tasks for predicting review scores, and (c) limitation of existing predictions methods. Finally, we highlight the limitations of how existing prediction methods were developed and illustrate the contribution of this paper to this research area.

A. EXISTING TAPR PLARFORMS
The TAPR tools have been developed by both publishers and technology vendors for different purposes. For example, Frontiers has developed The Artificial Intelligence Review Assistant (AIRA) 2 which addressed several tasks such as reducing reviewer fatigue, editor-article matching, connecting with funders, etc. The next tool is UNSILO Evaluate Technical Check 3 which evaluates how well the submitted manuscript follow the submission guideline. The SciScore 4 offers a service to analyze method section of the paper, based on several standard of reporting such as National Institute of Health (NIH), Materials Design Analysis Reporting (MDAR), Animal Research: Reporting of In Vivo Experiments (ARRIVE), etc. and the provides scores for every submission. Following this, Scholastica 5 optimize the peer review through integrating the peer review itself with the production and journal hosting software. Elsevier has released editorial tool called EVISE 6 for several tasks including plagiarism detection and reviewer matching. All these developed tools proofs that the peer review system needs to be intervened by technologies to solve the issues of review burden.

B. PAPER ACCEPTANCE PREDICTION
The ICLR is the most widely adopted source for discussing the dataset used to make predictions. This trend is because the ICLR provides both accepted and rejected papers accompanied with peer-review information, such as review comments and review scores. In this study area, the dataset published by [22] is the most cited work, which [22] compiled numerous peer-review datasets comprising ICLR, arXiv, Association for Computational Linguistics (ACL), and Conference on Computational Natural Language Learning (CoNLL). However, only two works used the non-ICLR dataset, such as [15] which used the 94 Related Work section of the ACL dataset, and [23] which used paper collections obtained from the Artificial Intelligence (AI) Conference (2013 and 2019) and Robotics (2015 and 2019).
Two major categories of classification features are used in the existing works in the classification tasks. The first category is classifying features developed based on the manuscript's content. In this category, the proposed features range from lexical features to word representation methods. Alternatively, the second category is classifying the features by employing the review comments (most existing works fall into this category). Additionally, most existing works treated the prediction as a binary accepted-rejected classification task. For example, studies proposed more than two classes, as in [11] which used two and three labels for acceptedrejected and accepted-borderline-rejected, respectively, and in [15] with three classes of good-average-poor. Conversely, most existing studies predicted the aspect review scores in the regression task as the structured summary reflecting the manuscripts' strengths and weaknesses. Therefore, this aspect of the review scores can contain several points, e.g., impact, recommendation, substance, clarity, etc., as stated in [22]. Additionally, two existing studies proposed the final review scores as in [23] and [24].

C. LIMITATTION OF EXISTING PREDICTION METHODS
The literature review poses some limitations in most existing publications. First, the crucial role of citation functions was omitted from being addressed in assessing the paper's quality. Second, existing studies did not provide what the manuscript's aspects or sections are important to predict its quality. Third, the unfairness of using review comments as prediction features and using only accuracy as the only metric biased toward the majority class. Fourth, the bias of predicting only accepted-rejected due to the final review decision relies on multiple factors. Therefore, this paper develops a prediction method that depends only on the manuscript's content, particularly using the citation functions obtained from citing sentences to resolve these challenges. We propose creating two additional prediction features, regular sentences and reference-based features. The paper majorly aims to predict the paper quality (good-poor) and the review scores. The final review decision is covered as well for comparison purposes. Accordingly, we address the limitation of determining the most influential part of the manuscript to predict its quality using several ML and FS methods.
Interestingly, the study by [11] conducted experiments on the three classes of accepted, borderline, and rejected, and the two classes accepted and rejected by eliminating the borderline papers. Although eliminating the borderline papers improved the prediction performance, this becomes inapplicable in the entire peer-review process. Additionally, when a reviewer judges a paper as borderline, it does not mean that the other two reviewers judge it as the same since the submitted manuscripts are reviewed by three reviewers and have three different review scores. Due to this reason, we prefer to use the average review scores to determine whether a paper is good or poor (further explanation of this issue is presented in the subsequent section). Casey et al. [15] proposed good, average, and poor as final quality decisions in which the labels are determined by the annotator and not by conference reviewers or editors in a study with the same three-class boundaries. Tables 1 and 2 Show the details of the existing studies.

III. PREDICTION METHOD
This method briefly describes the stages used to build the prediction method proposed in this paper, as shown in Figure 1. The prediction method follows several stages: In the first stage, we discuss the research papers' data source, which is a paper acceptance dataset. The second stage explains three predictors having classification and regression features due to the system being treated as classification and regression problems. These predictors are citing sentence predictors developed based on the labeling scheme of citation functions, regular sentence predictors created by applying the label of citation functions to non-citation text, and reference-based features constructed by identifying the source of citations. Finally, the final stage explains the proposed prediction scenarios and evaluations. Therefore, we define several terminologies used in the entire paper for consistency. These terms include citing paper as an author's work; citing paper as previous work cited by the citing paper; citing sentence as a sentence containing citation marks; and a regular sentence that does not contain citation marks. Therefore, we introduce the term predictor as several classification features. This section explains the three types of predictors, including citing sentences, regular sentences, and reference-based predictors. The other parts of the proposed method will be explained in the next section.

A. CITING SENTENCE PREDICTOR
The citing sentence predictor is the first proposed and main technique to estimate all prediction tasks. This predictor is developed based on the citation functions, which explain why the author of the research papers cited previous works. Therefore, we use the labeling scheme of citation functions developed in our previous study [20] comprising 5 coarse and 21 fine-grained labels. The scheme of citation function was developed using a research paper dataset from [38], containing 90,278 parsed papers from arXiv Computer Science (CS) from January 1993 to December 31, 2017. Furthermore, we define coarse labels for representing the general idea of the citation functions and fine-grained labels to develop a detailed version of the labels. Moreover, all these labels are applied as features, and we include one more feature to represent the number of citing sentences in each paper. The features are developed by classifying all citing sentences in the ICLR dataset using ML and calculating the labels contained in each paper. Finally, we denote the features as c0 to c19 for encoding purposes, as shown in Table 3.

B. REGULAR SENTENCE PREDICTOR
The regular sentence predictor is the first additional predictor proposed in this paper. This predictor is motivated by not all authors' reasons for making citations during manuscript writing can be accommodated using only citing sentences. Specifically, they provide detailed explanations after making citations. This predictor is designed by applying the scheme of citation functions to regular sentences. Accordingly, applying the scheme implies that we categorize all regular sentences extracted from each paper of the ICLR dataset using ML when classifying the citing sentences. Therefore, this predictor will have the same labels as the citing sentence predictor, and we denote the labels starting from r0 to r19.

C. REFERENCE-BASED PREDICTOR
The second additional predictor proposed in this paper is a reference-based. This predictor comprises 24 generic, preprint, and journal labels. These labels are generated by manually reviewing the reference section of the papers in our dataset. The reviewing process is in two aspects as follows: The first aspect involves checking well-known publications in both conferences and journals in AI, ML, Natural Language Processing, and Data Mining, among others; and the second aspect is appearing these publications in the reference section of the ICLR paper in our dataset. Additionally, the review shows that the papers are frequently cited in preprint repositories and references published within 3 years. Therefore, we encode the labels from ref0 to ref23 and all the labels as prediction features. Table 4 presents detailed features of this predictor.

D. COMBINATION PREDICTOR
Here, we include one more predictor comprising all the mentioned predictors. This combination predictor is proposed to examine whether the combined features of all predictors can generate optimum prediction performance compared with the features that belonged to a single predictor. We denote the VOLUME 10, 2022 features in this predictor as comb0 to comb63 for the encoding purpose.

IV. BUILDING PREDICTION FEATURES
This section discusses the prediction features for classification and regression tasks comprising several parts. Firstly, the beginning of this section describes the paper acceptance dataset as the primary data source employed in this paper. Secondly, this section discusses the creation of prediction features and their distribution. Lastly, this section describes how the experiment scenarios are planned and executed.

A. THE DATASET OF PAPER ACCEPTANCE
This paper applies the dataset from [21], which provided a well-parsed paper collection from the ICLR 2017-2020 and their equivalent final review decisions and review scores. The final review decision on whether the submitted papers are accepted or rejected is determined by the editor of the conference. The review scores are assigned by three reviewers ranging from 1 to 10, where the review score <4 is labeled as ''rejected,'' that >7 is labeled as ''accepted,'' and that of 5 and 6 are labeled as ''marginally below'' and ''marginally above,'' respectively. These review scores are provided by the OpenReview platform in the review process. Notably, the paper with marginal review scores can still be labeled as ''accepted.'' Therefore, this study uses the average of three review scores from three reviewers to determine whether the paper is good or poor. A submitted paper can be labeled as poor when the average review score is ≤4 and good when the average review score is 4. We decided the papers had 4<average review scores<5 as the good category for several reasons. First, this score-range should be obtained from at least one reviewer who provides a review score of 5 or more; second, the paper in this category can be accepted by the editor; and third, the guide shows that scores of 4 or below will be rejected and no rule to reject the borderline scores of 5 and 6 directly. Since the review scores are the focus, we do not consider whether the accepted paper will be presented as an oral, poster, or workshop. The assumption in using the review score as the quality indicator is that the reviewers have already considered several review aspects such as originality, novelty, clarity, impact, etc. as a common guidance when doing the review. This paper selected 5,156 papers out of 5,192 papers from the dataset. This difference occurs because we could not determine the corresponding review results regarding the final review decisions or scores in many papers. Finally, the paper acceptance dataset for the final experimental comprises 1,722 and 3,434 accepted and rejected papers, respectively. We also identified 3,575 and 1,581 good and poor papers, respectively, within the same dataset. Table 5 shows the detailed dataset distribution.

B. BUILDING THE CLASSIFICATION FEATURES
The classification features are created by gathering each feature (label) of all predictors in the paper. Therefore, we extract all citing sentences, regular sentences, and references from all papers in the dataset. For the first two predictors, i.e., citing and regular sentences, the extracted sentences are categorized into fine-grained labels using our developed ML model based on SciBERT [39] obtained from our previous study [20]. Accordingly, our SciBERT model achieved an accuracy of 0.83, followed by an f1 score of 0.84. We applied the hyperparameters setting to obtain this performance as follows: learning rate 3e −5 , batch 32, class weight-based balanced dataset. Notably the SciBERT was applied with the ktrain 7 python package. Conversely, for the reference-based predictor, we employed the keyword matching approach to estimate each label in all papers. Therefore, to create the combination predictor, we simply merge the features of all predictors to obtained 64 features (atr0 to atr63). The final features will accompany the target label of accepted-rejected and good-poor.

C. BUILDING THE REGRESSION FEATURES
The review score prediction applies similar features as that in the classification tasks. The difference is that the review score  prediction is considered a regression problem comprising two tasks, i.e., average and individual review score predictions. The average review score is obtained when the average review scores given by three reviewers are calculated. In contrast, each review score is given by each reviewer in the individual review score prediction. Here, we treat the average and the individual review score predictions as single-and multioutput regressions, respectively. Therefore, l both regression tasks will follow similar experiment settings.

D. THE DISTRIBUTION OF CREATED PREDICTION FEATURES
Therefore, this section presents the distribution of prediction features previously developed in the preceding section to provide a clear view of our method.
Here, we discuss the instance distribution of all predictors. Table 6 shows the yearly distribution. Figure 2 depicts the distribution of entire years. In Figure 2.1 and Figure 2.2, it is clearly observed that labels in the citing sentence predictor significantly vary compared with the regular sentence predictor. This trend is caused using labels in the regular sentence predictor adopted from the citing sentence. In Figure 2.3, the spread of labels in the reference-based predictor is dominated by the number of references for the last 3 years (NUM-REF2YEARS), followed by preprint source (arXiv), ICLR, NeurIPS, and ICML. Furthermore, the other labels in this predictor possess relatively equal distribution. Fig. 3 demonstrates the comparison of the mean distribution of all predictors. Notably, the relatively equal distribution happens in the citing sentence and the reference-based predictors. Generally, the distribution of regular sentence predictors should be significantly higher than the other two predictors.

E. EXPERIMENT SCENARIO
Here, the accepted-rejected and good-poor predictions are treated as classification issues. Both prediction tasks apply similar experimental settings as follows: we propose four experiment scenarios, with each scenario representing each type of predictor. Specifically, the experiment on the citing sentence, regular sentence, reference-based, and combination predictors adopt features c0 to c19, r0 to r19, ref0 to ref23, and comb0 to comb63, respectively. We apply XGBoost as a ML algorithm for all experiments and three FS methods to show the most influential features. Additionally, the FS methods employed here are Chi-square (Chi2), Recursive Feature Elimination (RFE), and Sequential Feature Selector (SFS) Forward. Notably, the FS methods are implemented using the python scikit-learn library. 8 The FS method experiment is conducted by observing the classification performances based on the number of selected features, beginning from a single feature to the maximum number of features. Therefore, we evaluate the data balancing technique's impact on the classification performances using Synthetic Minority Over-sampling Technique (SMOTE)-based method. 9 Conversely, this paper proposes using five regression algorithms and one FS method in the regression experiment. The regression algorithms used here are the Random Forest Regression (RFR), Gradient Boosting Regression (GBR), Support Vector Regression (SVR), Extreme Gradient Boosting Regression (XGBR), and Decision Tree Regression (DTR). Alternatively, the FS method used here is Selec-tKBest, based on the python library. In each experiment, the FS observes the regression performance starting from a single feature to the maximum number of features. Therefore, this study uses MAE and RMSE as performance metrics. Notably, all the regression algorithms and FS method are implemented using the scikit-learn python library.

V. PREDICTION EXPERIMENT RESULTS
This section describes the experiment results for predicting paper quality, which is classified into three parts, i.e., the results of the accepted-rejected, the good-poor, and the review scores tasks, respectively. Furthermore, the results cover prediction performances measured by several metrics and the most influential features to achieve the best performances. Moreover, this section also provides an analysis of the performances against the real review scores, the phenomenon of meaning shifts of regular sentence predictors, and the performance comparison between our study and previous studies.

A. PERFORMANCE OF CLASSIFICATION TASKS
Tables 7 and 8 present the best results of all scenarios in accepted-rejected and good-poor tasks, respectively. Therefore, this study uses additional metrics such as precision, 9 https://imbalanced-learn.org/stable/ recall, AUC, and f1 for two reasons instead of using only a single accuracy metric. First, the accuracy can be biased toward most classes in an imbalanced setting. Second, recall by setting accepted or good papers as a positive label should be a more suitable metric in this study. This result is because predicting as many positive instances as possible is better than wrongly predicting positive instances into negative classes.
In the accepted-rejected task, the best accuracy was 0.73, which was achieved using the combination feature, SFS Forward, and 15 features in the balanced setting. This scenario was also considered the best setting since it achieved 0.50 recall (second best result), 0.61 precision (best result), 0.72 AUC (one of the best results), and 0.55 f1 (one of the best results). Another remarkable result is that the same accuracy of 0.71 was obtained by applying a combination feature with two FS approaches, such as Chi2 and RFE, in the balanced setting. In the imbalanced setting, the reference-based and combination features had accuracies of 0.71 and 0.70, respectively, which were slightly lower than the best result in the balanced setting. Generally, the imbalanced setting generated lower performance in all metrics than the balanced setting. The proposed classification approaches are less effective for determining the paper acceptance ratio even if it reached reasonable accuracies of more than 0.70 considering the entire performance.
In the good-poor tasks, the highest accuracies were 0.75 achieved using a combination of balanced settings, combination features, and three FS methods, such as Chi2 (55 features), SFS Forward (using 45 features), or RFE (using 21 features). Although all FS methods in this setting showed similar accuracies, the Chi2 was slightly better than the others by showing a recall of 0.94. Furthermore, focusing on the imbalanced setting, the achieved accuracy of 0.74 was slightly lower than in the balanced setting. However, all performance metrics in the imbalanced setting generally revealed better results than those in the balanced setting. For example, the minimum accuracy, recall, and f1 in the imbalance setting are 0.72, 0.92, and 0.82, respectively, while in the balanced setting are 0.62, 0.66, and 0.71, respectively. Additionally, the imbalanced setting required less than 10 features for most settings and only a single feature (using Chi2 applied to referenced-based and combination types of features) to achieve reasonable accuracies of 0.72 in several settings.
Focusing on obtaining as many positive instances as possible through recall can provide broader performance measurements. The best recall on the imbalanced and balanced settings showed 0.37 and 0.63, respectively, which were considered ineffective for the accepted-rejected task. On the good-poor task, the recalls obtained the highest results by 0.99 using citing sentence predictors with all FS methods in the imbalanced setting. Interestingly, this recall was achieved using less than 10 features as follows: 8 features (Chi2), 8 features (SFS Forward), and 7 features (RFE). Conversely, in the balanced setting, the best recall was 0.94, achieved using the combination feature and Chi2. Notably, the balanced setting exhibited its consistency in applying the identical experiment VOLUME 10, 2022  configuration resulting in the best results based on accuracy and recall. All the performances proved that the citation functions are quite representative in predicting the quality of the manuscript, whether good or poor.
The impact of citation functions in the classification tasks is analyzed through the following two aspects: the classification performances and the number of features to achieve the best performance. The impact of citation functions is more dominant in the good-poor task than the accepted-rejected task, particularly in the imbalanced scenario. For example, the best recalls were obtained using the citation functionsbased prediction by 0.99 (citing sentences predictor) and 0.98 (regular sentences predictor). As mentioned above, attaining as much high recall as possible is important to get as many good papers as possible, which is more reasonable and applicable for assisting the editor in filtering the submitted manuscripts. Additionally, this highest recall was obtained by employing the fewest number of features by 7 when combining the citing sentences predictor with the RFE.

1) ANALYSIS OF THE MOST IMPORTANT FEATURES OF CLASSIFICATION EXPERIMENTS
This section reports the analysis of the selected features obtained using the FS methods, particularly the top 10 most important features adopted by the combination predictor (this predictor achieved the best performances in both prediction tasks). The most important features presented here encompass both imbalanced and balanced settings, with 60 selected features in each prediction task. Tables 9 and 10 show the distribution of selected features categorized based on predictors and coarse labels of citation functions, respectively. The distribution of these two tables is obtained from Table 11, and  Table 12 shows the detailed selected features in the acceptedrejected and good-poor tasks, respectively.
Notably, the top 10 most important features were dominated by features belonging to the regular sentence predictor, indicating the highest frequency of 28 and 26 in the accepted-rejected and good-poor tasks, respectively. These results are strongly influenced because this predictor has the highest number of instances compared with other predictors (see Table 5). The second highest frequency was obtained by features belonging to the reference-based predictor by signifying a frequency of 20 in both prediction tasks. The citing sentence predictor has the lowest frequency by 12 and 14 in the accepted-rejected and good-poor tasks, respectively.
We report other notable findings, further investigating the top 10 most important features. The significant highest frequency is shown by fine-grained features belonging to citing paper work by 17 and 14 in the accepted-rejected and good-poor tasks, respectively. These significant fine-grained features were citing_paper_use, citing_paper_future, cit-ing_paper_dominant, and citing_paper_corroboration. The second highest frequency was the number of citing sentences or number of regular sentences, with 8 and 12 in the accepted-rejected and good-poor tasks, respectively. A slightly lower distribution is shown by background by 7 and 8 in the accepted-rejected and good-poor tasks, respectively. Although fine-grained features belonging to cited paper have only a few frequencies, that related to the compare and contrast showed zero frequency. The zero frequency in the compare and contrast is caused by low instance distribution in the dataset. Notably, the citing_paper_dominant had high frequencies, although it has few instances distributions in the dataset (see Figure 2.1 and Figure 2.2).
Identifying the features based on the reference-based predictor depicted that the highest frequencies are obtained by a generic reference containing two features, i.e., num_ref and num_ref_3years, by showing values of 8 and 10 in the accepted-rejected and the good-poor task, respectively. The features belonging to the conference venue show the small lower frequencies by showing the distribution of 8 and 6 in the accepted-rejected and good-poor tasks, respectively. The journal venue showed few frequencies of 4 in both prediction tasks; however, the preprint (arXiv) revealed the zero-frequency but had significant instance distribution in the dataset (see Figure 2.3).
Another fascinating finding in our experiments is that the citation functions-based predictors (citing and regular sentence predictors) are more influential than the referencebased predictor. Two experiment results support this fact. First, the distribution of features belonging to the regular sentences predictor has the highest number in the experiment using a combination predictor in both prediction tasks (Table 9). This trend implies that this predictor contributes more to the prediction results. Second, using a few features, the citing sentences predictor obtained the highest recall in the good-poor task. Additionally, this highest result is one of the most important findings since obtaining as many good papers as possible is crucial in the review process. Finally, although the reference-based predictor, when considered, reached slightly higher accuracy in the accepted-rejected task when using the imbalanced setting, the balanced setting for the same task or both imbalanced and balanced settings on good-poor task had accuracy reaching the same or even lower results compared with citation functions-based predictors. Altogether, the reference-based predictor still contributes to forming the combination predictor, although the citation functions-based predictors have more impact in obtaining the best results.

2) ANALYSIS TOWARD THE REAL REVIEW SCORES OF CLASSIFICATION EXPERIMENTS
It is worth discussing why our models were effective in the good-poor task rather than the accepted-rejected task. Accordingly, we depict the review scores of ICLR 2017-2020 in Figure 4 and the mean and variance of review scores of the best results in both classification tasks in Figure 5. The boundaries between TP (True Positive) versus TN (True Negative) and FP (False Positive) versus FN (False Negative) in the mean of review scores are clearly separated in the goodpoor task but unclear in the accepted-rejected task. However, the two classification tasks show a similar pattern in the distribution of variances. The only prominent difference is that TP has the most paper in the good-poor tasks, whereas TN has the highest number in the accepted-rejected task. This variation occurs because the achieved recall on the good-poor task is greater than in the accepted-rejected task. Summarily, our proposed classification features are more effective at categorizing whether the paper is good or poor rather than predicting its acceptance rate.

3) THE MEANING SHIFT OF REGULAR SENTENCE PREDICTOR
Since the citing sentence predictor's attributes are designed for citing sentences, they must be checked for compliance with regular sentences. The compliance check is performed by randomly selecting 1,000 samples from labeled sentences and evaluating the label for each sentence. This procedure reveals that, while several labels' meanings shifted, other labels remain relevant with the original definition adopted from the citing sentence. This occurred because the ML models struggle to recognize clear indications of whether a regular sentence describes a citing paper or cited paper. For example, the coarse label background does not experience the meaning shift compared with other coarse label compare and contrast, which mainly discusses the similarity and difference between citing paper and cited paper. Although several attributes' meanings shifted, they still retained the same idea as the original attributes. Table 13 presents a detailed explanation of this phenomenon.

4) PERFORMANCE COMPARISON OF CLASSIFICATION EXPERIMENTS IN THIS PAPER WITH PREVIOUS WORKS
Here, the performance comparison cannot be conducted on the same dataset. This because there is no single standard of benchmark dataset which has final review decision and review scores as comprehensive as provided by ICLR. For example, there are works that use datasets only for prediction of final review decision based on arXiv using two classes: accepted vs probably-rejected. Since directly predicting the  final decision is problematic, we propose not only predicting the final decision but also predicting the paper quality and review scores. Therefore, the comparison in our paper is presented to show that the performances our method are competitive compared with previous works even though not using the reviewers' comments.
Generally, several existing works used accuracy as the only performance metric. Two studies employed alternative metrics, such as [31] using the f1, and [32] which employed the AUC. The other three studies employed more than one metric such as [27] which used accuracy, precision, recall, and f1, [28] which used accuracy and AUC, and [23] which used accuracy, recall, and f1. Here, we applied five metrics, i.e., accuracy, precision, recall, f1, and AUC (see Tables 7 and 8). Table 14 shows the detailed comparison.
The best performance was achieved by [30] showing an accuracy of 0.85 on a relatively small ICLR 2017 dataset. However, these results have some limitations as follows: no other metrics were used to show the performances under imbalanced situations. Second, accuracy was biased toward most classes. Third, since this work applied pre-defined (handcrafted) features, the results are less insightful for helping the peer-review process. Other promising results were [27] and [28] which achieved accuracies of 0.83 and 0.81, respectively. These two works used the arXiv dataset proposed by [22] that the papers' acceptance in the dataset were determined using two labels, i.e., accepted or ''probablyrejected.'' Therefore, an issue regarding the confident level of the achieved accuracies existed. Several works obtained other competitive results by showing accuracies of more than 0.75. However, most of these studies used part of the review results as classification features. This approach is considered unfair since the acceptance prediction should be based on the manuscript. Our work achieved accuracy of 0.73. Therefore, considering the abovementioned issues, this result was competitive since our model was developed using 15 classification features from the paper manuscript. Another perspective of the paper quality showed that the good-poor task achieved 0.75 of the best accuracy, which is considered slightly better than our best accuracy in the accepted-rejected task. However, the good-poor task obtained a high recall of 0.94 and competitive f1 of 0.84 using the same experimental setting.
Another interesting comparison can be obtained between our study and that of [15] in which we have developed a predictor containing a labeling scheme of the author's intentions to predict the paper quality. The difference is that while [15] used the author's intentions in the Related Work section, which may cover both citing and regular sentences, our study used the author's intentions through citation functions represented by citing sentences in the entire paper. Although the comparison cannot be performed directly because of the difference in the dataset and the target classes, we showed that the labeling scheme of citation functions (citing sentence predictor) used here achieved better results in the good-poor task by showing the best accuracy and recall of 0.72 and 0.99, respectively. However, note that [15] showed the best accuracy of 0.7 in the poor-average-good task. These findings indicate that our citation functions labeling scheme is more effective than the intention labels proposed in [15]. Additionally, covering the author's intention in the entire section of this paper is crucial to assess the paper's quality rather than only in the Related Work section.

B. PERFORMANCE OF REGRESSION TASKS
This section presents the regression task experiment results for predicting the average review score (Table 15), the individual review score (Table 16), and the top 10 most influential features in both regression tasks (Table 17).  The experiments show that the combination predictor achieved the best performances in both regression tasks by showing the lowest RMSE and MAE results. For example, in the average review score prediction, the lowest RMSE was 1.34, which RFR, GBR, and XGBR reached. Conversely, RFR and XGBR achieved the MAE's lowest results by demonstrating 1.07 points. DTR's best results required only a single feature in this regression task.
Conversely, the overall performances were worse in the individual review score prediction than the performance in the average review score prediction. The best results in the individual review score prediction was 1.71 for RMSE and 1.38 for MAE. Additionally, these results were produced by incorporating the combination predictor with RFR for RMSE and SVR for MAE. Interestingly, all best   performances demonstrated by DTR require only a single feature, as in the average review score prediction task.
The impact of a predictor on the regression performances can be explained by comparing the performances (RMSE, MAE) and the number of features needed to obtain the best results. The citation functions-based predictors (citing sentence and regular sentence predictors) obtained slightly lower performances than the reference-based and the combination predictor in both the average and individual score prediction. However, the citation functions-based predictors require lesser features to achieve the best performances.
It is worth noting that the features representing the number of instances belonging to each feature or predictor were the most important in each predictor. For example, the rank-1 feature was the number of citing sentences and the number of regular sentences in the citing sentence predictor and the regular sentence predictor. Furthermore, the reference-based predictor and the combination predictor shared similar rank-1 features that were num_ref_3years. Second, an interesting fact here is that in the combination predictor, the rank-1, rank-2, and rank-3 features were filled by the rank-1 feature in the reference-based predictor, the citing sentence predictor, and the regular sentence predictor, respectively. This trend showed a consistent contribution of these rank-1 features in the regression tasks. Third, interestingly, the feature cit-ing_paper_dominant was in the top 10 most important features in the citing sentence and regular sentence predictors, although the feature's distribution in the dataset is minimal. This trend corresponds with the phenomenon that occurs in the classification experiments.
Furthermore, evaluating the impact of features to achieve the best performance when using the combination predictor shows that the features belonging to the citation functionsbased predictors dominated the distribution. Specifically, the distributions of citing sentence predictor, regular sentence predictor, and reference-based predictor in the top 10 most important selected features are 4, 4, and 2, respectively. Therefore, as previously mentioned in the classification tasks, the reference-based predictor contributes less to achieve the best performances when using a combination predictor.
We compare the best results of regression tasks in this paper with that of existing studies. Note that the comparison cannot be performed on all previous studies since most focused on predicting the aspect review scores (based on review comments) rather than the final review score. Therefore, the comparison can only be performed with the regression results from [23] developed based on review comments that achieved the best RMSE and MAE of 1.28 and 1.05, respectively, which are slightly higher than our performances. However, our best performances (RMSE: 1.34, MAE: 1.07) are considered competitive since the regression method was developed based on the paper without review comments.

VI. CONCLUSION AND FUTURE WORK
This paper developed a method for predicting paper quality to reduce the review burden that depends only on features extracted from the paper. This method is intended to handle the drawbacks of most existing studies involving the review comments for making the prediction. Our prediction method encompasses three tasks where two are classification tasks, and the other is a regression task. The classification tasks primarily predict the paper quality to judge whether the submitted manuscripts are good or poor; however, the task of predicting the final review decision of accepted or rejected is also included for comparison purposes. Conversely, the regression task can predict the average and individual review scores.
Furthermore, the experiments on the classification tasks demonstrate remarkable findings. First, predicting the paper quality based on the good-poor task is more effective than the accepted-rejected task. This was proved by error analysis results and supported by the achieved performances and the effectiveness, showing that the difference between TP-vs-TN and FP-vs-FN are separated in the good-poor task, although unclear in the accepted-rejected task. Second, the citing sentences predictor obtained a satisfactory performance by a recall of 0.99 in the good-poor task. Therefore, this result proves our hypothesis concerning the crucial role of citation functions in the manuscript.
Regarding the regression experiment on the average and individual review scores, the combination predictor demonstrated its superiority over other predictors. However, citing sentence predictors showed a competitive performance using fewer classification features. These results increase our confidence level for making predictions by relying only on the paper when predicting the review scores.
Therefore, several points must be improved for further developments exist. First, it is worth applying our method to other domains, e.g., broader CS and medicine, among others. Second, we intend to explore more about using citation functions to predict the review aspect score (clarity, originality, impact, etc.) and the review score, which the assigned reviewers determine. Therefore, we hope to be one step closer to incorporating TAPR into the entire peer-review process.
Besides the benefit of using the proposed methods for TAPR, we identified several limitations. The proposed method promotes a specific style of paper writing in convincing the automatic prediction system rather than producing articles with sufficient quality. The next consequence is that since the citation functions based on Computer Science domain, the prediction method for paper quality only works for the same domain. Following this, the Feature Selection techniques for analyzing the top 10 most important features for predicting the paper quality are unable to provide the reason why these features were selected. These issues bring a new challenge for our future research in this domain.