Integration of Prediction Scores From Various Automated Essay Scoring Models Using Item Response Theory

In automated essay scoring (AES), essays are automatically graded without human raters. Many AES models based on various manually designed features or various architectures of deep neural networks (DNNs) have been proposed over the past few decades. Each AES model has unique advantages and characteristics. Therefore, rather than using a single-AES model, appropriate integration of predictions from various AES models is expected to achieve higher scoring accuracy. In this article, we propose a method that uses item response theory to integrate prediction scores from various AES models while taking into account differences in the characteristics of scoring behavior among models. It is found that the proposed method achieves higher accuracy than that of individual AES models and conventional score-integration methods. Furthermore, the proposed method facilitates interpreting each AES model's scoring characteristics and score-integration mechanism.


Integration of Prediction Scores From Various Automated Essay Scoring Models Using Item
Response Theory Masaki Uto , Itsuki Aomi , Emiko Tsutsumi , and Maomi Ueno , Member, IEEE Abstract-In automated essay scoring (AES), essays are automatically graded without human raters.Many AES models based on various manually designed features or various architectures of deep neural networks (DNNs) have been proposed over the past few decades.Each AES model has unique advantages and characteristics.Therefore, rather than using a single-AES model, appropriate integration of predictions from various AES models is expected to achieve higher scoring accuracy.In this article, we propose a method that uses item response theory to integrate prediction scores from various AES models while taking into account differences in the characteristics of scoring behavior among models.It is found that the proposed method achieves higher accuracy than that of individual AES models and conventional score-integration methods.Furthermore, the proposed method facilitates interpreting each AES model's scoring characteristics and score-integration mechanism.

I. INTRODUCTION
E SSAY-WRITING tests have been used in various as- sessment situations to measure examinees' practical and higher-order abilities, including logical thinking, critical reasoning, and creative thinking [1], [2], [3], [4], [5].Essay-writing tests require grading by human raters for essays that are written by examinees concerning a given topic.However, essay grading is an expensive and time-consuming task, especially for large-scale tests [5], [6].To resolve this problem, various studies have examined automated essay scoring (AES), in which natural language processing (NLP) and machine learning are used to grade essays automatically as an alternative to human grading.AES is also important in the context of teaching writing in an educational setting.To efficiently cultivate students' writing skills, immediate and accurate feedback on their writing is required [7], [8].Particularly, the accuracy of the feedback is critical because erroneous feedback might lead to misconceptions and biases in students' knowledge and understanding [7].Furthermore, many formative feedback systems and analytical scoring systems for students' writing, which are tools for supporting writing education, are strongly related to AES technologies (e.g., [8], [9], [10], [11], [12]).These facts indicate that realizing accurate AES is crucial for both teaching and assessing writing in an educational setting.
Feature-engineering approach models are advantageous in terms of interpretability and explainability, but they generally require careful feature design and selection to achieve high accuracy.The automatic feature extraction approach has become popular to eliminate the need for feature engineering.
These DNN-AES models predict a score from the sequence of words in the essay, meaning that no manually designed features are required.However, some recent studies have proposed hybrid models that incorporate manually designed features into DNN-AES models [10], [25], [42], [44], [46] and have reported that such a hybrid approach is effective for improving scoring accuracy.
Such conventional AES models have different characteristics of scoring behavior because each model employs different features or different DNN architectures.Therefore, rather than using a single-AES model, integrating predictions from various AES models appropriately is expected to achieve higher scoring accuracy.A simple score-integration strategy is to calculate the average scores or majority vote scores.However, such simple methods might be inaccurate because they ignore differences in characteristics of scoring behavior among AES models.Another score-integration strategy is stacking, a popular ensemble learning approach [47].A stacking-based AES can be designed as a supervised regression model, such as linear regression, support vector regression (SVR), and regression tree, which receives multiple AES scores as input and outputs integrated scores.However, the stacking method using such popular regression models has the following drawbacks.
1) Those regression models are necessarily not efficient in modeling the scoring behaviors of human raters and AES models because they are not specialized in the essay scoring domain.The inefficient modeling may prevent maximizing the scoring accuracy.2) Those models do not provide a clear meaning for their parameters, making it difficult to analyze the scoring behaviors of human raters and AES models in detail.The lack of this interpretability hinders our understanding of the score-integration mechanism and the characteristics of individual AES models.To resolve these problems, this article proposes a method to integrate scores from various AES models using item response theory (IRT) models incorporating raters' characteristic parameters [48], [49], [50], [51], [52], [53], [54], [55], [56], [57].Those IRT models have long been studied in the educational measurement field to realize accurate scoring while taking into account differences in rater characteristics, such as severity and consistency.Those models have been applied to various performance assessments, including essay writing tests, and have demonstrated their effectiveness in realizing accurate scoring and detailed analysis of rater characteristics [56], [57], [58], [59], [60].This study applies such IRT models by regarding AES models as human raters to obtain integrated essay scores.Our experiments using an AES benchmark dataset demonstrate the effectiveness of the proposed method.
Note that another AES method that uses IRT incorporating rater parameters was recently proposed [61], [62].However, the objective of that study was to obtain accurate gold-standard scores for essays, which are then used for AES model training.Gold-standard scores for training data are generally created by sharing the essay grading task among many human raters, although scores from some raters may be inaccurate and unreliable [63], [64], [65], [66].Thus, that study proposed the use of IRT to remove the effects of such unreliable raters from training data, indicating that the objectives and method are completely different from those in our study.
It should also be noted that, although this study focuses on AES, the proposed method can also be used for automated shortanswer grading (ASAG) and other text-scoring tasks, for which many models with different characteristics have been developed.For example, there are many ASAG models [67], [68]; some are similar to AES models, but others are different.Representative models similar to AES models include c-rater [69], which is a representative feature-engineering approach model, and CNN-RNN-based and BERT-based DNN models [70], [71], [72], [73].Major differences between AES and ASAG models are 1) the importance of coherence is often emphasized in AES but not necessarily in ASAG and 2) reference answers are often used for ASAG [74], [75], [76] but not for AES.Although some similar models and task-dependent models exist for different scoring tasks, as explained above, the proposed method is applicable to those tasks for which many different scoring models exist.

II. RESEARCH CHALLENGES
This study provides a theoretical contribution beyond the simple engineering application of the improved essay scoring system.The purpose of this study is to clarify the effectiveness of IRT models with rater characteristic parameters in order to integrate predictions from various AES models.To this end, we present the following two research challenges.
1) The proposed IRT-based score-integration method can integrate scores from various AES models while considering the characteristics of their scoring behavior, which are parameterized appropriately based on extensive research in the educational measurement domain.Owing to the sophisticated modeling of scoring behavior, the proposed method is expected to provide higher scoring accuracy compared with individual AES models and other score-integration methods, including the general stacking method.Accordingly, our first research challenge is to examine how effectively the proposed method improves scoring accuracy.2) IRT models provide explicit meaning for the model parameters, helping us to understand the score-integration mechanism and the characteristics of individual AES models.Thus, our second research challenge is to show how to analyze the score integration mechanism and the characteristics of AES models based on the proposed method.Although IRT models that incorporate rater-characteristic parameters have been widely used in various educational assessment studies (e.g., [49], [50], [51], [52], [53], [54], [55], [56], [57]), no previous study has used such IRT models to integrate predictions from various AES models.Therefore, it remains unclear how those IRT models might be applied to realize AES integration and how the method would be beneficial.The fact that our study answers these questions confirms its theoretical contribution.

III. AES MODELS
This section presents a brief review of conventional AES models based on the feature-engineering and automatic feature extraction approaches.

A. Feature-Engineering Approach
In the feature-engineering approach, models predict essay scores based on textual features, which human experts must design manually.Typical features are essay length and number of grammatical and spelling errors.In this approach, such textual features are first calculated from a target essay text; then, the feature vector is input into a regression or classification model, and a score is output.
A representative model is e-rater [14], which was developed by ETS and has been used in the Test of English as a Foreign Language and the Graduate Record Examination.E-rater v.2 [15] uses 12 features and predicts essay scores based on a linear regression model with empirically determined weight parameters.The Enhanced AI Scoring Engine (EASE) [16] 1 is another model that has recently come into widespread use and achieved high performance in the Automated Student Assessment Prize (ASAP) competition on Kaggle. 2 EASE uses several types of features, including length-based features, part-of-speech-based features, prompt-relevant features, and bag-of-words-based features.There are many other models that incorporate various types of features, such as word topicality [17], bag-of-superword embedding [18], argument features (e.g., number of claims and number of supporting relations) created by argument-mining techniques [19], a sentence semantic similarity defined using a graph-based text analysis method [20], and semantic features that are specific to the Chinese language [21].
One of the most popular DNN-AES models is the CNN-RNNbased model [27].This model calculates a score for a targeted essay, which is defined as a sequence of words, through five DNN layers, namely, the lookup table layer, the convolution layer, the recurrent layer, the pooling layer, and the linear layer with sigmoid activation.See Appendix A for details on the whole architecture.There are many variants of this model, such as those employing different pooling methods in the pooling layer [28], [30], those using a different word embedding in the lookup table layer [28], and those consisting of a word-level CNN and a sentence-level CNN [29].
One limitation of the CNN-RNN-based models is that they cannot directly consider textual coherence, which represents the semantic connection and consistency of the whole text.Coherence is an important factor for determining the quality of essays.Thus, some DNN-AES models with a function to capture textual coherence explicitly have been proposed [31], [32], [33], [34], [35], [44].A representative model is the SkipFlow model [32], an extension of the CNN-RNN-based model that incorporates a neural tensor layer, which explicitly captures textual coherence.See Appendix B for details on the SkipFlow model.Another model tries to capture the coherence based on the continuity in the semantics between the adjacent two sentences [33].
While the above-introduced models used CNNs and RNNs, some recent models have used attention-based DNN architectures [37], [38], [39], [40], [41], [42].A popular attention-based DNN is a transformer network [36] that consists of stacked self-attention and fully connected layers.Transformer networks are known to capture long-distance dependence between words in a text with more accuracy than that of RNNs and CNNs.
Transformer-based DNN-AES models typically use pretrained models.A representative pretrained model is BERT [45], which was released by the Google AI Language team.BERT is pretrained on massive amounts of unlabeled text data for two tasks, called masked language modeling and next-sentence prediction.BERT can be used for various NLP tasks, including AES, by applying a fine-tuning (model retraining) based on a task-specific supervised dataset.See Appendix C for details on the BERT-based AES.The BERT-based AES also has various extensions, such as those incorporating architectures to capture the textual coherence [43], [44], those extended toward multitask learning [39], [77], and those using the DistilBERT [78], a variant of BERT [79].

C. Hybrid Approach
The feature-engineering and DNN-based automatic feature extraction approaches can be viewed as complementary rather than competing [6] because they have different advantages and drawbacks.Thus, some hybrid models that integrate the two approaches have recently been proposed [10], [25], [42], [44], [46].
Hybrid models are generally formulated as DNN-AES models incorporating manually designed features.Specifically, they concatenate a feature vector to either a predicted score of a DNN-AES model or a hidden vector obtained from an intermediate layer of a model.Then, the concatenated vector is mapped to a score value through a regression layer, such as a linear layer with sigmoid activation.The DNN-AES models used in the hybrid models include variants of the CNN-RNNbased model [10], [25] and the BERT-based model [42].As an example, Appendix C introduces details on the BERT-based hybrid AES model.
Other hybrid models [44], [46] consist of two types of DNNs.One processes word sequences in the same way as the conventional DNN-AES model, and the other processes manually designed features.

IV. ITEM RESPONSE THEORY
The conventional AES models discussed above have different scoring behaviors because they employ different features or different DNN architectures.The purpose of this study was to integrate prediction scores from various AES models using IRT while considering differences in the characteristics of their scoring behaviors.
IRT [80] is a test theory based on mathematical models.IRT uses probabilistic models, called IRT models, to estimate examinees' abilities from testing data, which generally consist of binary or polytomous scores that the examinees received on test items.IRT offers the following benefits.
1) Examinee ability can be estimated in the context of test item characteristics, including item difficulty and discrimination.2) Abilities of examinees who take different tests can be estimated on the same scale.3) Missing data can be applied easily.Traditional IRT models are applicable to data consisting of scores that examinees receive on test items.Examples include the Rasch model, the two-parameter logistic model, the graded response model [81], and the generalized partial credit model [82].However, this study applied IRT to other data consisting of scores for each examinee's essay provided by multiple raters, including human raters and AES models.IRT models incorporating rater characteristic parameters can be applied to such data [49], [50], [51], [52], [53], [54], [55], [56], [57].
The most popular model is the many-facet Rasch model (MFRM) [51].The MFRM, however, relies on some strong assumptions that do not hold in practice.Various extensions of the models have been proposed to relax these assumptions, including hierarchical rater models [52], [53], rater bundle models [54], and trifactor models [55].The present study employs one of the latest extension models, which is called generalized MFRM (GMFRM) [57].

A. Generalized Many-Facet Rasch Model
In the GMFRM, the probability that rater r assigns score k to the essay of examinee j for test item i (which means an essay task or a prompt) is defined as ) where θ j represents the latent ability of examinee j, α i and β i represent the respective discrimination power and difficulty of item i, α r and β r represent the respective consistency and severity of rater r, d rm represents the severity of rater r against rating category m, and K indicates the number of categories.D = 1.7 is the scaling constant used to minimize the difference between the normal and logistic distribution functions.Here, I i=1 log α i = 0, I i=1 β i = 0, d r1 = 0, and K k=2 d rk = 0 are given for model identification.
Note that this study applies the GMFRM to each item independently by removing the item parameters for the following two reasons.
1) To appropriately estimate the item parameters based on the original GMFRM while ensuring parameter linking, we require a scored essay dataset in which some examinees answered all the items [83], [84].However, almost none of the existing datasets that are used for AES studies include such examinees.Therefore, it would be very difficult to estimate the item parameter appropriately based on the existing AES datasets.2) Our objective is to estimate the integrated essay scores by using the GMFRM while considering the scoring behavior of each individual AES model, which is represented by the rater parameters in the model.Thus, the main interest in our use of IRT is the rater parameters, not the item parameters.Although the item parameters are typically a major interest when using IRT, omitting them in this study is reasonable and does not impair the main feature of the IRT models that incorporate rater parameters, for the above reasons.When the item parameters are omitted, the GMFRM equation can be rewritten as follows: In this form of GMFRM, θ j represents not only the ability of examinee j, but also a latent score of the examinee's essay estimated from multiple raters' scores, because there is only a single essay for each examinee.
1) Consistency: The degree to which a rater assigns similar ratings to essays of similar quality.2) Severity: The tendency of a rater to give consistently lower ratings is justified by the quality of the essays.3) Range restriction: The tendency to overuse a few rating categories.To show how these characteristics are represented, Fig. 1 shows item response curves (IRCs) of the GMFRM, which are drawn by plotting the probability P jrk in (2), for four raters for the parameters presented in Table I.In the figure, the horizontal axis shows the latent score θ j and the vertical axis shows the probability P jrk .These IRCs show that essays with higher θ j tend to obtain higher scores.
In the GMFRM, rater consistency is represented by α r , with lower values indicating smaller differences in response probabilities between rating categories.This can be confirmed in Fig. 1, which compares raters 1 and 2, who have different consistency levels.This figure suggests that scores given by a rater with a lower consistency parameter will be unreliable because the rater tends to assign different ratings to essays with similar qualities.
Rater severity is represented by β r .The IRC shifts to the right as this parameter value increases, indicating that raters with high β r values have a tendency to consistently assign low scores.In Fig. 1, the IRC for rater 3 with a high β r value shifts to the right overall.
The GMFRM represents the range restriction characteristic as d rm .The closer d r(m+1) and d rm become, the lower the overall probability of responding with category m.Conversely, the higher the difference d r(m+1) − d rm becomes, the higher the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The GMFRM can estimate latent scores θ j while taking into account differences in these characteristics among raters while earlier popular IRT models with rater parameters, including MFRM, cannot consider all the above rater characteristics simultaneously.Thus, this model can realize an accurate score estimation compared with the other IRT models when raters with various characteristics exist [57], [58].This is why we chose the GMFRM.

V. PROPOSED METHOD
This study proposes a method to integrate scores from various AES models using the GMFRM.Specifically, the proposed method applies the GMFRM by regarding AES models as human raters.The proposed method consists of three steps, namely, AES model training, IRT parameter estimation, and integrated score prediction.The detailed procedure for integrating these steps is as follows.
1) AES model training: First, we train multiple AES models individually using training data consisting of essays with gold-standard human scores.This is the same as the procedure required to train any conventional AES model.2) IRT parameter estimation: This step estimates the characteristic parameters of the AES models based on the GMFRM.The parameter estimation is conducted using another dataset consisting of essays with gold-standard human scores, such as development data.The detailed procedure for this is as follows: a) Generate prediction scores for essays in the data using each trained AES model.b) Estimate the GMFRM parameters using those AES scores and the gold-standard human scores.Fig. 2 illustrates the outline of this procedure.Through this procedure, we can obtain the GMFRM parameters for the AES models and the human rater who created gold-standard scores.This study uses a Bayesian estimation based on a Markov-chain Fig. 2. Outline of IRT parameter estimation in the proposed method.Note that X r,j represents the score for the essay of the jth examinee provided by the rth AES model or human (where r = 0 represents the human rater and r ≥ 1 corresponds to AES models).
Monte Carlo (MCMC) algorithm for the IRT parameter estimation, as we detail in Section VI-B.3) Integrated score prediction: Using the trained AES models and their GMFRM parameters, this step predicts integrated scores for new essays.The outline of this procedure is illustrated in Fig. 3.As shown in the figure, we first generate prediction scores for the essays from the trained AES models individually.Then, the predicted scores are used to estimate the latent score θ j for each essay based on the GMFRM.In this estimation, characteristic parameters of the AES models are given.Finally, the estimated latent scores θ j are projected to an original rating scale on human rater criteria.Specifically, letting r = 0 be the human rater, the rescaled score y j , which corresponds to the expected value of the human rater's score, is calculated as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Require: given θ j and human rater parameters in ξ. 12: end for the above-explained IRT parameter estimation procedure.Also, line 6 and subsequent lines correspond to the integrated score prediction procedure.
Through the above procedures, the proposed method can output scores that integrate prediction scores from multiple AES models while considering the characteristics of their scoring behavior, and the output scores are projected onto the rating scale of the human rater.
Note that we can design a similar method based on factor analysis (FA) because FA and IRT are closely related [90], [91].For example, some IRT models are known to be equivalent to some confirmatory categorical FA models with one factor [90].Major differences between them are the purpose and domain [91].The primary purposes of the IRT, which specializes in the educational and psychometric measurement domain, are scoring and test analysis, whereas the primary purpose of the FA, which is used in various contexts, is to investigate the construction of latent factors behind observed multivariate data.We used the IRT because its purpose and domain fit our study well, making the interpretation of the model parameters easy and natural.
Principal component analysis (PCA), which intends to estimate a latent factor behind observed data, would also be regarded as a similar approach to the IRT and FA.However, PCA cannot realize the score integration that we realized in the proposed method.Here, suppose we construct a PCA model using a dataset consisting of scores from multiple AES models and the gold-standard human rater, as in the IRT parameter estimation step.In that case, the constructed PCA model cannot calculate integrated scores for new essays because we have no gold-standard human scores for such essays.Furthermore, when we construct a PCA model using only scores from multiple AES models, the scale of the model scores might not be consistent with the rating scale of the gold-standard human rater.Thus, we removed the PCA from the candidate pool for the proposed method.

VI. EXPERIMENTS
In this section, we present evaluation results for the effectiveness of the proposed method based on experiments with actual data.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Dataset
Our experiments used the ASAP dataset, which has been published for Kaggle competitions.This dataset, which has commonly been used in various AES studies as benchmark data, consists of essays for eight prompts, written by students from grades 7 to 10.Each essay has one gold-standard score from a human rater.Scores are provided based on ordered categories with different value ranges.Each prompt corresponds to one of the three essay types: 1) argumentative; 2) source-dependent; and 3) narrative [92].The argumentative type asks students to discuss and justify their opinion on a specific topic, whereas the source-dependent type asks students to respond to a question about a given text.The narrative type asks students to narrate a story about a specific topic.See Table II for detailed statistics and types for each prompt.
The AES models are generally trained and evaluated for each prompt individually in many AES studies, so our experiments also follow this procedure.

B. Setup
Using the ASAP dataset, we evaluated scoring accuracy in each prompt based on fivefold cross-validation.In each partition, 60% of the data were used as the training data, 20% as the development data, and 20% as the test data.As the evaluation metric, we used the quadratic-weighted Kappa (QWK), which is a metric showing agreement between predicted scores and ground truth.The QWK is the common evaluation metric in the ASAP competition.
Our experiments used the following six AES models.1) EASE (SVR), EASE (BLRR): As a recently introduced popular feature-engineering approach model, we used the EASE model described in Section III-A.EASE typically uses Bayesian linear ridge regression (BLRR) and SVR as the regression models.We thus examined both variants of EASE.We implemented the models using scikit-learn [93] following the method in [16].2) XGBoost: As another feature-engineering approach model, we used an XGBoost model with the manually designed features proposed in [25].A unique characteristic of this model is the use of parse-tree-based features, which are not used in EASE.We used CoreNLP [94] to generate parse trees and implemented the XGBoost model following the method in [95].
3) RNN: As the most traditional DNN-based automatic feature extraction approach model, we used the CNN-RNNbased model detailed in Appendix A. Note that we omitted the optional convolution layer in our implementation.We implemented this model using PyTorch.3 4) SkipFlow: We also used the SkipFlow model detailed in Appendix B as another DNN-AES model with different characteristics.This is the most popular model incorporating a function that directly captures textual coherence, as explained in Section III-B.We used PyTorch to implement this model.5) Hybrid-BERT: We used the fine-tuned BERT model incorporating manually designed features [42], detailed in Appendix C, as a hybrid model.We used the uncased pretrained BERT-base model and PyTorch for implementation.We tokenized the essays using the NLTK tokenizer. 4Other details, including hyperparameter settings, were the same as those used in the original studies.
We compared the proposed method with three common scoreintegration methods.
1) MEAN: Arithmetic averaging of multiple AES scores.
2) VOTING: Hard voting for multiple AES scores.
3) STACKING: We examined four stacking models using a linear regression model, a Ridge regression model, an SVR, and a boosting model.We designed these models to receive multiple AES scores as input and predict a gold-standard human score.We used the scikitlearn library to implement these models.We trained these models using the development data in the same way as in the IRT parameter estimation of the proposed method.Note that these three integration methods encompass most of the popular ensemble methods that integrate outputs from multiple models.This can be confirmed by the fact that conventional ensemble methods are commonly categorized as weightingbased methods or meta-learning methods, where the most popular weighting-based methods are majority voting and output averaging and the most popular meta-learning method is stacking [47].
We also examined some variants of the proposed method by changing the employed IRT models.As explained in Section IV-B, the GMFRM can represent the three common rater characteristics, namely, consistency, severity, and range restriction.Some GMFRM variants in which some rater parameters are restricted can be regarded as models equivalent to some earlier IRT models with rater parameters, including MFRM.For this reason, we examined some restricted versions of the GMFRM, including MFRM, as detailed as follows.
1) Consistency-fixed GMFRM: A GMFRM in which α r is restricted to 1 for all raters r ∈ R, meaning all raters share the same consistency level.This model is equivalent to the variant of MFRM shown in [48] and [86].2) Severity-fixed GMFRM: A GMFRM in which β r is restricted to 0 for all raters r ∈ R, meaning all raters share the same severity level.3) Threshold-fixed GMFRM: A GMFRM in which d rm is changed to d m for all raters r ∈ R, meaning no difference in range restriction characteristics exists among raters.4) MFRM: The most popular IRT model that incorporates rater parameters.MFRM is equivalent to a GMFRM in which α r is restricted to 1 and d rm is changed to d m for all raters.Although the severity-fixed GMFRM and the threshold-fixed GMFRM have no corresponding earlier IRT models, we examined them to investigate the effects of each rater parameter.
To estimate IRT parameters in the EstIrtP aram() function shown in Algorithm 1, we applied an expected a posteriori (EAP) estimation, a type of Bayesian estimation that is known to provide accurate estimations for complex IRT models [56], [96], using an MCMC algorithm.As the MCMC algorithm, we used the No-U-Turn sampler [97] based on the Hamiltonian Monte Carlo approach [98].The estimation program was implemented in RStan [99], [100].Following the original GMFRM paper [57], we calculated the EAP estimates using parameter samples obtained from 2000 to 4000 periods within three independent MCMC chains.Furthermore, the prior distributions were also the same as those used in [57], namely where N (μ, σ) indicates the normal distribution with mean μ and standard deviation σ.In the EstIrtScore() function of Algorithm 1, we calculated the IRT scores through an EAP estimation using the Hermite-Gauss quadrature [101], which has been widely used in various IRT studies.Specifically, given rater parameter estimates, the score estimates are calculable as where θ h is the hth integral point and H is the number of such points.We created the integration points by setting H = 40 and dividing the value range [−4, 4] with an equal interval.
In addition, L(X j , θ h ) is the likelihood conditional on θ h for X j , which consists of observed scores for the jth essay.g(θ h ) indicates the prior probability for θ h .We assumed the standard normal distribution as the prior distribution.
We used a Tesla V100-SXM2 GPU to train DNN-AES models, whereas we used an Intel Xeon 2.00 GHz CPU for training other AES models and score-integration methods, including the proposed method.

C. Results
Table III presents the experimental results, with bold text indicating maximum QWK values and underlined text representing the second-highest values for each prompt.In the table, the Avg.column shows the average QWK value for each method, and the p-value column shows the results of the one-tailed paired t-test between the proposed method using GMFRM and the other respective method.
According to Table III, the average QWK values of the individual AES models are around 0.7 in almost all models.Recent AES studies that used the ASAP dataset have generally reported average QWK values ranging from 0.7 to 0.8 [102], which are almost consistent with our results.Note that QWK values reported in different studies are not necessarily directly comparable, even when the same model and the same dataset are used, because they might employ different hyperparameter settings and methods for splitting data during cross-validation.
Comparison of the proposed method using GMFRM with the individual AES models shows that the proposed method is superior in all cases except for only one case (Hybrid-BERT in prompt 3) and shows a significantly higher average accuracy.The other score-integration methods also outperform the individual AES models in many cases, suggesting that the integration of prediction scores from various AES models is effective.
Furthermore, compared with the conventional scoreintegration methods, the proposed method with GMFRM shows higher accuracy in almost all the cases and its average accuracy is significantly higher.This result indicates the effectiveness of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.PROMPT 1 the GMFRM-based score integration considering differences in characteristics of scoring behavior among the respective AES models.This result also suggests that the proposed method is expected to be effective with various datasets because the proposed method is superior in many prompts with different characteristics.

TABLE IV WAIC VALUES FOR THE GMFRM AND COMPARATIVE MODELS TABLE V RATER PARAMETER ESTIMATES FOR
Among the proposed methods using different IRT models, the GMFRM provided the highest average accuracy at a significance level of 0.05.Thus, it would be reasonable to use the GMFRM in general.This result also suggests that all the rater parameters in the GMFRM are effective in improving the accuracy.To further examine the effectiveness of the GMFRM, we conducted a model comparison experiment using an information criterion.As the information criterion, we used the widely applicable information criterion (WAIC) [103], which is suitable for Bayesian estimation using MCMC.The WAIC was calculated for each cross-validation set, and these values were averaged for each prompt.Table IV shows the results.We highlighted the minimum scores in the table as bold text because the model minimizing the WAIC is regarded as the optimal model.According to the results, the GMFRM shows the best performance in all cases.The results suggest that the three rater characteristics (i.e., severity, consistency, and range restriction) vary among the AES models and the human rater, and thus the GMFRM is suitable compared to the other simpler models.

D. Analyzing the Characteristics of Scoring Behavior
Besides the improvement in scoring accuracy, a unique feature of the proposed method is its high interpretability, as explained in Section I.This section provides an interpretation of the scoring characteristics of each AES model based on the rater parameter values obtained from the GMFRM.As an example, Table V shows the rater parameter estimates of the AES models and the human rater for prompt 1.Note that, here, we estimated the GMFRM parameter using predicted scores of the AES models and the gold-standard human scores for all the essays for prompt 1.The AES prediction scores for all the essays can be obtained from the fivefold cross-validation explained in the previous section.Furthermore, based on the parameter values in Table V, we illustrate the IRCs for the AES models and the human rater in Figs. 4 and 5, respectively.In the figures, the horizontal axis shows the latent score θ j , and the vertical axis shows the response probability for each category.
According to the table and figures, we can interpret the following characteristics.
1) EASE (SVR) has an extremely low severity, reflecting the strong tendency to output the highest score (k = 12) overall.2) RNN and XGBoost show relatively low probabilities for categories 3, 4, and 5, suggesting the existence of a range restriction that avoids these categories.Moreover, XGBoost has another range restriction tendency to prefer category 8 slightly.3) EASE (BLRR), SkipFlow, and Hybrid-BERT have relatively high consistency.Furthermore, in the IRCs for these models, the curves for some categories [i.e., k = 3 and 4 in EASE (BLRR) and k = 3 in SkipFlow and Hybrid-BERT] are not displayed because these probabilities are extremely low, meaning that they have an extremely strong range restriction.4) The human rater tended to prefer categories 6, 8, and 10, and rarely used categories 3, 4, and 5, indicating the existence of a strong range restriction.As explained earlier, XGBoost shows a relatively similar range restriction, indicating that XGBoost imitates the human rater most precisely in prompt 1. 5) Another interesting observation is that the AES models show higher consistency than the human rater overall.One motivation of AES research is to realize consistent scoring, and this result demonstrates that AES can achieve it.This analysis shows that the AES models have different scoring characteristics, indicating that the integration of multiple AES models considering such characteristic differences is effective.Furthermore, the above discussion shows that the human rater who created the gold-standard scores used different scoring criteria compared with the AES models.This result indicates that the projection of the GMFRM-based latent scores θ j into the human rater's rating scale is important to achieve high scoring accuracy.Fig. 4. IRCs of the AES models for prompt 1.Note that, in the EASE (BLRR), SkipFlow, and Hybrid-BERT models, the curves for some categories [k = 3 and 4 in EASE (BLRR), and k = 3 in SkipFlow and Hybrid-BERT] are not displayed because they have extremely small probabilities.Fig. 5. IRC of the human rater for prompt 1.

E. Relation Between Predicted Scores of Individual AES Models and Integrated Scores of Proposed Method
To further examine the characteristics of individual AES models, this section describes the relations between the prediction scores of individual models and the integrated scores of the proposed method.Fig. 6 illustrates the relations in prompt 1.In each figure, the horizontal axis indicates the integrated scores of the proposed method using the GMFRM, and the vertical axis indicates the predicted scores of each AES model.The size of each bubble represents the appearance frequency of each data point, where a larger bubble represents a higher frequency.
According to Fig. 6, EASE (SVR) shows an extremely different tendency compared with the other models.Specifically, EASE (SVR) tends to overuse the high scores because it is extremely lenient, as described in the previous section.Also, EASE (SVR) cannot distinguish essays with medium or above qualities due to its extreme leniency.Thus, within the middle or above score range, its prediction scores substantially differ from the integrated scores.EASE (BLRR), SkipFlow, and Hybrid-BERT tend to avoid several low-score categories, such as 2, 3, and 4, which is consistent with the fact that their IRCs represent extremely low probabilities for some of these categories, as explained above.The figures for these models also show that the proposed method can predict scores that some models do not produce at all.
XGBoost, which has characteristics most similar to those of a human rater, predicts scores that agree well with the integrated Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.scores overall.The RNN also shows a relatively high agreement with the integrated scores.However, within the middle score range, that is, 7-9, the RNN shows a slightly larger disagreement with the integrated scores than XGBoost does.The reason is that XGBoost captured the tendency that the human rater prefers category 8, as explained in the previous section, whereas the RNN could not do that properly.
The above discussions demonstrate that the proposed method calculated the integrated scores while considering characteristics of scoring behavior in each AES model and their similarity to that of the human rater.

F. Relation Between Proposed Method Effectiveness and AES Characteristic Diversity
We can expect that the effectiveness of the proposed method will increase when the characteristic difference among the individual models becomes large.This section examines this hypothesis.
For this analysis, we quantified the characteristic differences among AES models using the mean absolute differences in IRCs, which have been used for IRT equating [104].The difference metric for two AES models r and r is defined as follows: where P jrk (θ) indicates the GMFRM-based probability calculated in (2) given the ability value θ, {θ h |h ∈ {1, . . ., H}} is a collection of integration points, and H is the number of points.
We created the integration points by setting H = 40 and dividing the value range [−4, 4] with an equal interval.We calculated the distance metrics δ(r, r ) for all the pairs of AES models and for all the pairs between the human rater and the AES models.The results are shown in Table VI.
First, focusing on the results for prompt 1, we can confirm that the metric between XGBoost and the human rater, which have similar characteristics of IRCs as explained earlier, shows a small value.Furthermore, the metric values between EASE (SVR), which has an extremely different IRC, as shown in Fig. 4, and the other models tend to be high.These results suggest that this metric reflects the scoring characteristic differences appropriately.
Next, focusing on the average row in Table VI, we can confirm that average metric values are relatively large in prompts 7 and 8. Furthermore, according to Table III, the proposed method using GMFRM shows large improvements in these two prompts compared with conventional integration methods, such as MEAN and VOTE.Here, Fig. 7 shows the relation between the average δ(r, r ) values and the difference in the QWK values between the proposed method and the MEAN method.In the figure, the horizontal axis indicates the difference in the QWK values between the proposed method using GMFRM and the MEAN method, the vertical axis indicates the average values of the IRC difference metric δ(r, r ), each plot indicates the results for a prompt, and the dotted line indicates the regression line.The figure shows a strong correlation.In particular, the Pearson correlation coefficient was 0.856, and it was significant (p < 0.01).A similar result was obtained between the proposed method and the VOTE method.Specifically, the correlation  was 0.840, and it was also significant (p < 0.01).From these results, we can conclude that the effectiveness of the proposed method tends to increase with increasing differences between the characteristics of the scoring behavior among the AES models.

G. Relation Between Proposed Method Effectiveness and Prompt Characteristics
This section examines the relationship between the prompt characteristics and the proposed method effectiveness to investigate what prompt-dependent factors affect the performance of the proposed method.In this analysis, we regarded the difference in the QWK values between the proposed method and the MEAN method as the proposed method effectiveness, in the same way as the analysis of Section VI-F.Fig. 8 shows the relation between the proposed method effectiveness and the prompt-dependent factors, namely, the average essay length, the score ranges, and the essay types.In the figure, the left panel shows the relation with the average essay length, the center panel shows that with the score range, and the right panel shows that with the essay type.
The figure shows that the effectiveness of the proposed method tends to increase with increasing the essay length and the score ranges.In addition, the effectiveness of the proposed method is relatively high for the narrative-type prompts compared to the other types.These results might indicate that these prompt-dependent factors affect the performance of the proposed method.However, it should be noted that, in the ASAP dataset, the essays for the narrative-type prompts are longer and have greater numbers of score categories than those for the other prompts, which might emphasize the characteristics difference among the individual models.As discussed in Section VI-F, the characteristics difference among the individual models affects the effectiveness of the proposed method, meaning that these prompt-dependent factors might have no or small direct impact on increasing the effectiveness of the proposed method.A further analysis based on large-scale experiments with various datasets is required to investigate the factors affecting the effectiveness of the proposed method in more detail, and this task remains as future work.

H. Computational Costs
This section investigates the computational cost for each method.Specifically, we calculated the time for training each AES model and the score-integration models.We calculated the times for each partition in the fivefold cross-validation.
Table VII shows the average times.Here, the time for data preprocessing and computing manually designed features was excluded.According to Table VII, the feature-engineering approach models can be trained in less than 1 s, whereas DNN-AES models take a few minutes even though they use a GPU.The total time to complete the training of all individual AES models is about 5 to 10 min.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The conventional score-integration methods can also be trained in less than 1 s.However, the proposed method requires up to about 20 min, which is a relatively long time compared with the others.The main reason is that we used the MCMC for the IRT parameter estimation because it is expected to provide high estimation accuracy.If the faster estimation is required, other estimation algorithms are available, including the marginal maximum likelihood estimation and the maximum a posteriori estimation using the Newton-Raphson method and EAP estimation with variational Bayesian methods.However, the total training time for the proposed method, including the time to train individual AES models, is about 30 min at most, which will generally be acceptable in practical use.
We also computed the time for scoring a new essay using each trained model.Consequently, the time to compute the score of a single essay was less than 0.1 s in all models, which is also sufficient for practical applications.

VII. CONCLUSION
In this study, we proposed a method that uses IRT to integrate prediction scores from various AES models while taking into account differences in scoring behavior characteristics.Specifically, we proposed the use of IRT incorporating rater characteristic parameters by regarding AES models as raters.We performed experiments with a benchmark dataset to demonstrate that the proposed method with the latest IRT model, GMFRM, provided significantly higher accuracy than individual AES models and conventional integration methods.We also showed that the scoring characteristics could be interpreted for each AES model based on the IRT parameters and confirmed a large characteristic variety among AES models and between a human rater and AES models.Furthermore, we demonstrated that the effectiveness of the proposed method tends to increase as this characteristic variety increases.

VIII. LIMITATIONS AND FUTURE WORK
We will evaluate the effectiveness of the proposed method by adding more distinctive models because our method is expected to improve accuracy with the addition of various AES models, as demonstrated in Section VI-E.Furthermore, although this study employed multiple models with entirely different architectures as base models, input manipulation, an ensemble learning method in which multiple models are constructed using different training subsets, as in AdaBoost and Bagging, may also be suitable for preparing multiple different models.Examining the proposed method using an input manipulation method is our future work.Another extension of the proposed method based on other meta-learning methods, such as mixture of experts, may also be a possible future direction.We also need a further analysis based on large-scale experiments with various datasets to elucidate the factors affecting the effectiveness of the proposed method, as discussed in Section VI-G.
As shown in Section VI-F, the proposed method provides information representing the characteristics of scoring behavior for AES models.Such information would be helpful not only to understand the scoring characteristics of each model but also to consider an improvement or extension of each model.We thus plan to examine how individual AES models can be improved based on the information obtained from the proposed method.
Another future direction is to consider the use of the proposed method for enhancing collaboration between AES models and human raters because the use of AES to support human raters is also a recent popular research topic [105], [106].
Moreover, in this study, we assumed that the gold-standard scores in training data are given by a single human rater.These scores, however, are often created by aggregating multiple scores given by multiple human raters, as pointed out in some previous studies [61], [63], [66].The proposed method can be easily applied to data in which each essay has multiple human rater scores.In future studies, we plan to apply the proposed method to such data and evaluate its effectiveness.
This study focused on a prompt-specific scoring task, the most common AES task, in which an AES model is trained for a prompt and the trained model is used to evaluate essays for the same prompt.Another important AES task is a cross-prompt scoring task, in which no or few training data for a target prompt exist, but data for other prompts are available.Cross-prompt scorings are often realized using domain adaptation or transfer learning techniques, which are studied widely in AI and machine learning domains.However, the number of papers dealing with this task remains limited [13].We will examine an extension of the proposed method for such tasks in future work.
Another future direction relates to the use of the IRT.An extension of the proposed method using cognitive diagnosis models (CDMs), including DINA (deterministic inputs, noisy, and gate) [107] and NeuralCD (neural cognitive diagnosis) [108] models, might contribute to improving the interpretability of the attributes (e.g., knowledge or skills) measured by the target essay writing test.However, most CDMs cannot be directly applied to our framework because they require a Q-matrix, which defines the required attributes for each test item, and it is not generally included in most existing datasets for AES.We would like to investigate the possibility of integrating CDMs into our framework in the future.
Furthermore, analyzing the differences in the item characteristics between the essay writing test items and the other types of items, such as multiple choice questions, is an important issue in the field of educational measurement, although this is outside the scope of this study.A fusion of IRT and AES, as in the proposed method, might be helpful for realizing a detailed analysis of this aspect, which is also a future work.In addition, although this study focused on the AES context, applying the proposed method to other text-scoring tasks, including ASAG, is also future work.

APPENDIX
This appendix describes the detailed architectures of the DNN-based automatic feature extraction approach models and the hybrid model, which were used in our experiments.

A. CNN-RNN-Based Model
Fig. 9 shows the architecture of the CNN-RNN-based model [27].This model calculates a score for a targeted essay, which is defined as a sequence of words {w 1 , . . ., w n } (where w t is the tth word in the essay and n is the number of words), through five DNN layers.
1) The first layer is the lookup table layer that transforms each word into a D-dimensional word-embedding representation, in which words with similar meanings have similar vector representations.
2) The second layer, the convolution layer, uses a CNN to extract N-gram-level features from the sequence of word-embedding vectors.In this layer, each word vector is transformed into another vector representation that reflects dependencies among N-adjacent words.This layer is often omitted in some extension models.
3) The third layer, the recurrent layer, uses an RNN to transform each output vector from the convolution layer into another vector representation that reflects the context of the essay.A long short-term memory network is generally used as the RNN.4) The fourth layer is a pooling layer that transforms the output vector sequence of the recurrent layer into an aggregated fixed-length hidden vector by averaging the vector sequence.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.5) The last layer, the linear layer with sigmoid activation, projects the output vector of the pooling layer to a scalar value by using a sigmoid function.

B. SkipFlow Model
The SkipFlow model [32] is a representative DNN-AES model with a function that captures textual coherence directly, as explained in Section III-B.The architecture of this model is shown in Fig. 10.The SkipFlow model consists of almost the same components as those used in the CNN-RNN-based model.Specifically, the model uses the lookup table layer, the recurrent layer, the pooling layer, and the linear layer with sigmoid activation.The main difference is the incorporation of the neural tensor layer.The neural tensor layer takes two positional outputs of the recurrent layer that are collected from different time steps and computes the similarity between each pair of positional outputs.The similarity score is regarded as a neural coherence feature between the two selected positions.The list of the similarity scores is concatenated with the pooling layer output, and the concatenated vector is used to predict an essay score.

C. BERT-Based Hybrid AES Model
Fig. 11 shows the architecture of the BERT-based hybrid AES model proposed in [42].This model concatenates manually designed essay-level features to the distributed essay representation, which is the input vector for the last linear layer in the BERT-based AES model.
Note that, as explained in Section III-B, the BERT is a transformer-based model pretrained on massive amounts of unlabeled text data for two tasks, namely, masked language modeling and next-sentence prediction.Masked language modeling involves predicting the words that are masked out of the input text, whereas the next-sequence prediction involves predicting whether two given sentences are adjacent.AES using the pretrained BERT can be realized by the following procedure.
1) Add a special classification (CLS) token to the beginning of each essay text.2) Add a linear layer with sigmoid activation over the output corresponding to this token because the BERT output for this token is known to be a distributed representation for specific input text.3) Fine-tune this BERT-based AES model using a training dataset that consists of essays and corresponding scores.To implement the BERT-based hybrid AES model, manually designed essay-level features are concatenated with the BERT output for the CLS token in step 2 of the above procedure.

Fig. 1 .
Fig. 1.IRCs of four raters for the parameters presented in Table I. response probability for category m.In Fig. 1, rater 4 has smaller d r3 − d r2 and d r5 − d r4 values and relatively larger d r4 − d r3 and d r6 − d r5 values.Thus, in the IRC, response probabilities for categories 2 and 4 decrease, whereas those for categories 3 and 5 increase, representing a range restriction characteristic with overuse of categories 3 and 5 while avoiding categories 2 and 4.The GMFRM can estimate latent scores θ j while taking into account differences in these characteristics among raters while earlier popular IRT models with rater parameters, including MFRM, cannot consider all the above rater characteristics simultaneously.Thus, this model can realize an accurate score estimation compared with the other IRT models when raters with various characteristics exist[57],[58].This is why we chose the GMFRM.

Fig. 3 . 1 :
Fig. 3. Outline of integrated score prediction in the proposed method.Note that P j0k is calculable based on(2) given the estimated latent scores θ j and the human rater's parameters calibrated in the IRT parameter estimation step.This score rescaling is required for the following two reasons.a) The latent scores θ j are estimated on the logit scale [-∞, ∞], which differs from the original categorical score ranges.b) The main goal of AES is to predict scores on the rating scale of the gold-standard human rater.Algorithm 1 shows the detailed process of the proposed method.Here, we assume that training data and development data, which are composed of essays and gold-standard human scores, are available in the training phase.Furthermore, we assume the execution of AES for essays within test data.In Algorithm 1, E train , E dev , and E test represent essays in the training data, development data, and test data, respectively.Furthermore, X train and X dev represent the gold-standard human scores in training data and development data, respectively, and R indicates the number of candidate AES models.The process in each function is as follows.1)T rainAES(r, E train , X train , E dev , X dev ) trains the rth AES model using training data (E train , X train ) and returns trained model M r .Some models may use development data (E dev , X dev ) for early stopping or hyperparameter tuning in the training phase.2) P redAES(M r , E) predicts scores for given essays E using trained model M r and returns the prediction scores X r .3) EstIrtP aram(X) runs the GMFRM parameter estimation using given score data X and returns estimated rater parameters ξ, consisting of α r , β r , and d rm for each AES model and human rater.4) EstIrtScore(j, ξ, X) computes the latent score θ j from data X based on the GMFRM with the rater parameter estimates ξ.In Algorithm 1, line 2 corresponds to the AES model training procedure explained above.Lines 3 and 5 correspond to

Fig. 6 .
Fig. 6.Relation between predicted scores of individual AES and integrated scores of the proposed method in Prompt 1.

Fig. 7 .
Fig. 7. Relation between characteristic diversity among AES models and the proposed method effectiveness.

Fig. 8 .
Fig. 8. Relation between prompt-dependent factors and the proposed method effectiveness.

Fig. 11 .
Fig. 11.Architecture of BERT-based hybrid model (F indicates a manually designed feature vector).

TABLE I PARAMETERS
FOR FOUR RATERS WITH DIFFERENT CHARACTERISTICS

TABLE VI IRC
DIFFERENCE METRIC VALUES AMONG AES MODELS AND THOSE AMONG THE HUMAN RATER AND AES MODELS

TABLE VII COMPUTATIONAL
TIMES FOR MODEL TRAINING (SECONDS)