A Systematic Review of the Effects of Automatic Scoring and Automatic Feedback in Educational Settings

Automatic scoring and feedback tools have become critical components of online learning proliferation. These tools range from multiple-choice questions to grading essays using machine learning (ML). Learning environments such as massive open online courses (MOOCs) would not be possible without them. The usage of this mechanism has brought many exciting areas of study, from the design of questions to the ML grading tools’ precision and accuracy. This paper analyzes the findings of 125 studies published in journals and proceedings between 2016 and 2020 on the usages of automatic scoring and feedback as a learning tool. This analysis gives an overview of the trends, challenges, and open questions in this research area. The results indicate that automatic scoring and feedback have many advantages. The most important benefits include enabling scaling the number of students without adding a proportional number of instructors, improving the student experience by reducing the time between submission grading and feedback, and removing bias in scoring. On the other hand, these technologies have some drawbacks. The main problem is creating a disincentive to develop innovative answers that do not match the expected one or have not been considered when preparing the problem. Another drawback is potentially training the student to answer the question instead of learning the concepts. With this, given the existence of a correct answer, such an answer could be leaked to the internet, making it easier for students to avoid solving the problem. Overall, each of these drawbacks presents an opportunity to look at ways to improve technologies to use these tools to provide a better learning experience to students.


I. INTRODUCTION
Automatic scoring and feedback consist of calculating grades on students' work and providing personalized feedback using technological tools that do not require human participation [1]. These tools play a significant role in online learning. Like massive open online courses (MOOCs), many new learning environments would not be possible without them [2]. Automatic scoring has been a tool for a while, and multiple-choice tests have been available for a long time. Large-scale multiple-choice tests have been possible since the The associate editor coordinating the review of this manuscript and approving it for publication was Dongxiao Yu .
introduction of the Scantron. This tool continues to be used today [3].
With the rapid growth of technology and internet access, the use of automatic scoring and feedback has accelerated [4], [5]. The benefits for institutions and instructors of the usage of these tools are apparent. The institutions and instructors acquire the ability to increase students per instructor and provide fast and consistent results [6]. However, all these advantages come with potential drawbacks.
Multiple areas of study can use automatic scoring. An early implementation of immediate automatic feedback is multiple-choice questions. Generating the questions in multiple-choice format simplifies any issues related to interpreting the answers [7]. Unit testing evaluation of programming assessments was also an early entry into the area, given its simplicity of usage and implementation [8]- [10]. With the broader availability of machine learning in education [11], the field expanded to include the grading of short essays [12] and long essays [13]. It also started to address other more complex problems, such as grading code correctness [14].
With the expansion of automatic scoring and feedback as a tool, several issues have emerged. From a technical perspective, devices based on machine learning need data, potentially in substantial amounts, to be accurate [15]. From an educational perspective, authors, including Bancroft [16], affirm that automatically scored tests, for example, multiple-choice tests, ''do not test anything more than just straight recall of facts.'' Given these potential issues, studies on automatic feedback, problem set up, and its effects on student's education and experience are still being produced [17].
This review attempts to expand the literature on the effects of using automatic scoring and feedback as a learning tool, emphasizing its impact on the students' learning experience. To achieve that goal, it focuses on these research questions: -RQ1 What types of automatic scoring and automatic feedback are in use? -RQ2 What are the positive effects on education goals of using automatic feedback and automatic scoring? -RQ3 What are the positive effects on the student experience of using automatic feedback and automatic scoring? -RQ4 What are the adverse effects on educational goals and student experience using automatic feedback and automatic scoring?
-RQ5 What type of evaluation was carried out to measure the effect of automatic scoring and feedback on student academic performance? -RQ6 What improvements can be made to mitigate the adverse effects in RQ4?
Below, we list most tools and application fields to present current automatic scoring and feedback (RQ1). We present most tools currently being used, even though some are on their way to obsolesce, to show the field's evolution. Concerning the positive effects (RQ2), we look at the experiences and opportunities these technologies have enabled, focusing on the student (RQ3). We then shift focus to the problems these tools may introduce from a learning perspective and a student experience perspective (RQ4). We also analyze the size evaluations conducted to examine the effect of effect on the academic performance of students (RQ5). Having studied the adverse effects, we look at possible improvements that mitigate the findings on RQ4 (RQ6).
The following ten sections of this paper continue with a discussion of previous studies, the process used to carry out the review, the most relevant findings, and interpretation of those findings, and potential avenues for future research.

II. RELATED WORK
Multiple papers have investigated the state of automatic scoring and feedback. These papers tend to focus on ways to use the tools and the quality of their output. Few of them concentrate on the educational effects of using the technologies. Table 1 provides an overview of some of the studies and their findings. Among their conclusions is that automatic feedback is being used more in structured questions that require well-defined answers. These questions include multiple-choice [18], fill-in-the-blank [19], or those with a solution presented in a structured language, i.e., a mathematical formula [20] or a program [21], [22]. The main positive effects of automatic feedback include the students using the feedback for improvement [23], increased student engagement [24], [25], and reduction of instructor bias [26]. Despite its benefits, automatic scoring is only one of the potential uses for machine learning and can be expanded to encompass others, including performance prediction, material curation, and course adaptability [27]. However, automatic feedback has some drawbacks, including the complexity of measuring the feedback quality compared to a manual grader [28].
This work looks at reviewing a broader set of papers compared to Table 1, focusing on examining the effect of feedback on the student experience and identifying opportunities to improve the automatic feedback mechanisms from a student experience perspective.
The previous reviews do not determine the effect of automatic scoring and feedback on students' performance. In this paper, we advance this issue.

IV. PLANNING
This part of the work included creating a strategy to select the most relevant result that helped address the research questions. We performed an iterative search using Web of Science as a platform for the search given the quality of its database and the iterative filtering capabilities [31].
We searched for the terms '''automatic scoring' AND education,'' '''automatic grading' AND education,'' '''automatic feedback' AND education,'' and '''machine learning' AND education.'' The search included only works published from 2016 to mid-2020. The first query returned 15 papers for the first terms, the second 19, the third 27, and the fourth 1233. We restricted the last one by refining the search results by looking for ''scoring'' or ''feedback.'' The refined result of all the search refinements included 256 papers which were then manually filtered by title using the inclusion criteria.

V. INCLUSION CRITERIA
The works selected for review fell under the following parameters: 1-The study focuses on the use of technologies in education.

2-
The study helps answer at least one of the research questions.
3-The study was published after peer review. After the criteria were applied, the list of works was reduced to 125.

VI. CONDUCTING THE REVIEW AND REPORTING
After completing the planning, a content analysis was carried out. As mentioned by [32], content analysis allows one to find the research trends by analyzing the articles' content and grouping them according to the shared characteristics. We created a collection form to code the information relevant to answering the research questions. The columns included in that form are shown in Table 2. Each paper was thoroughly reviewed by three of the authors using a shared Excel file. A simple majority (2 votes) was needed for the value to be selected for the categorical values. For the open-ended questions also two votes/appearances were used to keep an answer. This analysis was used to group the papers, and the groupings were used to answer the research questions.

VII. FINDINGS
This section shows the current trends in automatic feedback and scoring. First, we present general findings, followed by the analysis of each research question.

A. GENERAL FINDINGS
From the selected papers, 12% are from 2016, 16% from 2017, 27% from 2018, 28% from 2019, and 17% from the early part of 2020. This trend suggests a likely increase in interest in the subject. Figure 1 shows this behavior.

B. EDUCATIONAL LEVEL
According to the International Standard Classification of Education [33], most of the work reviewed was at the bachelor's or equivalent level (92% of papers), with small numbers at the early education (2% of papers), such as Saha's study of automatic grading of explanatory answers in middle school [34] and secondary education (6% of papers) such as Anohah's analysis of high school computing science courses [35].
This distribution supports the fact that technologies used for automatic grading and feedback require information technology and a medium understanding of language and mathematics. Despite this, some of the works dealt with teaching topics in early education, including handwriting [36] and basic math [20]. See Figure 2.

C. FIELD OF EDUCATION
Most of the papers addressed works that have effects across disciplines, e.g., using student data to predict performance [26]. For the discipline-specific ones and using the International Standard Classification of Education (ISCED) [33], most of the work fell into the categories of the sciences [37], including areas like geology [38], mathematics [20], computer science [24], computer networking [39] (47% of papers). This is followed by cross-disciplinary applications (32% of the papers), followed by art and humanities (21% papers). This set is completed by applications in medicine where virtual reality and other technologies are being used to support immersive practical experiences such as virtual artificial intelligent assistants [40], surgical skill assessments [41], [42], physiotherapy training [43], and clinical skills [44]. A couple of papers target areas such as music where immediate feedback is also used to improve the student experience in general musical learning [45] and instrument learning [46]. The following sections will describe the answers according to each research question.

D. RQ1 WHAT TYPES OF AUTOMATIC SCORING AND AUTOMATIC FEEDBACK ARE IN USE?
The types of automatic scoring and feedback can be divided into two dimensions. The first is the input-form, and the second is the mechanism used for auto-grading and generating feedback.
From the input perspective, the primary forms are structured. These include mathematics [47], code [22], and controlled environments such as simulations [40]. Other inputs include a short free form (e.g., a short sentence [12]), long free form (e.g., an essay [13]). The main mechanisms are static or dynamic. Static ones include comparing the answer to a key or set of keys [16] or running a fixed set of unit test cases [48]. Dynamic ones include comparing the answer to other student answers or using machine learning to learn expected grades from past answers.
The tools used to produce grades and feedback dynamically include ontologies [7], neural networks to identify possible solutions [42], [47], machine learning used to identify learning paths [49], [50], and machine learning used to determine student's risk of falling [51]. Table 3 summarizes the tools and techniques in each area identified in the works analyzed.

E. RQ2 WHAT ARE THE POSITIVE EFFECTS ON EDUCATION GOALS OF USING AUTOMATIC FEEDBACK AND AUTOMATIC SCORING?
Most of the analyzed papers concluded that there were educational advantages to automatic feedback and scoring. The most commonly mentioned advantages included bias reduction [52] and grading consistency [26], the ability for the instructor to shift focus away from grading into other activities [49], and allowing more students to participate in the learning experiences [53]. Table 4 summarizes the benefits.

F. RQ3 WHAT ARE THE POSITIVE EFFECTS ON THE STUDENT EXPERIENCE OF USING AUTOMATIC FEEDBACK AND AUTOMATIC SCORING?
Very few of the works focused on the student experience, instead emphasizing learning experiences. They primarily highlighted students' positive reception of features such as immediate grading with the allowance of multiple submissions [54]. Together with this, several works focused on the ability to create custom learnings paths based on student performance [55], [26], [53]- [58], and the ability to flag students at high risk of not succeeding [51], [59]- [63], [61]. Table 5 summarizes the benefits found in the review.

G. RQ4 WHAT ARE THE ADVERSE EFFECTS ON EDUCATIONAL GOALS AND STUDENT EXPERIENCE OF USING AUTOMATIC FEEDBACK AND AUTOMATIC SCORING?
Most of the papers focused on the usability of the tools and techniques they work with and evaluated ways to replace current practices with equivalent or better methods, with very few adverse side effects. When detected, these side effects included students losing the social aspect of learning [64]. This was replaced by human-computer interactions and students learning to work within the system (e.g., creating multiple accounts in a MOOC to gain access to the answers [65].
Very few studies assessed the adverse effects on student experience, although some included sections on potential issues that need to be further studied. These issues included students learning to solve the assessment questions without understanding the underlying concepts [66]. This phenomenon is not exclusive to automatic feedback, as studies have shown that the number of past tests studied is a strong indicator of future tests [67]. Other potential adverse effects included loss of human interaction and lack of interpersonal skills while solving problems [64], and lack of personalized feedback that could help outlier students [68], especially struggling students [69].

H. RQ5 WHAT TYPE OF EVALUATION WAS USED TO MEASURE THE EFFECT OF AUTOMATIC SCORING AND FEEDBACK ON STUDENT ACADEMIC PERFORMANCE?
As shown in Table 6, some of the papers contained experiments related to the tool's quality or the algorithm. Examples include [70]- [74] (36%). Others included a one-group experiment (34%) and an individual case study (29%). These results show the need for further experiments following this pattern to understand better the actual effects of automatic feedback and scoring on student academic performance.

I. RQ6 WHAT IMPROVEMENTS ARE BEING MADE TO MITIGATE THE ADVERSE EFFECTS IN RQ4?
Some of the effects cannot be easily mitigated, i.e., bringing back the student-professor interaction [29]. Automatically, automatic systems are becoming popular are impossible to keep large MOOCs [66]. Chatbot technologies could eventually help this area provide a more personalized experience [75], [76]. Specific learning can be mitigated by generating dynamic problems unique for each student [77], [78]. From a student perspective, work can be developed to improve the design and delivery of the automatic feedback to improve the experience, including finding a way to personalize the feedback [79]. Finally, it is essential to mention the need for more long-term studies to understand the impact of feedback on the students' experience.

VIII. DISCUSSION
The results of this review suggest that automatic scoring and feedback is an area undergoing constant improvements as technology evolves and data becomes available. The use of automatic scoring and feedback has led to the expansion of MOOCs and online courses and the ability to support large students in the same program [53]. Automated scoring and feedback are not only present both in MOOCs [80] and other systems where the scale requires it but also in smaller settings as a tool to support learning, including introductory programming classes [21], [81]- [83]. This new capability has led universities to open their programs to more applicants and allowed more students to go through those programs.
The most common uses of automatic scoring and feedback are in three areas: 1) programming problems through mechanisms such as assisting the student with the coding [84], [85], analyzing coding patterns [86], [87], automatic grading [88]- [92], and customized feedback [84], [86], [85]; 2) short essays [93]; and 3) extended essays [94]. Programming problems are the easiest to use as input for this technology, as they appear in a structured language that computers can understand [95]. Short essays can also be looked at, as their complexity tends to be low [96], while long essays prove the most considerable challenge for this technology [97]. With this in mind, automatic scoring and feedback are being employed broadly in computer science, mathematics [98]- [101], [47], and similarly analytical courses, together with language-learning areas [102]- [105], [36], [97].
With the expansion of these technologies, we expect to see the benefits presented in the works surveyed in this paper materialized beyond the furthering of access, particularly an improvement in grading consistency [106] and the freeing of instructor time to dedicate to other activities [107]. An analysis of the effects of this time re-allocation would provide more information on the end effects of this benefit. We also expect to see an increase in student engagement driven by the ability to solve problems in a more interactive way given the feedback [108]- [110]. Looking at the effects of this engagement on learning is another area of future study. Similarly, a more personalized student experience would be expected to lead to better matching between the learning experience and the student's learning style [111]- [115].
On the other hand, the potential adverse effects of using automatic feedback cannot be ignored. Students can solve problems without learning the underlying concepts. Using the feedback as a trial-and-error exercise or accessing answers using the internet can cause a very detrimental effect on learning [116]. Similarly, where solutions are not easily defined or grouped, subjects will have a more challenging time implementing these technologies as they are less developed in those areas.

IX. LIMITATIONS
This study addresses only some relevant questions when analyzing the extensive use of automatic scoring and feedback.
There are fundamental questions about the quality of content and student privacy [117], for example, which are not considered in the study. The study also does not reveal funding and other possible biases affecting the underlying studies and does not focus on the specific tools used to implement the technologies. The research does not focus on features and requirements for automatic scoring and feedback tools or possible solutions to many challenges.

X. CONCLUSION AND FUTURE WORK
This work presents a systematic review of the literature with an analysis of 125 studies focused on using automatic scoring and feedback. Results indicate that these technologies play an essential role in expanding access to education and are still evolving. The use of these technologies is also growing both in large and small classes in multiple areas. The number of application areas, tools being used, and published works in this area are increasing. This trend is most likely related to a combination of technological advances and the need to serve more students.
This review shows the current state of automatic scoring and feedback and identifies areas of potential improvement and further analysis. Among these areas, the study of the effects on educational quality and student experience is highly relevant.
MARCELO GUERRA HAHN (Senior Member, IEEE) received the bachelor's and master's degrees in computer science from the Universidad de la Republica, Uruguay. He is currently pursuing the Ph.D. degree with the Universidad Internacional de la Rioja. He is studying technologies associated with automatic assignment feedback and their effects on learning achievements and experiences. He is also a Guest Lecturer with the University of Washington and the Director of engineering at Sound Commerce.
SILVIA MARGARITA BALDIRIS NAVARRO received the bachelor's degree in systems and industrial engineering from the Industrial University of Santander (UIS), Colombia, and the master's degree in industrial informatics and automation and the Ph.D. degree in technologies from the University of Girona. She is currently an Associate Professor with the Universidad Internacional de La Rioja, Spain, and Fundación Universitaria Tecnológico Comfenalco, Colombia. Since her early twenties, she has been interested in research on how technologies can facilitate all students' inclusion in the educational systems. She has coordinated and participated in international projects and initiatives in Europe and North/South America, including serving on the editorial boards of high-impact scientific journals. He also works as the Director of the Research Institute for Innovation and Technology in Education (UNIR iTED, http://ited.unir.net). He is also a Professor at An-Najah National University, Palestine, an Adjunct Professor at the Universidad Nacional de Colombia (UNAL), Colombia, an Extraordinary Professor at North-West University, South Africa, and a Visiting Professor at Coventry University, U.K. He is or has been involved in more than 60 European and worldwide research and development projects. He works as a Consultant for the United Nations (UNECE), the European Commission and Parliament, and the Russian Academy of Science.