K-NCT: Korean Neural Grammatical Error Correction Gold-Standard Test Set Using Novel Error Type Classification Criteria

Recently, active research has been conducted on Korean grammatical error correction on machine translation (MT) and automatic noise generation. However, there is no gold-standard test set for objective and official comparative analysis. A significant limitation is measuring the ill-defined performance because the experimental error types in the train set are also included in the test set. Moreover, error types in the training set are also included in the test set. Additionally, the types of errors for qualitative analysis are defined differently with no explicit guidelines. This study proposes a gold-standard test set called the Korean Neural Grammatical Correction Test set (K-NCT) for Korean grammatical error correction using a new error type classification guideline. To ensure the factuality and reliability of the proposal, we conduct a quantitative analysis using a commercialization system and human evaluation. Experimental results demonstrate that the proposed grammatical error correction test set has a well-balanced, diverse, and precise guideline. Our dataset is available at https://github.com/seonminkoo/K-NCT


I. INTRODUCTION
Grammatical error correction is a system that detects errors in a given sentence and corrects them. Particularly, in Korean, several grammatical errors occur owing to morphological richness and agglutinative characteristics [1], [2], [3].
The most intuitive solution for the Korean grammatical error correction is a rule-based approach. In this approach, several error types and their corresponding correcting rules are predefined for the correction process [4]. This method is effective because spelling and grammatical errors are amended without destroying the original sentence structure, hence this approach is currently utilized. However, this method has several limitations in that it is time consuming and The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . requires human resources to establish the correction rules. Furthermore, it strictly revises sentences by the predefined rules and cannot correct other types of errors.
To address these problems, a statistic-based approach is proposed, which mitigates the necessity for the construction of correction rules that require high resources and judge errors based on the probability estimated by the given corpus [5], [6]. However, it also demands a sufficiently large corpus to attain decent performance.
Recently, deep learning-based grammatical correction algorithms that can effectively alleviate the above limitations are being utilized. Several methods that can construct an error correction model without parallel data have been proposed, particularly for the Korean language [1], [7], [8]. These are mostly based on the automatic noise injection process that generates pseudo-parallel corpus through the unlabelled VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mono corpus [1]. Generally, a sentence given by mono corpus is regarded as a target sentence, and through the noising process, a corresponding source sentence is generated for the training of sequence to sequence-based correction model. However, despite the significant improvement in the deep learning-based approach, these approaches contain the following two limitations. First is the absence of the official grammatical error correction dataset. This follows the inconsistent evaluation of the error correction model. Because public test data does not exist, researchers construct their test sets using the arbitrary sampling of their original corpus. As test data differs according to each research study, the corresponding performance assessment may not reflect the objective and reliable performance of each model [9], [10].
Second, the unstandardized error types used in each research study worsen the performance assessment's reliability. The high performance of each model may not reflect its effectiveness because a precise standard for the error types has not been established [2]; hence it may be underestimated or overestimated by the utilized test set.
We propose the error type classification standards for Korean grammatical error correction research and release the corresponding gold-standard test set, K-NCT (Korean Neural grammatical Correction Test set). The proposed types include four significant criteria: spacing, punctuation, numerical, spelling and grammatical error. These are divided into 23 subcategories related to balance, diversity, and factuality. We proceed with human evaluation by consulting linguistic experts and qualitative analysis through publicly released commercialization systems to secure the factuality and reliability of the K-NCT.
Section III.B describes the three reliability features (balance, diversity, and factuality) considered for K-NCT construction. Subsequently, we propose an error type criteria reflecting the characteristics of Korean in Section III.C. Section III.D describes the data selection processes, preprocessing, post-processing, and error injection to construct K-NCT. Finally, Section III.E presents an overview of the completed K-NCT. IV describes the experiments and results.
The contributions of this study are as follows.
• It identifies the limitations of the performance evaluation methods that are utilized in previous studies and suggests a ''New Error Type Classification Criteria.'' Presumably, it is the first error type standard.
• Based on the error type criteria, we released the first gold-standard test set for Korean grammatical error correction called K-NCT for active Korean grammatical error correction research.
• An in-depth quantitative analysis is performed by applying it to the commercialization system, and objective and reliable K-NCT verification is conducted through human evaluation.

II. RELATED WORKS
Several deep learning-based Grammatical error correction models solve the task from the point of view of machine translation. From a machine translation perspective, grammatical error correction is ''translating'' an error sentence into a correct sentence. It is mainly corrected through the noising encoder and denoising decoder structures based on the sequence-to-to-sequence model [11], [12], [13].
Recently, research on grammatical error correction for high-resource languages is active. A sequence tagging model including a transformer-based encoder is proposed for error detection and error correction [14]. From an educational point of view, interpretability is improved by adding text and examples for language learners expanding the reason for correction together [15]. Self-Supervised Curriculum Learning is applied to measure data difficulty through training loss and train the model to increase performance [16]. Applying a contrastive learning approach to the GEC model improves performance for low error density domains [17]. Considering the inference efficiency, decoding many tokens through aggressive decoding improves the model's speed [18].
However, to apply the methodology, a parallel corpus composed of pairs of error sentences and correct sentences is required; nonetheless, there is no publicly available Korean data. Various studies are underway to construct a pseudo parallel corpus without human-labor by applying the automatic noise generation technique. The automatic noise generation technique automatically generates a parallel corpus by designing a noise function for a mono corpus and applies it to generate a parallel corpus. Grapheme-to-phoneme, spacing, punctuation, and pronunciation errors, etc. are artificially injected to add noise. However, because the types of errors defined for each study are different, and the training set and test set are divided and used in the pseudo generated data set. The type of error used for training will be high probability included in the test, rendering objective research difficult.
Therefore, this study pointed out the limitations of data construction and performance evaluation of the existing Korean grammatical correction research and proposed the error type classification system for the first time. Also, based on the system, presumably, a gold-standard test set for Korean grammatical error correction system called K-NCT was built for the first time.

III. K-NCT A. WHY K-NCT?
Gold-standard test set has not been used in previous studies for Korean grammatical error correction. This causes several limitations.
First, it is impossible for accurate performance measurement to generate pseudo-parallel corpus, which includes only certain types. Second, the error type used in training is included in the test because the generated corpus is divided arbitrarily. Third, it is difficult to analyze model development in detail because there is no systematic error type guideline. Fourth, some datasets use single language and sentences of domains and lengths that are not diverse; therefore, objective comparative studies are complex.
We propose K-NCT set, which was constructed by considering the error type classification criteria and various reliable factors to solve the problems above. K-NCT set not only systematizes error types and applies these to actual sentences; it also considers different domains, methodologies, and the number of syllables. It is a 100% human-constructed highquality dataset.

B. DESIGNING K-NCT
K-NCT can be applied to various features of constructs that can be objectively verified. First, the dataset is designed considering the balance, which produces fair features and comprises data that are unbiased. Second, it is an organized dataset that considers diversity; therefore, the dataset assumes numerous features. Third, the dataset is created by reflecting the factuality. The objective gold-standard test set has little to no unreal data and unnatural error types for evaluating the model by humans.

a: CONSIDERING THE BALANCE
K-NCT includes well-balanced error types, syllable lengths, text style, and domains. The balance is achieved by determining the proportion of error types and ratios based on details in the case of spelling and grammatical errors. It has a fairness configuration of 500 spacings, 500 punctuations, and 500 numerical errors. The detailed error types of spelling and grammatical errors consist of 1312 monolingual, 200 multilingual, 800 spelling, 411 syntax, 200 semantic, and 100 neologism errors. This balance is adjusted by dividing the length of the syllable into a specific range and setting the ratio for each degree. Text styles compose sentences of various types considering the characteristics of Korean and proportions of each to be uniform.

b: CONSIDERING THE DIVERSITY
We consider not only single text style, but also written, spoken, and dialog style for the diversity of K-NCT. Several previous datasets and pseudo-parallel corpus only contain single text style in certain previous studies. Therefore, we include various text styles to construct the dataset for a more accurate and objective comparative analysis.
The dataset covers the range of syllable lengths of 2∼20, 21∼29, 30∼50, and 51 syllables or more to ensure diversity. Furthermore, it proposes various types of errors and constructs the dataset that satisfies them to secure the errors. It is possible to determine which error types are bullish and bearish in a model through multiple error types.

c: CONSIDERING THE FACTUALITY
K-NCT conducts a human evaluation to prove the factuality of the proposed guidelines and generated sentences (see Section 3.3). Because 'factuality' is a naturalist or true to life for a person to judge, it undergoes a human evaluation. We present human evaluation criteria to determine the factuality and corresponding score.

C. ERROR TYPE CLASSIFICATION CRITERIA FOR K-NCT
Accurate performance evaluation and analysis of Korean grammatical error correction models requires diverse and systematic error types. To this end, a more detailed and specific error type system is proposed. Table 1 shows the error type classification criteria used in K-NCT construction. It is classified into 23 detailed error types based on four major categories (spacing, punctuation, numerical, and spelling and grammatical errors). Particularly, spelling and grammatical error is divided into primary and secondary errors for detailed analysis.

a: SPACING ERROR
This violates the Korean spacing rules. Similar to deep learning models, humans also commit these errors because of their fast typing speed or habits.

b: PUNCTUATION ERROR
This occurs when punctuation marks are not attached in the sentence or misplaced. When generating a sentence with deep learning model, it easily appears because of an unregistered word. The intent of the sentence may be different depending on punctuation marks.

c: NUMERICAL ERROR
This occurs when a cardinal number indicates quantity and an ordinal number indicates order. For example, the correct sentence ' (han si il bun)' is incorrectly written as ' (hana si il bun)' or ' (ilsi ilbun)'.

d: SPELLING AND GRAMMATICAL ERROR
This violates Korean spelling and grammatical. As the most frequent case in Korean, it is divided into primary and secondary errors. Primary and secondary errors can be nested. Therefore, the primary error is first classified and the second error is sub-classified. Primary errors are classified into monolingual errors occurring in Korean and multilingual errors occurring in other languages. The subtypes are as follows: • Remove error: This occurs when some words are not recognized, or the endings or postpositions are omitted. This is one of the common mistakes Koreans commit, and it is classified as an error type.
• Addition Error: This occurs when the same word is repeated, postpositions are not used, or when endings are added. These mistakes are committed frequently owing to sentence typing speed and incorrect grammatical knowledge.
• Replace Error: It is subdivided into word replacement, in which another word replaces a word, and rotation replacement, in which the order of syllable changes within one phrase. The errors occur at a fast typing speed, primarily related to spelling errors around the intended spelling position. VOLUME 10, 2022 • Separation Error: This occurs when consonants and vowels of a character are separated. This frequently occurs in Korean because the space key is usually used to separate words with spaces.
• Typing language Error: This occurs when typing while the keyboard is not in Korean mode. It occurs when the language change key is pressed while manipulating the keyboard, for example, ' ' is incorrectly typed as 'dkssud'.
• Foreign word conversion Error: This generates differently from the standard foreign language pronunciation. In Korean, there is a normal foreign language notation; however, it is the case that does not follow it, for example, ' (supeu)' is incorrectly spelled ' (seupeu).' Secondary error is classified into four types (Spelling, Syntax, Semantic, and Neologism Errors). The subtypes are as follows: • Consonant vowel conversion error: Spelling error in non-speaking alphabet units, for example, ' .(ije god gabnida.)' is incorrectly written as ' .(ije kon gabnida.)' • Grapheme-to-phoneme(G2P) Error: Writing spellings according to pronunciation, for example, ' .(ije god gabnida.)' is incorrectly written as ' .(ije goj gabnida.)' • Element Error: Korean sentence elements are not in place or do not fit in the order of words. Korean has a fixed sentence structure (subject, object, verb), and there are transitive verbs that require an object.
• Tense Error: Using a verb that does not match the tense, e.g., 'Did in the future.'.
• Postposition Error: Using a postposition that does not conform to the grammatical. Because Korean is an agglutinative language, the use of the verb is important. Furthermore, there are various types of verbs such as case and auxiliary verbs.
• Suffix Error: Using endings that do not conform to spelling. Suffix Error is classified as an error type because the original verb is modified according to each situation.
• Auxiliary predicate Error: Using auxiliary verbs that do not conform to grammar. This occurs because Korean uses auxiliary verbs to construct sentences.
• Dialect Error: Writing in non-standard language.
Korean has a variety of dialects. The criterion for determining dialect errors is the speaker or author's intention. When the model fails to create the intended dialect, it is judged as an error.
• Polite speech Error: Polite speech expression that does not fit the subject. This error type reflects Korean cultural characteristics.
• Behavioral Error: An expression that the subject cannot perform, for example, 'the apple eats the banana'.
• Coreference Error: Invalid entity reference. Misrecognition of an object can create a sentence different from the intended one.
• Discourse context Error: Contradicting the context of the previous discourse. This error encourages the generation of a wrong sentence with different information from the previous sentences.
• Neologism Error: Using a spelling or a new word that is not in the existing grammar system. Similar to dialect error, neologism error is the speaker's or author's intention used as a criterion for judging errors. By establishing this guideline, we built K-NCT, a test set consisting of sentences that reflect a detailed error type classification system. This allows accurate performance measurements and clear comparative studies. In this study, we applied K-NCT to actual Korean sentences to generate sentences that included various error types., We conducted an experiment based on the guidelines to evaluate the performance and verify the reliability of K-NCT as a gold test set.

D. CONSTRUCTION PROCESS a: DATA SELECTION
All original sentences are extracted from the Korean-English translation (parallel) corpus of AI hub 1 [19]. AI hub is a data platform operated nationally in Korea. It processes natural language datasets such as machine translation or document summarization, as well as various fields such as images, autonomous driving, and healthcare datasets. In summary, AI hub builds high-quality and large-capacity datasets in the Korean language and contributes to AI research by disclosing those datasets to the public.
In the case of Korean-English translation (parallel) corpus data, including those files that are written, spoken, and dialog styles, the written style is a formal language mainly used in formal occasions, and the spoken and dialog styles are relatively informal and natural languages. The dataset consists of various domains and syllables.
We integrate the individual documents by style to create three files consisting of 1,100,000 written style sentences, 400,000 spoken style sentences, and 100,000 dialog sentences. Considering balance and diversity, we randomly sampled 1,000 sentences from each of the three language style files and collected the dataset as 3,000 sentences. This dataset uses as raw data for the gold-standard test set.

b: PRE-PROCESSING
Pre-processing is the first step to conduct error injection in correct sentences. Accordingly, we performed sentence alignment and error tagging.
First, the sentence alignment is performed, that is, aligning sentences considering the reliability of syllables. The sentences are sorted in the corresponding syllable range based 1 https://bit.ly/3QwW9IT on the index. Index 0∼749 are arranged to correspond to 2∼20 syllables, 750∼1799 correspond to 21∼29 syllables, 1800∼2399 correspond to 30∼50 syllables, and 2400∼2999 correspond to more than 51 syllable. Because the average number of syllables in the dataset is different for each domain, it is difficult to control the number of syllables in the downstream domain, which hinders the reliability. For factuality, we do not consider the domain when we sort the sentences by the range of syllables.
Second, error type tagging is performed, that is, tagging each sentence with an error type. We randomly tagged the spacing, punctuation, numerical, monolingual, multilingual, spelling, syntax, semantic, and neologism errors on the sentence of the dataset by a predefined ratio.
Some of the error types require essential conditions to occur. For example, a numerical expression must include a numerical error in the sentence. Also, dialect or neologism errors must include a dialect or neologism error in the sentence. Therefore, if the essential condition of the tagged error type is not satisfied, we switch the alignment position of the error type with another sentence that satisfies the condition. Although this method is quite simple, it required postprocessing because it cannot be applied to all error types.

c: POST-PROCESSING
Post-processing is performed after pre-processing, in which correct sentences that do not have fitted error conditions are targeted. We proceeded with the correct sensation modification process to meet the error type generation conditions.
In the switching method of pre-processing, there are sentences that have no conditions for generating errors. Therefore, we form high-quality data by modifying the correct sentences such that the essential conditions for error type insertion are satisfied while keeping Korean orthography. For example, there are cases where extracted original sentences do not contain dialects and neologisms. In such cases, we corrected the sentences to include dialect or neologism. In addition, if the numerical expressions do not match the proposed format, they are corrected. Using post-processing, we generated high-quality data that can satisfy these conditions.

d: ERROR INJECTION
The error types included in the correct sentence were sorted through three steps (data selection, pre-processing, postprocessing). Error Injection generates an error sentence (incorrect sentence) including the sorted error type in the correct sentences. The labor generates incorrect sentences by performing error injection according to the given correct sentence and error type. The guideline for error injection is as follows.
• Define the error type and present actual examples. • Modify the number of syllables in index, giving limited freedom.
• Restrict changes in the style of the sentence. VOLUME 10, 2022 • Number expression represent only statistical or date expression, other than in letters.
• When correcting spelling, use the part within two editing distance of the keyboard.
• Indicate span and the corresponding error type when the error occurs. Labor must have a certain understanding of the error types to generate high-quality data. Therefore, definitions of error types and practical examples are given to labor. Labor generates error sentences within the range of syllables of the specified index, forcing the text-styles to stay constant because they were pre-processed before providing them to the generator.
The guideline presents a numerical expression that can generate a numerical error. Statistical or date expression may be generated as a number, and in other cases expressed as a letter. For example, 'November 2021' is presented as a number and 'there are two apples' is presented as a character. The candidates for the changed character are limited to the characters whose keyboard editing distance is two or less to reflect the factuality of the error sentence.
Error types are randomly tagged at sentences by designating a balanced number for each error type. The labor identifies the correct sentence for the corresponding error type and creates an error sentence. Through the guideline presented above, a certain standard of freedom allows the creation of more realistic dataset. In addition, labor directly inspects all data for the accuracy and reliability of the K-NCT.

E. FINAL GOLD TEST SET
Three thousand high-quality sentences are constructed through data selection and two pre-processing (error type alignment), post-processing, and error injection processes. Six people 2 created and inspected the guidelines and built and evaluated the data to build the dataset. Each dataset is specialized by completing training on detailed descriptions of the guidelines for more than three hours to create high-quality data. K-NCT constructs a json format file that is publicly released as the corpus. Table 2 shows an actual example of the generated final gold-standard test set. K-NCT contains errors and correct sentences, applicable domains, phrases, a number of syllables, and a number of errors and error types. Error sentences mark a span at the location of the error, and the error type is indicated with the error type at the location. Because errors may occur in multiple locations in one sentence, the span may be displayed in the phrase where the error occurred, as shown in ' e1 error /e1 ', to display several error locations and types, as shown in the second row of Table 2.

IV. EXPERIMENTS AND RESULTS
We conducted experiments and validation to prove the reliability and objectivity of K-NCT. Three aspects (statistical, 2 The anotators consist of ordinary people with no background knowledge of the task and are trained before starting the assessment.   quantitative, and human evaluation) were utilized to verify the quality of K-NCT as a gold-standard test set.

A. STATISTICAL ANALYSIS
First, we conducted basic statistical analysis as shown in Table 3. The K-NCT consists of 3000 sentences. We define the error sentences as source and the correct sentences as target. The length of error sentences on average is 43.27, the number of words on average is 10.39, and the number of spaces on average is 9.39. The length of correct sentences on average is 43.29, the number of words on average is 10.57, and the number of spaces on average is 9.57. Considering the statistical similarity between the error sentence and the correct sentence, these results show that the generated error sentences consist of realistic errors that are sufficient to understand the original sentence.
Error sentences contain 93,794 K-tokens, 904 E-tokens and 4,785 S-tokens. Correct sentences contain 93,856 K-tokens, 860 E-tokens and 4423 S-tokens. In the real world, English is mixed with the Korean sentences. Through the punctuation error injection, we generated error sentences covering various punctuation.
Secondly, we analyzed the reliability of K-NCT. Figure 1 and 2 show distribution for each text style. Figure 1 shows the data amount for text style(written, spoken, dialog) that reflect Korean characteristics. Each of them consists of 1000 sentences, indicating that K-NCT satisfies the equality of style. Figure 2 shows the syllable distribution for each style. In the case of written style, 50∼70 syllables, spoken style 15∼25 syllables, and in dialog style, 20∼30 syllables are distributed.
This shows that K-NCT is composed of realistic sentences. All sentences of K-NCT are distributed in 15∼30 syllables, including other syllables. This depends on the situation and style in which the vocabulary is used, and the length of the syllable varies. Therefore, K-NCT is constructed considering the diversity of style. Figure 3 shows the distribution of error type tagging. Considering the balance and diversity, 1500 sentences are tagged with spelling and grammatical errors for detailed error types, and the remaining types are tagged in 500 sentences each.
Monolingual error is tagged in 1312 sentences, and spelling error is tagged in 800 sentences. Multilingual and semantic errors are tagged in 200 sentences, syntax error is tagged in 411 sentences, and neologism error is tagged in 100 sentences. This shows that the various error types are well-balanced. Therefore, exploiting K-NCT can generate detailed measurements through high-quality sentences generated by considering reliability features (e.g., style, syllable, and error type).

B. QUANTITATIVE ANALYSIS
We conducted quantitative analysis. Based on K-NCT, we conducted a proofreading performance comparison experiment for Naver, 3 Daum, 4 and Pusan, 5 which are the most representative models of the Korean grammatical correction commercialization system. The reason for choosing the commercialization system for comparison is that it is a certified system used by several researchers, and the latest deep learning-based grammatical correction methodology is applied; hence, it is the most objective and reliable system for accurate analysis.
The performance of each corrector is measured by using the error sentences of K-NCT as input for three commercialization systems and performing quantitative analysis using the BLEU score [20] and GLEU score [21], which are used in various deep learning-based grammatical correction studies as evaluation indicators [1], [22], [23]. The experimental results are shown in Table 4.
Experimental results show significant performance in the order of Pusan, Daum, and Naver based on the GLEU and BLEU scores. However, the performance difference is not significant for each subtype in a fairly similar section. K-NCT is a gold-standard test set that can objectively measure performance without biasing any system.
As an additional experiment, the strengths and weaknesses of each commercialization system are analyzed based on the error type classification criteria designed in this study. The experimental results are displayed in Table 5.
We clearly analyze each commercialization system used K-NCT. First, Pusan, which shows the best overall performance, shows an overwhelmingly better performance than other models in spacing. Pusan model shows high performance of 87.13 based on BLUE score and 86.07 based on GLUE score, whereas Naver and Daum show a significantly lower performance with a BLUE score of 60.9 and 56.30, and GLUE score of 44.02 and 40.83, respectively. It is found that the Pusan model is the most robust model for correcting spacing.
Second, in the case of punctuation, Daum shows the best performance with BLEU and GLUE scores of 69.16 and 50.61, respectively. In the case of numeric, daum based on BLEU score and Pusan based on GLEU score shows the best performance; however, there is no significant difference in all three models. Finally, in the case of spelling and grammatical, which is the most important performance, it is found that the Pusan model shows the best performance with BLEU score of 75.07 and GLEU score of 70.66.
As shown in the analysis above, because the error type for each sentence pair is labeled in K-NCT, the strengths and  weaknesses of the corresponding grammatical corrector can be clearly analyzed.

C. HUMAN EVALUATION
In order to evaluate the diversity and effectiveness of the generated dataset, a total of 200 sentences are randomly sampled for 10 error types, excluding Secondary Spelling and Grammatical Errors. Since Secondary Spelling and Grammatical Errors are a subset of Primary Errors, we do not sample them separately.
We employ 5 human evaluators with bachelor's degrees. Each evaluator performed the evaluation after receiving education on the introduction and evaluation method of K-NCT. The quality is evaluated for five items, and a score of zero for poor, one for normal, and two for excellent is given. The evaluation items are as follows [24]. (1) It contains all given error types (compositionality). (2) The relationship/use between error types is natural (Association). (3) It is easy to grasp the original meaning/intention of the sentences (Fluency). (4) It is easy to recognize which type of error the error sentence contains (Factuality). (5) It is a common error type (Typicality). The experimental results are as shown in Table 6.
As a result of the evaluation, the average score for the five evaluation items is high, with an average score of one point in the mid-range. Particularly, it is noteworthy that most of the ratio of excellent scored high. Through human evaluation, it did not only prove the realism of the error type; it also demonstrated that it consists of high-quality data.

V. CONCLUSION
Recently, many studies on grammatical error correction based on machine translation and automatic noise generation have been conducted. However, in the case of Korean grammatical error correction, an objective comparative study is difficult because a pseudo corpus including only specific error types are generated and used owing to the absence of a learning dataset and a gold test set. To solve this problem, new error type classification criteria are proposed, based on which K-NCT, a gold test set for Korean grammatical correction research, is built for the first time. In addition, the reliability of the proposed K-NCT is verified through statistical analysis, quantitative analysis, and human evaluation, and the data is completely accessible to the public. Our dataset can be applied not only to formal situations such as news articles, but also to dialogue or spoken Korean spelling correction tasks.
Our limitation is that Coreference and Discourse context Errors cannot be determined in sentence unit errors. Therefore, these error types are not included in the test dataset. In the future, to deal with errors that occur in paragraph units, the data will be expanded in units of paragraphs and to various language pairs. Based on documentation or conversation resources, we will construct and publish datasets including types of errors not included in this version of the dataset.