How to Progressively Build Thai Spelling Correction Systems?

Neural-based sequence-to-sequence methods (Seq2Seq) have proven to be highly effective for Context-sensitive Thai spelling correction. However, they also inherit the drawbacks of Seq2Seq, such as a fixed vocabulary and large data requirements. However, dictionary-based methods and their typical applications are insufficiently robust to produce corrections with reduced error rates. These drawbacks inhibit the application of these methods in a broader range of use cases. In this paper, we provide a practical guide on how to build correction systems progressively and efficiently with three main contributions. First, we present a process for efficiently and progressively producing training data for both neural-based and dictionary-based methods. Our annotation process enables existing methods to be trained with only two percent of the data hand annotated. Second, we propose the Extendable Neural Contextual Corrector (XNCC), a novel text correction approach that decouples the dictionary from the neural model. This enables the dictionary to be extended post-training. Finally, we compare text correction systems with various configurations to demonstrate how these systems can be effectively used to produce corrections. Our experiments show that 1) minor changes to dictionary-based methods can significantly improve correction performance, 2) neural-based correction systems can be trained using a fraction of the data, and 3) XNCC can have the dictionary extended to generalize to new data without re-training. Lastly, we provide recommendations for progressively building text correction systems at multiple levels of implementation effort based on our findings.


I. INTRODUCTION
With the ubiquitousness of the Internet and smart devices, text-based communication has increased to an unprecedented scale. Accurate spelling correction systems have become essential for businesses when conducting critical written communications (e.g., Emails, customer support chats, and social media presences). In addition, proper application of spelling correction on user-generated social text has been shown to improve accuracy in downstream Natural Language Processing tasks [1]. Nevertheless, the development of such systems remains difficult. The complexity of the Thai language, with its ambiguous word boundaries and multiple valid alternative words for minor character-level The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu. modifications, poses challenges for accurate spelling correction. Existing dictionary-based methods lack the necessary robustness without human assistance [2]. On the other hand, state-of-the-art approaches like Seq2Seq models rely on expensive human-annotated corpora of erroneous and corrected text.
A significant hurdle in spelling correction is handling outof-vocabulary (OOV) tokens [2], [3], [4]. While dictionarybased methods can be easily extended by the user, Seq2Seq models are limited by the initial vocabulary. As a result, special structures are employed to enable correctors to produce OOV tokens. However, these techniques only allow the model to leave out OOV tokens and not produce corrections outside the initial vocabulary.
This brings us to the more general issue that different use-cases for correctors will have different target vocabularies. Despite the existence of official Thai dictionaries [5], [6], they are rarely used in isolation by dict-based correctors since new words, borrowed words, slangs, and domain-specific words are a significant part of written communication. Off-the-self correctors often use dictionaries built from the target text corpus [7] or add external dictionaries 1 to alleviate this issue. Although these general-purpose dictionaries are far from perfect, they can be extended to fit the needs of the users.
In this study, we demonstrate how to build text correction systems progressively and effectively. First, we introduce our data annotation pipeline that efficiently produces data that is applicable to a variety of existing text correction methods. Our approach to annotation allows us to produce both data for both dictionary-based and neural-based text correction systems. Second, we introduce Extendable Neural Contextual Corrector (XNCC), a neural-based text correction method that can be extended with new vocabulary post-training. Our corrector decouples the internal dictionary from the neural text embeddings, allowing extensions to the dictionary during inference time like traditional dictionary-based methods while producing correction based on context provided by the surrounding text. Finally, we outline steps implementors can take to progressively build effective text corrector at different total effort and resources.
This paper is structured as follows. Section II outlines works relating the Thai text correction. Section III details our recommended annotation pipeline for efficient data annotation. Section IV layout our proposed text correction method, the Extendable Neural Contextual Corrector (XNCC). Section V describes our experimental setup to evaluate various text correction methods at various stages of development. Section VI discusses the results of the experiments. Section VII concludes the trade-offs of various text correction systems and provides recommendation for progressively building effective text corrector.

II. RELATED WORKS
The aim of spell correction is to correct erroneous words into the words originally intended. Spell correctors can be viewed as noisy channel models which aim to produce the most probable correction. In noisy channel modeling [8], the corrector models the signal (language modeling) and the channel (i.e., errors being introduced into the text) to produce corrections. In addition to the corrector, correction systems often feature a detector to improve both speed and accuracy. In this section, we will examine the various approaches to spell correction by formulating how each method models the noisy channel problem.
Dictionary-based correctors have the most simplistic modeling. Error detection is quite simple, a token is considered erroneous if it does not exist within the dictionary [9]. For languages without explicit token boundaries (e.g., Thai), a word tokenizer is required for preprocessing. These correctors are publicly available as Free open-sourced software (e.g., Peter Norvig [10], SymSpell [9], Hunspell 2 and often come with ready-to-use dictionaries. Dictionary-based correctors (e.g., Peter Norvig, SymSpell) use lexical methods to model the channel. For example, Peter Norvig and SymSpell use Damerau-Levenshtein distance, that is correction candidates with higher distance are less prefered. Simplistic forms of language modeling such as word frequency and word grams are used for tie breaking.
To better model the corrections, methods have been proposed to augment both the channel model and the language model. Character confusion [11], multi-character edits [12] have been proposed for error correction in Thai optical-character-recognition systems. References [13] and [14] have proposed Soundex with multi-model reranking for error correction in Thai search engine queries.
Since tokenizers are not accurate on noisy text, propagation errors from tokenization are problematic for token-based correctors. Specialized correctors such as [3] and [11] for Thai and [15] for Chinese operate on the character-level and utilize a pruned search algorithm.
Modern Thai correction adopted neural-based sequenceto-sequence (Seq2Seq) methods from the English grammatical error correction literature. Seq2Seq methods model both the channel and the language in an end-to-end manner. This exploits neural-based ability to learn features automatically from erroneous-correct text pairs. In addition, modern correctors meant for automatic correction have structures to handle OOV tokens (e.g., names, new words). Copy-Augmented Transformer [4] can copy tokens from the source text, allowing the model to produce corrections with unaltered OOV tokens. The two-stage corrector [2] uses a detection-stage to mask out non-erroneous OOV tokens prior to correction. However, these only allow leaving OOV tokens uncorrected. On the other hand, dictionary-based methods can produce correction with new tokens by adding them into the dictionary. Since internal vocabulary of Seq2Seq models is tied to the learnt neural embeddings, model re-training and additional text with tokens are required to expand the dictionary.

III. ANNOTATION
Our purposed annotation routine aims to address two issues: expanding compatibility with text correctors and reducing human effort required to annotate data.
Producing annotation data that is compatible with a variety of text correctors allows implementors to pick text correction 2 Hunspell website. VOLUME 11, 2023 methods that best suit their needs without locking them into a specific class of models in the future. We annotate at the character-level instead of word-level, corrections are marked with the goal of one annotation covering one word instead of a contiguous segment of errors. For example, the erroneous text ''I sea u'' would be annotated as ''I see you'' instead of ''I see you'' or ''I see you''. This lets us derive a dictionary of individual correct and erroneous tokens required for dictionary-based methods. Unlike previous methods [2], [21] where the absence of corrective annotation denotes correct tokens, annotators explicitly annotate both correct and erroneous tokens. We found this method to be highly effective at preventing erroneous tokens from being introduced to the correct-tokens vocabulary, which is important for dictionary-based text correctors.
To reduce the effort required to build enough data to develop text correction systems, we utilize a dictionary-based maximum-matching tokenizer to dynamically produce automatic annotations. This exploits the fact that most errors are non-word errors (erroneous tokens outside of the dictionary), as shown in Table 7, and thus can be detected by dictionarybased methods. Erroneous tokens detected are automatically annotated with the most common correction. As a result, texts that feature tokens from previous annotations are automatically annotated and only require verification from the human operators. Unannotated data are queued up for annotation by the number of meaningful characters not covered by the automatic annotations. To prevent train-test leakage, annotations from the test-sets must not be used by the automatic annotator when developing the training-sets and the development-sets. All development-sets and test-sets are fully annotated to ensure accurate evaluation. As a result, for a each data source we recommend annotating the development-set first, followed by the training-set, and then test-set. This ordering maximizes vocabulary coverage since the annotators do not need to label tokens already present in the development-set. Fig. 1 shows annotation coverage of our Channel A data (see Section V-B) as we annotate the data. 89.4% character coverage 89.8% token coverage on the training set is achieved from automatic annotation purely with data from the development-set.
For our experiments in Section V, we explicitly instruct our annotators to verify every automatic annotation for every line they annotated. However, our preliminary experiment on similar data only required our annotators to fill in the gaps left by the automatic annotations (see Appendix B). Both methods have their own pros and cons. Although, only annotating the gaps left by the automatic annotations reduced the effort required for each data entry. This approach can lead to more unique non-word errors being discovered given the same annotation effort but at the expense of real-word errors left unannotated.
At training time, unannotated data are annotated using the same method for producing automatic annotations. Although some of the automatic annotations are mislabeled, they are still helpful when either in-domain data is abundant, or annotation capacity is limited. Nevertheless, our results (in Section VI) demonstrate that performant neural correctors can be built using this practice.

IV. EXTENDABLE NEURAL CONTEXTUAL CORRECTOR (XNCC)
Our proposed method, XNCC, consists of four modules: text normalization, tokenization & masking (TokM), error detection, and correction. The overall structure is shown in Fig. 2. The primary contribution of XNCC is the novel correction module which utilizes neural-based generation techniques to separately model the error channel and the language. The input text is passed to the text normalization module (detailed in Appendix A). The normalized text is tokenized with tokens outside our scope masked out by predefined rules (see Section IV-B). Concurrently, the normalized text is detected for errors using a neural-based error detector (see Section IV-C). The error ranges produced from the detection module are projected into the tokenized text with masking. The unmasked tokens that overlap with error ranges are merged into erroneous segments (requiring correction). The tokens, along with erroneous segments, are passed to the correction module (see Section IV-D), which produces the final corrected sequence.

A. EXTENDABLE DICTIONARIES
In addition to the static vocabularies used by neural models, XNCC also features two extendable dictionaries: the error dictionary (error-dict) and the correct dictionary (correctdict). This allows recognition and correction of new words without model retraining.
The error-dict is a many-to-many mapping from misspellings (error-tokens) to their possible corrections (correcttokens). For example, the erroneous token '' '' is mapped to two correct tokens: '' '' and '' ''. The error-dict can be produced from the individual annotations that contain a correction. Moreover, frequently misspelled words can also be directly added to the error-dict. The correct-dict is a many-to-one mapping from correct-tokens to a token in the neural-based Language Model vocabulary (LM-vocab). Since both the correct-dict and LM-vocab are built from the correct-tokens present in the training data, the tokens in the correct-dict are typically mapped to LM-token of the same spelling. On the other hand, rare correct-tokens that are cut off from LM-vocab are mapped to the UNK token. When a correct-token is added to the correct-dict, it should be mapped to LM-token that is a synonym. If a synonym does not exist, it is mapped to UNK token.
Manual additions to the extendable dictionaries are highly task specific. Our preliminary experiment of directly adding the whole official Thai dictionary [5] has resulted in unsatisfactory performance since archaic words are not present in our data.

B. TOKENIZATION & MASKING (TokM)
Our TokM module enables the model to recognize tokens present in the correct-dict, while masking out some portion of the text to prevent correction. The TokM module consists of four stages: multi-mask-tokens, dictionary-tokens, singlemask-tokens, and special-tokens. Each stage extracts tokens in positions not previously marked by the prior stages. Since each stage also masks the stages after it, the stages are ordered according to their size. For our task, we have categorized six types of mask-tokens: URLs, Hashtags, numbers, alphanumeric codes, English text, and text from other languages. First, the multi-mask-tokens stage is responsible for extracting URLs and Hashtags, which can contain other regular tokens. Second, the dictionary-tokens stage uses the maximum-matching tokenizer with the correct-tokens dictionary to extract regular tokens. Third, the single-masktokens stage extract and mask-out numbers, alphanumeric codes, English text, and text from other languages. Lastly, the special-tokens stage collects the remains text which are symbols (e.g., space, '−', ';', ':', '!', '?', '.', '$', ' , ' ') and single character regular tokens (i.e., '' '', '' '', '' '').

C. ERROR DETECTION
Our error detection module features the same error detector as [2], which is a multi-layer Bidirectional-LSTM sequence tagger with Bi-LSTM character encoding [22], [23].
The input text is tokenized into a sequence of tokens This specific tokenizer uses the detector vocabulary as its dictionary. The error detector produces a label of either Error or Correct for each input token w i . The vocab is derived from source tokens in the training data. Therefore, it includes both correct and erroneous tokens. The normalized text from our annotation routine (see Section III) is tokenized with the vocab. The word-level tokens are labeled as Error if they overlap with any corrective annotation. The model is trained with gradient descent. The hyperparameters are listed in Appendix C.

D. ERROR CORRECTION
The corrector searches for a sequence of corrective operations which produce the lowest correction cost. Corrective operations are split into character-operations (char-ops) and token-operations (token-ops). There are three types of char-ops (i.e., char-acceptance CA * , char-deletion CD * , char-insertion CI * ) and three types of token-ops (i.e., sequence-begin T BEGIN , token-acceptance TA * , sequenceterminate T END ). Intuitively, the char-ops dictate how to input characters are accepted or modified, while the token-ops dictate how the resulting characters are decoded into tokens. Fig. 3 shows a correction of an example input ''I sea u #U'', which produces the target sequence ''I see you #U'', alongside the corrective operations. Calculation of the correction cost is detailed in Section IV-D1. Section IV-D2 outline the rules for generating corrective operations and implementation details on how to keep track of the search state. Section IV-D3 explains how the corrector optimizes for the corrective operations with the lowest cost.

1) CORRECTION COST
The corrector uses the Edit Model and the Language Model to compute the correction cost of possible output sequences. Given a possible output sequence Tokens, the correction cost is defined as the sum of fluency cost and edit cost for each token t i (see Eq 1). Intuitively, fluency cost penalize VOLUME 11, 2023 improbable output sequences while the edit cost penalize improbable edits.
The fluency costs Flu t i are computed from negative log-probabilities for each token t i in Tokens estimated by the Language Model minus the token-reward constant α clipped to a minimum value of zero (see Eq 2). The token-reward constant α is used to alleviate preference for short sequences when decoding [24], [25].
If the token t i was generated using modifying char-ops, such as CD or CI , the edit cost Edit t i is determined by the scaled sum (β) of the edit-cost constant γ and the negative log-probabilities associated with each modifying char-op (e j ∈ Edits) used to produce the token t i (see Eq 3). However, if the token t i was solely produced through char-acceptance operations (CA * ), the edit cost is influenced by the dictionary from which the token was decoded. In the case of the token t i , the edit cost can take either a value of zero (Edit t i = 0) or the map-cost constant (Edit t i = δ), depending on whether the correct dictionary or the error dictionary was utilized, respectively.

2) SEQUENCE GENERATION
The corrector produces a target sequence by performing corrective operations. The specific rules used for generating operations are outlined below. Tokens from the correct-dict, the error-dict, and specialtokens (from Section IV-A) are populated into a trie. The trie is used to constrain the search space by preventing the exploration of char-ops that would not lead to a valid token. Tokens from the error-dict and special-tokens in the trie are unreachable if CD * or CI * was performed since the last TA * .
For the production of every possible sequence, the corrector keeps track of the following as the search state: a pointer to the trie to represent the current token being produced, a cursor on the input text, a set of characters that have been deleted since the last CA * ; and the correction cost of the sequence.
The root search state starts with the initial operation T BEGIN . The pointer is pointed to the root trie node, The cursor is set to the beginning of the input. And the cost is set to zero.
Character-acceptance (CA * ) can be performed if the cursor is not at the end, if there exist an edge of the same character as the cursor on the current trie node, and the character is not in the delete-set. When performed, the pointer advances to the node of the same character, the cursor is moved by one character, and the delete-set is cleared.
Character-deletion (CD * ) can be performed if the cursor is not at the end, the previous op is either CA * , CD * , or T BEGIN . When performed, the cursor is moved by one character, add edit cost for deleting the cursor character, and the character is added to the delete-set.
Character-insertion (CI * ) can be performed for any character that is an edge from the trie node, and is not a character in the delete-set. When performed, the pointer advances to the node of the same character and adds edit cost for inserting the character.
Token-acceptance (TA * ) can be performed if the trie node is a token, and either the cursor is at the end or the cursor character is not in the delete-set. When performed, the pointer resets to the root trie node, adds fluency cost for producing the corresponding correct token. If the token is from the error- dict, the edit cost is the map-cost δ. If all character ops since last TA * are CA * , edit cost is zero. Otherwise, edit cost is the cost of terminating the edit sequence.
Terminate (TA END ) can be performed if the cursor is at the end, and the pointer is at the root trie node. When performed, add fluency cost for producing an END token. This operation marks the end of the sequence generation.
When the cursor is operating on portions of text that are not marked for correction, only TA * ops are allowed. The generation of TA * are based on token boundaries produced from TokM (see Section IV-B).

3) OPTIMIZATION
Our approach for optimizing the target sequence involves using Dijkstra's algorithm [26] to find the lowest-cost path from the initial state TA BEGIN to the terminal state TA END . This method was inspired by [11] and [15] use of search for token decoding. We have also incorporated a modified beam search to reduce the search space. Beam search is a greedy algorithm that restricts the number of paths to explore at each level [27]. Our preliminary experiments have revealed that defining the beam depth as the output length leads to the incorrect insertion of new characters for the sake of increasing number of tokens. Therefore, we propose a modification to the traditional implementation of beam search where we define the depth of the input characters consumed (i.e., the cursor position, see Section IV-D2) to prioritize input consumption instead.

4) LANGUAGE MODEL (LM)
The Language Model (LM) is a simple autoregressive recurrent neural network with a two-layer Gated recurrent unit [28] and shared weights between the embeddings and the output layer. The architecture is shown in Fig. 4. Tokens are embedded by the projection layer. The embeddings are encoded by the two-layer bi-directinal GRU. The encodings are projected back to the token space by the projection layer. Effectively the model performs cosine similarity between the each encoding and the token embeddings.
The vocab is derived from corrected tokens in the training data plus three special tokens: BEGIN , END, and UNK . LM is trained with gradient descent to model a sequence of correct token. The hyperparameters are listed in Appendix C.

5) EDIT MODEL (EM)
The Edit Model shares the same architecture as the Language Model, as shown in Fig. 4. However, EM models a sequence of edit operations instead of a sequence of word-tokens. The vocabulary consists of three variants of edits for each character in character-set and two special tokens: BEGIN and END. The three variants correspond to the char-ops (detailed in Section IV-D2). Vocabulary coverage of input is ensured by text normalization (see Appendix A) and token masking (see Section IV-B).
EM is trained on sequences of edit operations produced from error-correct word pairs derived from the annotations. The distribution frequency of each error-correct pair in data is maintained during training. The hyperparameters are listed in Appendix C.

V. EXPERIMENT SETUP
Our experiment setup aims to evaluate three scenarios: 1) building text correctors from scratch with data annotation, 2) domain transfer text correctors to another with additional data annotation, and 3) domain transfer text correctors to another without data annotation. Our experiment consists of three data channels, which corresponds to the three scenarios. Details about the data from each channel are detailed in Section V-A. Data entries from each channel are shuffled and split into three sets: training-set, development-set, testset. Models are built or trained using the training-set and the development-set unless explicitly stated otherwise, whereas the test-set is used for evaluation. Of the three scenarios, the models are evaluated in various configurations detailed in Section V-C.

A. DATA SOURCES
Our data is collected from 3 automated text-based chatbot channels, which is named in the paper as A, B, and C. Data is only collected for the client side (does not include chatbot automated response). Channel A and B are used by separate groups of our customers whereas channel C is used by our internal staff. Due to the short nature of chat messages (as shown in Fig. 5, 6, and 7), many duplicate entries exist across multiple users and channels. The text data is normalized VOLUME 11, 2023 72709 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    (detailed in Appendix A) with the duplicates removed on a per channel basis. A data entry is a unique normalized text from some channel. The overlap of data entry between different channels is shown in Fig. 8 and 9.
Overall, texts from channel A and B are shorter and contain more errors than channel C. This is apparent when observing the length distribution (shown in Fig. 5) and base error-rate of the data (shown in the ''Do nothing'' row of Tables 3, 4, and 5)

B. ANNOTATION INFORMATION
Our data is annotated in the following order, A developmentset, A training-set, A test-set, B development-set, B test-set, and C test-set. Our annotation routine is detailed in  Section III. The training-set of channel A is partially annotated until the automatic annotations cover ∼98% of the data. The training-set of channel B is annotated purely with automatic annotations. Details of each data split are shown in Table 1. Fig. 10 shows the annotation speed of an hour of continuous annotation. A single annotator is able to annotate 1,899 tokens in an hour, averaging 31.7 tokens per minute. Annotations include both confirming tokens from automatic annotations and annotating new tokens.

C. CONFIGURATIONS
Given one of the three scenarios, an implementor has multiple options to utilize the available data.
First scenario: the implementor is developing a new text corrector without pre-existing annotated data of their target domain (channel A). When starting from scratch, the implementor would start experimenting with off-the-shelf solutions. We start off with evaluating off-the-shelf methods on channel A, requiring only annotating the test-set. Then the implementor might start annotating their data for model training. We experimented with utilizing different amounts of the annotated training-set (i.e., none, half, and all) alongside a fully annotated development-set. Second scenario: the implementor is developing a text corrector for a new domain (channel B) that is similar to an existing domain (channel A). We experimented with directly utilizing the existing text correctors to see how they generalize to the new domain. Then, we evaluate how each model performs when given additional in-domain data for training (also known as transfer learning for neural-based methods).
Third scenario: the implementor is developing a text corrector for a new domain (channel C) that is significantly different from the annotated data they already have. Like the second scenario, existing text correctors were evaluated on the new domain. However, instead of continuing experimenting by annotating more data, we focus on methods that can easily (retraining not required) have their vocabularies extended to evaluate how each text corrector performs when new words are added by the users.

D. EVALUATION CRITERIA
Word-error-rates (WER) and General Language Evaluation Understanding (GLEU) were chosen for their use in prior Thai text correction research [2], [3]. WER is the relation of errors in some given sequence to the length reference sequence. Errors are defined as the minimum number of insertions, deletions, and substitutions needed to correct the given sequence to match the reference. Whereas GLEU, at a high level, compares the given sequence against the reference at the grams-level instead of word-tokens. Given two sequences of the same WER, GLEU will penalize sequences with sparse distribution of errors, which is found to have a higher correlation with human judgment [29].

VI. RESULTS & DISCUSSION
This section shows and discusses the results of the experiments outlined in Section V. Ablation study of XNCC is detailed in Appendix B-A.
For the first scenario, we experiment with developing text correction systems from scratch for channel A. Evaluation of all models and configuration are shown in Tables 2 and 3. Publicly available dictionary-based methods along with the built-in dictionary were unable to reduce the overall errors. However, the combination of Hunspell with a clean dictionary produced with our annotation routine and an error aware tokenizer was able to reduce the overall number of errors. Fig. 11 and 12 shows the results at varying amount of training data. The two-stage contextual attention (2-stage Ctx-Attn) VOLUME 11, 2023 FIGURE 12. Word-error-rate on channel A with varying amount of training data.

FIGURE 13
. XNCC GLEU scores on channel C with varying amount of test-set tokens added to the dictionary. The 0% and 100% results are the same as the ones reported in Table 5.

FIGURE 14.
XNCC WER on channel C with varying amount of test-set tokens added to the dictionary. The 0% and 100% results are the same as the ones reported in Table 5.
performed the best in reducing the overall number of errors at all amounts of annotated data. While XNCC, although  significantly better than Hunspell, is not as accurate as the 2-stage Ctx-Attn.
For the second scenario, we experiment with developing text correction system for another domain (i.e., channel B). Results show that XNCC and 2-stage Ctx-Attn perform similarly with around 60% error rate reduction (Table 4). Transferring the annotated data from channel B back to channel A results in higher correction performance for all models across the board (Table 3, Fig. 11 and 12).
For the third scenario, we experiment with repurposing existing text correction systems in a significantly different and challenging domain (i.e., channel C and significantly lower base error rate). Without modifications, all methods were unable to produce corrections that further reduce the error rate. However, XNCC was able to further reduce the error rate by 38% and 45% when its dictionary was extended with correct tokens and error-correct token pairs respectively (Table 5). Fig. 13 and 14 show the GLEU scores and WER of XNCC at varing amounts of extensions to the dictionary. The correct tokens and the corresponding error-correct token pairs are added in descending order of their frequency.

VII. CONCLUSION
This paper evaluated how a multitude of text correction approaches perform at various stages of development.
When starting from scratch in some domain (channel A), off-the-shelf systems alone are unsuitable for performing automatic text correction since they introduce more errors than they correct. While prior work has not found success  in utilizing dictionary-based methods [2], we have found that given a clean dictionary and a tokenizer that is aware of erroneous tokens, dictionary-based systems can produce correction that reduce the overall error rate. Since Hunspell and maximal matching tokenization implementations are publicly available, they serve as a good starting point for implementors. Given access to computational resources and engineering effort, the two-stage contextual attention (2-stage Ctx-Attn) [2] performed the best on all amount of annotated data for correcting in-domain text. Our proposed correction method significantly out performed Hunspell but is not as accurate as 2-stage Ctx-Attn.
When adapting existing resources to produce text correction systems to a new domain of similar nature (channel B), directly utilizing effective systems developed for the existing domain (Hunspell, XNCC, 2-stage Ctx-Attn trained on channel A) proved robust at reducing the total amount of errors. However, more accurate corrections can be achieved with little annotation (only annotating the developmentset). When jointly using annotated data from both domains XNCC and 2-stage Ctx-Attn performed comparably on the new domain. Given the additional annotations from the new domain, adapting the data back to the original domain also provides uplift in correction performance for all three effective methods.
When utilizing existing correctors on a significantly different and more challenging (having a lower base error rate) domain, XNCC is recommended. XNCC can have its dictionary easily extended with tokens for the new domain and produce corrections that further reduce the error rate. Since XNCC was specifically developed for Thai correction, it can be adapted to other languages without explicit token boundaries (e.g., Chinese, Japanese) or provide correction on inputs with incorrect boundary markers.

APPENDIX A TEXT NORMALIZATION
Our text normalization routine has three primary objectives: produce stable normalization (i.e., consistent output across multiple passes), conform to industry-standard NKFC-based normalization, 3 and provide obvious non-destructive corrections.
The Thai character-set consists of consonants (C), vowel characters, and tone marks (T i.e., , , , ). One or more vowel characters are use to write actual vowels in the Thai Language. There are four types of vowel characters used in modern text: leading vowel (L i.e., , , , , ), hanging vowel (H i.e., , , , , , , , ), following vowel (F i.e., , , ). This normalization attempts to produce text with the following pattern: ''LCHTF''. Thus, specific patterns of characters can be reordered non-destructively.

APPENDIX B PRELIMINARY EXPERIMENTATION
Prior to the experiments in Section V, we carried out a preliminary experimentation on an older version of channel A data. There are three differences between the preliminary data and the final data. First, development-set is not considered a separate data split that is fully annotated like the test-set. Instead, the development-set is part of the larger training-set and as a result mostly comprised of automatic annotations. Second, the annotators are not instructed to confirm every automatic annotation. As a result, the annotators effectively perform partial annotations to fill in the gaps between the automatic annotations. Third, the data from other channels     were combined into an out-of-domain set. Information about our preliminary data is shown in Tables 6 and 7. Results for the preliminary experiments are shown in Table 8 and are in-line with results in Table 3 from Section VI.

A. ABLATION STUDY
During our preliminary experimentation with XNCC, we conducted an ablation study to analyze the effect of each XNCC Corrector sub-module on end-to-end performance. We start with a bare corrector with only the correct-dict. In the absence of the LM and EM, the fluency cost Flu t i is zero, and the edit-cost of character-edits is γ . The results of various configurations are shown in Table 9.  The mapping from error-dict to correct-dict was first added since it has the most significant impact on performance. This order aims to underscore that the mapping does not supersede other modules.
Overall, the results show that all three sub-modules in the XNCC Corrector module and the fine-tuning routine play a part in improving the final correction performance. In addition, the strong correction performance of entry #2 also suggests that a dictionary of misspelled tokens might be the missing trick to improve dictionary-based text correctors.
Lastly, we also experiment with extending the dictionary post-training. Entries #7 and #8 demonstrate the ideal case of extending the dictionary with relevant tokens.

APPENDIX C HYPERPARAMETERS
Hyperparameters for all modules in XNCC are shown in Table 11. Reference to hyperparameters of the Error Detection module follows the same naming convention as [2]. The rest of the hyperparameters are named as per Section IV.

APPENDIX D ERROR ANALYSIS
We sampled and analyzed 30 corrected lines on the test-split of channel A made by XNCC and 2-stage Ctx-Att. The results are shown in Table 12. Of the 30 lines sampled, 6 lines do not require any correction. Of the 24 lines that required correction, 33 errors required correction. Of the 33 errors, 23 were non-word errors and 10 were real-words errors. Of all 6 lines that do not require correction, both methods operate correctly and left the lines alone. Both XNCC and 2-stage Ctx-Att share the same set of corrected and uncorrected realword errors. For the non-word errors, errors that 2-stage Ctx-Att is able to correctly corrected is a superset of the ones by XNCC.
Overall, both correctors are very conservative with their corrections. Of the 30 lines analyzed, both methods only made one mis-correction, and no errors introduced to any existing correct tokens.

APPENDIX E INFERENCE TIME
This section discuss about XNCC inference time and optimization opportunities. The contextless nature of XNCC error modeling enables caching of token-level edit-cost. Table 10 shows inference time of XNCC on a separate dataset consisting of 300 lines (2,483 tokens) with a base word-errorrate of 16.96% executed on a single CPU core. XNCC is experimented on three configurations: without caching, with cold cache, and with hot cache. Special thanks to Atthakorn Petchsod for optimizing XNCC and running the experiments.