Korean Erroneous Sentence Classification With Integrated Eojeol Embedding

This paper attempts to analyze the Korean sentence classification system. Sentence classification is the task of classifying an input sentence based on predefined categories. However, spelling or space error contained in the input sentence causes problems in morphological analysis and tokenization. This paper proposes a novel approach of Integrated Eojeol (Korean syntactic word separated by space) Embedding to reduce the effect of poorly analyzed morphemes on sentence classification. The paper also proposes two noise insertion methods that further improve classification performance. Our evaluation results indicate that by applying the proposed methods on the existing sentence classifiers, the sentence classification accuracy on erroneous sentences is increased by 8% to 15%.


Introduction
Recently, many service platforms like Amazon alexa1 help third party developers create a wide range of natural language-based chatbots.Customers can access to those customized chatbots either through voice interface (employing speech recognition system), or textual interface (using chat applications or custom-built textual interface).
Natural Language Understanding for chatbot mainly consists of two components: (1) intent classification module that analyzes the user intent of an input sentence, and (2) entity extraction module that extracts entities from the input sentence.
Table 1 shows how to analyze a Korean input sentence2 .The first step is to classify the user intent of the input sentence, which is "play music" in this example.Chatbot developers should predefine the set of possible intents.Once the intent is correctly classified, it is relatively easier to extract entities from the input sentence, since the classified intent

Input
모짜르트 클래식 틀어줘 (Play Mozart's classical music) Intent play music Entities genre: 클래식(classic), composer: 모짜르트(Mozart) helps determine the possible types of entity for the input sentence.
When users access a chatbot through the textual interface, input sentences may occasionally contain spelling or space errors.For example, users could get confused about the spellings of the similarly pronounced words; users occasionally omit some required spaces between words; users could also make some typos.
Since Korean is an agglutinative language, its words could contain two or more morphemes to determine meanings semantically.Thus, various researchers tokenize Korean input sentences into morphemes first with a morpheme analyzer, and then the resultant morphemes are fed into a classification system (Choi et al., 2018;Park et al., 2018a;Oh et al., 2017).Such approaches based on morpheme embedding may work well on grammatically correct sentences but show poor performances on erroneous inputs, as shown in section 4.3.Errors in input sentences are likely to cause problems in the process of morphological analysis.Occasionally, morphemes with significant meaning are miscategorized as meaningless morphemes due to spelling errors.Table 2 briefly illustrates the case mentioned above.An important clue "파케 스트", the typo of a Korean morpheme "팟케스트 (podcast)", is separated into two meaningless morphemes due to spelling errors.
In this paper, a novel approach of Integrated Eo-  The main idea of IEE is to feed the Eojeol embedding vectors into the sentence classification network, instead of morphemes or other subword units embedding vectors.In the case of an Eojeol w, subword unit-based Eojeol embedding vectors are calculated first based on different subword units of w, and the resultant vectors are integrated to form a single IEE vector.By doing so, the algorithm could significantly reduce the effect of incorrect subword unit tokenization results caused by spelling or other errors, while maintaining the benefits from the pretrained subword unit embedding vectors such as GloVe (Pennington et al., 2014) or BPEmb (Heinzerling and Strube, 2018).
Also, two noise insertion methods called Jamo dropout and space-missing sentence generation are proposed to automatically insert noisy data into the training corpus and enhance the performance of the proposed IEE approach.The proposed system outperforms the baseline system by over 18%p on the erroneous sentence classification task, in terms of sentence accuracy.
In Section 2, related works are briefly reviewed.Section 3 describes the IEE approach and the two proposed noise insertion methods in more detail.Section 4 shows evaluation results of the proposed system.Finally, conclusions and future works are given in Section 5.

Related Works
There exists a wide range of previous works for English sentence classification.Kim (2014) employed CNN with max-pooling for sentence classification, Bowman et al. (2015) used BiLSTMs to get the sen-tence embeddings for natural language inference tasks, and Zhou et al. (2015) tried to combine the LSTM with CNN.Recent works such as Im and Cho (2017) or Yoon et al. (2018) tried to explore the self-attention mechanism for sentence encoding.However, to the best of our knowledge, the classification task of erroneous sentences received much less attention.
This paper mainly focuses on integrating multiple embeddings.Yin and Schütze (2015) also considered the idea of integrating multiple embeddings; they considered many different embeddings as multiple channel inputs and extended the algorithm of Kim (2014) to handle the multi-channel inputs.This paper mainly differs from their work in that we attempt to generalize their approach.Unlike them, IEE does not require the input embeddings to have the same subword tokenizations while integrating various embeddings.
Besides, Choi et al. (2018) proposed morphemebased Korean GloVe word embedding vectors along with Korean word analogy corpus to evaluate them, while GloVe (Pennington et al., 2014) is a pre-trained embedding vector set using unstructured text.Choi et al. (2018) also used the trained Korean GloVe embedding vectors with the algorithm of Kim (2014) to train the Korean sentence classifier for Korean chatbots.Park et al. (2018b) focused on sentence classification problem for sentences with spacing errors, but other types of errors such as typo are ignored.
In this paper, the proposed IEE vectors are fed into the network proposed in Choi et al. (2018).The overall system performance is compared against the original sentence classification system of Choi et al. (2018) and Kim (2014), to clarify the effect of IEE vectors.

Classifying Erroneous Korean Sentences
In this section, an algorithm to correctly classify erroneous Korean sentences is proposed.

Brief Introduction to Korean Word and Syntactic Structures
In this subsection, the structures of Korean words and sentences are briefly described to help to understand this paper.Eojeol is defined as a sequence of Korean characters, distinguished by spaces.A given input sentence s is tokenized using spaces to get its constituent Eojeol list {w 1 , w 2 , ..., w s }.An Sentence Eojeol contains one or more morphemes, which can be extracted using a Korean morphological analyzer.Table 3 shows the Eojeols and morphemes of the example sentence from Table 2.
A Korean character consists of two or three Korean alphabets or Jamos; a consonant called the initial (or choseong in Korean), a vowel called the medial (or jungseong), and optionally another consonant called the final (or jongseong).19 consonants are used as the initial, 21 vowels as the medial, and 27 consonants as the final.Table 4 shows the list of possible Jamo candidates for each placement.Theoretically, there can be 11,172(= 19×21×(27+1), considering the cases without the finals) different kinds of Korean characters in total, but only 2,000 to 3,000 characters are being used in the real world.
Table 5 shows a few examples of Korean characters having constituent Jamos.The first example has no final but still makes valid Korean characters.There exist 51 Korean Jamos, 30 consonants, and 21 vowels in total, except duplications.

Integrated Eojeol Embedding
Figure 1 illustrates the network architecture to calculate an IEE vector for a given Eojeol w.
For an Eojeol w, t types of its subword unit lists are generated first.In this paper, four types of subword unit lists are considered: Jamo list  6 shows the subword unit lists of an Eojeol "팟케스트로".
Each subword unit list is then fed into the subword unit merge (SUM) network.For a subword unit list L S (w), the SUM network first converts each list item into its corresponding embedding vector to get an embedding matrix E S (w).Afterward, one-dimensional depthwise separable convolutions (Chollet, 2017) with kernel sizes k = 2, 3, 4, 5, and filter size F are applied on E S (w); the results are followed by max-pooling and layernorm (Ba et al., 2016).The result of the SUM network is the subword unit-based Eojeol embedding vector e S (w) ∈ R 4F .The Integrated Eojeol Embedding vector e i (w) is then calculated by integrating all the subword unit-based Eojeol embedding vectors.For subword unit types T = {s 1 , ..., s t }, three different algorithms to construct the IEE are proposed.
• IEE by Max Pooling.Set the j-th element of IEE as the maximal value of j-th elements of the subword unit-based Eojeol embedding vectors: e i (w) = max[e s 1 (w), ..., e st (w)] ∈ R 4F .Figure 2 illustrates the network architecture for sentence classification that uses the proposed IEE approach.A given Korean sentence s is considered as the list of Eojeols {w 1 , w 2 , ..., w s }.For each Eojeol w i , e i (w i ) is firstly calculated with the IEE network proposed in 3.2 to get an IEE matrix E(s) = [e i (w 1 ), e i (w 2 ), ..., e i (w s )] ∈ R s×d(I) .In the formula, d(I) is the dimension of the IEE vector; R 4tF for IEE by Concatenation and R 4F otherwise.

IEE-based Sentence Classification
Once E(s) is calculated, depthwise separable convolutions with kernel sizes k = 1, 2, 3, 4, 5 are applied on the matrix, and the results are followed by max-pooling and layernorm.Finally, the two fully connected feed-forward layers are applied to get the final output score vector O ∈ R c , while c is the number of possible classes.The overall network architecture is a variation of the network architecture proposed in Choi et al. (2018), with its input replaced by the newly proposed IEE vectors.

Noise Insertion Methods to Improve the Integrated Eojeol Embedding
In this subsection, two noise insertion methods that further improve the performance of the IEE vectors are proposed.The first method, called Jamo dropout, masks Jamos during the training phase with Jamo dropout probability or jdp.Instead of masking some elements of the input embedding vector as regular dropout (Srivastava et al., 2014) does, Jamo dropout masks the whole Jamo embedding vector.Also, if any subword unit of different types contains masked Jamos, the embedding vector of that subword unit is masked.The method has two expected roles.First, it introduces additional noise during the training phase to make the trained classifier work better on noisy inputs.Secondly, by masking other subword units that contain masked Jamos allow the system to learn how to focus on the Jamo embeddings when the input subword unit is unknown in trained subword unit vocabulary.
Compared to the morpheme-embedding-based approach, the Eojeol-embedding-based approach is expected to show poor performance on sentences without spaces, since Eojeols are gathered by tokenizing the input sentence with spaces.Therefore, the second noise insertion method called spacemissing sentence generation is introduced to resolve such impediments.This method aims to process input sentences without necessary spaces by automatically adding the inputs into the training corpus.More precisely, sentences are randomly chosen from the training corpus with the proba-bility msp (missing space probability), and all the spaces in the chosen sentences are removed.The space-removed sentences are then inserted into the training corpus.The space-missing sentence generation method is applied only once before the training phase.

Experiments
In this section, experimental settings and evaluation results are described.

Corpus
The intent classification corpus proposed in Choi et al. (2018)  The KM corpus standing for Korean Misspelling is manually annotated to measure the system performance on erroneous input sentences.The KM corpus is annotated with 46 intents, with two intents fewer than the WF corpus.The removed two intents are OOD and Common; OOD is the intent for meaningless sentences, and Common is the intent for short answers, such as "응 yes."The sentences for the Common intent are too short of recovering their meanings from errors.
For 46 intents of the KM corpus, two annotators are asked to create sentences for 23 intents each.For each intent, an annotator is firstly asked to create 45 sentences without errors.After then, the annotator is asked to insert errors on the created sentences.The error insertion guideline is given in Table 7.As a result, the KM corpus contains 2,070 erroneous sentences in total, and it is used only for testing.
Also, a space-missing (SM) test corpus is generated based on the WF and KM corpus to evaluate the system performance on sentences without necessary spaces.The corpus consists of sentences from the other test corpus, but the spaces between words are randomly removed with the probability of 0.5, and at least one space is removed from each RULE 1.For each sentence, insert one or more errors.Some recommendations are: -Remove / Duplicate a Jamo.
-Swap the order of two or more Jamos.
-Replace a Jamo with similarly pronounced.
-Replace a Jamo with nearly located on keyboard.RULE 2. The erroneous sentences should be understandable.
-Each annotator tries to classify other's works.Misclassified sentences are reworked.RULE 3.For 45 sentences of each intent, 25 sentences should contain only valid characters, and 20 sentences should contain one or more invalid characters.
-"Invalid" Korean characters are defined as those without initial or medial.
Table 7: KM corpus annotation guidelines original sentence.The SM corpus contains 14,781 sentences and also is used only for testing.

Experimental Setup
Sentence accuracy is used as a criterion to measure the performances of sentence classification systems on each corpus.The sentence accuracy or SA is defined as follows: Throughout the experiments, the value of jdp is set to 0.05, and msp is set to 0.4.Those values are chosen through grid search with ranges jdp = {0.00,0.05, 0.10, 0.15, 0.20} and msp = {0.0,0.1, 0.2, 0.3, 0.4}.For each experiment configuration, three training and test procedures are carried out, and the average of three test results on each test corpus is presented as the final system performance.
ADAM optimizer (Kingma and Ba, 2015) with learning rate warm-up scheme is applied.The learning rate increases from 0.0 to 0.1 in the first three epochs, and exponential learning rate decay with a decay rate of 0.75 is applied after five epochs of training.On each epoch, the trained classifier is evaluated against the validation dataset, and the training stops when the validation accuracy is not improved for four consequent training epochs.

Evaluation Results
In this subsection, the system configurations are notated as follows: IEE C represents IEE by Concatenation, IEE W represents IEE by Weighted Sum, and IEE M represents IEE by Max-pooling, as described in subsection 3.2.Also, the applied subword units are represented in a square bracket; m for morpheme, b for BPE unit, j for Jamo, and c for character.The use of noise insertion methods are represented after + sign; SG represents the application of space-missing corpus generation, and JD represents the application of Jamo dropout, and ALL means both.
Table 8 compares the performances of the proposed system and the baseline systems.The first and second baseline systems are based on the al-gorithm proposed by Kim (2014).Source code is retrieved from the author's repository and modified to get the Korean sentences as its input.Baseline Kim-1 gets the Eojeol list of an input sentence as its input, and baseline Kim-2 instead gets the morpheme list of the given sentence as its input.The third baseline system is a system proposed in Choi et al. (2018).It also receives the list of Korean morphemes as its input.The fourth baseline system M-BERT is the multilingual version of BERT (Devlin et al., 2019).We downloaded the pretrained model from the author's repository, and fine-tuned it for Korean sentence classification.For the proposed approach, the performances of three different system configurations are selected and presented.
As can be observed from the table, the performance on the KM corpus improved over 17%p compared to the baseline systems.The proposed system outperforms the baseline systems on the WF and the SM corpus by integrating information from different types of subword units.
Another set of experiments is carried out to configure the effect of each noise insertion method and to compare the three proposed IEE approaches.All four subword unit types are used throughout the experiments, and evaluation results are presented in Table 9.As can be observed from the table, using the IEE approaches with the two proposed noise insertion methods dramatically improves the system performance on the erroneous sentence classification task.The performance on the KM corpus increased about 14 to 15%p on every proposed IEE approach.
As expected in section 3.4, the IEE approaches show relatively low performance on the SM corpus without the application of SG, compared to the baseline systems presented in Table 8.However, the evaluation result suggests that a decrease in performance could be efficiently handled by applying the SG method.The performance of IEE C [mbjc]+SG system on SM corpus reaches up to 89.40%, which is about 2%p higher than the SM corpus performances in the baseline systems.Among the three proposed IEE approaches, IEE C performed better than the other two approaches in most cases.IEE M performed slightly better than the IEE C on the KM corpus but showed low performance on the WF corpus.The performance of IEE W on the KM corpus was much lower than the other two approaches.
To distinguish the contribution of the Eojeol- based approach from the contributions of the two noise insertion methods, two subwords unit-based integrated embedding approaches are newly defined.Integrated morpheme embedding (IME) approach creates an integrated morpheme embedding vector for each morpheme by integrating the Jamo-based morpheme embedding vector and the character-based morpheme embedding vector with the pre-trained morpheme embedding vector.The integrated BPE unit embedding (IBE) approach creates an integrated BPE embedding vector for each BPE unit in the same manner.The two approaches are the same as the IEE approach, except that IME and IBE feed the morpheme embedding vector and BPE unit embedding vector respectively while the IEE calculates and feeds the Eojeol embedding vector into the network in figure 2. [mjc] is used to compare IME and IEE subword unit settings, since the BPE embeddings cannot be integrated into IME.To compare IBE and IEE subword unit settings, [bjc] is used for the same reason.For the integration method, the concatenation is only considered in the experiments.Finally, experiments are conducted on IEE C system with various subword unit configurations to figure out the effect of each subword unit on system performance.Table 11 shows the comparison result.SG noise insertion method is applied only to the systems that have no Jamo subword units, while the two noise insertion methods are applied to other cases.The evaluation results are compared between the two different groups; one with the Jamo subword unit and one without it.
Several interesting facts can be observed from the table.First, the application of the Jamo subword unit dramatically improves the system performance on the KM corpus.The performance of the IEE C [j]+ALL system on the KM corpus reaches up to 69.16%, which is the best performance on that corpus.However, using the single method of the Jamo subword unit resulted in a relatively low performance of 95.19% on the WF corpus due to the lack of pre-trained embeddings.By using the Jamo subword unit together with other subword units such as morphemes or BPE units, the system was able to achieve excellent performance on the WF corpus alike.Additionally, IEE C [mbj]+ALL achieves the best performance of 97.47% on the WF corpus by combining the morpheme embedding vectors with BPE embedding vectors.The result suggests that one can expect additional system performance improvement by further integrating different types of pre-trained subword embeddings.The evaluation result also shows that the Jamo subword unit is more effective compared to the character subword unit in terms of the KM corpus performance.Considering that Korean users are more likely to type in Jamos than characters, this result is quite understandable.

Error Analysis
Several examples are observed to figure out why the proposed Eojeol-based approach works better than the existing morpheme-based approaches.Those examples are presented in table 12; important clues for sentence classification are marked in bold.
For Case 1 (typoT), the vital clue "불(light)" is not extracted as morphemes due to spelling error.However, the clue is successfully recovered in BPE subword units.For the Case 2, the typo can be handled by considering the Jamo subword units.The Jamo subword unit lists for correct and wrong sentences are the same, while their morpheme and BPE subword unit lists are different.As observed in the examples, the system can get a clue from other types of subword units by integrating multiple different types of subword unit embeddings when one subword unit type fails to recover the vital clue.
The proposed algorithm still has its weakness in redundantly spaced sentences, which is exemplified by the sentence "ㅈ ㅗ명 꺼줘".It is the misspelling of "조명 꺼줘(Turn off the light)".In the example sentence, the vital clue "조명(light)" is separated into two different Eojeols "ㅈ" and "ㅗ명" due to the misplaced space.Since they are separated into two Eojeols, the proposed Eojeolbased algorithm fails to recover the critical clue "조명" and to get the correct classification result.

Conclusion
In this paper, a novel approach of Integrated Eojeol Embedding is proposed to handle the Korean erroneous sentence classification tasks.Two noise insertion methods are additionally proposed to overcome the weakness of the Eojeol-embedding-based approaches and to add noises into training data automatically.The proposed system is evaluated against the intent classification corpus for Korean chatbot.The evaluation result shows over 18%p improvement on the erroneous sentence classification task and 0.5%p improvement on the grammatically correct sentence classification task, compared to the baseline system.
Although the proposed algorithm is tested only against the Korean chatbot intent classification task, it can be applied to other types of sentence classification, such as sentiment analysis or comment categorization.Also, the application of the proposed algorithm need not be restricted to the Korean text.For example, it can be applied to English text to integrate the English GloVe embedding vectors and BPE unit embedding vectors.
Our next work will be to investigate the performance of the proposed algorithm further and expand the algorithm to cover other languages.
Figure1illustrates the network architecture to calculate an IEE vector for a given Eojeol w.For an Eojeol w, t types of its subword unit lists are generated first.In this paper, four types of subword unit lists are considered: Jamo list L J (w) = {j 1 , ..., j l j (w) }, character list L C (w) = {c 1 , ..., c lc(w) }, byte-pair encoding (BPE) subunit list L B (w) = {b 1 , ..., b l b (w) }, and morpheme list L M (w) = {m 1 , ..., m lm(w) }, where l j (w), l c (w), l b (w) and l m (w) are the lengths of the lists.Table 6 shows the subword unit lists of an Eojeol "팟케스트로".

Table 1 :
An example of analyzing a chatbot input sentence.

Table 2 :
An example of miscategorized morphemes due to a typo in the sentence.

Table 3 :
An example of Eojeols and morphemes of the sentence from Table2.

Table 4 :
Possible candidate Jamos for each Korean character component.

Table 5 :
Examples of Korean characters and their constituent Jamos.

Subword unit-based Eojoel embedding #1 Integrated Eojeol Embedding vector ei(w)
Figure 1: Network architecture to get the proposed Integrated Eojeol Embedding.

Convolution Max Pooling LayerNorm Feed Forward LayerNorm Feed Forward Softmax Eojeol list
Figure 2: Network architecture for sentence classification contains 127,322 manually-annotated, grammatically correct Korean sentences with 48 intents.Examples are weather (Ex: 오늘 날씨 어 때 How is the weather today), fortune (Ex: 내일 운세 알려줘 Tell me the tomorrow's fortune), music (Ex: 음악 틀어줘 Play music) and podcast (Ex: 팟캐스트 틀어줘 Play podcast).Sentences representing each intent are randomly divided into 8:1:1 ratio for train, validation, and test dataset.The test dataset here is called as the WF (Well-Formed) corpus throughout the paper and consists of 12,711 sentences.

Table 8 :
Minibatch size is set to 128.Dropout (Srivastava Comparison of the baseline systems and the proposed IEE approach

Table 10
shows the comparison result.The two noise insertion methods worked well on IME C and IBE C ; however, for the KM corpus, the performance of the Eojeol-based embedding approach is 3%p higher than that of morpheme-based or BPE-based approaches.The result shows that the Eojeol-based embedding approach handles with the subword unit analysis errors efficiently, compared to the subword unit-based integrated embedding approaches such as IME or IBE.

Table 11 :
Experiment results on the effect of each subword unit embedding

Table 12 :
Examples for which the Eojeol-based approach works better than the morpheme-based approaches.In the first column, T means the sentence with typo, and C means the corrected sentences; in the second column, S, M, B and J means sentence, morpheme subunits, BPE subunits and Jamo subunits, respectively.