Skip to Main Content
Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.
Date of Conference: Feb. 27 2012-March 1 2012