Mlphon: A Multifunctional Grapheme-Phoneme Conversion Tool Using Finite State Transducers

In this article we present the design and the development of a knowledge based computational linguistic tool, Mlphon for Malayalam language. Mlphon computationally models linguistic rules using finite state transducers and performs multiple functions including grapheme to phoneme (g2p) and phoneme to grapheme (p2g) conversions, syllabification, phonetic feature analysis and script grammar check. This open source software tool, released under MIT license, is developed as a one-stop solution to handle different speech related text processing tasks for automatic speech recognition, text to speech synthesis and non-speech natural language processing tasks including syllable subword based language modeling, phoneme diversity analysis and text sanity check. The tool is evaluated on a manually crafted gold standard lexicon. Mlphon performs orthographic syllabification with 99% accuracy with a syllable error rate of 0.62% on the gold standard lexicon. For grapheme to phoneme conversion task, overall phoneme recognition accuracy of 99% with a phoneme error rate of 0.55% is obtained on gold standard lexicon. Additionally an extrinsic evaluation of Mlphon is performed by employing the pronunciation lexicon created using Mlphon, in Malayalam automatic speech recognition (ASR) task. Performance analysis in terms of the computation time of lexicon creation process and the word error rate (WER) on ASR task are presented along with a comparison over other automated tools for lexicon creation. Pronunciation lexicons with more than 100k commonly used Malayalam words in phonemised and syllabified forms is created and they are published as open language resources along with this work. We also demonstrate the usage of Mlphon on different natural language processing applications - syllable subword ASR, assisted pronunciation learning, phoneme diversity analysis and text sanity check. Being a knowledge based solution with open source code, Mlphon can be adapted to other languages of similar script nature.


I. INTRODUCTION
23 Precise text processing taking care of intricate linguistic 24 details is a pre-requisite for many downstream natural lan-25 guage processing (NLP) tasks. This applied research work 26 presents the motivation and steps involved in the development 27 of a knowledge based computational linguistic tool, Mlphon, 28 that can solve multiple text processing problems closely asso-29 ciated with speech related and some general purpose NLP 30 tasks. Mlphon is built on finite state transducers (FSTs) to 31 The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino . perform multiple functions including grapheme to phoneme 32 (g2p) and phoneme to grapheme (p2g) conversions, syllab-33 ification on graphemes as well as phonemes, phonetic fea-34 ture analysis, and script grammar check for Malayalam. The 35 features of Mlphon are accessible through a programmable 36 python API, that can be integrated with the development 37 process of automatic speech recognition (ASR) and text to 38 speech synthesis (TTS). 39 A grapheme is the smallest functional unit of the writing 40 system of a language and a phoneme is the smallest dis- 41 tinguishable sound unit of a language [1], [2]. The corre- 42 spondence between the two, largely depends on the nature 43 Morphologically complex languages with very large num-99 ber of rare words are challenging for machine translation 100 and ASR tasks due to huge out of vocabulary (OOV) rate. 101 Malayalam language is known to demonstrate a high level 102 of morphological complexity than many other Indian and 103 European languages in terms of type-token ratio and type-104 token growth rate [14], [15]. For languages with very little 105 transcribed audio datasets available for speech related tasks, 106 a precise grapheme to phoneme conversion can ensure better 107 acoustic modeling, even in end-to-end [16] ASR systems. 108 Segmenting words to syllables has got its applications 109 in machine translation systems and speech to text systems 110 especially in the context of morphologically complex lan-111 guages where subword level units improve system per-112 formance [17], [18]. In many languages, g2p correspondence 113 depends on the relative position of grapheme within a word 114 and a syllable, which makes syllable boundary identification 115 further more important for phoneme level analysis. In the 116 era of large language models being built on web crawled 117 text corpora [19], it is necessary to ensure the sanity of text. 118 Checking for the linguistic validity of character sequences 119 can guarantee this to a large extent. 120 Availability of a ready to use pronunciation lexicon is an 121 essential linguistic resource for ASR (DNN/HMM pipeline 122 model) and TTS tasks. There are machine readable pronun-123 ciation dictionaries available for various world languages. 124 CMUDict is an open source machine readable pronuncia-125 tion lexicon for North American English that contains over 126 134k words and their pronunciations [20]. Similar efforts 127 for creating pronunciation lexicons for different world lan-128 guages are reported in literature, namely; Globalphone, pro-129 viding pronunciation lexicon of 20 world languages [21], the 130 LC-STAR Phonetic Lexica of 13 different languages [22], 131 Arabic speech recognition pronunciation lexicon with two 132 million pronunciation entries for 526k Modern Standard Ara-133 bic words [23], ASR oriented Indian English pronunciation 134 lexicon [24], manually curated Bangla phonetic lexicon of 135 65k lexical entries prepared for TTS [25], to mention a few. 136 However openly available large vocabulary pronunciation 137 lexicon has not been reported for Malayalam, till date. The 138 reported works on Malayalam pronunciation lexicons has 139 mostly been done manually or semi-automatically with a 140 small or medium vocabulary for ASR tasks [26], [27]. Agri-141 cultural speech and text corpora for Malayalam with 4k man-142 ually transcribed phonetic lexicon entries has been reported 143 by Lekshmi et al. [28]. Considering the agglutinative nature 144 of Malayalam language and its practically infinite vocabulary, 145 a manually curated, small sized pronunciation lexicon would 146 be inadequate for general domain speech tasks [14]. Also 147 there could be need for expanding the vocabulary of lexicon 148 as new words get added to the language in the form of proper 149 nouns and loan words. These lexicons can serve as high 150 quality annotated data sets for bootstrapping data driven g2p 151 training. 152 The need to perform precise grapheme to phoneme con-153 version on demand, to perform syllabification on graphemes 154 97556 VOLUME 10,2022 as well as phonemes, and to create a programmable API for 155 integrating these functionalities on downstream NLP tasks 156 prompted us to develop the multifunctional tool Mlphon. and visarga have their properties as tabulated in Table 5. 319 Virama removes the inherent vowel from the consonant pre-320 ceding it. The virama that occurs at word ends, apart from 321 removing the inherent vowel, adds the mid-central vowel 322 schwa to native Malayalam words. Dot reph is an alter-323 nate sign representation for the consonant clusters that begin 324 with or . Anuswara is a sign common in Malayalam. 325 Its phonemic representation is and always mark syllable 326 endings. Visarga sign is popular in Sanskrit derived words 327 and they introduce slight pronunciation changes similar to 328 aspirated glottal stop.

330
Apart from the basic characters, Malayalam script has hun-331 dreds of complex graphemes representing consonant clusters. 332 A consonant cluster is a sequence of consonants with no inter-333 vening vowels. The removal of inherent vowel from the con-334 joining consonants happens on the addition of a virama sign. 335 A consonant cluster, often forms a complex grapheme with 336 one or more of stacking, changing and merging the shapes 337 of the constituent characters. Hundreds of possible complex 338 graphemes in Malayalam are not individually encoded in 339 Unicode, instead they are constituted from basic charac-340 ters. Table 6 lists certain examples of consonant clusters in 341 Malayalam and their constituents.     clusters in onset and coda positions as shown in Fig. 1.

346
The sequence of characters and signs that constitute a valid 347 syllable in Malayalam can be summarized as [47], [48]:    A sequence of characters that do not belong to any of the 372 classes listed above, will not form a valid syllable and can 373 not be accepted for pronunciation analysis. A vowel sign 374 following an independent vowel ( ), a word beginning with 375 a virama ( ), an independent vowel after a . An FST representing a simple pronunciation mapping that accepts two words and . The states are represented as circles and marked with their unique number. The initial state is represented by a bold circle and final states by double circles. An input symbol i and an output symbol o are marked on the corresponding directed arc as is a special symbol that indicates the generation of an output corresponding to an empty input string. Here the inherent vowel a is inserted at the transition from the state q2 to q3.  Table 7.

407
FSTs satisfy closure property, such that the inversion and 408 composition of transducers are two natural consequences.

409
According to the composition property, if transducer T 1 maps 410 from input symbols I 1 to output symbols O 1 and transducer 411 T 2 maps from O 1 to O 2 , then the composition T 1 ||T 2 maps 412 from I 1 to O 2 [49]. The composition of a series of trans-413 ducers perform the mapping from an input string to output 414 string, passing through the states defined by the constituting 415 transducers. The inversion, T −1 of a transducer T , reverses 416 the input and output symbols. This inversion property has 417 enabled the development of Mlphon as a bidirectional g2p 418 converter.

419
Mlphon, the tool we introduce is developed using SFST. 420 SFST is programming language for FSTs, written in C++ 421 language [12]. It has a user-friendly python API, 2 freely avail-422 able under the GNU public license. SFST provides efficient 423 mechanisms for defining the input and output symbol sets for 424 FSTs and the rules for contextually mapping an input string 425 to output string. SFST has been employed in the development 426 of state of the art morphological analysers for Turkish [11], 427 German [51], Latin [52] and Malayalam [10].

428
The ruleset of Mlphon can be adapted with the necessary 429 script modifications to other Dravidian languages with a simi-430 lar script nature. The rulesets and graphemes must be adjusted 431 to fit the target language. To enable this, we have made 432 sure the source code is accessible, well-documented, and 433 freely licensed to allow for adaptations. 3 In a code switching 434 context, a language detector may be needed to separate the 435 text and route it to language-specific g2p systems.

436
A. ARCHITECTURAL DESCRIPTION

437
The system architecture of Mlphon is described in Fig. 3. 438 We follow a modular approach in the design of Mlphon. 439 The mapping from Malayalam script to IPA is carried 440 out in eleven steps, where each step represents an FST. 441 In Mlphon, FST parameters are not directly defined. They are 442 instead compiled from SFST programs. An SFST program 443 is essentially a regular expression. They represent context 444 sensitive rewrite rules. When the programs are compiled, 445 we get eleven transducers shown in the architectural diagram 446 in Fig. 3.

459
For transducers that carry out complex tasks, the expres-460 sions might be quite complicated. In order to create complex 461 expressions from simpler ones, variables are defined [53]. 462 The SFST program is structured as a combination of (i) one-463 to-one and one-to-many mappings from input symbols to This FST accepts all Malayalam characters and invisible zero 483 width characters. 4 Characters that do not require normali-484 sation are self mapped. Character sequences that essentially 485 represents the same graphemes are normalised to a standard 486 form. , zwj is normalised to a common form of single atomic character, , by passing through states from q2, q3, q4 and q5. If the word were already in normalized form, that character is self mapped as indicated in other transitions.

494
This FST accepts all Malayalam characters. The token passed 495 to Mlphon for analysis is considered as a word. Tags in 496 angle brackets <BoW> and <EoW> are added to indicate the 497 beginning of word and the end of word respectively by this 498 FST and is returned to the output. The procedural description 499 is provided in Algorithm 1. This FST accepts all Malayalam characters, along with word 502 boundary tags. As discussed in section V, some character 503 sequences are invalid according to Malayalam script gram-504 mar. The syllabifier FST checks for validity of character 505 sequences to form syllables. An invalid sequence of Malay-506 alam characters will not find a path from the start state of 507 this FST to the end state and will summarily be rejected. 508 On valid input strings, it inserts tags -<BoS>, <EoS> -509 at appropriate positions to indicate the beginning and end 510 of the syllables. The syllable boundary tags are essential 511 return chillunorm_fst ntanorm_fst It is the union of two predefined FSTs 5: end procedure 6: procedure Word Boundary Tagging 7: return <BoW>+token+<EoW> ← token Insert boundary tags to input word token 8: end procedure 9: procedure Syllable Boundary Tagging 10: c_v ← consonant + virama 11: syl_end is a variable, that can take any value in the list 12: Four types of character sequences that constitute a syllable is defined in the following lines   defined by this FST maps every grapheme to phonemes as per 518 tables 2-4 along with phonetic or graphemic feature tags. The 519 preliminary mapping carried out by this FST will be modified 520 by subsequent FSTs based on contexts. The boundary tags 521 are self mapped, so that they will be retained as such in the 522 output. An example of mapping the graphemes and to its 523 phoneme with phonetic features is described in Algorithm 2. . . . 5: . . . Basic g2p mappings 6: return g2p_1 || g2p_1 . . .

525
Inherent vowel /a/ is added after consonant phonemes if it 526 is at the end of a syllable position, or it is followed by the 527 anuswara, visarga, or a chillu as described in Algorithm 3. The most common alveolar consonant clusters in Malayalam, 530 and are constituted from consonants dental 531 nasal and alveolar trill , the pronunciations of 532 which are strikingly different. nta_fst: <BoS> <tags> <tags> ← <BoS>+n+<tags>+r+<tags> 9: return tta_fst || nta_fst Composition of two FSTs 10: end procedure 11: procedure Reph Sign Correction 12: return +<tags>+<virama> <flapped>+<reph> ← <tags>+<virama>+r+<trill>+<reph> 13: end procedure mapping is done by an FST that checks the context and 537 remaps these phonemes as indicated in Algorithm 3. The unvoiced aspirated labial plosive grapheme is 575 used to represent the labiodental fricative /f/ in non-native 576 words. On analysing a corpus of 100k most frequent Malay-577 alam words [19], only 6% of words that contained the letter 578 were native. All those native words had the letter , either 579 preceded by the letter or followed by . This graphemic 580 context is used as the parameter to determine the word origin 581 and remap fricative to plosive as described in Algorithm 4.

583
The tag-removal FST removes the boundary tags and pho-584 netic feature tags, by mapping them to the null symbol ε. 585 It will leave just the IPA symbols at the output.

587
The composition of the series of FSTs from VI-A1 to VI-A3 588 results in a very useful module that performs syllabification 589 of Malayalam text. We compose these FSTs to get the Syl-590 labifier FST and provide programmable access to it in the 591 Mlphon Python library. This module has interesting applica-592 tions like developing subword level language modeling for 593 ASR as described in section X. An illustration of this module 594 accepting Malayalam text as input and generating output with 595 syllable boundary tags is shown in Fig. 7.

596
If the token passed to the syllabifier is , it returns 597 the syllabified string . The 598 python interface to the FST for syllabification, parses the 599 boundary tags and returns the sequence of syllables. Phoneme analyser FST is compiled as a composition 602 of 10 FSTs described in sections VI-A1 to VI-A10 and 603 indicated in Fig. 3 Return the composition of three FSTs 17: end procedure is need to convert graphemes to phonemes and vice-versa. 626 Fig. 10 demonstrates an input and output symbol sequence of 627 G-P Converter FST.

628
This FST, parses the words and in analysis 629 mode as shown in the Fig. 11 (i). When operated in generate 630 mode, it converts a valid phoneme sequence into graphemes. 631 For example, in generate mode, it can parse the inputs 632 and as shown in Fig. 11 (ii). The core functionalities of Mlphon is written in SFST and 635 compiled into different finite state transducers. SFST com-636 piles the rules to form minimized FSTs which are very much 637 memory optimized [54]. The python binding of SFST pro-638 vides access to these transducers for high level programming. 639 Mlphon python library is very compact with 21 kB of total file 640 size.

641
One of the major motivation behind this work is to 642 provide pronunciation lexicon for integrating with ASR 643 97564 VOLUME 10, 2022  phonemes or sequence of syllables separated by spaces. 650 Additionally the function phonemise accepts the delimiters 651 defined by the user to separate phonemes and syllables.

652
The Mlphon library also provides a command line utility 653 for the tasks of syllabification, phoneme analysis and con-654 version between graphemes and phonemes. See Listing 1 655 for its usage and the list of optional arguments. The entire 656 development process was guided by a set of unit tests to 657 ensure expected functionalities.

659
Evaluating a script analysis toolkit like Mlphon is not straight 660 forward due the absence of any baseline ground truth lin-661 guistic resource. A gold standard with manually annotated 662 data, which can serve as a reference is an important part of 663 any quantifiable evaluation [11]. A gold standard for g2p 664 conversion contains a list of words annotated with their true 665 phoneme transcription. A gold standard for syllabifier is 666 annotated as a sequence of syllables. If a word has multiple 667 possible annotations, all of those should be present in the 668 gold standard lexicon. Before we explain the evaluation, the 669 following section presents the design of gold standard lexi-670 con. It follows a similar procedure and the number of entries 671 as in [11], used for creating a gold standard annotations for 672 Turkish morphological analyser.  The gold standard lexicon covers many regular words, loan 712 words, proper nouns and abbreviations as per the distribution 713 illustrated in Fig. 13.  A phoneme diversity analysis of the gold standard lexicon 715 was performed and plotted in Fig. 14. The relative frequency 716 of phonemes in gold standard lexicon follows the same pat-717 tern as previously reported values of phoneme diversity in 718 Malayalam speech corpora [55].     Table 9.

772
Comparing the true phonemes in gold standard lexicons to the 773 transcription provided by Mlphon, we present the phoneme 774 transcription accuracy in the form of a confusion matrix in 775 Fig. 16. For all phonemes other than those listed in Table 10, 776 the accuracy, precision, recall, and F1 scores were computed 777 to be 100%.

778
Except for the disambiguation rules, all contextual rule 779 sets operate flawlessly without a single error when evalu-780 ated on gold standard lexicon. The unintentional insertion 781 of samvruthokaram into non native proper names and abbre-782 viations transliterated from English was the cause of all the 783 insertion errors. Insertion is mapped to the empty symbol '#' 784 in the gold standard transcription. The top row of the Fig.16 785 shows insertion of . Since the mostly ambiguous grapheme 786 was g2p mapped with 100% accuracy on the gold stan-787 dard lexicon, we increased the evaluation space to include 788  2) PHONEME ERROR RATE

799
As an alternate metric to measure the phoneme transcription

811
We performed a detailed analysis of g2p errors on different 812 types of words in the gold standard lexicon. 1.4% of regular 813 words and 1.3% of loan words had substitution errors. About 814 23% of proper nouns and 15% of abbreviations had insertion 815 errors due to unintended samvruthokaram at word ends. All 816 the erroneous words account for 2.6% of the total words in 817 the gold standard lexicon. It is illustrated in Fig. 18. The correction of substitution and insertion errors involve 819 morphologically analysing the words, which is currently 820 beyond the scope of this work. Even with these limitations, 821 the PER on the gold standard lexicon that covers about 26% 822 of words from 167 million tokens is only 0.55%

849
Sample entries from the pronunciation lexicons created 850 using these tools, are presented in Table 11. On analysing 851 these lexicons, following observations can be made:    Kaldi toolkit [57] is used for our experiments on ASR.  Table 12. This amounts to 874 19 hours of speech data for training and 2 hours of speech data 875 for testing. Apart from the transcripts of speech which amount 876 to 7924 unique sentences, we have utilized the curated collec-877 tion of text corpus published by SMC [61] amounting to 205k 878 unique sentences for language modeling. After combining 879 these, we explicitly removed all sentences that are present in 880 our test audio transcripts. Bigram language model is prepared 881 on this language modeling corpus using SRILM toolkit [62]. 882 The vocabulary of our ASR is 69k words and the lexicons are 883 prepared using Unified Parser, Espeak and Mlphon.

884
The speech sampling rates of different sources are con-885 verted to a sampling frequency of 16 kHz prior to fea-886 ture extraction. As the acoustic features, we have used 887 standard Mel frequency cepstral coefficients (MFCCs) with 888 delta and double delta coefficients computed over a window 889 (Hamming) size of 25 ms with an overlap of 10 ms for 890 GMM-HMM monophone and triphone models. The acous-891 tic modeling begins with flat start monophone model fol-892 lowed by context dependent triphone acoustic modeling. 893 Then speaker independent linear discriminant analysis (LDA) 894 to reduce the feature space dimensionality and maximum 895 likelihood linear transform (MLLT) are performed. It is fol-896 lowed by triphone speaker adaptive training (SAT).

897
Phone alignments from final triphone model are used for 898 Kaldi chain acoustic modeling. It is implemented using time 899 delay neural networks (TDNNs) [63]. Acoustic features used 900 in TDNN training are: (i) 40-dimensional high-resolution 901 MFCCs extracted from frames of 25 ms length and 10 ms 902 shift and (ii) 100-dimensional i-vectors [64] computed from 903 chunks of 150 consecutive frames. Three consecutive MFCC 904 vectors and the i-vector corresponding to a chunk are con-905 catenated, obtaining a 220-dimensional feature vector for a 906 frame. Neural acoustic model is trained on a single NVIDIA 907 Tesla T4 GPU.    one was tested using private datasets described in respec-948 tive papers. The lexicon creation process was not explicitly 949 explained. Additionally, some of these works did not men-950 tion the sizes of the pronunciation lexicon and OOV rates, 951 which have a significant impact on the WER. Nevertheless 952 we present a comparison of these previously reported WERs 953 with ours. It is observed that, on two different test datasets 954 of OOV rates 14% and 1%, the proposed ASR with Mlphon 955 lexicon provides similar or better WERs when compared with 956 previously reported WERs as listed in Table 14.

959
Apart from small pronunciation lexicons created manually 960 or semi-automatically for some specific experiments as dis-961 cussed in section II, there is exists no openly available 962 pronunciation lexicons for Malayalam. To bridge this gap 963 we publish a large vocabulary pronunciation lexicon for 964 Malayalam, automatically created using Mlphon.

965
These lexicons consist of different categories of words 966 as described in Table 15. The tokens in common words 967 pronunciation lexicon are extracted from a general domain 968 text corpus of 167 million types covering the fields of 969 business, entertainment, sports, technology etc. as described 970 in Indic NLP dataset [19]. The rest of the categories are 971 curated word lists from the Malayalam morphology analyser, 972 Mlmorph [10]. Since Mlphon fails to syllabify and phoneme 973 map abbreviations that contain word medial vowels, a work 974 around script has been written to split such words at the 975 position of vowels and obtain the right g2p results.

976
These pronunciation lexicons are published in two separate 977 formats; one with phoneme level transcription where pro-978 nunciation is described as a sequence of phonemes and the 979 other with syllable level transcription where pronunciation 980 is described as a sequence of syllables. The sequences are 981 separated with a blank space in between. The lexicons are 982 published in a two column, tab separated values (tsv) format. 983   minimised FSTs [54], upon which Mlphon is built. Unified 1004 Parser is prohibitively slow due to the additional memory 1005 management requirement. 8 The measurement of grapheme-1006 to-phoneme conversion speed was performed on a PC work-1007 station with 2 × AMD CPU @ 2.250 GHz and 4 GB of RAM. 1008

1009
In this section we describe some potential application of 1010  As an example, we demonstrate the usage of syllable based 1025 lexicons and language models on ASR task. We use the same 1026 experimental setup as described in section VIII. Evaluation is 1027 done on OpenSLR test set where OOV is higher. To evaluate 1028 syllable based language models and lexicons, we use word 1029 based lexicons and language models as baseline. The Fig. 19,   1030 shows how the WER of syllable based lexicons and language 1031 models are consistently better than word based ones, while 1032 incrementally increasing the vocabulary size. Each subword 1033 lexicon is built by including all the syllables present in corre-1034 sponding word lexicon. For example the first subword lexicon 1035 has 3.5k syllables as entries, obtained by syllabifying every 1036 entry in corresponding word lexicon with 25k entries. It is 1037 observed that syllable based ASR performed much better 1038 than word based ones, as it recovered many OOV words by 1039 reconstructing words by concatenating syllables.   is the dental plosive in Indic TTS [59] corpus while it is ified forms published along this work is the first of its kind in 1105 Malayalam. Mlphon that takes care of the script specific con-1106 textual rules for phonemic analysis serve as a useful resource 1107 for various NLP tasks including ASR, TTS, syllabification 1108 for language modeling, phonemic diversity analysis, assisted 1109 pronunciation learning and text sanity check as demonstrated 1110 in this article.