Skip to Main Content
Most Thai text-to-speech systems on personal computers can synthesize sound in real time with acceptable quality. However, when porting the Thai TTS systems to limited-resource systems such as mobile devices, computational time has to be reduced. Hence, the quality of synthesized sound is decreased. Even though Flite_Thai, a unit concatenation synthesizer for Thai, can reduce the computational time into a real time system, the output sound is quite unintelligible. In this paper, we aim at selecting the appropriate speech unit for Flite_Thai in order to improve its intelligibility. We design a new speech corpus that consists of three different speech units: demi-syllable, diphone and a new speech unit called hybrid diphone. We use a non-sense carrier sentence technique for recording this corpus since we focus more on clear articulation of each speech unit. Our carrier sentence contains a speech unit or a set of similar speech units per sentence without concerning the meaning. We compare the quality of speech synthesized using four types of speech units, a diphone from the TsynC corpus recorded with natural sentences, and the three types of units from the new corpus recorded with non-sense carrier sentences. In terms of intelligibility, all of the speech units from the new corpus achieved higher MOS (Mean Opinion Score) than the existing Flite_Thai system which uses speech units from TsynC. Among the three unit types in the news corpus, demi-syllable obtained the highest score. Although hybrid diphone obtained higher MOS than the existing system and the diphone, it still suffers from a similar problem which is unsmooth joints between units.