By Topic

Phoneme Based Representation for Vietnamese Web Page Classification

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Giang-Son Nguyen ; Sch. of Eng. & Comput. Sci., Victoria Univ. of Wellington, Wellington, New Zealand ; Xiaoying Gao ; Andreae, P.

This paper proposes a novel text representation for Web pages written in Vietnamese. This representation is based on an analysis of Vietnamese documents at phonetic level in which each document will be represented as a bag of phonemes. It is designed to capture sound-based information in documents and to be helpful for resolving some non-topic text classification problems including automatic Vietnamese language identification of a document, ancient Vietnamese document detection, author identification, and poem identification. We apply some typical machine learning methods including NB, KNN and SVMs to build text classifiers. The experimental results show a significant improvement in terms of effectiveness and efficiency compared to the traditional syllable based representation in most cases.

Published in:

Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on  (Volume:1 )

Date of Conference:

22-27 Aug. 2011