Skip to Main Content
In this paper, we propose a method to classify Web documents by genre (not by topic) based on features of words and HTML tags. For classification, we use SVM (support vector machine) and Naiumlve Bayes. In order to improve the accuracy of classification, we calculate discriminant efficiencies of each pair of a word and a HTML tag to find out HTML tags which are effective in classification. The experimental results show that our method using discriminant efficiencies achieves 8% increase in classification accuracy.