Skip to Main Content
This paper presents an automatic document classification system, WebDoc, which classifies Web documents according to the Library of Congress classification scheme. WebDoc constructs a knowledge base from the training data and then classifies the documents based on information in the knowledge base. One of the classification algorithms used in WebDoc is based on Bayes' theorem from probability theory. This paper focuses upon three aspects of this approach: different event models for the naive Bayes method, different probability smoothing methods, and different feature selection methods. In this paper, we report the performance of each method in terms of recall, precision, and F-measures. Experimental results show that the WebDoc system can classify Web documents effectively and efficiently.