Skip to Main Content
With the rapid multiplication of World Wide Web, there is an increasing requirement for automated web page classification techniques. Web page classification is an important task in web mining and is utilized in many other areas of research as well. General practice during classification is to use lexical terms as features. In this paper we investigate the effect of considering named entities as features in web page classification. We have conducted tests in five different domains â"-baseball, football, health, politics and science â"-with web pages collected from online news providers. Our results show that incorporating named entities can result in slight gains in classifier performance for narrow domains, but is not always true for all the domains. Results also showed that classification based only on named entities can be good for certain domains (e.g., baseball) but is still lower than the lexical terms based representation.