Skip to Main Content
Anti-spam has become one of the top challenges for the Web search. In this paper, we explore the Web spam detection as a binary classification problem. Based on the fact that reputable pages are more easy to be obtained than spam ones on the Web, an ensemble under-sampling classification strategy is adopted, which exploits the information involved in the large number of reputable Websites to full advantage. The strategy is based on the predicted spamicity of every sub-classifiers, in which both content-based and link-based features are taken into account. The experiments on standard WEBSPAM-UK2006 benchmark showed that the ensemble strategy can improve the web spam detection performance effectively.