Skip to Main Content
In recent years, many researches were focusing on developing effective spam email filters because spam emails became serious problems. Among all existing solutions, studies showed that the Naive Bayesian spam email filter was the best one because it could achieve the highest accuracy in filtering out English spam emails. However, how to filter out Chinese spam emails is still an open problem since it is difficult to correctly segment Chinese sentences. This paper presents a Web-Search-Results (WSR) based Genetic Algorithm (GA) Chinese sentence tokenizer which can automatically segment Chinese sentences. A fuzzy-splitting algorithm which helps GA handle longer sentences is also proposed. Besides, we show the implementation details of this tokenizer along with a standard Naive Bayesian email filter, and then we introduce the training and evaluation process. Evaluations on a real world spam email dataset "CCERT Data Sets of Chinese Emails" (CDSCE) showed that our approach effectively improves the accuracy of identifying Chinese spam emails.
Date of Conference: 18-22 July 2011