Skip to Main Content
This paper presents an anti-spam filter approach based on support vector machine (SVM). Firstly, we adopt the tri-gram language model to perform word segmentation in the Chinese email. In order to overcome the sparse data problem, the absolute discount smoothing algorithm is applied. Secondly, the different factoid words are identified by the automaton machine, so as to acquire the approximate syntactic and semantic usage of factoid words in the anti-spam filter task. Thirdly, we apply Support Vector Machine to filter the spam, where the emails are permitted to be written by the cross language, including Chinese and English. The experiments in the large-scale corpora with the cross language show that the SVM can improve the generalization than the Naive Bayes (Smoothed by Lidstone algorithm) by 4.09% precision, and 8.18% higher precision than the maximum entropy model.