Skip to Main Content
One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google's language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.