Skip to Main Content
Content-based phishing detection extracts keywords from a target Web page, uses these keywords to retrieve the corresponding legitimate site, and detects phishing when the domain of the target page does not match that of the retrieved site. It often misidentifies a legitimate target site as a phishing site, however, because the extracted keywords do not charecterize the legitimate site with sufficient accuracy. Two methods are described for extracting keywords: domain keyword extraction, which extracts keywords from not only the page on the browser but also from pages linked from this page, and time-invariant keyword extraction, which extracts keywords from the page and previous versions of the page. Experiments using 172 legitimate sites demonstrated a reduction in the false detection rate from 14.0% to 7.6%, while experiments using 172 phishing sites demonstrated no change in the rate of overlooking phishing pages.