Skip to Main Content
Phishing Web pages present a previously underused resource for information on determining provenance of phishing attacks. Phishing Web pages aim to impersonate a legitimate Web site in order to trick their potential victims into revealing their confidential data, such as usernames and passwords. However different phishing Web pages often contain small differences and these differences can provide a great deal of evidence on the provenance of phishing attacks. When impersonating a Web page, there is often a large amount of 'redundant' information, as much of the original, impersonated Web site is found in phishing Web sites, making phishing Web sites across different attacks very similar. In order to attempt to overcome this issue, a diff can be used which takes the phishing and original Web sites as input and returns the differences between the two.These differences present a new view on the data that is previously unused and presents a novel way to increase the ability of clustering algorithms to find good, distinct and separated clusters within the data. The research presented here outlines this diff process and shows that for the data used, comparable results were obtained while the dimensionality of the dataset was reduced. This reduction in size allows for clustering algorithms to complete faster, due to the reduced dimensionality of the dataset.