Phishing URL Detection: A Real-Case Scenario Through Login URLs

Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing websites through URL analysis. In most current state-of-the-art solutions dealing with phishing detection, the legitimate class is made up of homepages without including login forms. On the contrary, we use URLs from the login page in both classes because we consider it is much more representative of a real case scenario and we demonstrate that existing techniques obtain a high false-positive rate when tested with URLs from legitimate login pages. Additionally, we use datasets from different years to show how models decrease their accuracy over time by training a base model with old datasets and testing it with recent URLs. Also, we perform a frequency analysis over current phishing domains to identify different techniques carried out by phishers in their campaigns. To prove these statements, we have created a new dataset named Phishing Index Login URL (PILU-90K), which is composed of 60K legitimate URLs, including index and login websites, and 30K phishing URLs. Finally, we present a Logistic Regression model which, combined with Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction, obtains 96.50% accuracy on the introduced login URL dataset.


I. INTRODUCTION
In the last years, web services usage has grown drastically due to the current digital transformation. Companies motivate the change by providing their services online, like e-banking, e-commerce or SaaS (Software as a Service) [1]. Nowadays, due to the COVID-19 pandemic, restrictions have spread out the work-from-home model, which implies extra millions of workers, students, and teachers developing their activities remotely [2], leading to a substantial additional workload for services such as email, student platforms, VPNs or company portals. Therefore, there are even more potential targets exposed to phishing attacks, where phishers try to mimic legitimate websites to steal users' credentials or payment information [3], [4]. Recent studies [5], [6] concluded that phishing is one of the most significant attacks based on social The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar . engineering during the COVID-19 pandemic, together with spam emails and websites to execute these attacks.
Identifying phishing sites through their HTTP protocol is no longer a valid rule. In the 3 rd quarter of 2017 [7], the APWG reported that less than 25% of phishing websites were hosted under HTTPS protocol, whilst this amount has increased up to 83% in 1 st quarter of 2021 [8]. These websites provide secure end-to-end communication, which transmits a false safe impression to the user while making an online transaction [9]. Furthermore, the Anti-Phishing Working Group (APWG) [10] has reported a significant increase in phishing attacks, i.e. from 165, 772 to 611, 877 websites, just between the first quarter of 2020 and 2021 respectively. A reason behind this increase might be that people have resorted (and still are) to online services during the COVID-19 pandemic.
One of the most popular solutions for phishing detection is the list-based approach, which analyzes the requested URL against a phishing database [11]. Some examples of this solution are Google SafeBrowsing, 1 PhishTank, 2 OpenPhish 3 or SmartScreen. 4 If a requested URL matches any record, the request is blocked, and a warning is displayed to the user before visiting the website. However, despite the capabilities of the list-based approach, it would fail if the phishing URL was not reported previously [12]- [14], and it will require a continuous effort to update the database with newer phishing data. Bell and Komisarczuk [11] observed that many phishing URLs were removed after day five from Phishtank while OpenPhish removed all URLs after seven days from its report. This issue allows attackers to reuse the same URL when it is removed from different lists. Due to the mentioned drawbacks with the blacklist-based methods, automatic detection of phishing URLs based on machine learning, have attracted attention in research [15], [16]. These approaches can be grouped into four classes according to the type of data used for the detection: the text of the URL, the page content, the visual features and networking information [17]. Methods based on the page content and visual features require visiting the website to collect the source code and render it, which is a time-consuming task. Other availability limitations can be found in studies that rely on networking and 3 rd party information such as WHOIS or search engine rankings. To overcome these limitations, we focus on phishing detection through URLs since it implies advantages such as fast computation -because no websites are loaded-and 3 rd party and language independent, since features are extracted only from the URLs.
Existing URL datasets use the homepage URL from well-known websites as the legitimate [18], [19]. However, we think that the challenge is to determine if a login form of a website is legitimate or phishing. From our perspective, and to the best of our knowledge, publicly available datasets are not reflecting conditions that represent some real problems for phishing URL detection. Fig. 1 displays the differences between a homepage, a login page and a phishing website. Furthermore, it is observed that recent machine learning proposals obtained high accuracy using outdated datasets, i.e., typically containing URLs collected from 2009 to 2017. We demonstrate that models trained with old URLs decrease their performance when they are tested with URLs coming from recent phishing pages. This paper presents a phishing URL dataset using legitimate login websites to obtain the URLs from such pages. Then, we evaluate machine and deep learning techniques for recommending the method with higher accuracy. Next, we show how models trained with legitimate homepages struggle to classify legitimate login URLs, demonstrating our hypothesis about phishing detection and legitimate login URLs. Additionally, we show how the accuracy decrease with the time on models trained with datasets from 2016 and evaluated on data collected in 2020. Finally, we provide an overview of current phishing encounters, explaining attacker tricks and approaches.
The main contributions of the paper can be summarized as follows: • We extended our previous dataset PILU-60K (Phishing Index Login URL) [20], from 60K to 90K URLs equally distributed among three classes: phishing, the legitimate homepage, and legitimate login. We make this extended dataset, PILU-90K, publicly available for research purposes 5 • Using PILU-90K, we implemented and evaluated three pipelines for URL phishing detection: (i) we use the 38 handcrafted feature descriptors proposed by Sahingoz et al. [21] for training eight supervised machine learning classifiers and also (ii) automatic feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) at character N-gram level combined with Logistic Regression (LR) algorithm, and (iii) a Convolutional Neural Network (CNN) at character level too.
• We demonstrated empirically how an URL phishing detection model struggles in classifying login URLs when it was trained on the URLs of the homepage of phishing and legitimate URLs.
• We evaluated the robustness of the proposed phishing detection over time. We trained the model on a dataset collected between March 2016 and April 2016, and we evaluated the model on other datasets collected between 2017 and 2020.
• Phishing websites were analyzed using domain frequency. We found six different phishing domains depending on the service hired by the attacker. The organization of the paper is as follows: Section II reviews the literature on phishing detection. Next, Section III describes the proposed dataset and its content. Then, we explain the used features and the proposed classifiers in Section IV. The carried out experiments are covered in Section V. Section VI presents and discusses the obtained results. Finally, the main conclusions are drawn in Section VII, where we also point to our future work.

II. STATE OF THE ART
In the literature, researchers have focused on phishing detection following three main approaches: List-based and automatic detection using Machine Learning and Deep Learning techniques.

A. LIST-BASED
The list-based approach, well-known for detecting phishing URLs [22]- [24], can be based on whitelists or blacklists, depending if they store legitimate or phishing URLs, respectively. Jain and Gupta [24] developed a whitelist-based system that blocks all websites which are not on that list. Conversely, the blacklist-based systems, like Google Safe Browse or PhishNet [23], are more common as they provide a zero false-positive rate, i.e. no legitimate website is classified as phishing. However, they can be compromised if an attacker makes changes on a blacklisted URL. Besides, they depend heavily on the update rate of the system's records. Therefore, a list-based approach is not a robust solution due to the high volume of new phishing websites introduced daily and their short lifespan, which is estimated to be 21 days on average [12].

B. MACHINE LEARNING METHODS
To overcome blacklist disadvantages, researchers have developed machine learning models to detect unreported phishing encounters. Depending on their input data, these approaches can be classified into two categories: URL-based and contentbased.

1) URL-BASED
Buber et al. [25] implemented a URL detection system composed of two sets of features. The first was a 209 word vector, obtained with ''StringToWordVector'' tool from Weka. 6 The second, 17 NLP (Natural Language Processing) handcrafted features such as the number of sub-domains, random words, digits, special characters and length measurements over the URL words. Combining both feature sets, they obtained a high 97.20% accuracy with Weka's RFC (Random Forest Classifier) on a 10% sub-sample set from Ebbu2017 dataset. In the following studies, Sahingoz et al. [21] defined three different feature sets: Word vectors, NLP and a hybrid set combining both sets. They obtained a 97.98% accuracy on Random Forest (RF) using only 38 NLP features on Ebbu2017 [25] dataset. In this work, we used the NLP features from Sahingoz et al. [21], since they reported stateof-the-art performance in the last studies.
Jain and Gupta [26] built an anti-phishing system using 14 handcrafted URL descriptors, including some obtained using 3 rd party services like WHOIS registers or DNS lookups. They obtained an accuracy of 76.87% and 91.28% with Naìve Bayes (NB) and Support Vector Machine (SVM) classifiers, respectively, on a private dataset with 35, 491 samples.
Banik and Sarma [27] implemented a lexical feature selection from URL to optimize the number of features and the accuracy of their model. They started with a set of 17 descriptors and removed the less significant ones until they reached an optimal performance. Using 9 features and a Random Forest (RF) classifier they obtained 98.57% accuracy on an extension of PWD2016 [18] dataset.

2) CONTENT-BASED
Content-based works use features extracted mainly from the websites' source code. However, most of the current works combine these with URLs and other 3 rd party services such as WHOIS [28], [29].
One of the first content-based works was CANTINA [30], which consists of a heuristic system based on TF-IDF. CANTINA extracts five words from each website using TF-IDF and introduced them into the Google search engine. If a domain was within the n first results, the page was considered legitimate, or phishing otherwise. They obtained an accuracy of 95% with a threshold of n = 30 Google search results. Due to the use of external services like WHOIS 7 and the high false-positive rate, authors proposed CANTINA+ [31]. Their new proposal achieved a 99.61% F1-Score including two filters: (i) a comparison of hashed HTML tags with known phishing structures and (ii) the discarded websites with no form.
Moghimi and Vorjani [32] proposed a system independent from third services like Google Page Rank or WHOIS. They used two handcrafted feature sets, extracted from the URL and the Document Object Model (DOM) of the website. The first set has nine legacy features including a set of keywords, while the second has eight novel features which inform of whether the website's resources are loaded using SSL protocol or not. They used Levenshtein distance [33] to detect typo-squatting by comparing the website and resources URLs. These features were used to train an SVM classifier and obtained an accuracy of 98.65% on their banking websites dataset.
Adebowale et al. [34] created a browser extension to protect users by extracting features from the URL, the source code, the images, and features extracted using thirdparty services like WHOIS. Those features were introduced into an Adaptive Neuro-Fuzzy Inference System (ANFIS) and combined with the Scale-Invariant Feature Transform (SIFT) algorithm, obtaining an accuracy of 98.30% on Rami et al. [35] dataset.
Rao and Pais [28] developed a phishing website classifier using the URL, the hyperlinks on the HTML code and third-party services including the age of the domain and the page rank on Alexa. They reached 99.31% accuracy with a Random Forest classifier.
Yang et al. [36] proposed an Extreme Learning Machine (ELM) model and established three different groups of features: (i) Surface features, composed of 12 URL handcrafted and 4 Domain Name System (DNS) features related to the registration date and the DNS records for that domain; (ii) 28 Topological features that are related to the structure of the website and (iii) 12 deep features related to the text and image similarity. Combining these sets of features and the ELM classifier, they obtained 97.5% accuracy.
Sadique et al. [37] presented a framework for real-time phishing detection using four sets of URL features: (i) Lexical features related to the number of characters, dots and symbols found in different parts of the URL, (ii) host-based features that are related to the host, (iii) WHOIS features are related to the registration date and (iv) GeoIP-based features like the Autonomous System Number (ASN). A total of 142 individual features were evaluated using 98, 000 samples from Phishtank, where legitimate samples are also picked from false positives collected at PhishTank. They obtained a 90.51% accuracy on a Random Forest classifier using the proposed descriptors.
Li et al. [29] presented a stacking model which was the combination of three models: Gradient Boost Decision Tree (GBDT), eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Model (LGBM). This stacking model was fed with a set of features from different sources: eight from the URL, 11 from the HTML and HTML string embeddings inspired by Word2Vec model [38]. They obtained 97.30% accuracy using a 49, 947 samples dataset.

C. DEEP LEARNING
Regarding the methods based on Deep learning, Somesha et al. [39] proposed a model based on Long Short-Term Memory (LSTM) to classify phishing URLs using ten handcrafted features from Rao and Pais [28]. Those features are three URL features based on the number of dots, the length of the URL, and the presence of HTTPS, six features extracted from the HTML, including the internal links and images, the ratio of broken links and the presence of anchor links on the HTML body. Finally, one third-party numeric feature was obtained from Alexa's Page Rank. These features were extracted from a 3, 526 samples dataset and introduced into the LSTM model to obtain 99.57% accuracy.
Aljofey et al. [40] presented an RCNN model to classify phishing URLs. They used the URL as input for a tokenizer and then used a one-hot encoding to represent the URL as a matrix at a character level. The last step is to set a fixed length of 200 characters for the model input. If the URL is under that threshold, the remaining characters are filled with zeros. Otherwise, the characters above the limit are trimmed. Finally, they used a 310, 642 URL dataset to feed an RCNN model, which obtained 95.02% using the aforementioned character embedding level features.
Al-Alyan and Al-Ahmadi [41] proposed a modified Convolutional Neural Network (CNN). First, they omitted the URL protocol and then cropped URLs larger than 256 characters. They used a 69 characters alphabet with lower-case letters, numbers and some symbols to obtain a 128 embedding vector. Then, a one-dimensional CNN was applied to obtain 95.78% accuracy on a 2, 307, 800 URLs dataset.
Zhao et al. [42] presented a Gated Recurrent Neural Network (GRU) capable of learning sequences and patterns within the URLs. They compared this approach against a set of 21 handcrafted features combined with an RF classifier. Results showed how automatic feature extraction combined with GRUs outperformed RF, reaching 98.5% and 96.4% respectively.

III. DATASET: PHISHING INDEX LOGIN URLs (PILU-90K)
Phishers use login forms to retrieve and steal users' data. As far as we are concerned, the legitimate class in most phishing datasets are represented by URLs from their homepages [18], [19]. However, most websites have their login form in different locations, making models trained with such public datasets to be biased since the URLs of homepages tend to be shorter and simpler than others. An example of this is depicted in Figure 2.
In this paper, we present an extended version of the Phishing Index Login URL (PILU-60K) dataset [20] and we name it PILU-90K. PILU-90K contains 90K URLs divided into three classes (see Figure 2): 30K legitimate URLs of homepages, 30K legitimate login URLs and 30K phishing URLs.  We collected the legitimate URLs from the Top Million Quantcast website, 8 which provides the most visited domains from the United States. The list provided on that website only contains the domain names, so we visited them to extract the complete URL. To reach the login page from a website, we used the Selenium web driver 9 and Python, checking buttons or links that could lead to the login form web page. Once we found the presumptive login, we inspected if the form had a password field in order to confirm whether it was a login form. Otherwise, it was not added to the dataset. We collected reported phishing URLs from Phishtank [21], [36], [39], between November 2019 and February 2020.
In this work, we have built two subsets from the PILU-90K dataset to conduct the proposed experiments. The first one, named PIU-60K (Phishing Index URLs), is built using the URLs of both the homepages of the legitimate samples and the phishing ones, following the configuration of most of the current state-of-the-art approaches. The second one, PLU-60K (Phishing Login URLs), follows our strategy, i.e. it contains URLs of both legitimate login pages and phishing ones. Table 1 shows the distribution of the available URLs into each subset.
To the best of our knowledge, none of the works in the state-of-the-art use legitimate login URLs specifically. By using legitimate login URLs, our work not only reflects the real-world scenario but also shapes an unbiased dataset in terms of URL length. Table 2 include examples of URLs of each class in PILU-90K, where differences are noticeable between the legitimate index URLs and the other two classes. Specifically, the length of the different parts of the URLs and the usage of keywords like login, signin or secure, are the most remarkable ones. Figure 3 provides an overview of the distribution of the URLs length in the proposed subsets, where PLU-60K displays a more similar distribution between classes than the PIU-60K subset.
Apart from the number of features, PILU-90K recreates a challenging scenario for URL phishing detection. On the one hand, a quarter of the legitimate login forms URLs do not have a path, i.e. login forms were located on the homepages, matching its URL structure with the homepage samples. On the other hand, one out of seven samples from the phishing class does not have a path, so they will also match with the legitimate homepage samples, increasing the classification challenge, even for skilled humans.

IV. METHODOLOGY
In this paper, we compare the performance of machine learning and deep learning methods for URL phishing classification. Regarding ML techniques, we used for feature extraction the handcrafted features proposed by Sahingoz et al. [21] and (ii) statistical features using Term Frequency-Inverse Document Frequency (TF-IDF) combined with character N-gram. Concerning the DL techniques, we adopted the CNN models of Zhang et al. [43] and Kim [44].

A. MACHINE LEARNING TECHNIQUES
Text classification based on supervised machine learning consists of three main stages: text preprocessing, text representation to convert the input text into a vector of features and a classifier. In this section, we explain the two techniques we used to extract features along with the evaluated classifiers.   For the handcrafted features, URLs were parsed using tldextract 10 library. Then, raw words are extracted from the different parts of the URL by splitting the string using a set of symbols (specifically, '/', '-', '.', '@', '?', '&', '=', '_'). After preprocessing, we extracted 38 features proposed by Sahingoz et al. [21] using URL rules and NLP features: the frequency of aforementioned symbols, number of digits in the domain, subdomain and path (see Figure 2) and their lengths. Other features are evaluated, such as the number of subdomains, domain randomness using the Markov Chain Model, whether it has a common TLD (Top Level Domain), whether 'www' or 'com' are on other places different from the TLD. From the raw words, the following metrics are extracted: maximum, minimum, average and standard deviation of the words length, number of words, compound words, words equals or similar to famous brands or a keyword 10 https://pypi.org/project/tldextract/ like 'secure' or 'login', consecutive characters in the URL and the presence of Punycode.
In NLP, another popular feature extraction technique is the TF-IDF algorithm [49], a statistical approach that gives more or less weight to a term depending on how many documents such term occur on, i.e. the higher the number of URLs a term occurs on, the lower the weight and vice-versa. A term in the TF-IDF algorithm can be either a word or N-gram of characters. Given that the URLs might not have word terms in common, we resorted to the character N-gram. Therefore, TF-IDF operates on the character N-gram level to find patterns of N consecutive characters of a given URL. Following the work of Al-Nabki et al. [50], we extracted grams between two to five characters, i.e. N = [2,5]. The text preprocessing was limited to converting the text to a lower case. The extracted features were used to train an LR classifier given its good performance on similar noisy text tasks, such as File Name Classification [50], [51].

B. DEEP LEARNING TECHNIQUES
Besides the machine learning approaches, we explored the use of CNN to classify URLs [19], [41]. We selected the architectures of Zhang et al. [43] and Kim et al. [44], which operate at a character level.
The model of Kim et al. was originally built to function as a character-based language model. To use the model for URLs classification, we replaced the subsequent recurrent layers with a dense layer to perform a softmax operation over the classes. In contrast, the model of Zhang et al. did not require modifications to its architecture as it was intended for the text classification. It is worth mentioning that for both models, we did not carry out any text preprocessing step.

A. DATASETS
To test the model robustness against URLs collected in different periods, we used the five phishing datasets shown in Table 3.
These datasets are grouped into two different categories depending on their recollection strategy: (i) category A: PWD2016, 1M-PD and PIU-60K collected legitimate samples by inspecting the top-visited domains and (ii) category B: Ebbu2017 and PLU-60K visited those websites and performed further actions: in the case of Ebbu2017, its authors retrieved the inner URLs and, in the case of PLU-60K, we looked for the login form page. Therefore most of the URLs include a path. Table 4 shows the distribution of sample structure within the datasets.

B. EXPERIMENTAL SETTINGS
Experiments are executed on an Intel Core i3 9100F at 3.6Ghz and 16 GB of DDR4 RAM. We used scikit-learn 11 and Python 3 for the implementation of the different experiments.
TP denotes the true positives, i.e., how many phishing websites were correctly classified. FP refers to the false positives and represents the number of legitimate samples wrongly classified as phishing. TN (i.e., the true negatives) denotes the number of legitimate samples correctly classified. Finally, FN represents the false negatives that represent the number of phishing websites misclassified as legitimate ones.
Regarding the clustering experiments, we used the same approach of Al-Nabki et al. [50] for text representation, as explained in Section IV-A and, for the clustering, we used the Agglomerative Hierarchical Clustering (AHC) [52]. The clustering process is repeated four times, and each time we initialized the AHC with the number n of the desired clusters, i.e. n ∈ {4, 5, 6, 7}.

A. MACHINE LEARNING AND DEEP LEARNING APPROACHES
In the following, we report the result of the designed machine learning classifiers using both handcrafted and automatic feature extraction techniques. Then, deep learning approaches are presented and compared with the previous ones. Finally, we proved the impact of using legitimate login URLs against the current state-of-the-art approach.

1) HANDCRAFTED FEATURE EXTRACTION
In this configuration, we extracted handcrafted features and benchmarked several classifiers, as explained in Section IV-A. Each model was trained and tested on each subset of the PILU-90K dataset. Table 6 reports the performance of each classier. It can be seen that XGBoost, LightGBM and RF outperform the rest of the classifiers on both subsets, obtaining 93.22%, 93.12% and 92.91% accuracy on PLU-60K, respectively. While for the PIU-60K sample subset, 94.63%, 94.67% and 94.42% accuracy were obtained, respectively. Results for the eight machine learning algorithms showed that Sahingoz et al. [21] descriptors achieve better performance on PIU-60K. Length-based features, the number of words and the presence of keywords enhance the performance when the difference between legitimate and phishing URLs is significant. Using the PLU-60K subset, such descriptors decrease their performance since their values are similar between classes. VOLUME 10, 2022

2) AUTOMATIC FEATURE EXTRACTION
In this experiment, we evaluate the classification pipeline that uses TF-IDF and character N-gram for feature extraction and LR for classification, as explained in Section IV-A. For each subset of the PILU-90K dataset, we trained a classification model and reported its performance. Automatic feature extraction methods have outperformed all the other methods in the F1-score, including those based on Deep Learning. For the PIU-60K, the classifier obtained an accuracy of 96.93%, while for the PLU-60K, accuracy was 96.50%. Hence, this model outperforms the benchmarked classifiers that depend on handcrafted features (see Table 6).

3) EVALUATION OF DEEP LEARNING-BASED PHISHING DETECTION MODELS
Similarly, we trained and evaluated the proposed CNN character-based models of both subsets of the PILU-90K dataset. We found that the model of Zhang et al. [43] has an accuracy of 95.22% on the PIU-60K subset and 94.10% on the PLU-60K one. The model of Kim [44] has a slightly better result with an average accuracy of 96.43% on the PIU-60K and 96.00% on the PLU-60K (see Table 6). Compared to machine learning algorithms, both CNN models obtained better results than handcrafted features but TF-IDF combined with N-gram [50] remains as the best classifier for the two proposed subsets.

4) IMPACT OF THE REPRESENTATION OF THE LEGITIMATE CLASS ON THE CLASSIFICATION
We assessed the impact on URL phishing classifiers when they are trained with samples where the legitimate class is represented with homepage URLs, e.g. PIU-60K. We trained 11 classifiers and reported their accuracy, as shown in Figure 4. Then, these models classified 30, 000 legitimate login URLs and their accuracy was reported again. It can be seen that all the models have suffered from a significant decrease in their accuracy. Al-Nabki et al. [50] model's accuracy decreased 27% and was the most resilient with 69.50% accuracy. SVM decreased its accuracy up to 39.12% and obtained the worst result, 54.46% accuracy. CNN models of Zhang et al. [43] and Kim [44] obtained an accuracy of 65.13% and 63.50%, respectively. Furthermore, models based on handcrafted features, obtained the lowest accuracy, probably, due to the length-based features.
We observed that all models, including those trained with automatic features, misclassified more than 30% of the legitimate login URLs. These results can interfere with the application of the model in real-world applications since it presents a high false-positive rate. We argue that our TF-IDF and N-gram approach trained with PLU-60K can solve this issue since it can classify legitimate login samples with high accuracy as seen in Table 6. It should be noticed that this capability reduces overall accuracy in the advantage of reducing the false positives when users visit login pages.

B. ANALYSIS OF THE PERFORMANCE OF PHISHING MODELS OVER TIME
Recent machine learning proposals have reported good performance trained with PWD2016 and Ebbu2017 data sets. Since phishing attacks and, as a consequence, phishing TABLE 6. Performance of the assessed algorithms on the subsets of PILU-90K datasets. The eight first rows correspond to handcrafted feature extraction methods, whereas the 9 th one corresponds to automatic feature extraction methods. The last two columns depict the results for the assessed deep learning models. All the results are given in %.
websites' URLs get more and more sophisticated over time, we hypothesize that models trained with outdated datasets may decrease their performance when analyzing recent URLs.
To prove if this hypothesis is correct, we used PWD2016 and Ebbu2017 and the features from Sahingoz et al. [21] to train eight machine learning models (see Table 7) and test them using URLs from recent years. These datasets are 1M-PD from 2017, PIU-60K from 2020 and PLU-60K also from 2020. Among the proposed datasets we found two categories (see Section III). Datasets in category A were built using legitimate homepage URLs with no path, whereas in category B they include the path. For each category, we created a pipeline to avoid biased results. The first pipeline was focused on classifying URLs with no path, and we used category A datasets: PWD2016, 1M-PD and PIU-60K containing URLs collected in 2016, 2017 and 2020, respectively. In this pipeline, PWD2016 was used to train the eight machine learning algorithms and then it was evaluated using 1M-PD and PIU-60K. The second pipeline focused on classifying URLs with a path and, in this case, we used the datasets from category B: Ebbu2017 and PLU-60K, which contain URLs collected in 2017 and 2020, respectively. In this case, Ebbu2017 was used to train the proposed algorithms and then PLU-60K was utilized to test its performance.
From the experimental results shown in Table 7, all models struggled to endure over time and their performance decreased when tested on the following years' datasets. The model LightGBM obtained the best accuracy on both pipelines, but its results were the most affected over time, losing 10.42% and 30.69% accuracy on the first and second pipelines, respectively. On the other hand, SVM obtained the best results on recent datasets for the first pipeline, achieving 89.04% on the PIU-60K test, a 6.24% less than with the PWD2016 dataset used for training.
Overall results for the first pipeline, showed how a model trained with four years old datasets could not reach 90% accuracy, even when they obtained high performance on the base dataset. Moreover, the second pipeline, involving URLs classification with paths, also struggled to maintain performance on recent URLs.

C. CLUSTERING PHISHING URLs
In this experiment, we attempt to cluster the phishing URLs searching for patterns. By analyzing the obtained clusters, we did not identify significant relations among samples, despite the numbers of the clusters we tried. Nevertheless, when n = 7, we noticed associations between URLs but a further manual inspection of the clusters lead to uncertain conclusions. URLs were clustered due to similarities between different parts of the URL, i.e. similar domain or subdomain names were in the same cluster, but no further conclusions could be extracted.
Trying to look for phishing categories, we performed a term frequency analysis over the domain names of the URLs. First, we parsed the URL and obtained the domain using tldextract Python library. 12 Then, we sorted the results according to the domain frequency. We observed that the phishing class holds 12, 980 unique domain names, where 3, 543 of them were repeated using other subdomain or path. In order to identify the different categories, we performed a manual analysis of the 35 most common domains. We visited those domains and we evaluated the services provided on each domain, resulting in the six categories reflected in Table 8.
The first group is related to free subdomains, i.e. services that allow phishers to host their fake websites and make them accessible to the public. Typically, these services allow attackers to create a custom subdomain name to locate their website. Hence, this feature helps attackers in deceiving users by adding popular company names or using typosquatting and combosquatting techniques [53]. The main advantage of these hosting services is their price, as they have free plans where phishers only introduce their email with no identity confirmation. Another advantage is the free SSL certificate they offer. However, the only disadvantage could be the limited free resource offered by these services, in terms of the bandwidth, storage and computation assets.
The second group comprehends cloud services. In this approach, phishers hire resources on different cloud platforms, such as Google or Azure, to host their phishing website with an SSL certificate. Some of these services provide fixed or random subdomains, and only the path can be edited. The main disadvantages of this strategy are the price and the fact that phishers have to provide payment information to hire the service.
Fake forms are common phishing methods. In these attacks, phishers use form platforms from Google, Microsoft or Typeform to look legitimate, using logos and messages to encourage the user to introduce their credentials. Companies have detected these issues and advise users not to introduce their personal information or credentials.
Social media and malware blog posts are reported on PhishTank to advise users from entering those sites. These domains usually offer free recent films for users to download and watch. These files are detected as malware by many commercial antivirus systems, such as Avast.
Finally, most of the dataset samples are related to standalone domains bought or compromised by phishers to host their websites. Within this category, some domains are used to host different campaigns of phishing over time. They get online on active campaigns and offline when such campaigns have finished or when they have been reported to blacklists.

VII. CONCLUSION
Phishing detection mechanism aims to improve current blacklist methods, protecting users from malicious login forms. Our work provides an updated dataset PILU-90K for researchers to train and test their approaches. This dataset includes legitimate login URLs which are the most representative scenario for real-world phishing detection.
We explored several URL-based detection models using deep learning and machine learning solutions trained with phishing and legitimate home URLs. The main advantage of our approach is the low false-positive rate when classifying this type of URL. Among the different evaluated models, TF-IDF combined with N-gram and LR algorithm obtained the best results with a 96.50% accuracy. In comparison with the current state-of-the-art, reviewed in Section II, our approach present three main advantages: No dependence on external services. A limitation of the description methods that use features such as WHOIS domain age, page ranking on Google or Alexa or online blacklists, is their dependence on those services. Network slowdowns and service shortages can negatively impact analysis time, making real-time execution infeasible. Since phishing websites have a short lifespan [12], low detection times are required to warn users before accessing phishing websites.
Login website detection. Unlike other methods, which are trained with homepage URLs as representatives of the legitimate class, our model was trained with legitimate login websites. This ensures the correct classification of those websites. Therefore, our approach can be applied to the realcase scenario where users have to predict whether a login form page is legitimate or phishing.
Updated and real-world dataset. PLU-60K is focused on using updated legitimate login URLs. As demonstrated, models trained with old datasets were not able to endure their performance over time. We provide an updated phishing URL dataset for models to learn from nowadays phishing URLs and trends, which are crucial for real-world performance.
We demonstrated that phishing URL detection systems trained with legitimate land page URLs fail to classify legitimate login URLs correctly. The best-tested models could only classify 69.50% of these URLs correctly, which implies a high false-positive rate. For this reason, we recommend that a phishing detector, which intends to be used in a real situation, should be trained using legitimate login websites (such as PLU-60K) instead of homepages. The main drawback of using login websites for training is that, due to the similarity between phishing and legitimate samples, overall accuracy is slightly reduced. The tradeoff against the state-of-the-art methods is still fair due to their high false-positive rate.
Different categories for current phishing attacks were identified by using a domain frequency analysis. While standalone and compromised domains were the most common approaches, free hosting services, cloud web servers and malware blog posts represent many current phishing attacks due to their cost and effectiveness for phishing campaigns.
Finally, we demonstrated that machine learning models using handcrafted URL features decreased their performance over time, up to 10.42% accuracy in the case of the LightGBM algorithm from the year 2016 to 2020. For this reason, machine learning methods should be trained with recent URLs to prevent substantial ageing from the date of its release. In the future, we will add more information about the samples into the analysis, such as the source code of the website and a screenshot of its content, which could be useful to increase the phishing detection performance. In addition, we will enlarge our dataset, including such information. Finally, observing that deep learning techniques and automatic feature extraction obtained promising results over traditional feature extraction, we intend to explore different URL codifications to improve detection performance.