PhishHaven—An Efficient Real-Time AI Phishing URLs Detection System

Different machine learning and deep learning-based approaches have been proposed for designing defensive mechanisms against various phishing attacks. Recently, researchers showed that phishing attacks can be performed by employing a deep neural network-based phishing URL generating system called DeepPhish. To prevent this kind of attack, we design an ensemble machine learning-based detection system called PhishHaven to identify AI-generated as well as human-crafted phishing URLs. To the best of our knowledge, this is the first study to consider detecting phishing attacks by both AI and human attackers. PhishHaven employs lexical analysis for feature extraction. To further enhance lexical analysis, we introduce URL HTML Encoding to classify URL on-the-fly and proactively compare with some of the existing methods. We also introduce a URL Hit approach to deal with tiny URLs, which is an open problem yet to be solved. Moreover, the final classification of URLs is made on an unbiased voting mechanism in PhishHaven, which aims to avoid misclassification when the number of votes is equal. To speed up the ensemble-based machine learning models, PhishHaven employs a multi-threading approach to execute the classification in parallel, leading to real-time detection. Theoretical analysis of our solution shows that (1) it can always detect tiny URLs, and (2) it can detect future AI-generated Phishing URLs based on our selected lexical features with 100% accuracy. Through experiments, we analyze our solution with a benchmark dataset of 100,000 phishing and normal URLs. The results show that PhishHaven can achieve 98.00% accuracy, outperforming the existing lexical-based human-crafted phishing URLs detection systems.


I. INTRODUCTION
The distinctive characteristics of machine learning, ranging from detecting and extrapolating patterns to adapting a new environment, enable it to be a crucial part of technological systems like nuclear power plants monitoring, cyber and homeland security, computer vision, and IoT(Internet of Things), to name a few. In [2], the authors demonstrated through their study that machine learning is effective in providing security for IoT based systems. With the increasing demand of security, machine learning-based systems usually outperform traditional humans-based security monitoring The associate editor coordinating the review of this manuscript and approving it for publication was Fuhui Zhou . systems. Today, when the world heavily relies on electronic communications, connected devices lead to a variety of online threats and cyber attacks every day. In [3], the authors discussed in detail how cyber attacks for smart grids can be carried out in different phases and forms. In [4], the authors highlighted how cyber attacks on load forecasting can affect the crucial operational decisions needed for electricity delivery. In [5], the authors focused on FDI(false data injection) attacks and how to mitigate such cyber attacks. And in [6], the authors investigated the effects of cyber attacks on power grids.
Among a wide range of online threats and cyber attacks, phishing is the most common one. Phishing attack is any fraudulent attempt that involves an activity of disguising VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ oneself as a trustworthy party to obtain sensitive information. Phishing attacks can be of different types which include E-mail spoofing, website forging, social engineering, etc. One of the subtle yet deceiving methods to perform phishing attacks is phishing URLs. Phishing URLs are types of URLs which are especially crafted by phishing attackers. The common characteristic of these URLs is that they appear to be a legitimate URL but redirect users to the attackers' websites.
According to the report published by APWG(Anti-Phishing Working Group) on November 4, 2019 [7], the number of phishing attacks has risen to a high-level which were not seen since late 2016. Their report discussed and demonstrated the highest level of phishing attacks carried out throughout the year of 2019 based on 3 quarters. Several researches have been conducted to prevent, mitigate and even to correct phishing attacks. Majority of the researches are focused on using different machine learning models, deep learning models and/or the combinations of the models. In [8], the authors performed a study in detail on how to build potential cybersecurity systems using machine learning. Researchers and security analysts tend to improve phishing URLs detection systems through machine learning and deep learning models. In [9], the authors studied the optimization of phishing URLs detection systems through genetic algorithms. Also in [10], the authors designed a phishing URLs detection system. While over time, phishing adversaries have also spanned their horizons(i.e. targeting different end-devices) and enhanced their attacking strategies. In [11], the authors conducted a study to highlight the phishing attacks performed on mobile devices along with defence mechanisms and existing challenges. In [1], the authors proposed a model named as ''DeepPhish'' which is specially designed to generate AI phishing URLs. DeepPhish [1] takes Simple Phishing URLs, i.e., human-crafted phishing URLs as its input and generates new phishing URLs. Majority of these newly generated phishing URLs, i.e., AI-generated Phishing URLs are capable enough to easily bypass existing prevalent phishing detection systems. With this, the near future of cyber attacks can be easily forecasted where AI will be used to carry out highly sophisticated malicious attacks, known as ''Offensive AI''. A report by DARKTRACE [12] showed how a new paradigm of cybersecurity threats will emerge with AI driven attacks, enabling attackers to incorporate the characteristics of AI such as impersonating the trusted users, mimicking the users' behaviors, autonomous decision making ability, etc. along with existing sophisticated attacks and malwares. In [13], the authors critically examined ''machine ethics'' and concluded that machine ethics is not an appropriate technological fix to the social problems arising due to AI applications.
Furthermore, machine learning and deep learning models are primarily crafted statistical models to perform specific tasks effectively without any external instructions, they still lack accuracy in performing those specific tasks, resulting in misclassifications. There can be multiple reasons for lacking accuracy in a performance, e.g., mis-labeled data, inappropriate features reduction or selection, over-fitting or under-fitting of features. One of the most important reasons behind this lack of performance is the models' architecture restrictions. That is, the internal structures of models restrict models to manipulate and analyze different types of features. For example, Linear Regression models perform very well for features or patterns having linear relationships among them, but perform poorly when there are non-linear relationships. In [14], the author demonstrated various limitations of three different types of Boltzmann machine learning procedures. Due to this, even if we may facilitate models with a) ample amount of datasets, b) perform proper feature reduction or selection process, c) avoid over-fitting or under-fitting, models still somehow fall short of generating accurate results as models are unable to cater different types of features.
To address the above-mentioned problems, we propose PhishHaven, an efficient real-time AI-generated Phishing URLs detection system. Our study of relevant literature shows that PhishHaven is the first phishing detection system designed to detect AI-generated Phishing URLs. PhishHaven is especially designed to detect phishing URLs generated by DeepPhish [1]. Our proposed system uses lexical features-based extraction and analysis techniques. To proactively detect and classify a URL on-the-fly, we additionally introduce URL HTML Encoding as a lexical feature to further boost PhishHaven. In addition to this, we introduce URL Hit, an approach to effectively detect tiny URLs. Furthermore, we also design a new paradigm for executing ensemble-based machine learning for PhishHaven. This new paradigm makes parallel execution of ensemble machine learning models using multi-threading approach for training and testing phases. PhishHaven also employs unbiased voting concept in decision-making process to assign final labels (i.e., either phishing or normal) to the URL(s).
Main contributions of this study can be summarized as: 1) We propose the first AI-generated Phishing URLs detection system which is capable of detecting phishing URLs generated by DeepPhish [1] with high precision, accuracy and F1-measure of 98%. Even for Simple Phishing URLs, it outperforms the other existing detection systems with higher precision. 2) We introduce URL HTML Encoding as an additional lexical feature to classify URLs proactively and on-the-fly. 3) We introduce a URL Hit approach which can detect any tiny URL that can be from either phishing or normal category. Our approach is completely independent of any URL shortening softwares, algorithms, methodologies and does not require any prior knowledge in this regard. 4) We propose a new paradigm of execution for ensemble machine learning, which is comprised of parallel execution of ensemble-based machine learning models through multi-threading. Parallel execution in training and testing phases speeds up processes, hence allows to detect phishing URLs in real time.
83426 VOLUME 8, 2020 5) The proposed detection system boasts various desirable features. First, it is independent of any third-party services (i.e., WHOIS, Team Cymru, etc.) because all the procedures including features extraction from a URL, examination and classification of a URL are performed within our detection system. Second, it is independent of languages because it analyzes URLs only. And third, it is capable to detect zero-day attacks because our detection system analyzes URL based on URL's lexical features. The rest of the paper is organized as follows: Section II discusses our main motivation. Section III provides a brief overview of different approaches used for designing different phishing URLs. Section IV presents our proposed solution along with its methodology and time complexity analysis in detail. Section V highlights the experiments and evaluations of our proposed solution. Section VI describes some related work. Finally, Section VII concludes this paper along with future work and directions.

II. MOTIVATION
We examine and analyze Simple (human-crafted) Phishing URLs and AI-generated Phishing URLs through lexical features (a process of converting URLs into a sequence of characters) and word clouds (a visual representation of words which are frequently used in URLs) based analyses. We aim to analyze the differences in the formation and behaviour of Simple Phishing URLs and AI-generated Phishing URLs.

A. LEXICAL FEATURES-BASED ANALYSIS
To analyze lexical features-based behaviour and differences between Simple Phishing URLs and AI-generated Phishing URLs, we draw plots for behavioural analysis based on features' count against the number of URLs for both, i.e., Simple Phishing URLs and AI-generated Phishing URLs as shown in Figure 1 and Figure 2 respectively. We thoroughly analyze both Figure 1 and Figure 2, and perform comparative analysis. Firstly, in Simple Phishing URLs, there are some URLs which consist of more than one colon(:), i.e., they are some URLs which contain ports in them, while AI-generated Phishing URLs never consist of ports. Secondly, in Simple Phishing URLs, there are few URLs in which more than one double forward slash(//) exist, i.e., besides segment part they incorporate double forward slashes in the path section of the URLs, while in AI-generated Phishing URLs double forward slashes are only use for the segment part. From plots, we can see that there is a high variance for dots(.) in Simple Phishing URLs. Sometimes they use few dots and sometimes upto 15 dots, i.e., sometimes Simple Phishing URLs incorporate other details in the URLs apart from domain name, SLDs(second-level domains) and TLDs(top-level domains). However, in AI-generated Phishing URLs, there is almost no variance. This shows that AI-generated Phishing URLs usually use dots to add more information besides domain name, SLDs and TLDs. Similarly, there is a high variance for virgule(/) feature in Simple  Phishing URLs. This means that sometimes Simple Phishing URLs incorporate different depths of hierarchical tree paths, while AI-generated Phishing URLs generally include longer depths of hierarchical tree paths. It can be seen that Simple Phishing URLs sometimes may use question mark(?) for other purposes. But in AI-generated Phishing URLs, they only count this feature for highlighting query part of the URLs. In plot, we can see that AI-generated Phishing URLs consider equals(=) sign in a reasonably good amount with great diversity, i.e., they use a lot of assignment of values, IDs, etc. However, in Simple Phishing URLs case, they rarely use equals sign. It can also be seen that there is a great variation of words combinations use in Simple Phishing URLs. On the contrary, there is a subtle amount of variation of words combinations in AI-generated Phishing URLs. As opposed to the hyphen(-) feature, there is a great variation of words separations in AI-generated Phishing URLs. Simple Phishing URLs, in contrast to this, have almost no variation for word separations.
Since hash(#) and exclamation(!) are used in combination for highlighting fragment part, therefore, we consider them together. In Simple Phishing URLs, there are very few URLs which contain both of them. On the other hand, there are none in AI-generated Phishing URLs. Plot for ampersand(&) feature shows that AI-generated Phishing URLs have a great diversity, i.e., they mostly incorporate different numbers of queries in URLs. On the other hand, Simple Phishing URLs consider ampersand feature rarely. In Simple Phishing URLs, we can see there are some URLs which contain @, i.e., there are some URLs that incorporate user information part in them. Conversely, there is none in AI-generated Phishing URLs category. For percentage(%) feature, Simple Phishing URLs use it in a fairly good quantity, whereas AI-generated Phishing URLs use it in a subtle amount. Hence, it shows that Simple Phishing URLs frequently use URL HTML Encoding as compared to the AI-generated Phishing URLs. On the same footing, it can be said that Simple Phishing URLs use digits([0-9]) quite frequently as compared to AI-generated Phishing URLs. This concludes that Simple Phishing URLs use IDs, URL HTML Encodings and alphanumeric characters more frequently than AI-generated Phishing URLs. For plus(+) sign, Simple Phishing URLs rarely use this feature. But whenever they consider this feature, they use it twice per URL. AI-generated Phishing URLs use plus sign sometimes as compared to Simple Phishing URLs. DeepPhish [1] usually include plus feature once per URL. Then for semicolon(;), none of them, i.e., Simple Phishing URLs and AI-generated Phishing URLs, consist of this feature. And finally, for tilde(∼), there are some Simple Phishing URLs which include and specify home directory in them. However, AI-generated Phishing URLs do not use this feature. Table 1 highlights the key comparative analysis of Figure 1 and Figure 2. The reason for performing the behavioural analysis using the specific lexical features is discussed in the subsequent section.

B. WORD CLOUDS-BASED ANALYSIS
To further investigate the behavior of AI-generated Phishing URLs, we employ Word Clouds approach. With Word Clouds approach, we are able to identify terms that are frequently used by DeepPhish [1] to generate phishing URLs.
From Figure 3, it can be observed that Simple Phishing URLs usually consist of five different parts, i.e., segment, netloc, path, query and fragment. From the Figure 3a, we can deduce that Simple Phishing URLs usually use ''http'' and ''https'' as communication protocols. For subdomain, Simple Phishing URLs use ''www'' services. In case of SLD, they have a length of three characters and use ''com''. But they have no TLDs in them. For the given data, the frequently used term for domain name is ''naylorantiques''. In Figure 3c, terms including ''php, html, index, upload, identificacao, accesso, weblinks, internetBanking, do, login, views etc.'' are frequently used for path. These terms show that they have been used to fool users and to access users' information. On the other hand, in Figure 3d, mostly different values have been used with terms like ''id, passo, plugin, default, cliente''. This means that attackers tried to redirect users to their pages or locations by assigning their locations' positions and IDs. And for fragment, we can see that Simple Phishing URLs consist of gibberish combinations of alphanumeric characters just to give a visual illusion to the Internet users about the phishing URLs as complete URLs. Figure 4 shows that AI-generated Phishing URLs are made of four different parts, i.e. segment, netloc, path and query. Figure 4a depicts that ''http'' is generally used as a communication protocol in AI-generated Phishing URLs. Similar to the Simple Phishing URLs, AI-generated Phishing URLs also use ''www'' services as a subdomain, ''com'' as an SLD and also have no TLDs in them. Given the dataset, we can see that AI-generated Phishing URLs used two different subdomains, i.e., ''naylorantiques'' and ''netshelldemos''. In path, AI-generated Phishing URLs include words combinations like ''identificacao, naylorantiques, com, home, co, docs, menu, etc.'' to hide their malicious sites or to deploy malicious codes by downloading different files on users' systems or to trick users to access users' credentials. For query part, DeepPhish [1] used combination of bogus terms like ''lnms, q, X, AGN, isch, sa, ORIGEM, tbm, espv, source, CTA, conta, etc.'' with numbers to complicate query part. Hence by over-complicating queries, users are unable to understand things correctly and can be easily deceived by attackers.
Therefore, based on our exploratory data analysis, we can conclude that AI-generated Phishing URLs are usually made up of features that are somehow similar to the Normal URLs. Due to this, AI-generated Phishing URLs generally tend to look similar to that of Normal URLs. Thus, most of the time AI-generated Phishing URLs easily bypass simple phishing detection systems.
Therefore, this pose a need for a detection system which is able to even detect AI-generated Phishing URLs efficiently and effectively.

III. PHISHING URLs' DESIGN APPROACHES
Since URLs are the first thing that can be used to analyze and classify any website as phishing or normal. Phishing URLs always have some distinctive features. Those features can be incorporated in the phishing URLs through the following main approaches. VOLUME 8, 2020

1) HIDDEN LINKS
One way of persuading a victim to click on the phishing link is through Hidden Links. Hidden Link is a type of technique through which attackers hide the phishing URLs through: 1) Using some keywords, e.g., ''CLICK HERE'', ''DOWNLOAD'', ''SUBSCRIBE'', etc. 2) Replacing with IP addresses 3) Appending other domain names (mostly different brand names) This technique helps attackers to easily deceive a victim and launch a phishing attack without displaying a phishing URL.

2) COMBOSQUATTING
It is also commonly known as ''Cybersquatting'' or ''Domain squatting''. In this technique, attackers either register or use the domain name for phishing purposes.
Combosquatting can be carried out in either of the following ways: 1) Correctly spelled domain name but appending a string to it that appears legitimate and anyone can register, e.g., microsoft-login.com 2) Omitting a period, also known as ''Doppelganger'' domain. Doppelganger is a type of domain that has identical spelling to that of a legitimate FQDN(Fully Qualified Domain Name) but has missing dots between subdomain and domain, e.g., enwikipedia.org instead of en.wikipedia.org 3) Adding an extra period, e.g., air.france.com instead of airfrance.com

3) TINY URLs
URL shortening is an approach that helps in shortening the expanded or lengthy URLs, hence resulting in tiny URLs. URL shortening is desirable due to various reasons which include but not limited to make URLs aesthetically pleasing, hide underlying confidential addresses, and provide an ease to remember URLs.

4) TYPOSQUATTING
It is also known as ''URL Hijacking'', a ''Sting Site'', or a ''Fake URL''. It is a form of cybersquatting. It solely relies on mistakes, in terms of typos, either made by Internet users while entering the websites in the web browsers or based on typographical errors which are hard to notice while quick reading. Hence, any typo error may lead to a phishing page. This also includes brandjacking.
Typosquatting is usually carried out in one of the five following ways: 1) Simply misspelled, e.g., applle.com instead of apple.com 2) Typo based misspelled, e.g., appel.com instead of apple.com 3) Different domain name, e.g., apples.com instead apple.com 4) Different TLD, e.g., apple.org instead of apple.com 5) ccTLD(Country Code top-level domain), e.g., apple.cm, apple.co etc. instead of apple.com We take into account all these four types of techniques. Our proposed solution, methodology, and approaches are capable enough to deal with all of the above mentioned types of phishing URLs techniques.

IV. PROPOSED SOLUTION: PhishHaven
We propose PhishHaven, a novel phishing URL detection system. It works as a browser plugin as shown in Figure 5. The main novelty of PhishHaven lies in its detection, i.e., it is especially crafted to detect AI-generated Phishing URLs. Furthermore, PhishHaven is capable to deal with tiny URLs with our URL Hit approach. In addition to this, it is unique in terms of its parallel execution of ensemble machine learning models along with unbiased voting-based classification.

A. DESIGN PRELIMINARIES
To understand how our proposed system works, we need to understand the following preliminaries.

1) URL HIT
With the motive to obtain sensitive information such as pass codes, personal credentials etc., phish attackers generally use tiny URL approach. Therefore, to detect tiny URLs efficiently, it is important to understand how tiny URLs work.
The URL shortening works because of browsers' redirect functionality. When a user clicks or types a tiny (shortened) URL in the browser, the browser sends HTTP request to the server directing it to fetch the requested page, after which the server then sends either of the following redirect requests [29] There is a community of adversaries who took the leverage of URL shortening approach to fulfill their adversarial goals. Tiny URL is also one of the characteristics of phishing URLs. Through tiny URLs, attackers can easily hide paths of malicious pages or deploy a malicious piece of code. Thus, detection of tiny URLs is also essential. But the characteristics of tiny URLs make it difficult for the detection systems to detect tiny URLs.
Although, in [21] and [23], the authors have a rich list of lexical features and employed various detection approaches; their models were unable to detect tiny URLs.
With the main aim to detect AI-generated Phishing URLs, we also introduce a URL Hit approach to efficiently deal with tiny URLs. It is incorporated into our detection system in a way that whenever a user clicks a URL, let's say (I URL ), the URL firstly redirects toward our plugin, PhishHaven. The URL (I URL ) is then further hit by our plugin. We then fetch the response of the respective hit URL (I URL ) in terms of an extended URL, i.e., it returns an actual URL, let's say (A URL ), after which the very first component of our detection system, i.e., Features Extractor extracts features from URL (A URL ).

2) FEATURES EXTRACTOR
In order to detect AI-generated Phishing URLs, we employ an adversarial learning approach. Through this approach, we extract various features that are found specifically in AI-generated Phishing URLs. Thus, this approach enables PhishHaven to detect AI-generated Phishing URLs more accurately.

a: FEATURES SELECTION
There can be a wide variety of lexical features for classifying URLs. In our study, we specifically focus on AI-generated Phishing URLs. Therefore, we analyzed how similar or distinguishing are the AI-generated Phishing URLs to that of Simple Phishing and Normal URLs. In this study, we only analyze DeepPhish [1] generated URLs which is, to the best of our knowledge, the only AI model designed for generating phishing URLs.
In the future, there may be many other AI models designed for generating phishing URLs. Therefore, we study how Google deals with different URLs which are designed for different purposes. With this technique, our detection system, PhishHaven, is able to train on various yet most significant lexical features.
The two main categories of special characters used to create a full-fledged URL are: properly within the URLs can be easily misunderstood for various reasons. Based on these two major categories, we select the list of lexical features deliberately as mentioned in Table 2 along with their reasons. VOLUME 8, 2020 URL HTML Encoding, also known as character encoding, is essential as it is required to process non-ASCII characters of the URL. Also, to improve the page-loading time, especially on slow connections, HTML Encoding is indicated in the first 1024 bytes of the document.
Therefore, an attacker can play around with character encoding to hide malicious information embedded in the URL. Hence, we consider URL HTML Encoding as an additional lexical feature for phishing URL detection.
Since Features Extractor subcomponent is responsible for extracting lexical features from an extended URL; we develop an approach for this component. We extract features into two parts.

1) Overall count
In this part, we first take the URL as a whole. Then we extract all the selected lexical features mentioned in Table 2 one by one from the whole URL.

2) Individual count
On the other hand, in this part, we first divided URL into the following five major components, i.e, segment, netloc, path, query and fragment. a) Segment is also known as ''scheme''. It determines the type of protocol used for accessing the resources on the Internet, e.g., https, http, ftp, etc. b) Netloc is also known as ''hostname''. It determines the registered name, an IP address and user information. It is further divided into different subcomponents which are subdomain, domain name, top-level domain, second-level domain and port. i) subdomain determines the type of service used for accessing the resources, e.g., www, video, etc. ii) domain name determines registered entity, e.g., google. 83432 VOLUME 8, 2020 iii) top-level domain or TLD determines country code, e.g., pk, uk, etc. iv) second-level domain or SLD determines the type of a registered entity, e.g., org, edu, com, etc. v) port determines which port is used for communication between client and server. c) Path determines the location of a file on the web server, e.g., directory/folder/file. d) Query determines which type of request(s) are made by a user to the web server, e.g., docid=−12548987, etc. e) Fragment is also known as ''anchor''. It determines internal page or internal section references, e.g., category=blue, etc.
After dividing a URL into its respective five components, we then extract features from features list in Table 2 from each component, as shown in Figure 6.

3) MODELICS
Modern operating systems are capable enough to execute multiple tasks concurrently using multi-threading approach.
Therefore, we leverage the multi-threading to detect AI-generated Phishing URLs efficiently in real-time. We design this subcomponent in a way that it acts as a single process composed of multiple threads. These multiple threads are all the machine learning models running (learning or predicting) simultaneously, but independent of each other.
Each model (thread) takes extracted features from previous subcomponent, Features Extractor, as an input and then sends its respective predicted result individually and independently to the Decision Maker, the next subcomponent of our detection system.

4) DECISION MAKER
To avoid simple majority-based voting concept which suffers from a limitation of having equal number of votes for each value or state; we borrow the voting concept from Fault-Tolerant mechanism which is used by distributed systems to achieve a necessary agreement on a single value or single state by a majority of 2 3 or 67% and design our mechanism named as Voting.
Our Voting mechanism takes simultaneously generated prediction results (individual and independent) of each machine learning model and then makes a final decision about the class of a URL based on 67% of classifiers classifying for either of the class, i.e., phishing or normal.

Algorithm 1 Algorithm for URL Hit
Output: Expanded_URL Set variable: Tiny_URL Expanded_URL = requests.get(Tiny_URLs[index]) In a nutshell, our PhishHaven takes a URL from a browser. It then employs our URL Hit approach on it and fetches an extended URL. Thereafter, Features Extractor extracts features from an extended URL, after which the extracted features are sent to the Modelics subcomponent. And at last, the Decision Maker subcomponent takes all the outputs from the Modelics subcomponent, decides and assigns a Final Label, i.e., phishing or normal to the initially entered URL.

B. METHODOLOGY
With the aim to design an efficient real-time phishing URLs detection system, we design a PhishHaven which detects and classifies a URL using four subcomponents. First subcomponent, URL Hit which extracts extended URLs from tiny URLs. The second subcomponent is Features Extractor which extracts selected lexical features from the extended URLs. Then the third subcomponent, Modelics, which executes ensemble-based machine learning models in parallel and collects the classification results. And lastly, Decision Maker subcomponent which assigns a final class, i.e., phishing or normal to the URL.

1) URL HIT
When a user enters a URL, a URL is first redirected towards our detection system. Thereafter, the URL is further hit with a request of a response in terms of URL by our detection system. The requested respective response can be either an expanded URL in case of tiny URL or the same URL in case of expanded URL as shown in Algorithm 1. The requested respective response is then passed to the next subcomponent, i.e., Features Extractor.

2) FEATURES EXTRACTOR
To extract features from extended URL(s), this subcomponent first extracts features from URL(s) as a whole using Algorithm 3. Then it divides the expanded URLs into their respective five parts, i.e., segment, netloc, path, query and fragment as shown in Algorithm 2. After this, the respective parts of URL(s) also undergo the process of features extraction as shown in Algorithm 3. We use regular expressions to extract different types of features after which the extracted features become the output of this subcomponent.

Algorithm 2 Algorithm for Components Extraction From URLs
Input:

3) MODELICS
For this subcomponent, different machine learning models are set in parallel as individual and independent threads. A set of extracted features from the previous subcomponent, Features Extractor, becomes an input to this subcomponent. This subcomponent, Modelics, consists of ten machine learning models. We categorize machine learning models according to our case into three categories, i.e., boostingbased approach, non-learning-based approach and learningbased approach because we want to introduce variance in decision making and feature selection processes.

• Boosting-based approach
Classifiers in this category employ voting-based decision and focus on making weak learners strong. In this category, we consider AdaBoost and Gradient Boosting classifiers for the following reasons: -AdaBoost Classifier alters the distribution of the samples in the training dataset to increase the weights of the training samples which are difficult to classify. Moreover, it focuses on making weak learners strong by assigning weights to the weak learners based on data misclassifications. Also, it makes a final prediction based on the majority vote by taking the weak learner's predictions which are weighted by their individual accuracy. [31], [32] -Gradient Boosting Classifier minimizes the overall error of strong learners through gradient optimization process on each weak learner. It also minimizes the loss function of the strong learner in order to focus on misclassified samples in the training dataset [33].
• Non-learning-based approach Classifiers in this category are less prone to over-fitting. For this category, Decision Trees, Random Forest, Extra Tree classifier, Bagging classifier and K-Nearest Neighbour are considered along with the following reasonings: -Decision Trees considers all the features in the entire dataset at a time while constructing a model. It has a high variance [34]. -Random Forest considers random features from the entire dataset at a time while constructing a model. The selection of features is carried out on the basis of best split and with replacement(bootstrapping). It has a medium variance [35]. -Extra Tree Classifier also considers random features from the entire dataset at a time while constructing a model. Whereas, the selection of features is carried out on the basis of random split and without replacement. It has a low variance [36]. • Learning-based approach This category's classifiers perform classification through optimization and drawing decision boundaries. We choose three classifiers from this category, i.e., Logistic Regression, Support Vector Machines and Neural Networks because of the following reasons: -Logistic Regression makes a binary classification. It is a probabilistic approach and assigns samples to the class based on class probability [34], [38]. -Support Vector Machines draws a decision boundary based on features from the dataset in the hyperplane. It is a deterministic approach and, requires optimization and regularization [39]. -Neural Networks learns the distribution of the samples in the training dataset and performs classification based on optimized weights and biases. It also performs multi-layer optimization by itself [38], [40]. Each machine learning model takes a set of features and performs learning (in learning case) and prediction (in prediction case). Then every machine learning model predicts the class label and sends its prediction result to the next subcomponent. Each model produces output individually and independent of each other.
We can express this subcomponent mathematically as, Here, E is multi-threading of Ensemble Machine Learning Models where every model is an individual and independent thread. AB is an AdaBoost Classifier having X as a set of extracted features along with α, the weights for the classifiers and T , the number of classifiers. B is a Bagging Classifier consisting of the frequency of the label (f i ), a set of extracted features (X ) and the number of classifiers (T ). Next is the Decision Tree (DT ) made up of C number of unique labels and frequency of the label (f i ), whereas XT , an Extra Tree Classifier consists of X , a set of extracted features and frequency of the label (f i ). For Gradient Boosting Classifier (GB), it has a set of extracted features (X ), m number of iterations, T number of classifiers, step size of α and pseudo residual (R). In KNN , K-Nearest Neighbour, it has probability P, feature (x), set of extracted features (X ), k number of neighbours, y number of clusters, set of points close to x (A) and new sample (j). The Logistic Regression (LR) model uses the co-efficients of extracted features (W ), a set of extracted features (X ) and a slope (B). For NN , Neural Networks, parameters include bias (b), n number of inputs from the incoming layer, input to neuron (X i ) and weights (w i ). The parameters Random Forest (RF) utilizes include B the times of bagging, regression tree (RF i ) and (X') as a root feature of (RF i ). And for SVM , Support Vector Machines, X is a set of extracted features, w is the representation of hyperplane in terms of line and b is the representation of slope in terms of line.
The Algorithm 4 also shows the working of Modelics subcomponent.

4) DECISION MAKER
Finally, in this subcomponent, a set of prediction results from the previous subcomponent, Modelics, becomes an input. This subcomponent is responsible for collecting all the results and deciding the final class of an initially input URL in the Features Extractor subcomponent.
where, NoPL is the Number of Phishing Labels NoNL is the Number of Normal Labels. Since phishing attacks result in severe damages (e.g., theft of highly confidential information). Therefore, to refrain every possibility of phishing attack; we consider 67% of the classifiers voting for a single class at a time.
Hence, in a case where both classes (i.e. phishing and normal) have equal number of votes, the Modelics operation is re-performed. On the other hand, in a scenario where 67% of the classifiers classify a URL as phishing, it is classified as phishing, i.e., Case-I. Otherwise as normal, i.e., Case-II. The Algorithm 5 shows the working of the Decision Maker subcomponent.

C. TIME COMPLEXITY ANALYSIS
We perform the time complexity analysis of our proposed five algorithms, as shown in Table 3  Therefore, in the worst-case scenario, PhishHaven will extract five components from N expanded URLs leading to time complexity of O(N ), after which PhishHaven employs Algorithm 3 for extracting seventeen lexical features from both URLs as a whole and their components, thus consumes a constant time of O(1). Next for Modelics algorithm, PhishHaven runs machine learning algorithms in parallel but independent of each other through multi-threading. We choose ten machine learning algorithms as mentioned and discussed under the Modelics subcomponent.

A. SIMULATION ENVIRONMENT
To perform our experiments, we choose the LINUX Ubuntu 16.04 environment. The detection system is developed with Python version 3.5.4. For features extraction and data analysis, we use the Python libraries including nltk, re, numpy, matplotlib, wordcloud, counter, plotly, urlparse and dnstwist 1 [41]. And to design the Modelics subcomponent, we use pandas, sklearn, threading and seaborn libraries.

B. DATA PREPARATION
To achieve our main goal, i.e., to detect AI-generated Phishing URLs, we used DeepPhish [1], an AI-based model developed for generating phishing URLs. From DeepPhish [1], we generated 50,000 AI-generated phishing URLs. For Normal URLs, we took 50,000 URLs from Alexa. Also, to test our PhishHaven against Simple Phishing URLs; we took 50,000 Simple Phishing URLs from PhishTank.
To include and prepare our datasets according to recent phishing cases and current URLs' format, we downloaded datasets by September 4th, 2019 from both PhishTank and Alexa.
To have a fair evaluation of our PhishHaven against AI-generated Phishing URLs, we consider a 5:5 split ratio for training and testing datasets. To prepare our experimental datasets, we first combined AI-generated Phishing URLs and Normal URLs and shuffled them randomly. Then we applied Hold-out method to split the dataset into training and testing datasets. Due to the limitation of variation in the patterns of the features presented in the URL's generated from Deep-Phish [1], applying K-fold validation for evaluating our models' efficiency can result in a biased generalization. Hence, we choose to perform Hold-out method to introduce variation in the training set while focusing on limiting the overfitting or over-generalization. For Simple Phishing URLs, we performed the same evaluation procedure as that of AI-generated Phishing URLs.

C. EVALUATION METRICS
In the field of cybersecurity, security comes first. Keeping this aspect in mind, we first need to choose the evaluation metric, that is, which type of performance measure(s) we want to increase or decrease or we are more considerate about.
Since phishing attacks can cause severe damages and harm to the end-users, we decide to choose Sensitivity (also known as ''True Positive Rate'' or ''Hit Rate'') as our one of the main evaluation metrics. TPR is a probability of correctly detecting the H 0 (null hypothesis).
Here we have considered Moreover, we also consider Fall-Out (also known as ''False Positive Rate''). FPR is a probability of falsely detecting the H 0 (null hypothesis).
To avoid any hindrance faced by users in availing services, we also evaluate our system through Specificity (also known as ''True Negative Rate''). TNR is a probability of correctly rejecting the H 0 (null hypothesis).
where, TP = correctly identified TN = correctly rejected FP = incorrectly identified FN = incorrectly rejected To evaluate how accurate our PhishHaven works, we also choose ''Accuracy''. The Accuracy is defined as Our main objective behind the selection of TPR, FPR and TNR is to always prevent a user from any phishing attacker's attacks irrespective of any level of loss or damage while making sure that there are significantly less misclassifications regarding Normal URLs.
Further to evaluate the performance of our selected machine learning models, we choose to include are Precision, Recall and F1-measure as;

D. PhishHaven ANALYSIS
We evaluate PhishHaven using theoretical analysis as well as experimental analysis. Through theoretical analysis, we examine the logical structure of concepts and statements of PhishHaven. Through experimental analysis, we experimentally prove the logical structures of concepts and statements of PhishHaven.

1) THEORETICAL ANALYSIS
The purpose of conducting the theoretical analysis is to demonstrate that there is a valid argument in favour of our proposed hypothesis, i.e.,

Phishhaven Can Detect Ai-Generated Phishing URLs
To prove our hypothesis, we propose two propositions: Proposition 1, to prove that PhishHaven is capable of detecting every tiny URL; and Proposition 2, to prove that our selected lexical features are invariant, hence, leading to a proof that PhishHaven can detect AI-generated Phishing URLs.

PROPOSITION 1: URL HIT WORKS FOR EVERY TINY URL
Since PhishHaven is a plugin which means that PhishHaven works as a middle party between users and servers, all the entered tiny URLs are first redirected to PhishHaven. Phish-Haven then employs the URL Hit approach, i.e., further sends the tiny URL(s) to the server. Thereafter, in response of the sent tiny URL(s), PhishHaven receives an extended URL(s), and then the features extraction procedure is performed on the respective extended URL(s).
Hence, we can say that PhishHaven is one to one correspondence (or bijection) function, i.e.:

PhishHaven(I URL ) = A URL implies I URL → A URL
where, I URL = initial URL (tiny URL), A URL = response URL from server (always be extended). Furthermore, our proposed URL Hit approach is comprised of three main postulates.

• Postulate 1: Default behaviour
We set URL Hit as the default behavior of our detection system. With default behavior, we mean irrespective of the URL form, i.e., either extended or shortened; a URL always hit by our detection system with a request of a response in terms of an actual URL. By setting up this approach, we are able to cater the case of tiny URLs as the classification is directly applicable on an actual URL(s) (A URL ).
• Postulate 2: Independent of prior knowledge Since we leverage the browser's redirection property, our URL Hit approach is independent of any URL shortening software, e.g., bitly, TinyURL, Polr, etc. URL Hit works for every URL including those URLs which are shortened by any algorithm, approach or methodology. Also, with our approach, any prior information either regarding URL or shortening software or shortening algorithm is not required.

• Postulate 3: Consistent
Our proposed approach is consistent, i.e., on a given input, we always get the same output. The results from VOLUME 8, 2020 Our assumption for this postulate is that the server is always active. Therefore, this proves our Proposition 1 that our URL Hit approach works for every tiny URL. 2) Case 2b: (N ∩ S) ⊂ S Since S contains essential lexical features, which are necessary to construct a full-fledged URL, it is  Based on three cases, we can prove that our selected features are invariant. Hence, it can be said that a feature from a set of our selected lexical features must be a part of every full-fledged URL.
Through Proposition 1 and 2, we proved our proposed hypothesis. Therefore, we can say that PhishHaven can detect future AI-generated phishing URLs with 100% accuracy based on URL Hit and our selected lexical features.

2) EXPERIMENTAL ANALYSIS
We perform experimental analysis of PhishHaven to support our theoretical analysis. In our experimental analysis, we perform experiments a) to evaluate the performances of the selected machine learning models executed in an ensemble manner using multi-threading, b) to evaluate PhishHaven against AI-generated and Simple Phishing URLs, and c) to compare PhishHaven against some of the existing lexical features-based simple phishing URLs detection systems.
Our approach of parallel training and testing of the ensemble machine learning models enables PhishHaven to speed-up the process of URL classification. To evaluate the performances of our selected machine learning models which are appropriate in our case, we evaluate them against Precision, Recall and F1-measure as shown in Table 6.
It can be seen in Table 6 that among the selected machine learning models, Support Vector Machines performs significantly well having 97.68% Precision, 97.63% Recall and   97.64% F1-measure. The reason Support Vector Machines work significantly well in our case is that they perform binary classification efficiently by drawing decision boundary in hyperplane between two classes. Table 10 describes the parameters setup of our selected ten machine learning models in our experimental setup.
However, we proposed and employed ensemble-based machine learning through multi-threading for final classification. Therefore, Figure 10 illustrates how the performances of our selected machine learning models contribute altogether to the overall performance and efficiency of our detection system.
We test our detection system for two different types of URLs' detection, i.e., for AI-generated Phishing URLs and for Simple Phishing URLs. Our evaluation measures, as shown in Table 7 and Table 8, signify that our detection system is capable enough to detect efficiently not only AI-generated Phishing URLs but also Simple Phishing URLs and Normal URLs.
To evaluate the performance of our detection system, we choose the following measures: Precision, Recall, Accuracy, and F-1measure. To analyze the efficiency of our detection system, we employ Sensitivity (TPR), Fall-out (FPR) and Specificity (TNR). To further evaluate the performance of PhishHaven in terms of Simple Phishing URLs detection, we compare Phish-Haven with some of the existing lexical-based simple phishing URLs detection systems. In Table 9, we can see that PhishHaven outperforms the existing state-of-the-art detection systems designed for detecting simple phishing URLs. From Table 9, we can also see that in [27] they have slightly higher accuracy than PhishHaven because their dataset is smaller than ours, i.e., we took a dataset with a wider range of cases including a variety of significant cases.

VI. RELATED WORK
Phishing detection systems usually utilize several techniques for detecting and predicting phishing sites. The main techniques are as follows:

A. LISTS AND HEURISTICS-BASED TECHNIQUES
In these techniques, detection mechanisms employ whitelists or blacklists and a set of rules to compare and classify a URL either as phishing or normal. A major drawback of these approaches is that they completely fail in detecting newly generated phishing sites called zero-day phishing sites. Furthermore, they require continuous update of lists and rules. Also, websites which are similar in terms of URL contents to those in blacklists or whitelists, and websites which are similar in terms of appearance to those set heuristics can be easily misclassified.
In [15], the authors presented a phishing URLs detection approach based on string-matching which is heavily dependent on blacklists, which consumed significant computation cost.
On the other hand, in [16], the authors proposed a solution to improve the blacklisting-based phishing URLs detection systems. Their approach was not effective against zero-day attacks. In addition to this, their approach used a significant amount of computation time with relatively less outcome.

B. CONTENT-BASED TECHNIQUES
This approach analyzes the content of the webpage to classify the respective page either as phishing or normal. A main limitation which makes this approach not only computationally inefficient but also not a viable technique for most of the scenarios is that it requires either source code or the entire content of the website, i.e., images or text for performing features extraction and analysis process. In [17], the authors applied similarity in CSS(Cascading Style Sheets) features, i.e., visual features. Their proposed scheme consumed more time as a whole. In [18], the authors proposed a phishing webpage detection mechanism. Their mechanism noticeably outperformed, but consumed a lot of time. Furthermore, their approach of using stacked models involves GBDT(Gradient Boosting Decision Tree) for making initial and final predictions, added a biased factor in the final predictions.
The approach proposed in [19] utilized hyperlinks features extracted from pages' source codes. Since their approach completely relied on the analysis of features extracted from the source code, therefore, the longer the source code, the higher the time-complexity.

C. THIRD-PARTY-BASED TECHNIQUES
Some detection mechanisms use third-party-based features and services. The main drawback of this approach is high error rate in terms of misclassification. The main reason for misclassification is that they heavily rely on either the age of a domain or the number of occurrences in search results. There are likely chances in this approach that newly setup legitimate sites can be misclassified as phishing sites. In addition to this, there are chances that the third-party can be biased or hijacked, e.g., DNS(Domain Name System) spoofing, etc.
In [20], the authors conducted a study on phishing URLs detection using both lexical and external (third-party-based) features. Their proposed solution named as ''PhishDef'' demonstrated that it performed significantly accurate using only lexical features.

D. LEXICAL FEATURES-BASED TECHNIQUES
This URL classification technique turns out to be a more promising approach. This technique totally relies on URLs for features extraction. These features can be count-based features, binary features, blacklisted words, etc. It is also computationally efficient as it only takes a URL into consideration, i.e., a line comprises of alphanumeric characters and symbols.
In [21], the authors designed a phishing detection system which yielded an accuracy of 94.91% employing lexical feature analysis. In [22], the authors enhanced the performance of phishing URLs detection system through lexical features. The proposed model in [23] achieved a noticeable accuracy with the lexical features and consumed relatively less time since it was independent of any third-party services and source code analysis. In [24], the authors conducted a  study on improving the accuracy of the phishing detection systems through features selection and ensemble learning methodology. Their experimental results showed an accuracy of 95%. The detection system implemented in [25] used seven different machine learning algorithms. Their approach and types of selected features overcome many issues like language and third-party dependency, as well as real-time and zero-day attacks detection.

E. HYBRID FEATURES-BASED TECHNIQUES
In [26], the authors applied several techniques together such as whitelist, external features (third-party services), page contents, and TF-IDF techniques. Though the authors proposed the solution to improve the maintenance process of a blacklist, the classification techniques of HTML page content, external features and TF-IDF inherited their limitations to their proposed solution as well. Also, the solution proposed in [27] employed a combination of various techniques. Their system heavily relied on source code analysis, verification from whitelists, third-party-based services and page similarity using screenshots for detection purpose. Hence, this made their solution more time-consuming and less efficient. On the other hand, in [28], the authors employed lexical as well as host-based features to detect phishing URLs. Though the authors proposed a solution to complement blacklisting and heuristic-based detection systems but their proposed solution used host-based properties, hence inherited various limitations of host-based techniques.
From the above discussed studies and comparison shown in Table 11, we concluded that lexical features-based techniques are more efficient with fewer limitations. Therefore, we design our detection system based on lexical features extraction and analysis.

VII. CONCLUSION
In this paper, we proposed PhishHaven, the first AI-generated Phishing URLs detection system based on ensemble machine learning. Our proposed system is based on lexical features analysis. We also introduced URL HTML Encoding as a lex-ical feature to boost our detection system in proactive and onthe-fly detection of URLs. We further introduced a URL Hit approach for detecting and classifying tiny URLs. In addition to this, we presented a new paradigm for ensemble-based machine learning models execution. Our proposed new paradigm executes ensemble-based machine learning models in parallel using multi-threading technique, and results in real-time detection by significant speed-up in the classification process. For final classification, we employed an unbiased voting method.
We evaluated our solution both theoretically and experimentally. In theoretical analysis, we proved that our solution can detect tiny URLs as well as future AI-generated Phishing URLs based on our selected lexical features with 100% accuracy. Experimental analyses were conducted for two cases, i.e., AI-generated and Simple Phishing URLs.The dataset in the first case consists of AI-generated Phishing URLs and Normal URLs. The dataset in the second case consists of Simple Phishing URLs and Normal URLs. For the first case, the results showed a significantly high accuracy and F1-measure of 98%, while securing 97% TPR and 99.17% TNR with noticeably low fall-out of 0.8% FPR. For the second case, the results also showed a significantly high accuracy and precision of 98%, outperforming the other existing simple phishing URLs detection systems. Therefore, we can conclude that the proposed solution efficiently addresses the detection of AI-generated Phishing URLs in the forthcoming future as well as Simple Phishing URLs prevalent these days.
Although PhishHaven has achieved a significant accuracy in classifying both AI-generated and human-crafted phishing URLs, it has a limitation that it can detect only those AI-generated Phishing URLs which consist of lexical features and patterns similar to that of DeepPhish [1]. It is because, to the best of our knowledge, DeepPhish [1] is the only AI-based system designed to generate phishing URLs.
We have some future work and directions as follows: 1) PhishHaven can be further enhanced by incorporating unsupervised learning, i.e., deep learning models 2) The efficiency of PhishHaven can be further improved in the following ways: a) Based on our chosen metric, we can filter outperforming models and only use those models for making predictions. Hence, with this kind of model reduction, we can reduce the computation cost in terms of multi-threading. b) In continuation of the previous point, we can also extract the weights assigned by outperforming models to extract features and consider only those features. Hence with this type of approach for features reduction, we can reduce the computation process and cost in the Features Extractor subcomponent. 3) By applying multi-threading technique at an input unit (i.e., at the very initial point where PhishHaven takes a URL as an input), we can work on multiple URLs simultaneously, hence incorporating the scalibility factor in PhishHaven for classifying multiple URLs at a time.