Detecting Malicious URLs Using Machine Learning Techniques: Review and Research Directions

In recent years, the digital world has advanced significantly, particularly on the Internet, which is critical given that many of our activities are now conducted online. As a result of attackers’ inventive techniques, the risk of a cyberattack is rising rapidly. One of the most critical attacks is the malicious URL intended to extract unsolicited information by mainly tricking inexperienced end users, resulting in compromising the user’s system and causing losses of billions of dollars each year. As a result, securing websites is becoming more critical. In this paper, we provide an extensive literature review highlighting the main techniques used to detect malicious URLs that are based on machine learning models, taking into consideration the limitations in the literature, detection technologies, feature types, and the datasets used. Moreover, due to the lack of studies related to malicious Arabic website detection, we highlight the directions of studies in this context. Finally, as a result of the analysis, we conducted on the selected studies, we present challenges that might degrade the quality of malicious URL detectors, along with possible solutions.


I. INTRODUCTION
As the Internet develops and grows, many of our activities are now conducted online, including e-commerce, business, social networking, and banking, raising the likelihood of online crime.So, securing the world wide web is becoming increasingly important.According to Internet World Stats [1], around 237,418,349 users used the Arabic language on the Internet in 2020.Attempts to bait users to click through to malicious uniform resource locators (URLs) lead to the system being hacked or access being gained to sensitive data.Consequently, it is becoming increasingly necessary to secure this side.Protocols and regulations secure the connection between the client and server, yet it is still vulnerable to those with malicious intent to attack it.The term "Malicious" is a general term for attack types that include phishing, spam and malware, and more.
Malicious URLs are used to extract unsolicited information and trick inexperienced end users into falling for a scam, which causes losses of billions of dollars each year.
In order to identify the threat from malicious sites, the online security community has created blacklisting services to help detect harmful websites.The blacklist is a database that contains a list of all URLs already known to be malicious.URL blacklisting has been shown to be effective in some cases [2].However, the attacker can make use of them by easily fooling the system with changes to one or more components of the URL string.Inevitably, many VOLUME XX, 2022 malicious sites are not blacklisted because they are either too new or were never or erroneously assessed.
Another approach to identifying malicious sites is the heuristic method, which is an improved version of the blacklist method but based on signatures that are used to find the correlation between the new URL and the signature of an existing malicious URL.These approaches are adequate for identifying malicious and benign URLs.However, these previous methods have limitations, such as (a) the blacklist method failing to protect against zero-hour phishing attacks, as it classifies only 47-83% of new phishing URLs in a 12hour period [3], and as a result, it cannot categorize new URLs [4], and (b) these methods can be bypassed using an obfuscation method, such as generating a huge number of URLs with an algorithm that can bypass the blacklist and heuristic methods due to failures with handling extensive lists, in which case the blacklist method cannot be used with rapid change technology.Despite these limitations, the blacklist method is used by many anti-phishing companies due to its simplicity.
The third approach to detecting these malicious sites is the use of artificial intelligence (AI) approaches, including machine learning (ML) and deep learning (DL).These approaches have been widely applied in different fields, including cybersecurity, healthcare, medical imaging analysis, e-commerce, and social media.Particularly in the cybersecurity field, it can take advantage of how ML models can be designed to learn from their previous experience and thus have better self-learning without the need for human interaction.This results in significant property in large organizations, companies, banks, and others.Moreover, ML and DL techniques have proved their ability in many disciplines, and they are frequently used to detect malicious sites [5].
The use of ML for detecting malicious URLs has proved to be effective through detecting newly formed URLs and the automatic update of the model.Recent studies have explored DL models that use an approach to automatically detect newly formed URLs and extract the features.In this way, researchers can extract many features from URLs that help ML algorithms categorize the URL as malicious or benign.The most common features extracted from the URLs are lexical, content-based, and net-work-based, as described next.
In this research, we reviewed 91 studies published from 2012 to 2021 that used ML or DL in the classification of malicious URLs.The contents of the websites were classified as either Arabic or English language.We provide a taxonomy of the reviewed studies on the detection of malicious URLs in terms of several aspects, including the language used, related URL features, ML detection techniques, and the datasets used.The primary contributions of this paper can be summarized as follows: • Produces several taxonomies of malicious URL detection studies.
• Conduct many comparisons and discusses several techniques and properties related to malicious URL attacks and techniques for detecting them.
• Highlights several findings about the features of a URL that are used for detection, including the type of content, algorithms used for detection, and datasets used.
• Discusses several challenges that might impact the quality of ML detection techniques, including the size of the dataset, outliers, features selection, and the sustainability of the detectors.
The rest of this paper is organized as follows.Section 2 presents the background of URL feature types and the attack techniques.Section 3 provides the taxonomy of the works investigated in this study.Section 4 discusses and summarizes the ML studies about malicious URL attack detection.A discussion of the datasets used for evaluating detection techniques is presented in Section 5. Finally, Section 6 concludes the paper and presents future related work.

II. BACKGROUND
This section explains the common URL feature types and the possible types of attacks that can be used by attackers through URLs.The URL features discussed in this section include lexical, content, and network features.This section also discusses the common techniques of spam, phishing, malware, and defacement URL attacks.
The success of any ML model depends on the quality of training data and the quality of features fed into the model.Certain features must be available to analysts in order to create proactive models to identify malicious URLs.Simple URL strings can be used to extract these features, which can be lexical, content, or network [6], [7].
First, lexical features include the elements of the URL string.They are determined by how the URL looks or seems different in users' eyes and the URL's textual properties.These include statistical properties such as the length of the URL, length of the domain, number of special characters, and number of digits in the URL.Second, content features refer to the actual content on the page.These features are obtained upon opening or downloading the website, and it includes the hypertext markup language (HTML) tag count, Iframe count, hyperlink count, number of scripts, and count of suspicious JavaScript and other functions.Third, network features are a union of the domain name system (DNS), network, and host features.It also includes the resolved IP count, latency, redirection count, domain lookup time, number of DNS, connection speed, and the number of open ports.
The purpose of including these types of features is to enhance model performance to accurately detect malicious URLs.In general, it has been found that legitimate websites have more content than malicious websites.Moreover, network features can be useful in detecting malicious websites that tend to be hosted by less reputable service providers.Therefore, the DNS information can be used to detect malicious websites.Keywords extracted from the domain name can be compared to a list of commonly used keywords associated with malicious behavior.All of the mentioned features help in determining whether a web page is malicious [8].

B. URL ATTACK TECHNIQUES
Attack techniques are the methods or mechanisms used by attackers to illegally gain access to user data or cause damage to the attacked system.Attackers can use malicious URLs to perform those attacks.Malicious URLs can be classified as spam, phishing, malware, or defacement URLs.
The majority of cyberattacks happen when users click on malicious URLs.When URLs are exploited for purposes other than accessing legitimate resources on the Internet, they pose a threat to data integrity, confidentiality, and availability.The different kinds of malicious URLs are discussed below [9].

1) SPAM URL ATTACKS
These attacks occur when spammers create web pages in an attempt to fool the browser engine into perceiving they are legitimate when they are not.By illegally improving their rank, spammers want to deceive and attract more users to their spam websites [10].Spammers send spam emails that contain spam URLs to harm and infect the systems of their victims using spyware and adware [11].

2) PHISHING URL ATTACKS
Attackers use phishing URLs to attract users to open a fake website, where access to the user's computer is attempted in order to steal a user's private information, such as credit card numbers.Non-expert users can be easily fooled into clicking through to a phishing website by making barely noticeable misspellings in the URL, such as changing www.facebook.comto www.facebo0k.com,which makes user data more vulnerable [11].

3) MALWARE URL ATTACKS
These attacks direct users to a malicious website that typically installs malware on the user's device that can be exploited for file corruption, keystroke logging, and even identity theft.Malware is a type of malicious software that can steal someone's personal information and damage a computer.One example of malware is the drive-by download, defined as the unintentional download of malware caused by a user being tricked into visiting a malicious website [12].More examples include ransomware, keyloggers, trojan horses, spyware, scareware, computer worms, and viruses [11].

4) DEFACEMENT URL ATTACKS
This type of attack redirects the user to a malicious website that has been altered by hackers in one or more aspects, such as its visual appearance or some of the site's contents.Hacktivists strive to take down a website for several reasons [13].This form of action occurs when the attackers discover the vulnerabilities of the website and utilize those vulnerabilities to compromise the website and modify the content on the web page without the owner's authorization, which is technically known as penetrating a website [11].The classification of malicious URL attacks by ML techniques can be binary, such as either malicious or benign.Conversely, multi-classification is not restricted to any number of classes except that it has more than two, such as benign, phishing, suspicious, malware, spam, and others.

III. TAXONOMY OF MALICIOUS URL DETECTION TECHNIQUES
The existing works investigated in this research encompassed studies conducted between 2012 and 2021.Figure 1 provides a complete view of the explored studies based on the detection of malicious URLs according to ML detection techniques, classification types, and used datasets.We examined and summarized related work in the detection of malicious URLs on Arabic and English websites using ML algorithms.The type of classification for each study and the name of the classifications are exactly as written in the study.In general, most of the studies used binary classification of URLs.Overall, 81 studies used binary classification, and 10 studies used multi-classification.The datasets utilized to train and test the detection models in the examined studies came from a variety of sources, including open sources, those created by the study authors, those adapted from other authors, or a combination.The most common dataset sources were PhishTank [14] and Alexa, as well as datasets collected by the study authors.ML algorithms can be classified into three VOLUME XX, 2022 categories: supervised learning, unsupervised learning, and semi-supervised learning, which refer to labelled, unlabelled, and partially labelled training data, respectively.

IV. MALICIOUS URL DETECTION ON ARABIC AND ENGLISH STUDIES
This section reviews and summarizes the related work in terms of detecting Arabic and English malicious attack websites using ML algorithms.Many features can be extracted from the URL to help ML algorithms accurately detect malicious URLs.The features mentioned in the reviewed studies were sorted according to three main features in order to use unified terms in the present research.Those main features are lexical, content-based, and network-based, as discussed in Section 2. The summarized papers have been further linked based on the type of features they used and whether the website contents were in Arabic or English.

A. ENGLISH-BASED STUDIES
Many studies have been conducted investigating different features and algorithms for malicious attack detection on English content websites.This section presents these studies and categorizes them into five sections.The first section presents studies that focused on lexical, content, and networkbased features.The second section presents studies that focused on lexical and content-based features.The third section presents studies that focused on lexical and networkbased features; the fourth section presents studies that focused on lexical features only; and the fifth section presents studies that focused on content-based features only.

1) LEXICAL, CONTENT-BASED, AND NETWORK-BASED FEATURES STUDIES
The research conducted by Aldwairi et al. [15] was based on a lightweight self-learning scheme.The open-source datasets used were Alexa, which contains benign websites [16], and the PhishTank dataset, which contains malicious URLs [14].The extracted features are lexical, network-based, and contentbased, with a total of 31 features.The system achieved a precision of 87%.
Another study, which was conducted by Xuan et al. [17], proposed a malicious URL detection method using ML techniques.To classify URLs, they used 54 lexical, net-workbased, and content-based features.Their random forest (RF) algorithm resulted in high accuracy at 96.28%.However, Molah et al. [18] achieved better accuracy of 97.36% using RF in an intelligent system for detecting phishing websites using different ML techniques.Their dataset was adopted from the University of California Irvine Machine Learning Repository (UCI-ML) [19].A total of 30 lexical, network, and contentbased features were extracted to classify the URLs.
Yuan et al. [20], proposed a parallel neural joint model algorithm to analyse and detect malicious URLs by combining the technologies of a labelled capsule neural network (CapsNet) and independently recurrent neural network (IndRNN).The dataset used was collected from PhishTank [14], Malware Domain List [21], and Alexa [22].They used lexical and other features, and their model included three parts: IndRNN, CapsNet, and attention.Their proposed method achieved an accuracy of 99.78%.
In addition, Yu [23] proposed a hybrid model that combined the advantages of a deep belief network (DBN) and support vector machine (SVM) for phishing website detection.The dataset was collected from PhishTank [14].The features considered were lexical, content-based, and network-based.The model (DBN-SVM) achieved the highest accuracy of 99.96%.
Another study formulated by Zamir et al. [24] proposed a framework to detect phishing websites using a stacking model.The used dataset is Kaggle [25].They extracted 32 content, lexical, and network features.Two stacking models were formed based on the highest scoring classifiers: Stacking 1 (RF + neural network (NN) + bagging classifier (BC)) and Stacking 2 (K-nearest neighbor (KNN) + RF + BC).The highest achieved accuracy was 97.4% with the Stacking 1 model (RF + NN + BC).
A different approach was used by Alkhudair et al. [26], who applied a malicious URL detection method using four ML algorithms.They obtained their dataset from the Kaggle [27] and Urcuqui et al. [28] datasets and used 20 lexical, content-based, and network-based features.RF had the best result, with 95% accuracy.
However, Deebanchakkarawarthi et al. [29] achieved better accuracy of 97% in their ML methodology aiming to avoid database dependency, increase efficiency, and detect malicious URLs.
In order to detect and categorize malicious URLs Selvaganapathy et al. [30].proposed a methodology based on a stacked restricted Boltzmann machine for feature selection with deep NN.The dataset was formed from MalwareDomainList; UCI-ML Repository: Spambase Dataset [31]; UCI-ML Repository: Phishing Dataset [19]; DMOZ [32]; and Alexa [16].A total of 98 features were extracted.The highest accuracy was achieved by DBN (75%).
Similarly, Rao et al. [33] proposed a heuristic technique to detect phishing sites hosted on compromised servers.The dataset was collected from the PhishTank website [14] and Alexa [22].They selected 6 lexical features, 1 network-based feature, and 10 content-based features.The highest accuracy of 98.05% was achieved by the twin SVM (TWSVM).
Patil et al. [37] proposed a methodology to detect malicious URLs and the type of attacks based on multi-class classification.The dataset was collected from the Alexa top sites and PhishTank [14], MalwareDomainList [21], and jwSpamSpy [38].They extracted 65 lexical, 34 content-based, and 18 network-based attacks.The highest average accuracy of 98.44% in identifying the attack type was achieved by the confidence-weighted (CW) learning classifier.In the detection of malicious URLs, they achieved an accuracy of 99.86%.A limitation of their methodology is that it lacks the detection and analysis of obfuscated JavaScript on web pages.
Yang et al. [39] presented multidimensional feature phishing detection (MFPD) based on a DL detection method.They created a dataset by crawling PhishTank [14] and DMOZ [40], and the three types of URL features selected were lexical, content-based, and network-based.The MFPD algorithm achieved the best performance of 98.99% accuracy.
In addition, Mourtaji et al. [41] proposed a hybrid rulebased methodology to detect and control phishing websites.Their dataset was collected from PhishTank [14] and Alexa [16], and they extracted 37 lexical, network-based, and content-based features.The best accuracy was achieved by the convolutional NN (CNN) model at 97.945%.
Along with the same lines, Chen et al. [42] proposed an ML model for the intelligent detection of malicious URLs.They collected their dataset from Alexa [22], urlquery.net,urlscan.io[43], and GitHub [44].Their study provided 41 lexical, net-work-based, and content-based features.The 17 most significant features were identified using analysis of variance (ANOVA) and the extreme gradient boosting algorithm (XGBoost).They concluded that the XGBoost classifier had the best result with 99.98% detection accuracy.
A study conducted by Vundavalli et al. [45] aimed to distinguish between benign and malicious websites They obtained their dataset from the Kaggle website [46].The best result was achieved by naive bayes (NB) with an accuracy of 91%.
Additionally, Crisan et al. [47] proposed a method with the goal of using a combination of word embeddings and network-based features that considered specific methods of addressing the class imbalance.Their dataset was provided by a security company.The classifier with the best performance was a multilayer perceptron (MLP), with an accuracy of 95.81%.

2) LEXICAL AND CONTENT-BASED FEATURES STUDIES
Cao et al. [48] proposed a model to detect malicious URLs in online social networks (OSNs) using seven lexical and content-based features.They collected the original messages from the largest OSN in China, Sina Weibo.The bayesian network (BN) model achieved the best results with an accuracy of 84.74%.The limitations of this study included the lack of an expanded dataset, the collection of big data being a challenge for data mining, and the need for the evaluation to be more comprehensively compared with existing studies.
Humam et al. [49] evaluated various methods and offered rules-based applications for efficient phishing detection.The authors of this study built their dataset, and the detection methods were based on 13 lexical and content-based features.The experimental results showed that the decision tree (DT) had the highest accuracy, at 96.8%.
Similarly, Rao et al. [50] proposed a classification model based on lexical and content features to overcome the disadvantages of current anti-phishing techniques.Their dataset consisted of the Alexa PageRank system [22] and PhishTank [14].Principal component analysis RF (PCA-RF) performed the best out of the oblique RF methods, with an accuracy of 99.55%.
A study conducted by Adewole et al. [51] proposed a hybrid rule induction algorithm capable of separating phishing websites from legitimate ones.The hybrid algorithm uses the strengths of both the rule induction algorithm (JRip) and the projective adaptive resonance theory (PART) algorithm to produce rule sets.Their dataset was collected from PhishTank [14], Yahoo, Alexa [16], CommonCrawl [52], and OpenPhish [36].The total of extracted lexical, network-based, and content-based features was 40, with the proposed system returning the highest accuracy of 99.08%.
Liu et al. [56] designed a web spam detection method by extracting novel feature sets.They built their method based on the WEBSPAM-UK2007 dataset [57] and the UK-2011 [58].In addition, they selected 28 content-based and lexical features.The highest accuracy was achieved by the RF at 93%.

3) LEXICAL AND NETWORK-BASED FEATURES STUDIES
Manjeri et al. [59], proposed a model to classify a URL by handling class imbalance using a public dataset [60].The RF algorithm achieved the best accuracy at 96%.Differently, a study conducted by Vanhoenshoven et al. [61] developed a model to detect malicious URLs and obtained better accuracy with the same classifier of 98.26%.The dataset they used was adopted from the one presented by Ma et al. [62].The URLs were obtained from a large webmail provider and Yahoo's directory listing.The extracted lexical and network-based features were 3.2 million for these studies.
Another study conducted by Rakotoasimbahoaka et al. [63] aimed to solve the over-fitting problem of the combination of ML and DL by using different laws in a majority voting system.The datasets used were from OpenPhish [36].They extracted 12 lexical and network-based features.They found that the majority vote method solved the over-fit problem in a combination model (RF-CNN-LSTM), which reached 93% accuracy using the second dataset.
A study conducted by Rao et al. [64] developed an ML model for classifying URLs using a dataset collected from Kaggle [60].They used 19 lexical and network-based features.Finally, they implemented the system using XGBoost, which achieved an accuracy of 96.8%.
Rakotoasimbahoaka et al. [65] proposed a hybrid approach based on ML and DL methods.They used datasets from Kaggle and a combination of lexical and net-work-based features to get the best prediction.In the final experiment, they found that their proposed model CNN-LSTM-RF (96%) did not perform as well as CNN-LSTM (99%), but it detected URLs better.
In addition, a study conducted by Chiramdasu et al. [66] applied an ML approach to identify malicious URLs.The ML model was implemented using Logistic Regression (LR) with a dataset compiled from PhishTank [14], Kaggle, and GitHub public repositories [44].To classify the URLs, network-based and lexical features were used, and the KNN model achieved the best accuracy with 93%.
Shi et al. [67] proposed an approach to detect malware domain names using Extreme Learning Machine (ELM).Their dataset was collected from DNS queries in the Network and Information Center of Shanghai Jiaotong University.In addition, they selected nine lexical and network-based features, and their detection method had a high detection rate with an accuracy of more than 95%.
Furthermore, Parekh et al. [68] introduced a model to detect phishing websites using URL detection.The dataset used was gathered from PhishTank [14].The best accuracy came with the RF model, which reached around 95%.
Butnaru et al. [69] achieved a better result with an accuracy of 98.86% with their development of a phishing detection engine based on an ML model using nine features.Their dataset was formed from PhishTank [14] and Kaggle [70].
However, Shantanu et al. [71] achieved better accuracy of 99.7% with the same classifier in their comparison of the efficiency of several ML classifiers at detecting malicious URLs using 14 features.The dataset used was from the Kaggle repository [70].All three studies extracted two types of features: network-based and lexical.
Another study that proposed an approach to detect malicious URLs was conducted by Astorino et al. [72].They have used two datasets, which are PhishTank [14] and the second dataset from DMOZ Open Directory [73].They selected seven lexical and network-based features.They obtained the best accuracy (86.3%) with a spherical separation methodology.
Another study proposes a new model by Peng et al. [74].Their model was based on the attention mechanism (JCLA) for detecting malicious URLs.The dataset was from PhishTank [14], and the URLs were identified using 98 lexical and network-based features with the SoftMax classifier.The JCLA achieved an accuracy of 98.26%.
Wadas [75] presented a model to detect phishing URLs using ML techniques.The datasets used to train the model are from PhishTank [14], and another dataset was adapted from the author's previous work [76].Lexical and network-based features are extracted in this model by a total of 14 features.The NN achieved the best results with an accuracy of 78.4%.
Sadique et al. [77] developed a framework for detecting phishing URLs with a dataset built from PhishTank [14].They calculated the cost for each URL feature used.They noticed that lexical features require less time to extract than networkbased features.As a result, they attempted to categorize URLs using the less expensive feature sets first before obtaining the more expensive ones.The experimental results showed that RF outperformed all other ML algorithms in terms of accuracy and time duration, with a score of 90.51%.
However, Patgiri et al. [78] proposed a model to detect malicious URLs that obtained better accuracy (93.30%) with the same classifier.They divided the dataset into training and test data in 60:40, 70:30, and 80:20 ratios.They calculated the accuracy for several iterations for each split ratio.As a result, they concluded that the 80:20 split ratio was the best split.
In contrast, a study conducted by Chiramdasu et al. [79] using the same classifier detected malicious URLs with an ML technique using 13 features and achieved an accuracy of 99.61%.All three studies extracted lexical and network-based features of the URLs.
Prieto et al. [80] presented a novel knowledge-based system called domains classifier based on risky websites (DOCRIW).Five network-based features and one lexical feature were selected.The LR achieved the best accuracy of 89%.The DOCRIW framework had some limitations, such as an insufficient number of features and a limited data sample size.
Another study proposing a Chrome extension that acts as middleware between users and malicious websites was published by Desai et al. [81].The dataset was obtained from the UCI-ML Repository [82], and 22 lexical and networkbased features were utilized.The experimental results showed that the RF algorithm returned the best accuracy (96.11%).
Akour et al. [83] investigated the effectiveness of ML for phishing detection.They used a dataset proposed by Vrbancic et al. [84]  Along with the same line, He et al. [85] proposed a feature selection method based on RF.The dataset collected from Alexa, MalwareDomainList [86], OpenPhish [36], Cybercrime-tracker [87], and 360.com [88].There were originally 28 lexical and net-work-based features, and after the feature selection process, they became 18 in total.The best results were achieved by RF with an accuracy of 90.81%.
Ozcan et al. [89] proposed hybrid DL models that were combinations of deep NN (DNN)-LSTM and DNN-Bidirectional LSTM (BiLSTM).They used two datasets from Ebbu2017 [90] and PhishTank [14], along with a dataset from PhishStorm [91].They have extracted the network-based and lexical features.The highest accuracy was achieved by DNN-BiLSTM, with an accuracy of 99.21% using the second dataset.
Lee et al. [92] conducted a study to assess the efficiency of the ML approach in detecting and identifying malicious and benign URLs.They used a public dataset from Kaggle that contains malicious and benign URLs.For classification model construction, they used nine network-based and lexical URL features.They employed features optimization techniques to select relevant URL features by utilizing a bio-inspired algorithm, which reduces the time for training and testing and simplifies the malicious URL detection system.Ultimately, both the NB and SVM models presented a performance of 99% accuracy, which was better than the other classifiers.

4) LEXICAL STUDIES
Raja et al. [93] proposed a method to detect malicious URLs.In order to detect the malicious URLs, they extracted 27 lexical features, but only utilized 20 features that reduce execution time and storage requirements.The study used the university of new brunswick (UNB) dataset [94].The classifier achieved the best result with an accuracy of 99%.
Another study conducted by Vanitha N et al. [95] had the goal of allowing computers to learn independently, without human intervention or support, and consequently regulate actions.The dataset was collected from GitHub [96].They considered lexical and other features.The websites were classified as malicious or benevolent [96], and the best result was achieved by LR with an accuracy of 98.42%.
Aalla et al. [2] proposed a model that detects malicious URLs based on comparing the results between two algorithms using LR and DT.They used a dataset that labelled URLs as legitimate or malicious.The LR model achieved the best result with an accuracy of 97.5%.
A study conducted by Ateeq et al. [97] had the goal of introducing a method to classify URLs according to their type using NN.The dataset used in this study was CICANDMAL2017 [98], and they extracted eight lexical features from URLs.They used a feedforward NN (FFNN), which falls under DL algorithms, with multiple hidden layers to detect the URL type.The NN was able to successfully detect 98.48% of the URLs.
Another study using lexical features was conducted by Shivangi et al. [99], who proposed a tool deployed as a Chrome extension using DL techniques.The authors collected several URLs from various sources by web scraping.The dataset was obtained from search engines, PhishTank [14], and CommonCrawl [52].Finally, the LSTM model achieved the best results with an accuracy of 96.89%.
Likewise, a study that used lexical features only was conducted by Pingle et al. [100].Their goal was to provide a structure for detecting a harmful web page, and they found that the best classifier was ID3.
Lakshmanarao et al. [101] proposed a model that helps detect malicious websites using lexical features.The dataset used was from Kaggle [102].The best classifier was the RF with an addition to the hashing vectorizer (HV) technique, which achieved the highest accuracy of 97.5%.
Khan et al. [103] presented a model that used a majority voting classifier to combine numerous ML methods.The datasets used were obtained from UNB [94] and Kaggle [104].They extracted 47 lexical features with the help of feature scoring techniques to identify the most frequent significant features in both datasets.The voting classifier achieved the highest accuracy of 99.72%.
Another study that introduced a phishing detection technique was conducted by Abutaha et al. [105] using a dataset published by another author [106].The dataset was processed to produce 22 lexical features, and the best results were from SVM, with an accuracy of 99.896%.
Zhao et al. [107] focused on using ML techniques for multi-classification of malicious URLs.They used part of the dataset from [108] and [109], which were both derived from a Chinese Internet security company.The gated recurrent unit (GRU) NN model outperformed the RF model with an accuracy of 98.5%.
In addition, Hai and Hwang [110] presented a solution based on natural language processing (NLP) techniques to classify URLs as either benign or malicious.Their dataset was from DMOZ [32], MalwareDomainList [21], Malc0de [111], and CleanMX [112] Extracting lexical features of the URL.The best accuracy (97.1%) was achieved by SVM.
Another study focused on building an efficient and fast phishing URL detection approach was conducted by Banik et al. [113].The used dataset was collected from PhishTank [14] and the DMOZ directory [114].The performance was evaluated with different sizes of datasets using different numbers of features.A total of 18 lexical features were extracted from the URLs and the 15 most frequently contributed features were selected.The proposed system that used the SVM model was able to detect phishing websites with an accuracy of 96.35%.
Sameen et al. [115] designed an ensemble ML-based system called PhishHaven to detect both AI-generated and human-crafted phishing URLs.They classified the URLs as phishing and normal using 17 lexical features.The dataset was collected from Alexa [16], PhishTank [14], and DeepPhish This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 [116], and the results showed that PhishHaven was able to achieve 98% accuracy.
A study comparing the performance of traditional ML algorithms with popular DL framework models was conducted by Johnson et al. [11].Two experiments were conducted using the ISCX-URL-2016 dataset from UNB [94] containing five URL classes: benign, defacement, malware, phishing, and spam.In order to classify the URLs, 78 lexical features were used.The best results were achieved by RF with an accuracy of 96.99%.
A study by Liang et al. [117 proposed an algorithm based on deep bidirectional long short-term memory (DBLSTM) the researchers used open datasets from 360 NetLab [118] and Alexa [119].Lexical features of URLs were chosen due to the simplicity of analysis and widespread application to any kind of domain generation algorithm (DGA) family.The precision of the proposed DBLSTM algorithm remained high at 93-95%, while that of conventional models such as LR and SVM dropped significantly, to lower than 71-73%.
In addition, a study by Joshi et a. .[120] proposed a static lexical feature-based RF classification approach to classifying malicious and benign URLs they used a dataset from various sources, including OpenPhish [36], Alexa whitelists [16], and internal FireEye.After they analysed several URLs, they found 23 different lexical features that could be used to classify malicious and benign URLs.Ultimately, the RF model was the best choice for classification, with the best accuracy (92%) of the compared models.
One study conducted by Ispahany et al. [121 proposed an ML classification technique for detecting malicious URLs due to the COVID-19 pandemic using five lexical features.The used dataset was collected from DomainTools [122] and WhoisDS [123].Their model achieved an accuracy of 99.2%.In the future, they plan to investigate the incongruence of entropy.
A study by Afzal et al. [124] introduced a hybrid DL approach named URLdeepDetect for time-of-click URL analysis and classification to detect malicious URLs using lexical features.This study used a dataset from PhishTank [14] and Kaggle [70].The k-means model achieved the best results with an accuracy of 99.7%.
Another study conducted by Zeng [125] used 26 lexical features to detect malicious URLs in email content.The dataset used was from PhishTank [14] and DMOZ [32], and the experimental results showed that gradient boosting DT (GBDT) outperformed all other classifiers, achieving an accuracy of 90.71%.
Gupta et al. [126] developed an ML-based phishing detection system to help users check the legitimacy and maliciousness of a URL within a minimal time frame.This study used a dataset from the University of Canada Brunswick from UNB [94], and nine lexical features.They achieved the best accuracy of 99.57% with the RF algorithm.
Another study, which was conducted by Banik [127], developed an ML-based phishing URL detection system using lexical features of URLs.The dataset was collected from PhishTank [14], the DMOZ directory [114], a dataset proposed by Chiew et al. [128], and a dataset collected by the authors [129].A total of 17 lexical features were extracted from URLs.They achieved the best accuracy of 98.57% with the RF algorithm.
Sahingoz et al. [130] proposed a real-time anti-phishing system using lexical features.The dataset was collected from PhishTank [14] and Yandex Search [131].The RF algorithm using only NLP-based features gave the best performance with an accuracy of 97.98%.
Another study, conducted by Bahnsen et al. [132], focused on the classification of sites as legitimate or phishing using ML techniques.The dataset used was extracted from PhishTank [14] and Common Crawl [52], and 14 features were selected based on the URLs lexical and statistical analyses.The LSTM model achieved the best results with an accuracy of 98.7%.
Wei et al. [133] presented a method of detecting malicious URL addresses based on only URL lexical features using a DNN with convolutional layers.The dataset was collected from PhishTank [14], Alexa [16], and CommonCrawl [52].Their model achieved an accuracy of 99.98%.
Yuan et al. [84] proposed a methodology that makes use of embedded representations of characters in URLs to detect phishing web pages.The character embedding achieved by the word2vec model does not depend on any network load or external knowledge.However, it does depend on lexical features (character embedding).The dataset used was collected from Alexa [119], the technical challenge of network security [134], PhishTank [14], and Reasonable Antiphishing [135].The best performance was achieved by XGBoost with an accuracy of 99.69%.
Additionally, Yang et al. [136] proposed an integrated phishing website detection method based on RF and CNN.They used two datasets: the first dataset (DS1) was compiled from PhishTank [14] and Alexa [16].The second dataset (DS2) was a bench-mark dataset used by Sahingoz et al. [130] from PhishTank [14] and Yandex [137].They extracted lexical features (character embedding features) using the CNN model and classified multilevel features using RF classifiers.Their model achieved an accuracy of 99.35% using DS1.
Yuan et al.
[138] proposed a model that is based on the attention mechanism, bi-directional independent recurrent neural network (Bi-IndRNN), and CapsNet that when combined formed a joint NN algorithm model for detecting malicious URLs.The dataset was collected from the Alexa website [22], hpHosts [139], and PhishTank [14].The extracted features were lexical and a texture fingerprint feature that converts the URLs into grayscale images.The proposed model achieved an accuracy of 99.78%.

5) CONTENT-BASED FEATURES STUDIES
Altay et al. [140] proposed classifying web pages using supervised ML techniques and a dataset collected from PhishTank [14] and Alexa [16].The data were extracted from web pages using a keyword density extractor library designed by Comodo Group [141], and 8,000 content features were extracted.The achieved accuracy of 98.24% with SVM-Radial basis function (SVM-RBF).
McGahagan et al. [142] proposed assessing whether additional webpage features would improve the detection of malicious websites.They collected their dataset from the Cisco Talos Intelligence Group [143] and Alexa list [22] and selected 26 content-based features.The RF model achieved the best accuracy of 91.36% in the case of no sampling.
In addition, Jain et al. [144] provided a novel approach for identifying phishing threats by examining hyperlinks in the HTML source code of a website.They collected the dataset from PhishTank [14], Alexa top websites [22], Stuffgate Free Online Website Analyzer [145], and the online payment service providers list.Their proposed approach combined a variety of unique remarkable hyperlink-specific features to detect phishing attacks and divided the hyperlink-specific features into 12 content-based feature categories.The best result was achieved by the LR model with an accuracy of 98.42%.

B. ARABIC STUDIES
Many studies have been conducted to investigate different features and algorithms to detect phishing attacks on Arabic content websites.This section presents these studies and categorizes them into two sections.The first section presents studies that focused on lexical and content-based features, and the second section presents studies focused on content-based features.

1) LEXICAL AND CONTENT-BASED FEATURES STUDIES
Al-Kabi et al. [146] proposed an approach to detecting Arabic spammed web pages using content-based analysis.They built their dataset using a crawler developed by Alsmadi [147] and selected seven lexical and content-based features.The results showed the DT algorithm was the best, with an accuracy of 99.521%.
Another research focused on web spam detection was conducted by Al-Kabi et al. [148] proposed an integrated online Arabic web spam detection system (OLAWSDS) that filters malicious pages from search engines.They extracted 18 lexical and content features.The best results achieved an accuracy of 99% using the trust rank model.
In addition, EL-Mohdy et al. [149] proposed web spam detection based on web mining.The spam web pages were collected manually using search engines with a spamming query such as pages that support terrorism from Egypt's blocked websites list [150].The non-spam web pages were collected from trusted sites such as governmental and news sites.In addition, they selected one lexical feature and three content-based features.The DT classifier achieved an accuracy of 97%.

2) CONTENT-BASED FEATURES STUDIES
The study conducted by Alsaleh et al. [10] showed how ineffective Google's anti-spamming methods are against web spam pages that contain non-English content.It provided a solution in the form of a browser anti-spam plug-in detecting Arabic spam pages.The dataset was collected by the authors themselves, and they selected seven content-based features.Also, they tested four ML algorithms by using multiple variations to build their classifier.The results were that the performance of the random forest DT (RFT-S) showed the best detection rate, which was 87.13%.
Similarly, Wahsheh et al. [151] proposed a system to classify URLs as spam or not spam.The goal of the study was to build the first Arabic content or link web spam detection system using the rules of DT.The proposed system helps to clean a search engine results page (SERP) of all URLs referring to Arabic spam web pages.The proposed model achieved 93.1034% accuracy for Arabic links using 15 content-based features.
Another study, published by Al-Twairesh et al. [152], aimed to analyse the content of Saudi tweets to detect spam by developing both a rule-based approach and a supervised learning approach.They used the Twitter search application programming interface (API) to collect the dataset for spam and non-spam tweets.The NB classifier gave the best results by stemming 91.6% using four content features.
Likewise, Alorini [153] proposed discovering Arabic spam on Twitter using ML.The dataset was collected from Twitter using Twitter stream API and the Tweepy package from Python and then translated into English with the help of Arab annotators.Three content-based features were selected in this study.The highest accuracy of 91% was achieved by Bayesian reasoning (BR).
Additionally, Alkhair et al. [154] focused on investigating fake news content in the Arabic world through the information posted on YouTube.They collected comments that were classified as rumor or non-rumor using the YouTube API.The achieved performance varied depending on the rumor topic and the classifier used.Overall, for the dataset used, the best classifier was the SVM, which reached an accuracy of 95.35%.Wahsheh et al. [155] proposed an approach to detect linkbased spamming techniques used in Arabic spam web pages.The dataset was collected using Web Link Validator [156] to analyse the web pages by finding broken links, checking the HTML code's accuracy, and selecting six content-based features.The DT yielded the highest accuracy of 91.4706%.Mataoui et al. [157] proposed a new supervised spam detection approach by defining a set of features in the Arabic language.The dataset was extracted from Facebook.In the pre-processing step, they extracted tokens using standard NLP techniques, such as tokenization, normalization, stop-word removal, and stemming.The normalization stage in the Arabic language serves to convert each letter to its prescribed standard form (for example, " ‫ا‬ ‫آ،‬ ‫إ،‬ ‫أ،‬ ‫"ا،‬ are multiple forms for Another study that used content-based features of Arabic tweets was proposed by Alharbi et al. [158].They focused on classifying rogue and spam content in Arabic tweets using ML algorithms.They collected the dataset from spamming Twitter accounts.The 47 generated features were analysed, and the best features were selected.The performance results of the study showed that the RF classification algorithm with 16 features performed best, achieving accuracy rates greater than 90%.
Najadat et al. [159] proposed a keyword-based method to detect Arabic spam reviews.The dataset was extracted from different sections of Facebook pages using the Netvizz application [160].The Facebook comments were classified based on content-based features, such as the keywords extracted from them.The best results were achieved by the DT model, with an accuracy of 92.63%.
In addition, Mubarak et al. [161] proposed a model to detect Arabic spam tweets and identified different properties of Spam and Ham tweets.They built their own dataset from Twitter, and they selected four content-based features.The highest result was achieved by the Arabic bidirectional encoder representations from transformers (Ara-BERT), with an accuracy of 99.7%.
Alsulami et al. [162] proposed a personalized filtering model they called the SentiFilter that aimed to provide each user with a personalized level of protection against what the user perceives as unwanted content.The dataset was collected from Twitter.The best results were achieved by the SVM classifier, with an average accuracy of 90.89%.Wahsheh et al. [163] proposed Arabic opinions spam detection system (SPAR).The goal was to detect spam opinions in the Yahoo!-Maktoob social network and categorize them as spam or non-spam opinions based on many features.A dataset of opinions (reviews) from Yahoo!-Maktoob News was collected and analysed by the authors.Each data gathering opinion must be pre-processed using the following procedures: 1) Delete the non-Arabic text.2) Delete the punctuation.3) Normalize the similar Arabic letter.4) Tokenize the Arabic opinion.They used SVM to evaluate the proposed SPAR system, and it achieved an accuracy of 97.5073%.
Another study focused on Arabic tweets for the detection of suspicious messages was conducted by AlGhamdi et al. [164].The goal was to develop a system to detect suspicious messages written in the Arabic language.They used the Twitter streaming API to get Arabic tweet data.The SVM model achieved the best results with an accuracy of 86.72%

V. DISCUSSION AND ANALYSIS
This section summarizes the reviewed papers in terms of publication year, URL classification, dataset source and size, classifiers, and the highest results obtained, as previewed in Table 1.In the English studies, the most frequently used features were lexical features, in 72 studies out of 91.That was followed by network-based features, which were used in 39 studies, while 26 studies used content-based features.We can conclude that the highest result was achieved by CNN with an accuracy of 99.98% [133] on a dataset size of 21,208 URLs.In contrast, the Arabic studies mostly used content-based features, in 16 studies, while three studies used the URL lexical features.In contrast, network-based features were not used in the Arabic content websites.We can conclude from the Arabic studies that the highest result, with an accuracy of 99.521%, was achieved by DT [146] on a dataset of 15,000 Arabic spam web pages.The majority of English-based studies that used lexical, network-based, and content-based features achieved high accuracy that were greater than or equal to 95%.On the other hand, none of the Arabic-based studies used all three types of features together.Noteworthy, the English-based studies that used all three types of features and achieved the highest accuracy is the one conducted by Chen et al. [42] with an accuracy of 99.98%.This study showed that three networkbased features represent the most important features which are the following: 1. Whether the domain country code is included in the top eleven common malicious country codes or not.2. The interval between the domain update time and the current time.3. The interval between the contract expiration of the domain and the current time.Unfortunately, there are not many English-based studies that utilized these three features.Even more, there is no Arabic-based study that used any of the network-based features.Therefore, using the combination of the three types of features in an Arabic-based study could be a new promising research direction to explore.
Several studies combined two kinds of URL-based features: lexical-content-based and lexical-network-based features.The lexical-network-based features were the most used combination of URL-based features by various studies, including [71], which achieved the highest accuracy of 99.7% using the RF classifier.However, [50] achieved a greater accuracy of 99.55 % by combining lexical and content-based features and employing the PCA-RF classifier.
From our review, we found that the number of utilized lexical features only were in the range of 5 to 47 features.The lexical features are the most used type of features due to the following reasons: 1.The lexical features can be extracted without the need for additional services, tools, or an Internet connection.2. Most of the outputs are numbers so they did not require any sort of encoding such as (URL length, number of special characters, etc).

Fast execution time.
Some of the popularly used features are URL length, length of the domain name, count of some symbols such as ('@','&', #,'/','','), and count of digits.Furthermore, the special characters mentioned are considered suspicious characters and they are highly present in the phishing URLs.Moreover, the attackers tend to use long URLs to hide suspicious parts [110].They also use the redirecting symbol "//" to allow the redirection of the websites containing the attack [110].They may sometimes use some of the suspicious words within the URL such as the word tokens (e.g., sign in, confirm, free, etc.) [110].
Two years ago, the researchers focused on deploying a detector as quickly as possible to detect malicious URLs associated with a certain trend, such as the COVID-19 pandemic.During the COVID-19 pandemic, UW Medicine made extensive use of telemedicine capabilities to provide patients with virtual care [165].Staff members noted a dramatic increase in phishing emails that enticed employees to click on malicious links and download malware during this period [121].Although the lexical technique is fast, it might not be sufficient to guarantee complete security if attackers attempt to hide dangerous information behind normal URLs using benign tokens.
The content features without any additional features are the least used features in English and Arabic studies.Some of the widely used content features are, the number of words, the maximum number of words within certain HTML tags (<bod>, <head>, etc), and the count of certain HTML tags such as (<met>, <img>).The meta tags are typically used to specify the page description, keywords, and author of the document.So, the hackers utilize them to enhance the page rank by utilizing keyword stuffing.Usually, phishing sites contain more images than benign ones.To extract the content features, the researchers would need to consider the HTML content of webpages, and JavaScript ( <iframe> method, etc.).The content requires a set of pre-processing steps including the removal of stop words and some special characters.Moreover, some languages need the removal of "Tashkeel" and "Tatweel" like the Arabic language.Some studies that are only concerned about the content features may face different challenges such as customizing the detection model to handle different languages.However, the selection of the type of features usually relies on the URL dataset or the attack type, such as spam, phishing, drive-by-downloads, and malware.
In terms of the classification algorithms, the following set of algorithms achieved the best performance in terms of accuracy of 99% and above: CNN, XGBoost, LSTM, SVM, CW, Majority Voting Classifier, RF, K-means, Ara-means, DT, and NB.
Even though CNN, XGB, and LSTM achieved the highest results close to 100%, they are rarely used.The major disadvantage of using XGB is related to being very sensitive to outliers and is hardly scalable [166].CNN is mostly suitable for image data.Additionally, the LSTM takes a longer time, requires more memory to train, and is easy to overfit.On the other hand, the SVM, RF, DT, NB, and LR are the mostly used algorithms that achieved good performance with an accuracy of 98.42% and above.It should be noted that all of these algorithms are ML classifiers.The ML classifiers work well on small and large datasets whereas the DL classifiers work well with large datasets [167].
It is important to note that the Ensemble technique, which combines a set of algorithms, provides high accuracy of more than 90%, as shown in Table 3.In general, the ensemble method outperforms the individual models in terms of accuracy [168].Furthermore, the algorithms that have low performance are BN, NN, and DBN achieving an accuracy of lower than 90%.The major drawback of the NN and DBN is the need for a large amount of training data.Besides, the training process of the NN is the focal point of deciding the correct prediction of data patterns [169].In addition, there is no good or bad algorithm due to many factors such as how clean and good the pattern of a dataset is, the size of the dataset, and the number of features.
In terms of the dataset, a total of 45 different dataset sources were used.The most common dataset source is PhishTank [14], which is available in multiple formats and is updated hourly.Datasets built by the study's authors were the second most datasets used in the studies and were collected by using crawling tools, special APIs, or manually.In the studies that were based on Arabic content websites, all authors used their built dataset, since there is a lack of datasets for Arabic content websites.The third most dataset source is Alexa [16].There are datasets sources that were used in more than one study as well, such as Kaggle, OpenPhish [36], and CommonCrawl [52].However, compared to PhishTank [14] and Alexa [16] they are considered less popular.Some dataset sources were not used frequently such as Ebbu2017 [90], CleanMX [112], DMOZ Open Directory [73], and WEBSPAM-UK2007 dataset [57].Malicious or benign Kaggle [70] and PhishTank [14] 450,176 RF, MLP, and NB, LSTM, and k-means clustering.

A. MACHINE LEARNING (ML) TECHNIQUES USED
The ML studies explored in this section are based on many aspects, such as the highest accuracy, considering the algorithm's name, the number of studies, and the algorithm's highest accuracy.Table 2 below shows the highest accuracy for each algorithm among all the studies that used this algorithm.It provides an overview of individual algorithms with the classification method and category of the algorithm.It is noteworthy that the highest accuracy (99.98%) was achieved by Wei et al. [133] and Chen et al. [42] using CNN and XGBoost, respectively.In addition, as demonstrated in Table 2, the SVM algorithm is considered one of the most frequently used ML algorithms in URL classification and was used in 47 studies.The SVM algorithm achieved an accuracy of 99.89%.However, the SVM algorithm cannot handle large or noisy datasets.Table 3 lists the combination algorithms, and Figure 2 shows the statistics of studies per algorithm.

B. DATASETS USED
As mentioned in Section 4, the reviewed articles used datasets from different sources to train and test their detection models.Some of the datasets are open source, built by the study's authors, adopted from other authors, or a combination of those sources.However, the most common dataset sources are PhishTank [14] and Alexa [16].PhishTank [14] was launched in 2006 by OpenDNS [171] and acquired by Cisco in 2015 [172].PhishTank [14] is a free community site that enables anyone to submit, verify, track, and share phishing data.In contrast, Alexa [16] which was founded in 1996 by Brewster Kahle and Bruce Gilliat [173] acquired by Amazon in 1999 [174].Alexa [16] provides up-to-date web global rankings, traffic data, and other information on over 30 million websites [175].Also, Alexa is used for benign URLs because it is an analytical tool that lists the top-ranked URLs around the world or datasets collected by the study's authors.Most English-language content studies have used these two public datasets.In the same level of frequent use of the PhishTank dataset, the study authors used their own built dataset.In the Arabic-language content studies, all authors used their built dataset.Conversely, some sources were not often used despite representing further opportunities for research, including ArabicWeb16, which has 150 million Arabic web pages, making it the largest public Arabic web dataset.All the investigated studies had the same goal, which is to use these datasets to classify URLs.However, the authors differed in the next steps in terms of using the dataset with or without pre-processing, extracting more features, or further classifying the malicious URL as malware, phishing, or other.Figure 3 illustrates the investigated datasets along with the number of studies that used them.

VI. CHALLENGES AND RECOMMENDATIONS
Despite many significant improvements in malicious URL identification utilizing ML approaches over the previous decade, there are still many crucial and imperative unresolved problems and difficulties.
One of the main limitations of the reviewed papers was the data sample size [80], [162].Therefore, we recommend evaluating and validating ML models for detecting malicious URLs using enough samples with an acceptable ratio between the normal and malicious URLs.Balancing techniques can be used to enhance the quality of the detection rate while still taking into consideration enough samples in the dataset.On the other hand, big data collection is a challenge for data mining [48] due to the required processing computation and time.Overcoming this issue requires a scalable environment, such as cloud computing models or servers with enough computation power.
Moreover, there are other limitations, including the lack of analysis and detection of obfuscated JavaScript in web pages [37]; outlier values [75], [83]; number of features selected [74], [80]; and features effectiveness [74].Issues related to the type and number of selected features can be resolved by applying selection techniques such as the following: (1) Filter feature selection methods that apply a statistical measure to rank the importance of the features.This includes the Chisquare test, information gain, and correlation coefficient scores.(2) Wrapper methods that deal with feature selection as a search problem followed by a searching technique such as best first search, random hill-climbing, or heuristic algorithms to evaluate a combination of features and rank features based on model accuracy.(3) Regularization methods that process feature selection as an optimization problem by applying regression techniques for minimizing the model coefficients by removing unrelated features.
The continuous change of inclusive features that differentiate between legitimate and suspicious URLs is also a major challenge that could be addressed in future research.A possible solution in this regard is investigating the applicability of Concept Drift detection techniques for enhancing the performance of intelligent phishing URLs.However, this requires incorporating a method for detecting the drift in concept in order to warn the model designer about the necessity of building a new model.Yet, building a new model should not mean ignoring the old model since the set of features that were inclusive at time t1 and become useless at time t2 might return to be inclusive at time t3.In fact, completely ignoring the old model might result in what is called catastrophic forgetting.Nevertheless, catastrophic forgetting can be addressed using different techniques including an ensemble approach by merging different models together where each model will produce a decision and then all decisions will be collected and processed to come up with a final decision based on the improved voting technique that considers the quality of each model in the ensemble.
Further, consideration must be given that the network traffic in a test environment and a real-world network are different.With the development of the Internet, types of malicious URLs have become more diverse.Therefore, there is a need to consider the sustainability of a phishing detector by validating the detection models with consideration of the evolution of attacks.This can be done by selecting the best features that allow detectors to capture the dynamic nature of attacks.To validate the suitability, an ML model must be trained with samples captured during a specific time, and testing must be conducted for samples collected to represent future periods not involved in the training period.

VII. CONCLUSIONS
This article reviewed and analysed several research studies that combined the latest research in the field of detecting malicious URLs.The papers were recognized from a variety of articles obtained from reputable electronic sources.Primarily, this paper focused on reviewing studies about the detection of malicious URLs using ML algorithms, considering Arabic and non-Arabic content.The article presented several taxonomies and comparison results as a contribution to the field of malicious URLs detection.Additionally, the article highlighted and discussed several findings, including (1) lexical features of the URL, which is the most frequently used feature in both Arabic and non-Arabic content for detecting malicious URLs.Moreover, the studies conducted on Arabic websites did not utilize networkbased features.(2) Regarding the detection techniques, the most frequently used algorithms in the reviewed papers were SVM, RF, and NB.Furthermore, the CNN and XGBoost models achieved higher performance than other algorithms, with an accuracy of 99.98%.(3) Regarding the datasets used, we found that most studies on Arabic content generated their own datasets, whereas the studies of non-Arabic content used open-source datasets like PhishTank for malicious web pages and Alexa for benign web pages.Finally, we discussed several challenges that might impact the quality of the ML detection techniques, including the size of the dataset, outliers, feature selection, and the sustainability of the detectors.This article can be considered a starting point for future research since it highlights the recent advancements and possible research directions.

FIGURE 1 .
FIGURE 1. Taxonomy of malicious URL detection on Arabic and English websites using machine learning (ML).
that contained 111 lexical and net-work-based This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 features.Ultimately, SVM was the best performing model and it achieved an accuracy of 96.30%.
This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 the letter " ‫ا‬ " [alif]).The J48 model achieved the best results with an accuracy of 91.73%.
This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022

FIGURE 3 .
FIGURE 3. Datasets used in the reviewed studies.

Figure 2 .
Figure 2. The most frequently used ML algorithms in the reviewed studies.

Table 1 English content websites explored studies
This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/VOLUME XX, 2022 .
This article has been accepted for publication in IEEE Access.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3222307This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/