Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future Directions

Phishing has become an increasing concern and captured the attention of end-users as well as security experts. Existing phishing detection techniques still suffer from the deficiency in performance accuracy and inability to detect unknown attacks despite decades of development and improvement. Motivated to solve these problems, many researchers in the cybersecurity domain have shifted their attention to phishing detection that capitalizes on machine learning techniques. Deep learning has emerged as a branch of machine learning that becomes a promising solution for phishing detection in recent years. As a result, this study proposes a taxonomy of deep learning algorithm for phishing detection by examining 81 selected papers using a systematic literature review approach. The paper first introduces the concept of phishing and deep learning in the context of cybersecurity. Then, taxonomies of phishing detection and deep learning algorithm are provided to classify the existing literature into various categories. Next, taking the proposed taxonomy as a baseline, this study comprehensively reviews the state-of-the-art deep learning techniques and analyzes their advantages as well as disadvantages. Subsequently, the paper discusses various issues that deep learning faces in phishing detection and proposes future research directions to overcome these challenges. Finally, an empirical analysis is conducted to evaluate the performance of various deep learning techniques in a practical context, and to highlight the related issues that motivate researchers in their future works. The results obtained from the empirical experiment showed that the common issues among most of the state-of-the-art deep learning algorithms are manual parameter-tuning, long training time, and deficient detection accuracy.


I. INTRODUCTION
Phishing detection based on machine learning (ML) have received tremendous attention and interest from researchers in the cybersecurity community over the past decade. Extensive researches have been conducted to review the application of ML in various solutions to detect evolving phishing attacks [1]- [3]. Deep learning (DL), a subset of ML, has recently The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk . emerged as a potential alternative to traditional ML approaches. However, there are limited studies that discuss in depth the application of DL in phishing detection, their advantages and disadvantages, the current issues, and future research directions to address these challenges [4]- [6].
Notably, there is no study that provides a comprehensive review of the current challenges and future directions for DL algorithms with regards to phishing detection using a systematic literature review (SLR) approach. To the best of our knowledge, this is the first study that discussed phishing detection and DL in a single SLR paper.  TABLE 1 provides a comparison between our research and the related surveys on the topic of interest. The related studies were reviewed and compared from the perspectives of: (i) proposing a taxonomy of phishing detection, ML, or DL, (ii) providing a detailed discussion on the current challenges facing DL in phishing detection, and (iii) offering recommendations for future research. It is observed that among these studies, some authors provided taxonomies of the related topics, but did not discuss the open issues and future research areas [1], [4]- [6]. In contrast, other authors lacked an exhaustive review and classification of phishing detection; yet, they included current challenges and future directions in their studies [2], [3]. The authors in [7] conducted an in-depth benchmarking and evaluation on phishing detection, but primarily focused on the importance of features used for learning. Even though all three viewpoints above were considered in [8]- [10], the authors emphasized more on conventional ML techniques and did not provide a detailed analysis of DL for phishing detection.
Whereas, our research is different from the existing studies in which it provides an in-depth analysis of the DL algorithm for phishing detection through an SLR approach. Moreover, our study also includes the state-of-the-art DL techniques, and most importantly, discusses the current challenges and future research direction for DL in the phishing detection domain. This study is intended to guide researchers and developers, to whom DL and phishing detection would be of primary concerns. The in-depth analysis in this research has led to several key contributions.
• We adopted a SLR approach to analyze the relevant studies and selected a total of 81 articles based on several criteria to support this research.
• We proposed a taxonomy of phishing detection and DL by dividing them into several categories. In addition, we also surveyed numerous DL algorithms and discussed their strengths and weaknesses.
• We identified the current challenges and key issues related to DL in the field of phishing detection, and provided recommendations for future research areas.
• We conducted an empirical analysis of various DL architectures for phishing detection, and highlighted several issues previously discussed in the literature to identify possible gaps for future research directions. The rest of this paper is organized as follows. Section II provides background knowledge of phishing attacks, DL, and the adopted SLR approach that leads to the selection of 81 reviewed papers. Section III presents a taxonomy of phishing detection and DL to classify them according to several categories. Section IV discusses current issues and challenges facing DL in an attempt to fight against phishing attacks. Section V identifies potential research gaps and recommends future research directions. An empirical analysis is included in Section VI to map current issues with existing research gaps. Finally, Section VII concludes the paper and proposes future works.

II. BACKGROUND
This section consists of two main sub-sections to provide a comprehensive understanding of the research topic. The first section provides the definition of phishing and DL, while the second section describes the SLR approach used in this paper.

A. DEFINITION
This sub-section provides a brief introduction of phishing attacks and DL algorithms. A basic knowledge about phishing and its operation will assist in the understanding of why DL has emerged as a promising solution to detect phishing activities.

1) PHISHING
Phishing is a type of digital theft that disguises itself as legitimate or genuine sources to steal uses' private and confidential information. It has become a popular attacking approach in cyberspace by utilizing web applications' vulnerabilities and end users' ignorance, which is a security issue that needs to be addressed [11].
The evolution of phishing attacks is illustrated in FIGURE 1 [12]. Back in 1996, the term ''phishing'' was first introduced, and phishing attacks were slowly spread through various communication media over the years. It started with spam messages, mobile malware, spear-phishing to ''Man in the Middle'', Vishing, ''Chat in the Middle'', ''Tabnabbing'', ''Xbox Live'', etc. Phishing attacks started becoming a serious issue and caught more attention among researchers when a major incident happened in 2014, causing a huge financial loss. With the advent of the Internet and the popularity of social media, the number of phishing attacks has increased rapidly since 2016 and continued to grow in an upward trend. According to the latest statistics from APWG (Anti Phishing Working Group), the number of phishing attacks has grown tremendously since March 2020 and doubled over the course of the year [13].
Since phishing has become a serious security issue, understanding how it operates is an utmost important task in the detection and prevention of such cybersecurity threat. The life cycle of a typical phishing attack is shown in FIGURE 2, consisting of five phases [14]. The first phase is called reconnaissance or planning phase, in which the phishers choose the communication media, select the phishing vector, and identify potential victims [12], [15]. The second phase is weaponization or preparation phase, whereby phishers prepare phishing materials to be propagated to their targeted victims [14]. The next stage is distribution or phishing phase, as phishers start to deploy the baits by delivering the phishing materials to victims [16]. The following stage is called exploitation or penetration phase, where phishers exploit victims' weaknesses by luring them into giving up their private and confidential information. [17]. The final stage is known as exfiltration or data acquisition phase. The phishing operation has succeeded at this point, and phishers had successfully obtained the information they intended to take when planning the phishing attack initially. Phishers can decide to take further actions to gain financial benefits, or use the collected information for other purposes [12].

2) DEEP LEARNING (DL)
Phishing appears to be an effective way for cybercrime to occur because most users are unable to identify phishing websites or emails [18]. One of the current challenges in dealing with cyberthreats, especially phishing attacks, is lacking of cyber security solution, and Artificial Intelligence (AI) is believed to be the next frontier in cyber security defense [19].
ML is a part of AI that teaches machine the ability to learn like human beings. DL is a subset of ML derived from a neural network model (FIGURE 3). Traditional ML techniques refer to the learning methods that require human expertise to perform feature extraction and selection [20]. Feature selection is separated from classification task in a classical ML model, and these two processes cannot be combined together to optimize the model's performance. However, DL fills this gap by integrating these two processes in a single phase to detect and classify phishing attacks effectively and efficiently [21]. Although traditional ML approaches provide high accuracy and low false-positive rate, they still require manual feature engineering and depend on third-party services [22]. In contrast, DL models can learn and extract features automatically without human intervention. This eliminates the need for manual feature engineering and third-party services dependency. Moreover, traditional ML with manual feature engineering fails to deal with multi-dimensional and largescale datasets in the big data era [23]. DL, however, can to handle a significant amount of data and becomes a powerful tool for phishing detection that requires more attention in the cybersecurity community. There was no study that combined DL and phishing detection in a SLR approach despite the increasing attention given to these two domains. Therefore, a detailed process of selecting relevant studies was described in this paper, to examine the current trends and patterns in the existing research on DL for phishing detection. The primary purpose of conducting this SLR is to analyze the pros and cons of the state-of-the-art DL techniques, identify the current issues, highlight the research gaps, and recommend future research directions.

B. SYSTEMATIC LITERATURE REVIEW
This study adopted an approach suggested by Kitchenham [24] to conduct a SLR on the research topic. FIGURE 4 illustrates the process of selecting the relevant studies, consisting of four phases: research questions, search procedure, paper selection and data synthesis.  between 2018 and 2021. These include: Web of Science (WoS), IEEEXplore, Springer Link, Science Direct, and Google Scholar.

3) PHASE 3: PAPER SELECTION
This SLR applied a paper selection process based on PRISMA guidelines [25] which consists of several stages, such as automatic search, duplicity removal, title and abstract screening, full-text selection, and snowballing [26]. Quality assessment (QA) is the next step after the paper selection process that aims to evaluate the selected papers' quality.  [24] was adopted, where three possible scores can be given to an answer of each QA question: ''1'' for ''Yes'', ''0.5'' for ''Partly'', and ''0'' for ''No''. Eighty one (81) papers were selected for this study based on the sum of the total score to all five QA questions.
Appendix B shows the detailed scores of QA questions to ensure that the selected papers are the most relevant to the RQs and this SLR study.

4) PHASE 4: DATA EXTRACTION AND SYNTHESIS
A qualitative analysis software (Nvivo) was used in this study to extract data from 81 selected papers. The extracted data comprised of authors' names, published year, paper's title, objective, methodology, findings, and future works. Other related fields, such as publisher's name, quartile, impact factor, and citation count, were also included as the selected papers' quality indicators. The extracted data went through a process called data synthesis to answer the RQs, and was illustrated using visualization techniques such as tables, figures, and charts to present the findings.

5) THREATS TO VALIDITY (TTV)
Four common threats to validity were taken into consideration while carrying out this research, including constructing validity, internal validity, external validity, and conclusion validity [27]. Minimizing the risks of these TTVs helped to reduce the probability of missing relevant studies as much as possible and to make sure that the paper selection process was unbiased.
To sum up, 81 papers were selected for this research study based on three perspectives mentioned in Section I, and according to several selection criteria from a systematic literature review. By adopting an approach proposed by Kitchenham [24], following a selection process from PRISMA guidelines [25], applying the scoring technique adopted by previous authors [21], [24], and considering several threats to validity [27], we hold the belief that the reviewed articles are among the most relevant studies related to the research area, and more importantly, are selected based on objective criteria, and without biases.

III. TAXONOMY
The selected studies were analyzed and classified into different categories to answer RQ1 and RQ2. Phishing detection was classified according to various media and methods. Whereas, DL was divided into several categories based on the application areas, techniques and datasets.
A. PHISHING DETECTION 1) CLASSIFICATION BY MEDIA Cyber criminals carry out phishing attacks through various media, and social engineering is one of them [28]. Social engineering is a technique of deceiving users into giving up their valuable and sensitive information such as username, password or credit card number [17]. Instead of targeting the systems, social engineering attacks aimed at the users who are the weakest link in the security chain [10]. Common social engineering methods for phishing attacks include Website, Email, Short Message Service (SMS), Voice over Internet Protocol (VoIP), Mobile Devices, Blogs and Forums, and Online Social Network (OSN) [8] as shown in FIGURE 5.

a: PHISHING THROUGH WEBSITE
Website phishing is the most common phishing attacks in cyberspace where attackers build the websites to make them look identical to the genuine ones [29]. The attackers' primary goal is to trick users into believing that these websites are trustworthy since they are the replica of well-known sources such as Google, eBay, Amazon, Paypal, etc. Thereby, attackers can gain personal and financial details from the users by taking advantage of their ignorance and VOLUME 10, 2022 carelessness [12]. Since the phishers' target is the users and not their devices, website phishing is challenging regardless of how robust a phishing detection system is. Both technical and psychological solutions are required in the prevention and mitigation of such phishing attacks [17].

b: PHISHING THROUGH EMAIL
Cyber criminals usually send emails to online users claiming that they are from trusted companies to perform email phishing. They design the phishing emails to disguise themselves as legitimate organizations and urge the end-users to visit a fake website through a hyperlink included in it [28]. Users are often asked to update their information through this link and when they do so, phishers steal their confidential information for financial gain or other illegal purposes. Email phishing can be further divided into two groups: spear phishing and whaling [17].
Spear phishing targets at specific individuals, groups or organizations rather than random users with the final intention of obtaining confidential and sensitive information [16]. It is a well-planned attack where phishers initially collect information and details of their targeted victims, and then send emails pretending they are sent from a colleague, supervisor or manager in the same organization [30]. Spear phishing has a higher success rate as compared to other conventional methods because attackers disguise themselves as someone whom the victim knows and include content that is relevant to the victim in the email to avoid any suspicion [15].
Whaling is similar to spear phishing except that its targets are high-profile executives such as corporate CEOs, government officials or political leaders [16]. Phishers choose their victims based on their privileged access to the information or the authority they hold within the organization [15]. Phishers invest relatively more time and effort in this type of attack to enhance the success rate since the profit that is potentially earned from it is significant.

c: PHISHING THROUGH SMS
SMS phishing, also known as Smishing, is one of the popular attacks carried out on mobile phones. Smishing attackers usually send text messages to mobile phone users together with a link embedded in it [12]. When users click this link, they will be either redirected to a fake website or end up downloading and installing malicious software (malware) on their phones. Individuals can exchange short text messages at their fingertips nowadays with the advancement in mobile technology [15]. Such convenience allows attackers to approach their victims easily in an attempt to steal their private information. Even though SMS has become less popular due to the emergence of the Internet and other applications, Smishing still imposes a major threat in cyber security since text messages have been used as one of the common methods for online account verification [3].

d: PHISHING THROUGH VoIP
Besides SMS, voice is another medium for phishing attacks to take place in the cyber environment. VoIP phishing, or Vishing, is a type of phishing attack conducted over telephone systems or VoIP systems using voice technology [28]. Phishers often collect details about the victims prior to their conversation, such as name, address, phone number and other personal information, to gain more trust from the victims and make the attacks less suspicious. Vishing also has a high rate of success because some people believe that communicating with another human is more reliable than with a machine [15]. In addition, phone call receivers tend to make more mistakes during a phone call since they do not have enough time to think before responding or answer without proper consideration, and accidentally reveal their private and sensitive information to the phishers.

e: PHISHING THROUGH MOBILE DEVICES
Phishing through mobile phones has become more common recently as more and more people are relying on their phones to carry out their daily activities, from checking emails to paying bills, from browsing the Internet to online shopping, etc. [3]. This makes mobile phone users become potentially easy targets to phishers who plan to perform phishing attacks. Users may fall victim to such attacks while browsing or downloading an application from untrusted websites [12]. Once the malicious software is installed, it will collect the user's credentials and send them to the phishers for financial gain. Users usually find it difficult to distinguish between phishing and legitimate websites due to the small screen of mobile phones, limiting the amount of information to be displayed on the user interface, and the lack of security indicators of an application [15].

f: PHISHING THROUGH OSN
Social networking has become an indispensable part of the Internet, and millions of people's lives around the world. Online social network (OSN) such as Facebook, Twitter, Instagram, etc., become a new ground of attacks for phishers to perform their phishing activities [28]. Social network sites allow online users to interact, exchange and share information with each other, making it easier for phishers to conduct their illegal acts. Phishers mimic themselves as someone whom the users know of on these online social platforms and exploit their trust to gain financial benefits by taking advantage of these sites' popularity [12].

a: LIST-BASED METHOD
List-based is a phishing detection approach used to differentiate between phishing and legitimate webpages based on a collected list of trusted and suspicious websites. The listbased approach can be divided into two groups: blacklist and whitelist [10]. Blacklist is a list of malicious or suspicious websites in which users should not access. When users try to access any URL in the blacklist, they will be warned of potential phishing attacks and prevented from accessing the website [31]. On a contrary, a whitelist is a collection of all legitimate and trusted websites. Any webpages that are not included in the whitelist will be considered suspicious. Once users attempt to access webpages that are not listed as secure sites, they will be alert of the possible risk [12]. The blacklist-based approach is comparatively effective in phishing detection because it offers a low false-positive rate and provides simplicity in design and ease of implementation [32]. However, the main drawback of this approach is an inability to classify new malicious websites and to recognize non-blacklisted or temporary phishing pages [31]. As a result, it is unable to detect unknown or zero-day attacks. In addition, blacklists need to be updated frequently and require human intervention and verification. Hence, they consume a great amount of resources and are prone to human error [33]. Due to these limitations, it is advisable to combine list-based method with other approaches which can handle zero-day attacks, at the same time keeping the low false-positive rate.

b: HEURISTIC-BASED METHOD
Developed from list-based, heuristic-based phishing detection approach depends on numerous features extracted from the webpages' structure to identify fake and untrusted sites. These features will be fed into a classifier to build an effective phishing detection model [31]. Phishing site characteristics in a heuristic-based approach are created based on several hand-crafted features, such as URL-based features, webpage contents, etc. Phishing webpages are detected by evaluating, examining, and analyzing these manually selected components [22]. Unlike blacklist, the heuristic-based approach can detect potential phishing attacks once the webpages are loaded, even before their URLs are updated in the blacklist. Since heuristic method has better generalization capability, it can be used to detect new phishing attacks. Yet, such method is only limited to a number of common threats, and is unable recognize newly evolving attacks [9]. Besides, heuristic-based method tends to have a higher false-positive rate as compared to blacklist [8]. Consequently, it can be combined with other approaches to solve the high falsepositive rate problem.

c: VISUAL SIMILARITY
Phishing webpages are detected by checking and comparing the visual representation of the websites in visual similarity approach, rather than analyzing the source code behind it [17]. Identification of malicious webpages can be done by finding the resemblance with legitimate sites in page layout, page style, etc. Another method is to take the snapshot of the targeted websites and compare with the ones in the database using image processing technologies [34]. Phishing detection based on visual features of webpages' appearance relies on the assumption that phishing sites are similar to the legitimate ones [5], which might not always be the case. Plus, it requires higher computational cost since storing snapshots of websites need more space than storing their URL. Similar to the heuristic-based method, phishing detection based on visual similarity has higher false-positive rates than list-based [35]. Features are extracted and classified using ML techniques in ML-based approach. The accuracy of the classification technique depends on the selected algorithm [36]. This algorithm will be used to produce an accurate classifier model to differentiate between phishing and legitimate websites [31]. Examples of frequently-used ML techniques include Naïve Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), k-Nearest Neighbor (kNN), J48, C4.5, etc [3], [7], [10], [29]. Similar to heuristic, ML approach can detect zero-hour phishing attacks, which is an advantage over the blacklist method [1]. Moreover, it also has additional advantages as compared to the heuristic approach. For instance, ML techniques can construct their own classification models when a significant set of data is available, without the need to manually analyze data to understand the complicated relationship among them. Unlike heuristic, ML can achieve low false positive rate [8]. ML classifiers can also evolve to adapt to the changes in phishing trends as the phishing tactics evolve.

e: DEEP LEARNING (DL)
DL architecture is built based on neural networks with the ability to discover hidden information in the complex data through level-by-level learning [37]. DL approach has become more and more popular in the phishing detection domain with the recent development of DL technologies [2]. Although DL requires a more significant dataset and longer training time than the traditional ML method, it can extract the features automatically from raw data without any prior knowledge [23]. Various DL-based techniques have been employed recently to enhance the performance of classification for phishing detection [22]. Popular algorithms based on DL architecture include Convolutional Neural Network (CNN) [22], [38]- [41], Deep Neural Network (DNN) [42]- [45], Recurrent Neural Network (RNN) [46], [47], Long Short-Term Memory (LSTM) [44], [48]- [50], Gated Recurrent Unit (GRU) [48], [51], [52], and Multi-Layer Perceptron (MLP) [53]- [55], etc. It is believed that DL algorithms will become a promising solution for phishing detection in the near future due to a wide range of benefits that they offer [3].

f: HYBRID METHOD
The hybrid approach combines different classification techniques to achieve better performance in detecting malicious websites [22]. For instance, in a hybrid model where two different algorithms are combined, the dataset is trained using the first algorithm and then the result is passed to the second algorithm for training [36]. The overall accuracy of the hybrid model is believed to be higher than those from each individual algorithm. When new solutions are proposed to encounter various phishing attacks, cyber criminals will always take advantage of the vulnerabilities of the solutions and come up with new methods and produce new attacks [56]. Therefore, it is recommended to use hybrid models since a single approach has its own drawbacks that need to be addressed. Hybrid models combine different classification techniques to merge their advantages and resolve their individual disadvantages. As a result, phishing detection using a hybrid algorithm offers higher accuracy and provides a more decisive classification of phishing [3].

B. DEEP LEARNING
Since DL is getting more and more popular as one of the effective phishing detection methods, it has become a topic of interest in this study. The following section classifies DL into several classes, including application areas, techniques and datasets.
Intrusion detection is a technique to discover network security violations from both outsiders and insiders by monitoring and analyzing the traffic generated from various components in the network [62]. The primary purpose of an intrusion detection system (IDS) is to manage hosts and networks, monitor the behaviors of computer systems, give warnings if suspicious behaviors are found and take specific actions to respond to these illegal and unauthorized activities [63]. IDS can be divided into three types: anomaly detection, misuse detection, and hybrid [59]. Normal behavior in anomaly detection is defined and used as a baseline. Then, abnormal behaviors are identified by comparing them to the normal ones. Whereas, suspicious behaviors are represented as signatures in misuse detection, also known as signaturebased detection. A signature database is established, and network attacks are identified if they match these signatures. Hybrid is a combination technique that leverages the advantages of both anomaly and misuse detection methods. There have been many research conducted to develop DL-based models for intrusion detection systems [23], [64], [65], since DL-based methods can detect unknown malicious attacks, reduce false alarm rates and enhance the detection accuracy.
Malware detection is a method to detect malicious software that aims to interrupt a system's normal operation, bypass authentication, collect personal information, and take control of the device without users' realization. Examples of common malware include worms, viruses, Trojan, botnet, rootkits, adware, spyware, ransomware, etc. [66]. Malware has become a major concern among cybersecurity experts in recent years; thus, having an effective and robust detection approach is crucial to handle rapidly evolved malware threats [61]. Malware detection methods can be categorized into two groups: PC-based and Android-based. Android malware detection appears to be more popular due to an increase in the adoption of mobile devices using the Android operating system nowadays [59]. Since DL approaches have achieved successful results in different fields, they can also be applied to malware identification and classification. The utilization of DL for malware detection offers an effective solution to distinguish various malware and their variants. In addition, DL improves model accuracy and reduces the complexity in dimension, time, and computational resources [67].
Spam detection is an approach to identify unsolicited and unwanted messages sent electronically to a large number of recipients by someone they do not know of [68]. Spam can be classified according to multiple communication media, namely email spam, SMS spam and social spam. Email spam fills up the user's mailbox with undesired messages and unimportant emails. Meanwhile, SMS spam is usually distributed among mobile devices. Social spam has become more and more popular with the advent of the Internet and online social network, impacting social media users [69]. However, problems caused by spam messages can be prevented by spam classification and filtering. DL techniques can improve the effectiveness of spam filtering methods by developing and implementing spam detection systems [59], [70].
Phishing detection is another domain in cybersecurity that DL proved to be an effective solution [59], [61], [70], [71]. Similar to spam, phishing can also be spread through several communication channels, such as email, SMS, website, online social network, etc [8]. However, phishing has malicious intentions and is typically more dangerous as compared to spam. Spam emails, for instance, are delivered to users regardless of their consent and are often used for advertising purposes. Spam emails consume users' time, devices' memory and network bandwidth. On the other hand, phishing emails impose higher risk since they involve stealing sensitive information which can lead to huge financial loss [72]. DL efforts toward phishing detection have become a primary focus of this study due to the severe damages that phishing can potentially cause and the benefits that DL offers to mitigate these damages.
Discriminative DL models are used for supervised learning to distinguish patterns for classification, prediction or recognition tasks [23]. They work with labeled data to predict output by observing the inputs [75]. Popular discriminative DL models are Convolutional Neural Network (CNN), Multilayer Perceptron (MLP), etc. [74] Generative DL models are used for unsupervised learning to learn automatically from an unlabeled dataset [23]. Generative architectures leverage the advantages of data synthesis and pattern analysis to model the input data and generate random samples similar to the existing ones. They can describe the correlation among the input data's properties to achieve better feature representation [59]. Examples of generative DL models include Autoencoder (AE), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), etc. [73], [74].
Hybrid approach combines both discriminative and generative modes in a single architecture and therefore, benefits from both models [76]. Generative models are used as subcomponents for two purposes in a hybrid DL architecture, either parameter learning through feature representations or improved optimization to generate better discriminative Ensemble deep learning (EDL) models can be constructed by organizing multiple individual DL algorithms in parallel or sequential. There are two types of EDL architectures, namely homogeneous and heterogeneous [74]. A homogeneous EDL model combines DL techniques of the same genre (CNN-CNN, LSTM-LSTM, GRU-GRU, etc.). Meanwhile, a heterogeneous EDL model integrates DL techniques from different genres (CNN-LSTM, CNN-RNN-MLP, etc.). The theory behind EDL is that each individual DL algorithm has its pros and cons. EDL architectures join their advantages and resolve their disadvantages, provide better results, and prove to be more effective in phishing detection [70].
Reinforcement learning is an adaptive learning approach used to obtain proficiency for optimal behavior. The basic concept of reinforcement learning involves an agent who performs an action based on trial and error, and interacts with an unknown environment that returns feedbacks through numerical rewards [77]. Current research has shown a growing interest in deep reinforcement learning (DRL) [77], and it is anticipated that DRL will become one of the promising directions in the near future, as it has not been fully explored and experimented for designing a phishing detection model [59]. Examples of DRL are Multi-task Reinforcement (MTR), Multi-agent Reinforcement (MAR), Asynchronous Reinforcement (AR), Q-learning Reinforcement (QR), etc. [71].
Most of the existing literatures classified DL techniques into three main classes: discriminative, generative and hybrid [23], [59], [75], [76]. However, they did not include the ensemble DL and deep reinforcement learning approaches. The taxonomy proposed in this study introduces these two additional categories into the classification of DL techniques since they play essential roles in solving various security issues, including phishing attacks detection [70], yet their potential have not been fully exploited and need to be further examined [23]. On the one hand, ensemble DL methods merge the advantages of individual DL algorithms, cure their disadvantages, and improve the overall performance of the phishing detection model. Ensemble DL is different from the hybrid approach because hybrid methods combine supervised and unsupervised learning, while ensemble models are formed by stacking different DL algorithms. For instance, DNN is a hybrid DL technique, but DNN-SAE is an ensemble DL model. On the other hand, deep reinforcement learning has been implemented in a wide range of applications, such as pattern recognition, autonomous navigation, air traffic control, defense technologies, etc. [59]. As a result, it has opened a promising direction for research in the cybersecurity domain [70], including the detection of phishing attacks in the cyber environment.
Moreover, various frequently-used DL techniques for phishing detection were identified based on the analysis of 81 selected articles using SLR approach, as shown in FIGURE 9. LSTM and BiLSTM are the most popular DL techniques with a percentage of 34%, followed by CNN with almost equivalent distribution (30%). DNN and MLP contributed the same portion of 8%, while only 1 out of 10 articles implemented GAN or DRL in their studies. LSTM and CNN have been widely used in previous research partly because of their numerous benefits. LSTM models solve the vanishing or exploding gradient issues exist in the traditional recurrent neural network are suitable for handling time-series sequence data [21]. Meanwhile, CNN models are best suited for highly efficient and fast feature extraction from raw and complex data. CNN architectures provide more promising and robust results because they reduce the network complexity and speed up the learning process [61]. LSTM and CNN are well fitted for phishing webpage detection due to these benefits, as phishing websites contain multi-dimensional data such as text, images or both. In general, each DL algorithm has its strengths that can be leveraged, and weaknesses that need be improved. Therefore, it is essential to analyze the pros and cons of individual DL mechanism to build an effective model to detect phishing. Appendix C listed the advantages and disadvantages of several DL algorithms used in the previous studies.
Appendix D to Appendix Q provide details of DL techniques used in the literature. These DL algorithms are classified according to their application, platform, and dataset. It is observed that DL has been used to detect website phishing or email phishing. In addition, DL was also utilized for either feature extraction or classification purpose. Platforms that were used for the design of these DL models include Matlab, JavaScript, C++, Weka, Python, and RStudio. Last but not least, the datasets used for the implementation of these DL algorithms were also analyzed to examine their performance in detecting phishing websites and emails, which will be discussed in the next section.   etc. [8]), website and email are the most common phishing attacks in cyberspace. Among the reviewed articles selected for this research study, most of them belongs to the former group (47 articles), while a minority of them fit into the latter category (12 articles). In addition, different datasets are used for website and email phishing.

a: EMAIL PHISHING DATASET
Since emails typically hold private and confidential information, datasets for email phishing are limited. This restriction also applies to the publicly available ones [70]. Email phishing datasets contain two types of email, namely ham and spam (or phishing) [78]- [80]. FIGURE 10 displays the distribution of datasets for email phishing among 81 selected papers for this study. Spam Assassin and Enron are the most widelyused datasets for email phishing, with an equivalent distribution of 19%. Spam Assassin contains both ham and spam emails obtained from the SpamAssasin project [81], while Enron consists of more than 500 thousand emails generated by 158 employees from the Enron Corporation [80]. Other popular datasets are from the First Security and Privacy Analytics Anti-Phishing Shared Task (IWSPA-AP-2018) and Nazario phishing corpus, with both occupies 11% of the total email phishing datasets. Email corpus provided by the organizer of IWSPA-AP-2018 competition consists of two sub-tasks to build and train a classifier to distinguish VOLUME 10, 2022 ham or phishing emails from spam and legitimate ones. The first sub-task contains emails with only the body part, while the second sub-task comprises of emails with both body and header [56], [68].
The Nazario phishing corpus was created by Jose Nazario, and contained only phishing emails [80]. Other datasets used for email phishing detection involve CSDMC2010 SPAM, APWG, UCI, etc. A list of the most common datasets to detect phishing email is provided in TABLE 5.

b: WEBSITE PHISHING DATASET
Based on the analysis of 81 selected papers, the most frequently-used datasets for website phishing detection include Phish Tank, Alexa, DMOZ, UCI, and Common Crawl. Phish Tank is the most popular depository that provides phishing URLs to train a classifier to differentiate between malicious and genuine websites (FIGURE 11). A majority (34%) of the articles used Phish Tank as their dataset to collect phishing URLs, followed by Alexa and DMOZ (9% and 8%, respectively), two databases provide legitimate URLs for training and testing purposes [75], [82]. UCI is another common repository consisting of both malicious and legitimate URLs for machine learning and phishing detection [42]. Meanwhile, Common Crawl is a corpus of web crawl data comprised of only legitimate sites [48]. A list of the most popular datasets for website phishing detection is provided in TABLE 6.

IV. CURRENT CHALLENGES
This section analyzes the current issues found in the literature and proposes possible solutions to solve the challenges identified in the study, and to answer RQ3.

A. FEATURE ENGINEERING
Traditional ML algorithms, as discussed in the previous section, require manual feature engineering to extract features for phishing detection purposes [20]. The feature extraction and selection process are based on experiment and professional knowledge, which is tedious, labor-intensive, and susceptible to human errors [22]. Some researchers select features according to their own experience, while others examine different statistical techniques to determine the bestreduced set of optimal features [21]. Handcrafted feature selection is often done manually and still requires much labor and domain expert, limiting the performance of phishing detection.

B. ZERO-DAY ATTACKS
Classical ML techniques still suffer from the lack of efficiency in detecting zero-day phishing attacks [10]. The detection model must explore new behaviors and be able to dynamically adapt to reflect the changes in newly evolving phishing patterns to handle these types of attacks effectively. The majority of the existing classification techniques are unable to explore these new behaviors and incapable of adapting themselves to reflect the changes in the environment [92]. As a result, they fail to detect unknown or newly evolved phishing attacks. However, DL algorithms can detect zero-day attacks more efficiently [73].

C. DL ALGORITHM
There are many different DL algorithms and each of them has particular characteristics suitable for some specific applications. For example, CNN architecture provides better results when processing two-dimensional data with grid topologies, such as images and videos, due to the high correlation between pixels and neural networks [21]. Meanwhile, RNN is more suitable for sequential data, natural language and text processing [58]. In addition, more attention has been paid to supervised DL, yet the main disadvantage of discriminative learning is that it requires a massive amount of labelled data; collecting them is very costly and time-consuming [9], [59]. Therefore, it is challenging to choose the right algorithm best suited for a target application in the context of cybersecurity. Selecting an inappropriate algorithm might produce unpredictable outputs, leading to a waste of efforts and affect the model's effectiveness and accuracy [70].

D. COMPUTATIONAL CONSTRAINTS
Each stage in the phishing detection model, like data preprocessing, feature selection and classification, adds an extra level of computational complexity to the overall model. The computation complexity increases as neurons and layers in deep neural network architecture increase [58]. The use of Graphics Processing Units (GPUs) to accomplish maximum operations in minimum amount of time makes DL models more expensive to build [105]. This problem magnifies when new data arrives and model retraining is required [9]. Thus, computational complexity is one of the major issues in DL, and it is a challenging task for future researchers to build an effective phishing detection model with less computational resources [106].
Recent research by MIT suggested that the DL model's computational requirements have been growing significantly, which exceeds the ability that specialized hardware can handle. Additional enhancement will soon be needed since the development in hardware is slower than the improvement in DL computing power, which limits DL models' performance. Furthermore, complex DL models using GPUs and TPUs in their implementation have certain effects on the environment and energy consumption. The amount of carbon dioxide emitted from such models is approximately five times an average car's lifetime emission. This suggests that future researchers should start looking for alternative techniques more computationally efficient than DL [21].

E. DATASET
Dataset issues can be divided into four categories: availability, diversity, recency, and quality [10]. Firstly, there are limited resources for phishing email datasets since some organizations hesitate to share their information due to privacy issues [70]. Other publicly available phishing website datasets contain dead, duplicate, or incomplete links, which cannot be accessed by users. Furthermore, there are individual or organizations encountered or conducted research on phishing attacks, but did not submit to the crowd-sourcing sites; hence, new phishing emails or websites are not made publicly accessible [10]. As a result, researchers and developers have difficulty in finding available datasets to work with. This can become a major obstacle because DL requires a significant amount of data to train deep neural networks [21].
Secondly, the diversity of datasets is also an essential factor that can hinder the performance of DL models. If features in the datasets are not extensive and representative enough, the DL model will not have a great generalization ability [7]. DL models trained on datasets that only contain patterns of known attacks generally will not perform well when facing VOLUME 10, 2022 new phishing patterns. DL models will not be able to detect or classify them correctly, especially when attackers contaminate the datasets with adversarial samples (adversarial attacks [9]) to deceive the model into learning phishing attacks as legitimates (active attackers [7], [10]). This will affect the robustness of the underlying DL model [60].
Thirdly, there are different datasets made publicly available to train DL models, but not all of them are up-to-date (lack of recency). Moreover, issues caused by limited resources also lead to model training and validation on old and obsolete data. DL models trained on such datasets might fail to detect modern phishing patterns and produce low detection accuracy [7].
Finally, the efficiency and effectiveness of DL models depend on the nature and characteristics of the input data (data quality). If the input data contain ambiguous, missing or meaningless values and outliers, DL models might produce incorrect output results. Non-representative, poor quality, irrelevant features, and imbalanced datasets can lead to low detection accuracy [7], [10]. Therefore, it is crucial to have relevant and high-quality input data to produce better outcomes in DL models. In other words, one potential solution is to improve the existing pre-processing techniques or propose new data preparation methods to enhance the effectiveness of DL models in the phishing detection domain [70]. Significant improvement in model performance can sometimes be achieved from higher quality data rather than more sophisticated algorithms. Even though the cybersecurity community has recognized DL as a promising algorithm for detecting phishing attacks, there is still a lack of high-quality datasets exist in this field [107].
To sum up, DL generally requires significant datasets to achieve high detection accuracy. Thus, data resources containing a small number of instances, non-diverse data, outdated or highly imbalanced samples might cause overfitting problems [59]. Similarly, datasets comprising of old phishing attacks, do not represent real attack scenarios and behaviors, or do not possess real-time properties might not provide reliable performance results [23]. Models built on such datasets will suffer from the lack of efficiency, effectiveness, and accuracy in phishing detection.

F. PARAMETER OPTIMIZATION
The parameters in DL models include, but are not limited to, the number of hidden layers in the neural networks, number of neurons in each layer, number of epochs, type of activation function, type of optimizer, learning rate, and dropout rate, etc. [59]. There is no standard guideline for an optimal set of parameters that can produce the best performance accuracy. Researchers usually need to conduct a series of experiments to fine-tune these parameters [34], [39], [41], [42], [44], [75], [97], [108]. This process is time-consuming and requires much effort.

G. EVALUATION METRICS
A set of performance metrics need to be measured after training the DL model to evaluate the effectiveness and efficiency of the underlying algorithm. The most common computational metrics are False Positive Rate (FPR), False Negative Rate (FNR), accuracy, precision, recall, etc. [1], [59], [73]. Sufficient evaluation metrics are crucial in assessing the performance of a phishing detection system. A single metric is not representative of the high performance of a DL algorithm, but computing all the performance measures are not always the case in some of the studies [59]. In addition, appropriate evaluation metrics also play a vital role that need to be considered [7]. Especially in the case of imbalanced datasets, accuracy and error rate are not entirely suitable for performance evaluation. Instead, other metrics, such as Receiver Operating Characteristic (ROC) curve, and Area Under the Curve (AUC), are more desirable [10].

H. INFERENCE JUSTIFICATION
One of the main advantages of DL over ML is its ability to explore the hidden correlation between features, learn and make intelligent decisions on its own by building complex algorithms in multi-layer neural networks [21]. However, the major drawback of DL models is its inability to justify the inference it makes [42]. Since numerical weights represent the underlying knowledge inside DL models, it is not possible to explain the logic behind the assumptions, decisions, and conclusions that a neural network makes [10]. What DL models learn from the data is not interpreted, and DL models' internal operation is almost unknown, like a black box. Consequently, it would be difficult to understand the correlation between the input features and the output results [59]. The problem caused by inference justification becomes more challenging when it comes to solving errors. When there is an error in a DL model, it is extremely hard to diagnose and identify the main cause of the underlying error, since the output results are almost uninterpretable [58], [63]. Therefore, it is suggested that the causes of the attacks should be analyzed thoroughly to design an effective DL model for cybersecurity applications [71].
Neural networks are considered as black boxes since their internal operations are unknown to humans [9]. DL algorithm consists of multiple processing layers to learn data representation through multi-level abstraction. Yet, human experts have not determined the layer of abstractions but are learned from input data through some generic learning algorithm [109]. Since it is not possible to give a reasonable justification about the relationship between inputs and outputs in neural networks, more attention should be paid to the underlying mechanism inside DL models, even though DL algorithms practically perform well and have caught much interest among recent researchers [60].

I. BATCH LEARNING
Batch learning refers to the learning algorithms in which an entire training dataset was obtained prior to model training. Batch learning is used in both traditional ML and DL techniques since it offers ease of use and implementation. Nevertheless, batch learning still has several drawbacks, such as expensive retraining, high memory and computational constrains, inability to detect newly evolving threats, and poor adaptation to concept drift [9]. Online learning, however, can solve the problems caused by batch learning and suggests a promising direction for future research in the phishing detection domain.
J. TIME COMPLEXITY DL requires a significant amount of data and a substantial amount of time to train the model [103]. Datasets used for training neural networks usually contain millions of samples [45], [46], [84], [94], [111]. As a result, they need longer time to train the model to obtain high-performance accuracy.
Another factor that might delay the model's training time is limited processing and storage facilities [73].
Time complexity is an issue in threats detection and detecting phishing attacks is not an exception [73]. The existing detection techniques have been mainly developed for batch processing and not for real-time detection. As a result, traditional ML approaches lack efficiency in classifying phishing attacks in real-time scenarios. DL, on the other hand, can solve the problem caused by time complexity by using GPUs in its design and implementation [112]. In addition, big data technologies, such as Apache Spark or Hadoop, can help reduce the time complexity since they offer real-time processing capabilities [113].
Phishing webpages are short-lived; thus, there exists a need for real-time detection of phishing websites [10]. Phishing attacks are normally deployed in a short duration of time, usually in a few days or weeks, making it difficult for security experts to detect. The detection mechanism needs to be fast to capture zero-day attacks because the time-scale of phishing attacks are short. Therefore, real-time detection is a crucial part of a practical phishing detection system [7], [9].

K. BIG DATA CHALLENGES
The big data era imposes new challenges for phishing detection [9], especially when classical machine leaning techniques cannot handle a significant amount of data. DL, in contrast, can overcome this issue since it can deal with big data and perform better when the dataset size is getting more significant. DL when combined with big data, has the ability to manage and analyze a large amount of information in a short amount of time. However, the training process of DL models on such a tremendous amount of data with a single processor is not an easy task. Although GPUs and TPUs have been used to improve the training speed and reduce the training time of DL algorithms, the overall process still consumes a significant amount of time and needs high data processing capabilities [109].
All of the problems mentioned above are mapped to the existing DL techniques and classified into three groups: solved, partly solved, and not yet solved as shown in TABLE 7. For example, dimensional complexity is the major limitation of CNN models. However, this issue can be partly resolved by implementing dimensionality reduction techniques, such as RBM, DBN, AE or DAE [20], [23], [67], [70], [74]. In addition, vanishing or exploding gradient is a well-known drawback of RNN algorithm, which was overcome by its variants, namely LSTM and GRU. Even though the vanishing gradient cannot be completely resolved since it still occurs in long sequences, LSTM solves the problems of long-term dependencies and performs better than traditional RNN models [70], [107], [114]. In general, the problem of manual feature engineering is eliminated in DL, since DL algorithms extract features automatically from raw data without the need of prior knowledge. Although DL proved to be a promising solution for detecting zeroday phishing attacks, this issue has not been completely resolved as phishing tactics have evolved rapidly with the recent development of technologies. Other common issues among DL algorithms are high computational cost and manual parameter optimization. The optimal set of parameters generate the highest detection accuracy is still debatable. Last but not least, all current DL architectures lack of inference justification where the internal operation of DL models is unexplainable until recently. The following section proposes possible directions for future research based on the identified research gaps to help solve some of these problems.

V. FUTURE DIRECTIONS
This section provides an answer to RQ3 in which future research directions are suggested from the perspective of DL and act as a guideline for researchers and developers to mitigate phishing attacks in cyberspace.

A. CHOOSING THE RIGHT APPROACH
Since manual feature engineering can cause biases, DL algorithms becomes an alternative that can improve the efficiency of phishing detection. It was proven that DL models without manual feature extraction could perform better than traditional ML with feature extraction [76]. Moreover, classical ML methods are unable to explore the hidden correlation between these features. Whereas DL algorithms can extract information from the tremendous amount of data, find the correlation in the extracted data and handle multiple feature selection autonomously [73]. DL algorithms appear to be a promising solution since they avoid handcrafted feature selection, third-party service dependency, overcome false positive rate and improve detection accuracy [66]. DL has not been extensively studied in the phishing detection domain despite all of these advantages. Therefore, more attention should be paid to DL as it is a potential research direction in the near future [10].

B. SELECTING AN APPROPRIATE DL MODEL
There is a variety of DL techniques used to detect phishing attacks in the cyber environment. It is extremely important to choose the right algorithm for a specific application as it will affect the final outcomes. Therefore, it is essential for researchers to understand the reasons behind selecting a certain DL architecture, as failing to do so might result in VOLUME 10, 2022  an ineffective phishing detection model. For instance, it is expected that unsupervised DL will become more and more popular in the near future [9]. Semi-supervised learning is another potential research direction besides unsupervised to handle the massive amount of unlabeled data in cyberspace. Most of the current DNN models are now using unsupervised layer-wise pre-training and supervised fine-tuning, which is computationally expensive. However, suppose supervised and unsupervised learning can be combined in a powerful semi-supervised DNN model. In that case, there will be no need to have a separate layer-wise pre-training phase. As a result, it can increase detection accuracy while minimizing the computational cost [59].
In addition, wrong choices of DL design or implementation, based on low level of maturity on applying DL techniques, would lead to biased classification results. DL approaches offer a wide range of possibilities that have not yet been fully exploited. Researchers and developers would fail to explore the full potential of DL architectures by overlooking these. For instance, DL models are capable of capitalize multimodal (heterogeneous) input data and handle multiple classification tasks in addition to single-modal and binary classification [115]. On the one hand, multimodal DL-based classifier can learn the hierarchical representation of all the available modalities in the input data automatically, instead of performing the manual feature engineering process on a specific modal. On the other hand, multitask DL approach can reduce computational overhead, thereby limiting redundancy by sharing part of the feature engineering procedure. This improves generalization and provides better classification results, which would help with solving the task of phishing detection.

C. EMPLOYING OTHER COMPUTATIONALLY EFFICIENT TECHNIQUES
The training process in DL models is performed on a significant dataset and consumes many computational resources. Transfer learning can be used to overcome this problem by detecting phishing attacks with the same patterns. This can be done because transfer learning utilizes pre-trained models to solve similar problems and then train only the fully connected layer for a new classification task without building classifiers from scratch for different types of phishing attacks. Transfer learning can be applied without biases from features; only adequate data of the new attack is sufficient for the task [7], [106].
Besides transfer learning, lifelong learning and online learning can also be applied to solve the problem of computational constrains [9]. Online learning is a scalable learning algorithm that learns from data, make updates and predictions sequentially. In online learning, data is treated as stream of instances, making it more efficient than traditional batch learning.
Additionally, computational costs can be reduced by implementing distributed computing and distributed algorithms. Different jobs are distributed among several machines in the hybrid network to speed up the process and improve performance efficiency. Big data technologies, such as Apache Spark, can be applied to handle this task by utilizing the parallel computing capabilities to process the data with feasible computational resources [73].
One of the major limitations of phishing detection models is resource constraint, and DL algorithms are just computationally expensive. It is suggested that edge or fog nodes can be used to offload the computational constraints for effective phishing detection without increasing its computational cost [21]. Edge computing offers a more scalable platform of computational processes and power storage. Leveraging edge computing will facilitate handling this problem by allocating the computation process to several resources over the cloud [23].
Another approach to minimize computational cost is to integrate neuromorphic computing with DL. Neuromorphic computing is different from deep neural networks in both structure and principle. All neurons are activated by the activation function in the current deep neural networks, for example, Rectified Linear Unit (ReLU), Sigmoid, Tank, etc. However, unlike neural networks, all neurons are not activated every time in neuromorphic computing. This allows the model to achieve higher efficiency and lower power consumption. Neuromorphic computing help reduce the need for software and hardware development, leading to an increase in computational speed and a decrease in computational complexity [107].

D. SELECTING, LABELLING AND TRAINING DATASET
The efficiency and effectiveness of phishing detection solutions depend on the selection, labelling, and training of a dataset. First, some datasets are not available, non-diverse, out of date, or highly imbalanced. Thus, it is essential to select a recent and balanced dataset that contains various phishing patterns to detect newly evolving attacks in the live environment [21]. Second, supervised ML techniques required labelled data for training, yet the amount of labelled data is limited as compared to all the available data on the web. Therefore, researchers can apply active learning or crowdsourcing techniques, in which individuals and organizations can label and share malicious URLs, to handle the difficulty in acquiring labelled data or learning with limited amount of labelled data [9]. Third, pre-trained detection models might fail to handle new types of attacks once the phishers modify the nature of malicious websites or URLs [10]. Hence, retraining on a more recent dataset is required to fight against active attackers when testing data contains different characteristics from training data [7]. Furthermore, adversarial trainings can be used to handle adversarial attacks by minimizing the negative influence caused by monotonous samples or polluted data on DL algorithms. Combining DL with reinforcement learning is another possible solution, although it is unlikely to completely avoid adversarial attacks [60].
In addition, researchers can either increase the sample data or reduce the data dimension to solve the unbalanced dataset problem. On the one hand, small datasets can cause biases and suffer from a lack of generalization of new phishing patterns. It is possible to make the sample datasets balanced by increasing the sample data. On the other hand, training on significant datasets is also a challenging and time-consuming process. In this case, the dimensionality reduction technique can improve the performance accuracy and reduce computational complexity [117].

E. FINE-TUNING HYPER-PARAMETERS
It is essential to fine-tune several parameters in the DL architecture to build a robust and competent model for phishing detection. Fine-tuning is a process to optimize the performance of a training model by changing the number of hidden layers, neurons, epochs, learning rate, etc., in the neural network. This process aims to obtain the optimal combination of parameters that yield the best performance accuracy. Researchers can follow a set of pre-defined rules or formulas to calculate these values or narrow down the range of possibilities for these parameters. Nevertheless, there will exist some rules that are not always applicable or feasible in specific scenarios. Researchers need to examine all different combinations of parameters as much as possible in such circumstances, and choose the optimal parameter setting for a neural network with the best output results. Besides, a selforganizing neural network is another option for fine-tuning parameters in DL models. This technique allows the network to learn incrementally by adding or removing neurons according to different criteria [59].

F. PICKING THE BEST MEASURES
Another concern that needs to be considered in future research is choosing the appropriate metrics to evaluate the performance of phishing detection models. Researchers and developers must be careful in selecting performance metrics for model evaluation in highly imbalanced datasets. It might not suitable to use Accuracy, Precision, Recall, and F1-Score for class-imbalance issue to assess the effectiveness of the phishing detection systems [10]. Conventional metrics like accuracy cannot capture the true performance of a detection classifier in the case of imbalanced data. Instead, confusion matrix and Areas Under the Curve are more desirable. Other metrics made for imbalanced dataset are Geometric Mean (G-Mean), Matthew's Correlation Coefficient (MCC), or balanced detection rate, etc. [7], [10] G. EMPLOYING EXPLAINABLE NEURAL NETWORK It is advisable to design and implement a DL expert system to generate knowledge automatically from training data and to overcome the problem of lack of inner explanation in deep neural networks [99]. The refined rules are extracted from a trained neural network and then is replaced with the knowledge base of an expert system by combining these two methods in a hybrid model. The neural network will become more convincing and reliable as its internal operations are explainable.
Several efforts have been made to help reveal the internal interpretation of DL algorithms [118], [119]. However, these techniques were applied in different research domains VOLUME 10, 2022 (discriminative image localization, depression recognition from facial images) and have not been employed for cybersecurity purposes. When applied, explainable neural work can potentially assist security experts in determining the input conditions under which output is produced. Especially in the cybersecurity domain, understanding the output results of a cyber threat detection model would give security experts a valuable insight into preventing and mitigating such cyber threats.

H. INTEGRATING VARIOUS TECHNIQUES IN A HYBRID MODEL
Another future direction is to combine different DL techniques in a hybrid approach to gain optimized performance accuracy in phishing detection. The theory behind this method is that each individual DL algorithm has its pros and cons. We can merge their advantages and resolve their disadvantages by integrating different DL techniques in a single approach to provide a more robust model for detecting phishing attacks [70].

I. DEVELOPING A ROBUST, SCALABLE AND FLEXIBLE PHISHING DETECTION SYSTEM
Phishing attacks are continuously evolving with the advancement in information technologies, as phishers try to come up with a countermeasure for every new solution that security experts suggest. As a result, it is essential to have a robust detection system with a set of features that go beyond the common attacks, and a diverse, recent and high-quality dataset for model training [7]. Researchers should train the DL model on one dataset and test on different data to ensure the robustness of phishing detection systems. This is also known as generalization experiment, or cross-domain system testing, to verify the performance of a phishing detection model in classifying various types of attacks [10]. Since phishers always change their attacking tactics to bypass the defense mechanism, model retraining alone might not be sufficient to cope with newly emerging attacks. Therefore, a robust phishing detection system is a system with high adaptability, which can adjust to reflect the changes in the real-world environment, given the variety of phishing attacks, the newly evolving attacks types, and the numerous scenarios in which such attacks can happen [9].
Besides adaptability, scalability is another requirement for future phishing detection models. A phishing detection system should be able to handle millions of instances in the training data in the big data era. Researchers can employ more efficient and scalable learning algorithms, such as online learning, or efficient stochastic optimization algorithm, to meet the scalability requirement [9]. Moreover, big data technologies like Apache Spark and Apache Flink can process data in-memory. In-memory processing allows data to be analyzed in real-time, and real-time processing is extremely important, especially in detecting security threats. Incorporating DL and big data technologies will help to improve the performance and efficiency of security analytics [73].
It is crucial for a phishing detection system to be flexible enough for easy design, implementation, improvement, and extension, considering the complexity of phishing webpage classification based on DL [9]. The flexibility requirements include quick model update upon the arrival of new training data, being easy to change the classification model when needed, being flexible to be extended for model training to cope with new attack types, and finally being able to interact with users when required.
An example of a robust, scalable and flexible phishing detection system is an anti-phishing framework or web browser plug-in that can perform multiple tasks, such as detecting, preventing, and reporting, once a suspicious website is found. An ability to quickly report phishing attacks to the organization from the user's end is an essential feature that can be added to the existing phishing detection solutions. The time organizations lost on remediation after being attacked by cyber criminals can negatively impacts the productivity and profitability of their businesses. Therefore, it is vital to provide a feasible model that can detect and report phishing attacks as automatically and quickly as possible so that they cannot cause any further damage to the organizations. It is expected that in the future, an all-inclusive phishing detection system can be implemented in such a way that it can detect, report, and prevent malicious websites without requiring the user's involvement. When users are asked for their credentials or personal information, the developed framework or web browser plug-in should be able to check if the website is legitimate and notify the users beforehand. Therefore, a scalable and robust phishing detection solution is needed to perform website health checking during user browsing in the near future [8].
To sum up, many solutions have been proposed to detect phishing attacks, but there is no single solution to detect all attack types in the vast space of the cyber environment. Whenever researchers develop a new solution to fight against phishing attacks, phishers will take advantage of the vulnerabilities in the current solution and come up with a new attacking strategy to deceive the users. A list of current issues and challenges, together with their recommendation and future research directions are provided in TABLE 8, with the hope that it will contribute to the mitigation of phishing attacks evolving rapidly in recent years.

VI. EMPIRICAL ANALYSIS
This section provides an empirical analysis of several DL algorithms to manifest some of the current issues discussed above. First, the dataset and a list of features used in the experiment are mentioned. Then, the experiment setup is briefly described. Finally, the existing problems that DL is facing in phishing detection is highlighted from the experiment results.

A. DATASET
The dataset used for the experiment in this study was obtained from University California Irvine Machine Learning Repository (UCI) which has been widely used by various authors in their research [83], [97], [99], [101]. The dataset consists of 11055 URLs, in which 6157 are legitimate and 4898 are phishing. The dataset was divided into two parts, 80% for training, 20% for testing, and contained a total of 30 features. FIGURE 12 is a heatmap displaying the correlation matrix of the features. The correlation range is from -0.6 to 1, where 1 is the highest positive correlation and -1 is the lowest negative correlation. The closer to 1 the correlation is, the more positively correlated the features are. In other words, as one increases, so does the other. Specifically in this dataset, feature Favicon and Using Popup Window are highly correlated. No other feature in the dataset has a high correlation except for Favicon and Using Popup Window. Moreover, some features have a negative correlation, and others are positively correlated. Negative correlations mean one feature marks the URL as phishing, while the other does not [97].

B. EXPERIMENT SETUP
Various DL models were built in the experiment using Python programming language with Tensorflow on Google Collaboratory. Tensorflow is an end-to-end open-source platform for machine learning. It provides tools, libraries and resources, allowing researchers and developers to build, train, and deploy machine learning models. Google Colaboratory enables users to compile and execute python in their own browser. Google Colaboratory provides an interactive environment in which executable code, text, images, HTML, etc., can be combined in a single document. Codes are executed on Google's cloud servers, allowing users to leverage the power of Google's hardware. Plus, several DL models were built in this empirical study, including DNN, MLP, CNN, RNN, LSTM, GRU, and AE. Parameter settings for these DL architectures are listed in TABLE 9. VOLUME 10, 2022

C. RESULT AND DISCUSSION
It is essential to select a set of parameters with the best performance accuracy when building each DL model. These parameter settings can vary among different DL models, including the number of hidden layers in the neural networks, the number of neurons in each hidden layer, the number of epochs, batch size, type of optimizer, learning rate, type of activation function, etc. The same set of parameters was used in this research across all DL models just for the purpose of empirical analysis to highlight the current issues of DL in phishing detection. Fine-tuning will be added in future research to find the optimal set of parameters for each DL model that can produce the highest detection accuracy.
The loss and accuracy of various DL models during training and validation are illustrated in FIGURE 13. The accuracy for each DL model is shown in the upper graph, while the loss function is displayed in the lower plot. As the number of epochs grows, the accuracy starts to increase, while the loss function begins to decrease. The training accuracy, or training loss, is represented by a blue line, whereas the validation result is displayed in orange. A large gap between training and VOLUME 10, 2022   validation results is also known as overfitting problem [96]. Overfitting usually occurs when the model performs well on the training set, but poorly on the validation set, causing the training accuracy to be much higher than the validation accuracy. As a result, the smaller the gap between the blue and orange lines, the better the phishing detection model. In other words, the faster the training and validation graphs converge, the more efficient the DL algorithm. Most of the time, issues caused by overfitting can be prevented by using regularization techniques, such as batch normalization, early stopping or dropout [22], [23], [94], [101]. As can be seen from the graphs, CNN, LSTM and GRU models are less prone to overfitting problem since they implemented dropout function. In contrast, DNN and MLP algorithm might suffer from overfitting because none of the regularization techniques were used in the implementation of these DL models.
The results obtained from the experiments are summarized in These results are consistent with what has been discussed in the previous section, in which ensemble DL models combine the strengths and resolve the weaknesses of individual models to achieve higher performance accuracy. It is also observed from the experiment that LSTM and GRU take longer training time as compared to any other models. In addition, among the LSTM architectures, the duration to train ensemble LSTM models is longer than the training time of a single LSTM model. These results are also in accordance with the previous literature in which the more complex the DL architecture is, the longer the training time. Therefore, besides having an effective DL model that can produce high detection accuracy, it is also crucial to reduce the training duration, since longer training time requires higher computational resources.
In short, the empirical results obtained from the experiment of various DL models have manifested the following issues that need to be addressed. First, there is no specific guideline for an optimal set of parameters that yield the best performance accuracy in detecting phishing attacks. Researchers need to find-tune these parameters manually by conducting very tedious and time-consuming series of experiments. Second, individual DL models might produce lower accuracy as compared to ensemble or hybrid models. As a result, it is recommended to combine different DL algorithms in a phishing detection model to have an effective and robust solution to fight against phishing attacks. Last but not least, training duration is another factor that needs to be taken into consideration. Even though ensemble and hybrid DL models have higher accuracy, they might also take a longer time to train. This becomes a problem because a longer duration requires higher computational cost, which reduces the model's efficiency.
This section has assessed the classification performance of different DL algorithms and discussed their related limitations by analyzing several DL models in a practical context. The empirical analysis was performed with recently published, publicly available and commonly-used dataset for benchmarking and evaluation in phishing detection. In addition, the performance of various DL models was also evaluated with a set of standard metrics frequently used for validation in the phishing detection domain. Altogether, the benchmarking dataset, the evaluation metrics, and the empirical results were discussed to highlight the overlooked issues along with the perspectives that encourage researchers to explore DL and navigate the future research directions of phishing detection in this regard.

VII. CONCLUSION AND FUTURE WORK
To sum up, DL has caught much attention among researchers across numerous application domains. DL can handle complex data and extract raw features automatically without prior VOLUME 10, 2022  knowledge. DL has become one of the top interested topics in the cybersecurity with the advent of new technologies and the rapid growth of data in the big data era, especially in the phishing detection field. As a result, this study provided a comprehensive review of DL for phishing detection through an in-depth SLR approach. The paper also offered a  significant insight into the current issues and challenges that DL faces in detecting phishing attacks by analyzing the trends and patterns of 81 selected articles from various sources. This research has drawn a taxonomy for phishing detection and DL to classify them into several classes based on a thorough analysis of the relevant studies. Phishing detection was        results obtained from the empirical experiments indicated that the most common issues among DL techniques are manual parameter tunning, long training time and deficient performance accuracy. These findings imply that further efforts need to be taken to improve the state-of-the-art DL algorithms in terms of fine-tunning, training duration and detection accuracy, to ensure a robust and effective system for detecting phishing attacks in cyberspace. These outcomes also suggested that in addition to optimization techniques and ensemble methods, integrating DL with big data or cloud-based technologies in a hybrid approach are new research directions for phishing detection. Based on the above analysis, we believe that this study will serve as a valuable reference for researchers and developers in the field of cybersecurity.
As for future work, we will conduct extensive experiments by using different sets of parameters to obtain the highest possible detection accuracy. In addition, we also plan to include other DL techniques not yet been fully explored in phishing detection, such as GAN or DRL. Besides homogeneous architectures, we will implement heterogeneous ensemble DL models by integrating DL algorithms from different genres, for example, CNN-LSTM, DNN-AE, MLP-GRU, etc., to examine the effectiveness and efficiency of ensemble methods over individual techniques. Last but not least, instead of using a balanced dataset, we will use an imbalanced one in the experiment setup, owing to the fact that in real-life scenarios, phishing is an imbalanced classification problem, where the number of legitimate instances is much higher than the phishing ones. Tables 11-27. he is a Visiting Professor with the University of Hradec Králové, Czech Republic, and the Kagoshima Institute of Technology, Japan. His research interests include data analytics, digital transformations, knowledge management in higher education, key performance indicators, cloud-based software engineering, software agents, information retrievals, pattern recognition, genetic algorithms, neural networks, and soft computing. He is also currently serving on the Editorial Boards of the international journal of ENRIQUE HERRERA-VIEDMA (Fellow, IEEE) received the M.Sc. and Ph.D. degrees in computer science from the University of Granada, Granada, Spain, in 1993 and 1996, respectively. He is currently a Professor of computer science and AI and the Vice-President for research and knowledge transfer with the University of Granada. His H-index is 69 (more than 17 000 citations received in the Web of Science and 85 in Google Scholar), with more than 29 000 cites received. He has been identified as one of the World's most influential researchers by the Shanghai Centre and Thomson Reuters/Clarivate Analytics in both the scientific categories of computer science and engineering, from 2014 to 2018. His current research interests include group decision making, consensus models, linguistic modeling, aggregation of information, information retrieval, bibliometric, digital libraries, web quality evaluation, recommender systems, blockchain, smart cities, and social media.

See
HAMIDO FUJITA (Life Senior Member, IEEE) received the Doctor Honoris Causa degrees from Óbuda University, Budapest, Hungary, in 2013, and from Politehnica University Timisoara, Timişoara, Romania, in 2018. He received the title of Honorary Professor from Óbuda University, in 2011. He is an Emeritus Professor with Iwate Prefectural University, Takizawa, Japan. He is currently the Executive Chairperson at i-SOMET Incorporated Association, Morioka, Japan. He is a Highly Cited Researcher in cross-field and in the filed of computer science by Clarivate Analytics, in 2019 and 2020, respectively. He is a Distinguished Research Professor at the University of Granada and an Adjunct Professor with Stockholm University, Stockholm, Sweden; the University of Technology Sydney, Ultimo, NSW, Australia; and the National Taiwan Ocean University, Keelung, Taiwan. He has jointly supervised Ph.D. students at Laval University, Quebec City, QC, Canada; the University of Technology Sydney; Oregon State University, Corvallis, OR, USA; the University of Paris 1 Pantheon-Sorbonne, Paris, France; and the University of Genoa, Italy. He has four international patents in software systems and several research projects with Japanese industry and partners. He headed a number of projects including the intelligent HCI, a project related to mental cloning for healthcare systems as an intelligent user interface between human users and computers, and the SCOPE project on virtual doctor systems for medical applications. He collaborated with several research projects in Europe, and recently he is collaborating in the OLIMPIA Project supported by the Tuscany region on therapeutic monitoring of Parkinson's disease. He has published more than 400 highly cited papers. He was the recipient of the Honorary Scholar Award from the University of Technology Sydney, in 2012. He is the Emeritus Editor-in-Chief of Knowledge-Based Systems and currently the Editor-in-Chief of Applied Intelligence (Springer). VOLUME 10, 2022