Introduction
An ensemble self-tuning classification model which is dedicated to the problem of reducing the number of false positive instances is proposed. In this paper, we evaluate the proposed model in the context of phishing. However, the proposed model may be applied, for instance, to the detection of fire alarms or other problems, where minimizing instances of false positive alerts is important. The model uses aggregation functions to ensemble the results of the constituent models and it is working with time-series data periods. Aggregation functions are regarded as a valuable tool in a multitude of application domains [1]. Examples of diverse families of aggregation functions were applied to take advantage of their properties in merging the set of values into one final value.
Phishing attacks pose a serious threat to both individuals and organizations. They are most often carried out using fake URLs that are designed to deceive users and get them to divulge sensitive information. This can lead to financial losses, identity theft or data breaches. Therefore, there is a need to develop effective and efficient models for detecting phishing attacks [2], [3], [4]. In recent years, classification of phishing messages has become an important research topic. An overview of classification methods related to phishing detection can be found in [4], [5], [6], [7], and [8]. Various approaches have been proposed to improve the effectiveness of phishing attack detection, such as feature selection methods [10], [11], ensemble learning techniques [12], [13] and deep learning [8]. However, the effectiveness of these classification methods often depends on the specific targets and datasets used, feature extraction and sometimes dependency on third-party services [4]. Therefore, further analysis of the performance of available models, particularly classification methods, is needed to improve the detection of phishing messages. Detailed information about the phishing datasets and phishing detection methods are provided in [7], [8] and in this contribution in Section III. Recent papers devoted to machine learning-based predictors are discussed in [14], [15], [16], [17], and [18].
The proposed model uses a portion of the learning set to find the optimal threshold. In addition, it uses incremental learning strategies [19], [20], [21]. This approach makes it possible to iteratively improve the model’s performance by training on new data while using previously acquired knowledge. Incremental learning, due to its continuity, has high scalability and greater efficiency in terms of time and computational resources [22], [23]. The aim of the contribution is to discuss the problem of obtaining minimal value of FPR while simultaneously getting maximal value of TPR. We applied the proposed ensemble model and neural networks models which were adjusted to the incremental learning strategy (and as base models were applied typical examples of incremental learning models). Overview of the incremental learning methods may be found in [20], where the applied in our contribution method is exactly called on-line incremental learning which is also called as a synonym incermental learning. Throughout the paper we will use the shorter name of this method. More detailed discussion about on-line learning and list of popular models may be found in [24]. The presented approach is a much enhanced and developed version of the study presented in [25]. First of all a wide number of aggregation functions were applied comparing to [25] where only the arithmetic mean, minimum and maximum were applied to merge the prediction values of constituent classifiers. In the presented in this contribution experiments we adopted 28 strategies of creating prediction models. The majority of the models, namely 20 models, are based on the aggregation functions. For comparability reasons the base incremental models (Passive Aggressive classifier, SGD classifier, Bernoulli NB classifier) are also examined. Five of the remaining models are neural network models, i.e. Convolutional Neural Networks, Feed Forward and Long short-term memory (LSTM) which is a type of recurrent neural network aimed at mitigating the vanishing gradient. Furthermore, in the present contribution several performance metrics of the obtained models were compared, such as True Positive Rate (TPR), False Positive Rate (TPR) or the HMRS measure [26] which is a harmonic mean of the TPR and False Negative Rate (FNR) dedicated to imbalanced datasets. The HMRS measure is an aggregation of TPR and FNR and allows to analyze the behavior of the model globally. Finally, statistical tests were performed to show the significance of the obtained results. According to the statistical tests (Kruskal-Wallis and multiple comparison Dunn’s test with the Holm-Bonferroni correction) performed for the HMRS measure, for some of the desired TPR levels, the proposed model obtained significantly better by a few percentage points results then neural network models. The advantages of the proposed model based on aggregation functions are:
the ability to obtain the required TPR level;
the ability to obtain relatively high values of true positives;
statistically better results of the HMRS measure for smaller levels of the desired TPR level.
The advantages of the neural networks models adjusted to the incremental learning strategy are:
the ability to obtain smaller values of false positives;
statistically better results of the HMRS measure for greater levels of the desired TPR level.
The reason to consider this problem arose with the real-life problem of the Freshamil Company (FM) and the data provided by this Company. This is why this study uses a dataset provided by FreshMail. FreshMail is a Polish company that specializes in the field of email marketing. The company facilitates the transmission of a considerable volume of emails on behalf of its clients on a daily basis. In order to maintain a professional image, the company strives to avoid sending any phishing-related content through its platforms. The proposed algorithm could serve as a phishing detector, preventing the risk of FM sending a phishing email.
It is important to note, however, that FM does not have certain examples of phishing. The prediction is based only on an established level of suspicion that an e-mail is phishing, calculated based on similarity to external phishing e-mails and links from such e-mails. Nevertheless, in a real-world scenario, there is never 100% certainty that an e-mail contains phishing, so our approach must be reliable.
The article is structured as follows. Section II presents the aggregation functions used. Next, Section III discusses the FreshMail dataset, and Section IV presents the methodology used in experiments on this dataset. Section V presents the results of the experiments and discusses the performance of the proposed model. Finally, the article concludes with a summary of the research results and plans for future work.
Aggregation Functions
Firstly, we recall definition of an aggregation function which is used in the proposed ensemble learning model.
Definition 1 (cf.[27]):
A function \begin{equation*} (\forall _{1 \leq i \leq n} x_{i} \leq y_{i}) \Rightarrow A(x_{1}, {\dots },x_{n}) \leq A(y_{1}, {\dots },y_{n}), \tag {1}\end{equation*}
Definition 2 ([27]):
Let \begin{equation*} \forall _{x \in {[{0,1}]}} A(x,\ldots,x) = x.\end{equation*}
Proposition 3:
For every mean A we have\begin{equation*} \forall {x \in \mathbb {R}^{n}} \underset {1 \leq k \leq n}{\min } x_{k} \leq A(x_{1},\ldots,x_{n}) \leq \underset {1 \leq k \leq n}{\max } x_{k}. \tag {2}\end{equation*}
A. Convex Combination of Aggregation Functions
Let
maximum
A_{mx}(x_{1},\ldots,x_{n})=\max \limits _{1\leq k \leq n}x_{k} arithmetic-min average
\begin{equation*} A_{armn}^{(p)}(x_{1},\ldots,x_{n})=\frac {p}{n}\sum \limits _{k=1}^{n} x_{k}+ (1-p) \min \limits _{1\leq k \leq n}x_{k},\end{equation*} View Source\begin{equation*} A_{armn}^{(p)}(x_{1},\ldots,x_{n})=\frac {p}{n}\sum \limits _{k=1}^{n} x_{k}+ (1-p) \min \limits _{1\leq k \leq n}x_{k},\end{equation*}
geometric-min average
\begin{equation*} A_{gmmn}^{(p)}(x_{1},\ldots,x_{n})=\frac {p}{n}\sqrt [n]{\prod \limits _{k=1}^{n} x_{k}}+ (1-p) \min \limits _{1\leq k \leq n}x_{k},\end{equation*} View Source\begin{equation*} A_{gmmn}^{(p)}(x_{1},\ldots,x_{n})=\frac {p}{n}\sqrt [n]{\prod \limits _{k=1}^{n} x_{k}}+ (1-p) \min \limits _{1\leq k \leq n}x_{k},\end{equation*}
product-min aggregation
\begin{equation*} A_{prmn}^{(p)}(x_{1},\ldots,x_{n})=p\prod \limits _{k=1}^{n} x_{k} + (1-p) \min \limits _{1\leq k \leq n}x_{k}\end{equation*} View Source\begin{equation*} A_{prmn}^{(p)}(x_{1},\ldots,x_{n})=p\prod \limits _{k=1}^{n} x_{k} + (1-p) \min \limits _{1\leq k \leq n}x_{k}\end{equation*}
The family
The family
The chosen aggregation functions yield the best results in our experiments. As we see in most of the cases their values are close to the minimum aggregation function.
B. Uninorms
Aggregations function mentioned in the previous subsection are in most of the cases averaging aggregation functions. As example of aggregations that are not averages, at least in most cases, we can consider uninorms.
They appeared for the first time using the term uninorm in [28] studied in [29] with the idea of allowing certain kind of aggregation operators combining the maximum and the minimum, depending on an element
Definition 4 ([28]):
Operation
The general structure of the uninorm can be represented by the following theorems (cf. Figure 1).
Theorem 5 ([34]):
If a uninorm U has a neutral element \begin{align*} U(x,y)=\begin{cases} \displaystyle eT\left ({{\frac {x}{e},\frac {y}{e}}}\right) & \text {if}~ ~(x,y)\in [0,e]^{2}, \\ \displaystyle e+(1-e)S\left ({{\frac {x-e}{1-e},\frac {y-e}{1-e}}}\right) & \text {if}~~(x,y)\in [e,1]^{2}. \end{cases}\end{align*}
\begin{equation*} \min \leq U \leq \max \text { in}~ A(e)=[0,e)\times (e,1]\cup (e,1]\times [0,e)\end{equation*}
Due to the properties of uninorms, we can easily extend them to n-arguments. The most well-known classes of uninorms are listed below.
Uninorms in
(respectively{\mathcal {U}}_{\min } ), those given by minimum (respectively maximum) in{\mathcal {U}}_{\max } , that were characterized in [34]. As an example, we can provide a family of uninorms of the form:A(e) \begin{align*} U_{\min }(x_{1},\ldots,x_{n}) = \begin{cases} \displaystyle \max (x_{1},\ldots,x_{n}) & \text {if}~~x_{1},\ldots,x_{n}\geq e, \\ \displaystyle \min (x_{1},\ldots,x_{n}) & \text {otherwise} \end{cases}\end{align*} View Source\begin{align*} U_{\min }(x_{1},\ldots,x_{n}) = \begin{cases} \displaystyle \max (x_{1},\ldots,x_{n}) & \text {if}~~x_{1},\ldots,x_{n}\geq e, \\ \displaystyle \min (x_{1},\ldots,x_{n}) & \text {otherwise} \end{cases}\end{align*}
Idempotent uninorms, those such that
for allU(x,x)=x . They are characterized using a separating function in [35]. If we use a linear function as the separating function onx\in [{0,1}] , we obtain the following family of idempotent uninorms:[0, e] \begin{align*} & U_{id}(x_{1}, {\dots },x_{n}) \\ & = \begin{cases} \displaystyle \min \limits _{1\leq k \leq n}x_{k} & \text {if}~~\max \limits _{1\leq k \leq n}x_{k}-1 \leq \frac {e-1}{e}\min \limits _{1\leq k \leq n}x_{k}, \\ \displaystyle \max \limits _{1\leq k \leq n}x_{k} & \text {otherwise} \end{cases}\end{align*} View Source\begin{align*} & U_{id}(x_{1}, {\dots },x_{n}) \\ & = \begin{cases} \displaystyle \min \limits _{1\leq k \leq n}x_{k} & \text {if}~~\max \limits _{1\leq k \leq n}x_{k}-1 \leq \frac {e-1}{e}\min \limits _{1\leq k \leq n}x_{k}, \\ \displaystyle \max \limits _{1\leq k \leq n}x_{k} & \text {otherwise} \end{cases}\end{align*}
Representable uninorms, those that have additive generators. They were introduced in [34] and next were characterized as those uninorms that are continuous in the domain
(cf. [36]) and also as those uninorms that are strictly increasing and continuous in the open unit square (cf. [37]):[{0,1}]^{2}\setminus \{(0,1),(1,0)\} \begin{align*} & U_{rep}(x_{1},\ldots,x_{n}) \\ & = \begin{cases} \displaystyle 0 \quad \text { if}~~x_{i}=0 \ \text {for some} \ i, \\ \displaystyle \frac {\prod _{i=1}^{n} \left ({{\frac {1}{e}-1}}\right)x_{i}}{\prod _{i=1}^{n} \left ({{\frac {1}{e}-1}}\right)x_{i} + \left ({{\frac {1}{e}-1}}\right)\prod _{i=1}^{n} (1-x_{i})} \quad \text { else.} \end{cases}\end{align*} View Source\begin{align*} & U_{rep}(x_{1},\ldots,x_{n}) \\ & = \begin{cases} \displaystyle 0 \quad \text { if}~~x_{i}=0 \ \text {for some} \ i, \\ \displaystyle \frac {\prod _{i=1}^{n} \left ({{\frac {1}{e}-1}}\right)x_{i}}{\prod _{i=1}^{n} \left ({{\frac {1}{e}-1}}\right)x_{i} + \left ({{\frac {1}{e}-1}}\right)\prod _{i=1}^{n} (1-x_{i})} \quad \text { else.} \end{cases}\end{align*}
Using Łukasiewicz t-norm and its generator, we obtain the following operation\begin{equation*} U_{L}(x_{1},x_{2},\ldots,x_{n}) = \min \left ({{1,\max \left ({{0,\sum _{i=1}^{n} x_{i} -(n-1)e}}\right)}}\right).\end{equation*}
Descriptions of other classes can be found in [38], [39], [40], [41]. In the experiments, we applied the uninorms from the above examples for
Examples of uninorms and other aggregation functions, applied in the experiments are listed in Table 2.
Phishing URL Datasets
Several publicly available phishing URL datasets were created to validate models which can successfully detect phishing links. As phishing URLs expire or change the contents of the web pages linked to them, time of data collection is crucial. Phishing URLs from old datasets can be already banned. There are diverse feature extraction strategies, therefore for comparison of different methods of feature extraction we will mention the easiest to find on the internet and shortly summarize them.
A. Datasets in Literature
Phishing Dataset for Machine Learning: Feature Evaluation[42] contains data collected from January to May 2015 and from May to June 2017. Dataset contains 10,000 samples. Dataset is balanced as ratio phishing URLs data to non phishing URL is equal 1:1. Each sample is defined by 48 features. Half of the features are extracted from phishing URL and the other half from website linked to URL. Dataset is ready to import to Weka library, but it does not contain URL itself neither webpage content.
Phishing URL detection framework based on Similarity Index and Incremental Learning - PhiUSIIL dataset[3] is a comprehensive collection of phishing and legitimate URLs. It contains 34,850 legitimate (26%) and 100,945 phishing (74%) URLs, described by 53 features, including 49 numerical and 4 categorical attributes. These features were extracted from URL structures, domain characteristics, and web page content, with detailed algorithms for feature extraction outlined in [3]. Approximately 6% of the URLs represent subpages within larger domains, highlighting the dataset’s diversity.
This dataset does not include timestamps due to its aggregation from multiple sources lacking temporal metadata. While this limits its use in time-based analyses, it remains suitable for static evaluations, particularly in contexts where incremental learning models are applied. The absence of timestamps does, however, challenge the principle of adapting algorithms to evolving phishing URL characteristics over time.
The PhiUSIIL dataset has been utilized in related works, such as [25], where it supported the development of models for phishing URL detection. Its extensive size and rich feature set make it a valuable benchmark for such research.
The PhiUSIIL [3] algorithm uses features extracted from URLs, HTML code (by visiting websites and analyzing their content), and generates new features from existing data. The results presented in the paper [3] indicate that this algorithm improves phishing detection accuracy, especially when used in a pre-training approach. The PhiUSIIL algorithm in the fully incremental training procedure obtained an accuracy of 99.24% and 99.79% in the pre-training approach. In the paper [25], we used the PhiUSIIL dataset to study various issues, achieving impressive results. Our methodology was able to identify phishing links with 99.52% accuracy, indicating that the overwhelming majority of phishing links were correctly identified and flagged. Moreover, our methodology proved highly effective in minimizing the occurrence of false positives - only 0.42% of harmless links were misclassified as phishing links. This proves that our solution is very accurate and minimizes the risk of misclassifying legitimate links. In addition, our approach proved equally effective in identifying non-phishing links, achieving a classification accuracy of 99.58%. In general, the method has been demonstrated to be extremely effective in detecting phishing while simultaneously reducing the likelihood of false alarms.
The lack of clear timestamps in PhiUSIIL dataset allows only to mimic the phishing links characteristic changing over time, based on assumption that records with lower indexes were collected earlier than those with bigger indexes. However it is hard to decide which links should be present in which part. Because of that fact, results achieved with the use of this dataset are auxiliary. In this contribution, we study the Freshmail dataset only, since this dataset is the only available which is suitable for the presented approach.
Web page phishing detection[43] was created to benchmark algorithms created to detect phishing. The dataset is medium as it consists of 11,430 URLs with 87 extracted features and is balanced (1:1 phishing to non-phishing ratio). We can divide features into 3 classes:
56 extracted from the structure and syntax of URL
24 extracted from the content of their correspondent pages
7 are extracted by querying external services
Phishing Site URLs[44] is a large unbalanced dataset. URLs in this dataset were collected from websites that offer malicious links. In total dataset consist of 507,195 URLs where most of them are good links - 72%, while the percentage of phishing emails is approximately 28%. There is no extracted features, dataset has only two columns where in the first we have URL and in the second column there is a label.
Datasets for Phishing Websites Detection[45] is unique as it have dedicated web application to create and export phishing data. Authors provide us two medium sized versions of dataset. The main difference between these two is a difference in non phishing links count. Smaller dataset contains 27,998 non phishing links while larger contains 58,000. Both datasets have 30,647 phishing URL instances. Each instance is described by 111 attributes. However, this dataset does not contain phishing URLs - only numeric features are available.
PhishStorm dataset[46], introduced and evaluated in the paper titled “PhishStorm: Detecting Phishing with Streaming Analytics”, comprises 96,018 unique URLs. The dataset is balanced, containing 48,009 legitimate URLs and 48,009 phishing URLs. Each entry in the dataset is identified by a “domain” column, representing the URL, and a “label” column, indicating whether the URL is legitimate (0) or phishing (1). Additionally, for each URL, there is a set of 12 features introduced in the paper [46].
B. Freshmail Dataset
FreshMail, a Polish enterprise, focuses on offering extensive e-mail marketing support and operates a well-known e-mail marketing platform. Each month, countless marketers worldwide utilize this service to dispatch over a billion e-mails. As they advance the SendGuard initiative, FreshMail aims to harness intelligent data analysis to advise marketers, while crafting an e-mail campaign, on whether it should reach all recipients. Furthermore, within the SendGuard project, FreshMail is dedicated to devising strategies for forecasting phishing e-mails to automatically filter them before sending.
Freshmail has furnished us with a dataset comprising 2,564,973 instances and 19 features. Within this dataset, 2,383,902 instances correspond to non-phishing links, while the remaining 181,071 are phishing links. The features were selected on the basis of expert assessment and their impact on the classification of links as phishing. The process of features selecting was an additional preparatory step for the research that is the main thread of this article.
At the beginning we decided to generate as many features as possible describing the properties of links based their shallow analysis. As the links were collected over many months, in-depth analysis was not possible, as in many cases the addresses in the links were already out of date. For the shallow analysis, the links were divided into parts: full_domain, path, port, query, scheme, subdomain and tld_domain and attributes were generated for each part. In addition to typically statistical attributes such as number of underscores, number of digits, number of capital letters, etc., we generated attributes indicating the presence of characters pretending to be letters of the Latin alphabet, attributes containing information on the presence of company/domain names and some keywords. For the keywords and company/domain names, we searched for both exact matches and words that differed by 1 or 2 using Levenshtein distance. Finally, each link was represented by 112 attributes.
In the first stage of feature selection, features were analyzed for the variety of values they accepted. Features that took the same value for all or almost all links were rejected. The set of attributes was thus reduced to 77. For these attributes a correlation matrix was determined. Attributes that were highly correlated with others were removed from the set. The threshold for the correlation coefficient for highly correlated attributes was set at 0.75 (absolute value).
Further reduction was carried out using link classifier models. Model learning and testing was carried out on a set of 500,000 links from 2020 (split: 80% teaching set, 20% testing set). In addition, model validation was performed on a set of 2021 links (250,000). The aim of such an approach was to show how classification accuracy changes for younger links, which may exhibit slightly different characteristics, e.g. a different link shortening method used. The first approach used attribute standardization and random forest based models. For the reference model containing all 77 conditional attributes, an accuracy of 0.978 was obtained on the test set and 0.895 on the validation set. In subsequent steps, the set of conditional attributes was reduced to 50, 30, 25, and 19 attributes, respectively. For the latter set, accuracy 0.975 was obtained on the test set and 0.861 on the validation set. The most important features are shown in Table 1.
The second approach dropped the standardization of attributes and focused on decision trees. We wanted to generate a model easily interpretable for humans. It was an attempt to answer the question: why can we classify links based on this particular set of 19 features?
The lack of attribute standardization has already produced better results than before, accuracy of 0.967 on the test set and 0.924 on the validation set, respectively. It is worth mentioning that the random forest without attribute standardization achieved accuracy of 0.975 on the test set and 0.938 on the validation set, respectively (and F1-score 0.97 on the test set and 0.92 on the validation set). This showed a relatively small decrease in performance on younger links.
Unfortunately, even the simplest models in the form of decision trees (for example a tree with a maximum depth of 6 and using only 11 attributes) did not allow an easy interpretation of the unambiguous influence of individual features on the decision. There were cases where the key information deciding whether a link belonged to a particular class was the number of characters in the domain. Finally, the proposed set of 19 features was accepted by Freshmail experts and these features were used for further research.
The method of obtaining this kind of features based on originally textual URL is called URL-based feature extraction method [4]. Additionally, each URL is accompanied by the date of its first occurrence within the system. The dataset spans a total duration of 120 days.
Among the evaluated datasets, the FreshMail dataset uniquely fulfills the criteria essential for our experiments. The principal factor is its capacity to accurately mirror the company’s operational processes, involving continuous analysis of email content. In practical terms, this implies that a classifier developed on historical data—such as data spanning several days—necessitates regular retraining, a process inherently linked to the availability of dates within the dataset. The absence of temporal data precludes the consideration of this dynamic aspect of the process, thereby hindering the faithful replication of the company’s operational conditions.
The data was organized by the date of URL occurrence and divided into 12 parts, each containing data from 10 days. The first part, covering the first 10 days, was divided into a training and validation set in a 60%-40% ratio. These sets were used to create the proposed model (see Alg. 2). The next 10 days (from day 10 to day 20) were used as a test set to evaluate the model. In the next iteration, the data from days 10–20 were split into training and validation sets, and the data from days 20–30 served as the test set. This process was repeated until day 120, with the last part (days 110-120) reserved exclusively for testing. The entire procedure is described in Alg. 1 and illustrated in the diagram in Fig. 2.
Dataset preparation. The d denotes one period dataset, which is divided into train (t) and validation (v) dataset (cf. [25]).
Algorithm 1 FreshMail Preprocessing
Input:
d: The dataset to be preprocessed.
f: The number of consecutive days included in each segment for model creation (e.g., 10).
k: The total number of segments into which the dataset will be divided (e.g., 11).
Output: A list containing the training, validation, and testing datasets.
Procedure:
Sort the dataset d by the chronological order of URL occurrences.
Partition the dataset d into
Initialize an empty list tvt to store the train, validation, and test datasets.
for
Randomly divide the i-th segment
Extract the subsequent segment
Append
end for
returntvt
Algorithm 2 Proposed Model - Training Procedure for Individual Period (cf. [25])
Input:
A: aggregation function (e.g. the arithmetic mean)
g: minimal true positive rate (e.g. 0.995)
C: classifiers list (e.g. [PassiveAggressiveClassifier, BernoulliNB, SGDClassifier])
Output:
t - optimal threshold
Procedure
Randomly split
forc in C do
end for
Create empty matrix X where each column will contain values of prediction confidence that email is phishing by given classifier
forc in C do
end for
while
if
returnt
end if
if
end if
end while
return
Methodology
Based on the findings presented in [3] and the characteristics of the dataset, we have adopted an incremental learning approach in our research. This approach enables continuous updating of the classifier’s knowledge as new data becomes available. Initially, the model is trained using the
In addition the models used in [3] was also mentioned as used in the area of phishing detection in [6]. The Passive-Aggressive Classifier was first used in classifying links in [47]. Likewise the SGDClassifier was first introduced to this problem in [48]. The various Naive Bayes Classifiers were examined in phishing detection in [49]. The Passive-Aggressive is also included in [24] as a popular incremental model, while the SGDClassifier is included in surveys [20], [24].
A. Base Models
The following models were utilized with the standard parameters from the scikit-learn library. While these parameters can be optimized to potentially enhance predictive performance, particular attention should be given to the BernoulliNB model. This model is designed to handle binary features, and its default binarization threshold is set to 0.0, which may require adjustment depending on the dataset characteristics.
1) PassiveAggresive (PA)
The Passive-Aggressive Classifier is a type of online learning algorithm that adjusts its model parameters in response to incoming data, making it particularly suited for scenarios where data arrives sequentially or in a streaming fashion. It is “passive” when the prediction is correct and within a predefined margin, avoiding unnecessary updates. Conversely, it becomes “aggressive” when the prediction is incorrect or falls outside the margin, updating its parameters to correct the mistake aggressively.
This algorithm is commonly used in text classification tasks, where it can handle large, sparse datasets efficiently. Its implementation focuses on minimizing classification loss while maintaining a simplicity that allows it to adapt dynamically. Passive-Aggressive Classifiers have demonstrated effectiveness in applications such as sentiment analysis, spam detection, and real-time recommendation systems.
2) BernoulliNB (BNB)
The Bernoulli Naive Bayes Classifier (BernoulliNB) is a variant of the Naive Bayes algorithm tailored for binary feature vectors, where each feature represents the presence or absence of a particular attribute. This method assumes that features are conditionally independent given the class label, simplifying the computation of posterior probabilities.
BernoulliNB is particularly effective in document classification tasks, such as spam filtering or sentiment analysis, where text data is represented in a binary “bag-of-words” format. Unlike Multinomial Naive Bayes, which focuses on feature frequency, BernoulliNB emphasizes feature presence, making it a better choice when the absence of features carries significant discriminative power.
3) SGDClassifier (SGDC)
The Stochastic Gradient Descent Classifier (SGDClassifier) is a highly versatile and scalable learning algorithm that optimizes linear models, such as Support Vector Machines (SVMs) or logistic regression, using stochastic gradient descent. By updating the model parameters based on each training example rather than the entire dataset, SGDClassifier achieves high computational efficiency, particularly for large-scale or high-dimensional datasets.
SGDClassifier supports various loss functions, such as hinge loss for linear SVMs or log loss for logistic regression, making it adaptable to diverse classification tasks. Its capability to handle sparse data efficiently and integrate with feature extraction pipelines has made it a standard choice for tasks like text categorization, image recognition, and recommendation systems.
B. State of the Art Models
Recently, among traditional machine learning algorithms, the neural network, especially CNN, LSTMs and deep feed forward networks are popular choice for phishing link detection [7], [8]. In [7] the CNN was reported as having the highest accuracy among various other machine learning approaches. Because of that fact, we considered the neural networks as state of the art (SOTA) models.
In the research, neural network models such as feed forward, convolutional (CNN) and LSTM were utilized. The number of examined models of type feed forward was two. The first one, named ff_1 included a single hidden layer consisting of 200 neurons (Fig. 8a). The second one, labeled ff_2 included three hidden layers (each consisting of 200 neurons) and one dropout layer (Fig. 8b). In the case of 2D convolutional neural network abbreviated as cnn, one additional feed forward layer was used, to receive
LSTM model (abbreviated as lstm_1) has 20 hidden dimensions (Fig. 7). Learning rate in the first iteration was set to 0.0001, in next iterations learning rate was changed to 0.00005. The deep Feed Forward model was an exception, because learning rate was decreasing by 0.00009 each training iteration. First Feed Forward model were trained for 300 epochs, second one for 50 epoch, while LSTM and convolutional models were trained for 100 epochs. The structures of the neural networks can be found in Appendix. The loss function was binary cross-entropy. Each of the models were trained by Adam optimizer.
The learning process of neural networks was adopted to the incremental learning approach. In the case of online learning we trained models on
Neural networks models could also be prone to overfitting. Various techniques to prevent overfitting of neural networks are proposed, including early stopping, regularization, dropout, and network pruning [50], [51], [52], [53].
In this study, a Dropout layer was applied in one of the neural networks to prevent overfitting. Specifically, the ff_2 network included a Dropout layer with a value of 0.1. Additionally, a manual early stopping technique was used, which involved analyzing the loss function plots and adjusting the number of training epochs accordingly. Based on this analysis, the lstm_1, cnn_1d, and cnn_2d networks were trained for 100 epochs, while the ff_2 network was trained for 50 epochs. The ff_1 network achieved the best results when trained for 300 epochs.
C. Proposed Model
The fitting procedure requires training n base models, which are those mentioned in Section IV-A. We propose a heterogeneous ensemble model that employs aggregation functions and uninorms. The fitting procedure requires training n base models, as described in Section IV-A. While we have used the models mentioned in that section, exploring other groups of models in future work is a promising direction.
For prediction, aggregation functions and uninorms play a central role. Each of the n models produces a confidence score indicating the likelihood that a given sample is phishing. For an individual sample, this results in a vector of confidence scores denoted as
D. Model Tuning
The dataset parts
Define the desired true positive rate,
, for example,\text {TPR}_{\text {target}} .\text {TPR}_{\text {target}} = 0.9 Fit or update the model using the
subset of the dataset.t_{i} Apply the model to the
validation subset, which outputs a confidence level for each instance, indicating the likelihood that a given link URL is a phishing URL.v_{i} Gradually increase the threshold value g, starting from 0. After each adjustment, evaluate the true positive rate for the validation subset.
Stop increasing g when the obtained TPR exceeds
, and save the corresponding value of g as the optimal threshold.\text {TPR}_{\text {target}}
The algorithm finds the optimal threshold t by performing an exhaustive search of candidate values in the range of 0.01 to 0.99, with a step of 0.01. If the result obtained with the chosen strategy is greater than or equal to the chosen threshold t, the link is classified as phishing. Otherwise, it is classified as ordinary. Then, the resulting decisions are compared with the actual results from the validation set
E. Experiments Description
Since the Freshmail dataset has more than 1,000 instances the stratified k-fold cross-validation is not recommended while hold-out validation is preferred [54]. The hold-out validation was actually completed by adding a validation dataset, used to threshold adjusement.
Experiments were performed as follows. We collected data from 10 days (
After training, we utilize the dataset
Using this labeled data, we evaluate the performance of our framework during this period by comparing the model’s predictions against the specialists’ decisions. We compute various classification quality measures to assess performance. With the evaluation of
We split
Over the next 10-day period, a new dataset,
The strategies employed by attackers can evolve over time; for instance, phishing links during the New Year may differ from those used at other times of the year. In addition, attackers are constantly developing new methods to circumvent filters that detect phishing links. Therefore, the detection model should be gradually improved, while maintaining knowledge of previous attacks. The process described in Alg. 2 is repeated by incrementally training the model using
Algorithm 3 Proposed Model - Predict Procedure (cf. [25])
Input:
A: a given aggregation function (used during fit procedure)
t: a given threshold value (calculated in training)
C: a given collection of classifiers (learned on
Output:
prediction list v for
Begin
Create empty prediction matrix X for
forc in C do
{Matrix where each row represents a collection of confidences (probabilities) that a given URL is phishing}
end for
Create an empty list v
for
if
else
end if
end for
returnv
To evaluate the global classification quality for a given classifier with a fixed parameter t, we use a confusion matrix, which uses the following terminology:
TP (True Positives) - elements from the phishing class that were correctly classified as phishing.
TN (True Negatives, true negative) - elements from the normal class that have been correctly classified as normal.
FP (False Positives) - elements from the normal class that have been misclassified as phishing.
FN (False Negatives) - elements from the phishing class that have been misclassified as normal.
We can put the above information in the table, where the rows contain elements from the phishing class and normal class, respectively, while the columns contain elements classified by the classifier into the phishing class and normal class, respectively.
Each of these elements is crucial in evaluating the performance of the classifier, enabling further analysis of classification errors and optimization of the model in the context of minimizing false positives and improving phishing detection.
The accuracy calculated for the test objects from the phishing class is called sensitivity (TPR - true positive rate or recall), and the accuracy calculated for the test objects from a normal class we call specificity (TNR - true negative rate). In addition, we will consider FPR - false positive rate. Using the above notion we can calculate mentioned parameters according to the following formula:
TPR=\frac {TP}{P}=\frac {TP}{TP+FN} TNR=\frac {TN}{N}=\frac {TN}{TN+FP} FPR=\frac {FP}{N}=\frac {FP}{FP+TN}
For all test datasets, the individual values of true positive (TP), true negative (TN), false negative (FN), and false positive (FP) were determined, and a global confusion matrix was subsequently constructed. Based on the confusion matrix, the global true positive rate (TPR), false positive rate (FPR), and true negative rate (TNR) were calculated. It should be noted that TNR is equal to 1 - TPR. This methodology enables the determination of not only a quality measure but also remember the number of false positives (FPs).
To assess the obtained results we also used the HMRS measure [26] which is given by the following equation:\begin{equation*} HMRS=\frac {H M\left ({{R E C \cdot \frac {P}{M}, S~E L \cdot \frac {N}{M}}}\right)}{H M\left ({{\frac {P}{M}, \frac {N}{M}}}\right)}\end{equation*}
F. Measuring the Potential of Model Overfitting
In the context of the proposed model, there is an assumption that potential for overfitting is greatest at the level of the underlying estimators. This is due to the fact that the proposed model does not adjust the parameters of the aggregations utilized, or the aggregations lack parameters that can be adjusted. The threshold tuning method presented in Algorithm 2 can be regarded as a tuning of the model’s hyperparameters, making it also susceptible to overfitting. However, it should be noted that the learning data is unbalanced, and for such data, threshold moving is a popular and recommended technique [55]. In addition, the proposed algorithm is used as a detector, which has some very specific application: it should be calibrated to generate a relatively small number of false positives. This is achieved through the implementation of a sliding threshold, which optimizes a preferred quality measure (min TPR).
For PassiveAggressiveClassifier and Stochastic Gradient Descent Classifier models it is possible to apply popular techniques to counteract overfitting, i.e. early stopping and regularization (cf. [51]). Bernoulli Naive Bayes is an estimator for which such techniques cannot be applied.
In order to ascertain the presence of overfitting at the level of the underlying models, the following experiments were conducted: each base classifier was trained individually, with the model first being taught using period
The outcomes of the experiments revealed that, on the HMRS metric there is mostly no significant reduction in quality on the test data, and at times, the quality even surpassed that of the training data (this means that the models do not adjust too much to the training data, because they retain the ability to separate decision classes). However, certain model parameters led to underfitting, characterized by premature termination of learning and a substantial decline in classification quality compared to other results.
In summary, in our opinion, sufficient techniques were used to counteract model overtraining, which was further confirmed empirically. The additional tests performed show that our model does not overtrain to a significant degree. Using regularization techniques which are standard parameters of the scikit learn library and checking the results on separate datasets in our opinion effectively reduced the risk of overfitting.
Experimental Results
First, implementation details are provided and next results of the experiments are presented.
The experiments were conducted using Python 3.12, utilizing a collection of widely recognized libraries for data manipulation, analysis, and machine learning tasks. Specifically, the experimental framework relied on the following versions of key Python packages: Scikit-learn (version 1.4.1.post1), Pandas (version 2.2.1), Matplotlib (version 3.8.4), and NumPy (version 1.26.4). The study utilizes a GPU-accelerated environment with CUDA 12.1 and the deep learning framework PyTorch 2.5.1, along with TensorBoard 2.18.0 for visualization and TorchMetrics 1.5.1 for metrics calculations. NVIDIA CUDA libraries optimize tensor computations and multi-GPU communication.
In Tables 4, 5, 6, 7, 8 the arithmetic mean values of the considered in the experiments measures, i.e. TPR, FPR, TP, FN, ACC (accuracy), HMRS (the harmonic mean of TPR and FNR), are gathered. The abbreviation mTPR stands for the arithmetic mean of TPR. Analogous abbreviations are used for other measures. The best three results in each column are provided in green color, while the worst three results in each column are given in red color.
At first, our analysis focused on whether the desired true positive rate (TPR) obtained on the validation set is met on the test set. We tested five TPR values, with the tested values of g ranging from 0.9 to 0.999, including 0.9, 0.95, 0.99, 0.995, and 0.999. Mean values of this measure are provided in Tables 4 –8 in the column mTPR. Additionally, heatmaps for the results in folds can be found in the supplementary materials (cf. [9], Figures S1-S5). Among tested threshold values heatmaps were similar. Therefore we only include heatmap when g was set to 0.995 in Fig. 3. Looking at base models we can observe that SGDC performance on time period d7 drastically drops. This behaviour is also noticeable for cnn_1D. Since our proposed model include SGDC specific aggregations tends to prioritize SGDC bad decisions. Especially these functions which results tend to be closer to lower boundary (minimum) of aggregated values.
Regarding the minimal required TPR level and mean values determined based on all folds, at the minTPR = 0.9 level only one type of neural networks achieved a result above minTPR (cf. Table 4). All base models achieved the required minTPR = 0.9 and 8 out of 20 proposed models were able to achieve the desired level. For the remaining levels (cf. Tables 5, 6, 7, 8), none of the neural networks reached the required threshold. Starting from minTPR = 0.99, there are no base models (PassiveAgressive, BernoulliNB, SGDClassifier) among the models that have achieved the required minTPR level. The best performance was obtained by the proposed model
Next we focused on False Positive instances - URLs that our system treated as phishing so that they were not send and need to be analyzed by people. Mean values of this measure are provided in Tables 4 –8 in the column mFP. Additionally, we have created heatmaps related to all levels of the desired TPR and results in folds which can be found in the supplementary materials (Figures S6-S10). We have chosen two heatmaps when desired TPR was set to 0.9 and 0.999 which are presented on Figures 4 and 5. What can be seen that with rising g there are more FP instances. What we are focused about that time periods
When it comes to the mean values of FPR measure (column mFPR in Tables 4 –8), neural networks achieved the best results (lowest errors) at all levels except for the level minTPR = 0.999. For the minTPR = 0.995 level, the proposed model based on the uninrom
Probably the most known measure of classification is accuracy. Mean values of this measure are provided in Tables 4 –8 in the column mAcc. Heatmaps for accuracy and all folds can be found in Figures S16-S20 in the supplementary materials. Analysis of these heatmaps shows that for most of the base models and SOTA models accuracy drops down after
To compare both measures TPR and FPR we have used HMRS measure mentioned before. Mean values of this measure are provided in Tables 4 –8 in the column mHMRS. Regarding the HMRS measure and mean values determined for all folds, for lower minTPR levels, it can be seen that models based on aggregations are in the best positions (and, in addition, they meet the requirement to achieve minTPR). These are models based on
There were also performed statistical tests to compare the obtained values of the performance measures for the considered 28 models. According to Kruskall-Wallis test in mostly cases (of the considered desired TPR and the measures of classification) there are significant differences in the compared groups (cf. Table 9).
For further analysis special attention was paid to the HMRS measure as it is an aggregation of TPR and FNR, so it gives as the more general information about the model. Descriptive statistics of HMRS measure are provided in the supplementary materials in Tables S1-S5. These tables provide a summary of key statistical descriptors for this measure including the median values which are useful to discuss the results of the Dunn’s test with the Holm-Bonferroni correction to find out which groups of models obtained statistically different results of the HMRS measure. According to the results of the Dunn’s test with the Holm-Bonferroni correction:
when desired TPR was set to 0.9 there is significant difference when we compare HMRS median result of
[0.9637] with cnn [0.82645] orA_{mx} [0.89078]cnn\_1D when desired TPR was set to 0.95 there is significant difference when we compare HMRS median result of cnn [0.9029] with
[0.9637],A_{mx} (0.05) [0.96466] orA_{armn} (0.1) [0.96205]A_{armn} when desired TPR was set to 0.99 there is no significant differences when we compare HMRS median result among groups
when desired TPR was set to 0.995 there is significant difference when we compare HMRS median result of
[0.99203] withlstm\_1 (0) [0.93405],A_{armn} (0.05) [0.93405],A_{prmn} (0.1) [0.93405],A_{prmn} (1) [0.93405],A_{prmn} (e=0.95) [0.94033],U_{L} (e=0.6) [0.93405],U_{min} (e=0.8) [0.93405] orU_{min} (e=0.95) [0.93405]U_{min} when desired TPR was set to 0.999 there is significant difference when we compare HMRS median result of cnn [0.99753] with
(0) [0.93405],A_{armn} (0.05) [0.93493],A_{gmmn} (0.1) [0.9365],A_{gmmn} (0.05) [0.93405],A_{prmn} (0.1) [0.93405],A_{prmn} (1) [0.93405],A_{prmn} (e=0.95) [0.94033],U_{L} (e=0.6) [0.93405],U_{min} (e=0.8)[0.93405],U_{min} (e=0.95) [0.93405] and alsoU_{min} [0.99836] withlstm\_1 (0) [0.93405],A_{armn} (0.05) [0.93493],A_{gmmn} (0.05) [0.93405],A_{prmn} (0.1) [0.93405],A_{prmn} (1) [0.93405],A_{prmn} (e=0.95) [0.94033],U_{L} (e=0.6) [0.93405],U_{min} (e=0.8) [0.93405] orU_{min} (e=0.95) [0.93405].U_{min}
Conclusion
The principal objective of the present study is to identify emails that should be checked by experts in order to determine whether the email should be sent. In the context of transactional emails, it is crucial to ensure that any e-mail blocked is indeed phishing. As such, the relationships discussed are particularly relevant in this context.
It is also noteworthy that the number of individual models included in an ensemble model used in incremental learning can have a significant impact on the complexity of the resulting classifier. It is therefore crucial carefully consider both the number of models included in the ensemble and the choice of specific models for its construction.
One approach that can improve the effectiveness of the ensemble is to concentrate on a combination of models that are more confident about whether a particular URL is a phishing site, while other models remain uncertain in their decisions for those specific URLs. In addition, consider incorporating different learning algorithms into an ensemble model to better deal with the full spectrum of potential threats. The exploration of additional families of algorithms that facilitate this learning pattern may allow further improvements in the overall performance of the ensemble model, offering better protection against phishing attacks.
To sum up the results obtained by neural networks and proposed ensemble models based on aggregation functions we state there is no clear winning models with respect to all criteria. Models based on aggregation function were more often able to obtain the required minTPR level. Neural networks have more often better results if it comes to FPR measure. For the aggregated measure HMRS (taking into account TPR and FNR = 1-FPR) statistical tests shown for 2 out of 5 considered minTPR levels dominance of models based on aggregation functions (
It is also worth noting that learning a proposed model should be more efficient than learning a neural network using backpropagation with optimizer like stochastic gradient descent or Adam. This assumption is based on the fact, that examined neural networks have more than one layer and in this case the number of the parameters to optimize could lead to long learning time, especially on bigger learning data.
This aspect is especially important in the circumstance of updating the model and replace the phishing detector on production based on it. If the learning process is fast, the maintenance break of replacing the phishing detector could be reduced to minutes up to a few hours. Otherwise the learning of the model could force the longer maintenance breaks.
For the future work we would like to create ensemble models which are based on neural networks and aggregation functions. It is possible that this approach can connect advantages of aggregation functions used to combine models and neural networks as machine learning models. However, we think that the choice of aggregation functions for combining neural networks should be done more carefully with respect to the properties of neural networks.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for their valuable comments which helped to improve the final version of the article.