Eth-PSD: A Machine Learning-Based Phishing Scam Detection Approach in Ethereum

Recently, the rapid flourish of blockchain technology in the financial field has attracted many cybercriminals’ attention to launching blockchain-based attacks such as ponzi schemes, scam wallets, and phishing scams. Currently, Ethereum is the most prominent blockchain-based platform and the first that supports smart contracts. However, the number of phishing scam accounts are reportedly more than 50% of all cybercrimes in Ethereum. In contrast, this paper proposes a detection mechanism called Ethereum Phishing Scam Detection (Eth-PSD) that attempts to detect phishing scam-related transactions using a novel machine learning-based approach. Eth-PSD tackles some of the limitations in the existing works, such as the use of imbalanced datasets, complex feature engineering, and lower detection accuracy. We also investigated the aspects of constructing a new updated, balanced dataset that can be used to evaluate Eth-PSD effectively. Our experimental results indicate that Eth-PSD could efficiently detect the phishing scam on Ethereum with a detection accuracy of 98.11%, with a very low False Positive Rate of 0.01. Taken together, Eth-PSD showed a superior advantage compared to the existing works in reducing the dimensionality of the dataset by feature engineering and achieved an overall detection accuracy with an improvement of at least 6% compared to other existing solutions from the related work.


I. INTRODUCTION
Blockchain is a distributed ledger that can efficiently record transactions among parties permanently and verifiably [1]. Recently, blockchain has attracted many researchers and investors regarding the significant changes in essential fields such as politics, finance, and science [2]. Nowadays, one of the most critical and challenging blockchain applications is cryptocurrency [3]. Bitcoin was the first practical implementation of cryptocurrencies [4]. On the other hand, Ethereum is a successful large-scale blockchain-based application like Bitcoin. Ethereum is the most extensive application of blockchain that supports smart contracts. Unlike Bitcoin, which supports only financial matters [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang .
Due to the Ethereum blockchain characteristics like immutability, Ethereum became a widely used platform for financial applications, bringing excellent development support. Along with this high-speed development, Ethereum has become prone to many cybercrimes due to the lack of regulation, such as phishing scam, scam wallets, and Ponzi schemes [6]. Another reason is the fast flourish of e-commerce, where most people trade services online or goods, which gives phishing scam more and more opportunities [7].
Phishing scam refers to forging official websites or contact information to steal private information, such as usernames, passwords, or addresses, for further gain [8]. In the traditional scenario, the victim receives an email that seems like it is from an official website that informs the victim to click on a specific link. After that, the victim modifies relevant information, and this information is finally going to scammers VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ through that fake webpage. These counterfeit pages could be publicly spread by emails, google ads, chat apps, etc. [9]. But in Ethereum, the phishing scam is not only to obtain sensitive information of specific users but also to swindle money by spreading phishing addresses to victims through online chats, fake websites, and emails. Scammers made nearly $20 million in 2017, deceiving about 10,000 victims. Scammers' earnings had more than doubled a year later, and the number of victims had more than doubled as well. For a long time, phishing has been the most common scam, and phishing fraud is cyclical, meaning it occurs in cycles in response to price swings [10]. According to [11], the number of phishing scam accounts is more than 50% of all the cybercrimes in Ethereum. This kind of scam has become a significant threat to Ethereum security. Therefore, there is an urgent need to detect the phishing scam in Ethereum [12] efficiently.
In Ethereum, a phishing scam is one of the most widely used scams. It becomes a critical issue that attracts more attention from researchers to develop efficient countermeasures. However, traditional phishing detection methods cannot efficiently detect phishing scam because only a tiny part of phishing scams is implemented through fake websites [13].
Due to the transparency of blockchain, the victims can find out where their fraudulent funds went, and then they officially report the suspicious phishing addresses. Besides, all the Ethereum records are accessible. Consequently, it is possible to distinguish the behavior of phishing addresses after mining the transaction records of Ethereum.
To the best of our knowledge, there are just a few works to detect phishing scam in Ethereum, and we found only seven related works, which will be discussed later in section III. After exhaustively studying the few related works, we highlighted some limitations we will address in our proposed approach. For that, this field still requires more effort to set better solutions.
In this paper, the main contributions are as follows: • Propose a new approach Eth-PSD to detect the phishing scam in Ethereum that achieves better results (an improvement of at least 6%) compared to the bestrelated works.
• Construct a new balanced dataset that involves the latest phishing addresses and then publish it publicly for other researchers.
• Propose a novel voting-based technique to select the most significant features using ranking methods.
• Empirically, we optimize the detection stage to come out with the best performance by investigating many machine learning classifiers.

II. BACKGROUND
This section briefly introduces key background information about Ethereum, phishing scam, Intrusion Detection Systems as a countermeasure against phishing scam, and finally, the definition of the feature engineering methods used in Eth-PSD.

A. ETHEREUM IN A NUTSHELL
We can define Ethereum as a decentralized virtual machine that executes special programs called contracts [14]. Every contract has a permanent storage location for data and a set of functions that the users or other contracts can call. Users can use the Ethereum network to send transactions. More clearly, users and contracts can hold a cryptocurrency (ether, or ETH in short) and transfer and receive ether to and from one another. However, every user who sends transactions can: create a new contract, call a function of a contract, or transfer ether to other users or contracts [6], [15]. Each transaction sent by the user is called an external transaction, which is recorded on the public blockchain. When a contract receives an external transaction, it might initiate internal transactions that are not explicitly recorded on the blockchain but still impact users' balances and other contracts. Because transactions might involve financial resources, it is necessary to ensure that they are performed accurately. Ethereum has no central party where each transaction can be processed through a decentralized P2P network. To handle mismatches, whether to failure or attacks, this network has a consensus protocol that is currently based on a ''proof-of-work'' [16], [17].

B. PHISHING SCAM
Phishing is an online threat that involves imitating a legitimate company's website to get personal information such as usernames, passwords, and social security numbers [9]. Usually, the classic phishing scam begins with the target receiving an email that looks like from a legitimate organization. However, many researchers have proposed many anti-phishing solutions to counter the threat of phishing scam. Most of the existing works to detect phishing scam, such as [9], [18], and [19], are based on detecting web content and email (i.e., text recognition). But in Ethereum, phishing scam can be conducted in more diverse methods than traditional ones. One of the most significant differences of a phishing scam in Ethereum is that it does not only work by stealing sensitive information and money from the phishing websites, but also it can directly steal money by spreading phishing addresses among users or victims in many ways, such as emails, websites, or even online chats. For example, the famous phishing scam on Bee Token ICO was when the phishers sent an email to would-be buyers by inducing them to send money to a specific address before the startup's token sale. Interestingly, this phishing scam gathered about $1 million even without phishing websites [20]. Thus, the traditional detection solutions for a phishing scam cannot be directly applied to detect the phishing scam in Ethereum Because phishing websites are only used in a limited percentage of phishing frauds.
Another decisive difference is that phishers leverage blockchain characteristics or vulnerabilities in the Ethereum protocol. Phishers in Ethereum have many methods to steal cryptocurrency. For example, Phishers propose a Decentralized Autonomous Organization (DAO) for voting on investment proposals. When most investors approve the proposal that they are confident with and where their money can be spent. Then, the approving money is moved to the proposer account address.
In a real-life example, in 2018, scammers canned at least USD 928,000 when BeeToken was attacked for the first time. Scammers posed as the BeeToken team to phish the investors. Scammers urged the investors to quickly capitalize on gaining a significantly higher return on investment. Phishers send the would-be buyers an Ethereum address that redirects them to the address [21].
As a result of blockchain openness and transparency, all transaction records are publicly accessible; each victim can still find out about their losses by fraudulent funds and then report the suspicious phishing addresses. From this point, we find it possible to detect phishing addresses by mining historical records that include the phishing addresses, unlike the traditional solutions that only target phishing websites. In our work, we aim to detect phishing scam by identifying phishers' Ethereum addresses.

C. INTRUSION DETECTION SYSTEM (IDS)
An intrusion Detection System (IDS) is considered one of the most efficient security defenses against cyber-attacks. We can define this defense as software that identifies the attacks via distinguishing their intrusions as abnormal from normal traffic. Strategically, IDSs created to detect attacks before or after they happen [22].
There are two types of IDSs: Signature-based IDS (SIDS) and Anomaly-based IDS (AIDS). The first type was used in early IDSs, where this type identifies attack patterns by comparing the abnormal behaviors (of attackers) to the normal behavior of the network. This type cannot detect zero-day attacks, but it is also faster and more accurate in detecting predefined signatures. While the second type, also publicly known as Anomaly Detection System (ADS), depends on its detection of the variation from the standard form, order, or rule. This type does not rely on the predefined patterns of attacks but depends on the predefined threshold, i.e., it can differentiate between normal and abnormal patterns. In addition, this type can detect unseen-before attacks, unlike SIDS [23].
In terms of deployment method, there are also two types: Host-based IDS (HIDS) and Network-based IDS(NIDS) [24]. The first type can detect internal changes by scanning the local system, i.e., the data obtained only from the system. In contrast, the second type monitors the data obtained from the network. Fig. 1 shows the taxonomy of IDS.
Learning-based techniques, whether Machine Learning (ML) or Deep Learning (DL), have proved their effectiveness in IDSs. These techniques allow the systems to learn and improve from experience without being totally programmed [25]. In simple, the learning starts with observations of the data by looking for patterns from the provided examples to make better predictions. The main goal is to let computers learn even without human assistance or intervention. Then, the learned/trained systems adjust actions accordingly. There are three types of ML algorithms for IDS: Supervised, Semisupervised, and Unsupervised algorithms [26].
In this paper, we apply eight machine learning classifiers that have not been applied yet to detect the phishing scam in Ethereum, namely: J48 Consolidated, Fast Decision Tree, C4.5 decision tree, Naïve-Bayes Tree, PART decision List, JRip, K-nearest Neighbors, and OneR. Then, according to the experimental results, we can set the classifier of best performance as an official classifier in the detection stage of Eth-PSD.

D. RANKING METHODS
This paper has three ranking methods to select the significant features by the feature engineering stage. The choice of these methods is because the CorrelationAttributeEval method treats each feature individually as an independent indicator to measure the impact of the feature on the class, the Classifier-AttributeEval method depends on a prespecified classifier to evaluate each feature individually (wrapper) [24], and Pair-wiseCorrelationAttributeEval method evaluates the effect of sets on the class, i.e., considering the correlation between the features themselves. Consequentially, in this combination, we guarantee: the evaluation of the worth of each feature individually and parallel with other features in terms of influence on the class and eventually evaluate the worth of each feature based on a prespecified classifier. In other words, this combination covers the direct impact of each feature on the class and the united effect of a set of features on the class. Putting it all together for the first time in this combination, we contribute to proposing a novel, efficient feature selection method.
Correlation Attribute Evaluation (CorrelationAttribute Ev-al). This method evaluates each value as an independent indicator to esteem the worth of that attribute/feature by measuring the correlation between that feature and the class [27].
Classifier Attribute Evaluation (ClassifierAttributeEval). This method evaluates the worth of each attribute/feature by using a pre-specified classifier to esteem the impact of that feature on the class [28].  (Pairwise-CorrelationAttributeEval). This method has been inspired by the correlation-based feature selection method. But this method evaluates sets of features instead of doing them individually. To determine the effectiveness of each set, this method evaluates how well each set is predicting the class. Then, the sets with a high correlation to the class obtained higher scores in the ranking compared to sets that were poorly correlated to the class [29].

III. REQUIREMENTS AND RELATED WORKS
In recent years, the phishing detection topic has been extensively studied. Many approaches have been proposed as anti-phishing solutions by many researchers. On the contrary, just a few works on phishing detection in Ethereum.
In Ethereum, the unique characteristics of blockchain must be considered, unlike the traditional solutions. In traditional solutions, phishing detection is mainly based on phishing websites' content and URL information. Therefore, they cannot be directly applied to detect the phishing scam in Ethereum (see Section II.B). So far, we have only found seven works to detect phishing scams in Ethereum from launching this blockchain network in 2014 until February 2022. In short, this section is three-faceted; the first subsection lists the requirements for a new IDS, the second subsection presents the related works, and the third subsection critically analyzes the related works.

A. REQUIREMENTS
In the following, the basic requirements to develop a robust, reliable, scalable, applicable, and effective IDS have been derived based on the functionalities of the mechanisms in related work and our opinion on how to improve them.

1) RELIABILITY AND BALANCING OF THE COLLECTED DATASET
One of the characteristics of a good intrusion detection system is that it should be trained and evaluated using a reliable and balanced dataset. Reliability is crucial in training phishing scam detectors since any falsification leads to inaccurate detection regardless of whether the used countermeasure is efficient. Due to the openness of the Ethereum blockchain, all the related works have accessed reliable transactional records through authoritative websites, namely: XBlock.pro and Etherscan.io. Consequently, we will depend on the same authoritative websites to achieve the first requirement by constructing our dataset.
Additionally, the used dataset must be balanced to avoid overfitting in the machine learning classifier [30]. According to Google developers [31], that could be considered imbalanced when the minority classes were under 40%. Hence, the lower percentage, the worse training. Table 1 shows the range from extreme to mild proportion of the minority classes.

2) SIMPLICITY AND SPEEDING UP OF THE SOLUTION VIA FEATURE ENGINEERING
Feature engineering plays a vital role in improving the approaches/methods/models/frameworks [30]. Feature engineering is a crucial task that contributes directly to the proposed solutions through the main aspects like the simplicity and speed of the proposed solution. Moreover, it facilitates targeted patterns detection for algorithms (i.e., its predictive power). In addition, it reduces the data dimensionality, making the proposed solution faster and smoother to run. It also provides more excellent features and flexibility where it is easy to select the most significant ones. Moreover, it comprehensively explains the dataset [32], [33].

3) DETECTABILITY
This is considered a judger by being the last procedure of each approach; it works on identifying the data events or observations that deviate from normal behavior. Detection is based on the previous preparation (in data collection) for the dataset used and the feature engineering methods to improve the total detective performance. Some suitable detectors might not adequately detect the anomaly, not because of the detector itself but the preparation. That does not mean the detector will not play an important role; vice versa; it presents the final judgment of whether the proposed approach is effective enough or not. Therefore, care should be taken when choosing the last detector or classifier. But, as aforementioned, the data preparation can help the detector understand the data more and easily discern the abnormality.

4) ACCURACY, RECALL, AND PRECISION
There is no stable set of metrics to evaluate any IDS. However, Accuracy, Precision, and Recall are considered the key metrics required to evaluate the IDS through a balanced dataset. Some IDSs require different metrics based on their circumstances. For instance, once the used dataset is imbalanced, the other evaluation metrics like ROC or AUC perform better in evaluating the solution performance. Providing more evaluation metrics can give a complete vision of performance.

B. RELATED WORKS
A few works have been found about phishing scam detection in Ethereum. As usual, each work has its advantages and disadvantages. For that, phishing scam detection in Ethereum is still an open issue that needs more attention and effort to be taken by researchers.
Reference [34] proposed a framework based on feature learning and a phishing hidden framework through inserting transaction records. The authors obtained the highest accuracy detection rate by extracting 25 new features over the existing ones. Later, they applied three classifiers for detection purposes (SVM, K-NN, Adaboost). Although complexity in this solution, their framework has achieved the highest detection accuracy of 92%.
Reference [10] proposed a method based on Graph convolutional network (GCN) and autoencoders to detect the phishing scam in the Ethereum transaction network. The authors crawled phishing labeled addresses through the API of (etherscan.io). Then, they constructed a vast scale to represent the obtained data. After that, they only considered a few sampled subgraphs to achieve a detection accuracy of 58%.
Reference [13] proposed an approach to detect phishing scam in Ethereum after crawling the labeled phishing addresses from authorized websites (the authors did not mention them) to train their approach after constructing the transaction records. Then, they extracted features by using Trans2vec. Finally, they adopted a Support Vector Machine (SVM) to classify the nodes into normal and phishing ones. The authors did not mention the detection accuracy, but they achieved 92% and 89% of Precision and Recall, respectively.
Reference [35] proposed an approach to identify the phishing scam from Ethereum transaction records. The authors utilized the graph-based cascade to extract features; then, they adopted a Light GBM-based Dual-sampling ensemble algorithm to identify the phishing scam.
Yaun et al. [36] proposed a graph-based classification framework to detect phishing scam on Ethereum. The authors formed a set of subgraphs; then, they made a classification of these subgraphs. The authors collected the phishing and normal addresses from (etherscan.io).
Li et al. [37] proposed a graph-based network to detect phishing scam in Ethereum using the temporal transaction aggregation graph. The authors modeled the temporal relationship of historical transaction records among nodes in order to build the edge presentation of the Ethereum transaction network. Later, they combined the structural and statistical features with the trading ones as a feature engineering that contributes to detecting phishing addresses in Ethereum. Eventually, the proposed approach achieved 92.8% detection accuracy.
Finally, Xia et al. [38] proposed a new attribute egograph embedding framework to detect phishing accounts in Ethereum. The authors represented each labeled account by extracting the ego-graph of each account. Then, they utilized non-linear substructures sampled to learn representations for ego-graphs. Eventually, the graph embeddings feed to the Decision Tree as a classifier to identify the phishing accounts.

C. CRITICAL ANALYSIS
In the following, we provide a critical analysis of the related work in Section III.B according to the requirement laid out in Section III.A.

1) DATASET IMBALANCE
The imbalanced nature of datasets has been one of the biggest obstacles faced in phishing scam detection mechanisms on Ethereum. For detection mechanisms such as the IDS, the dataset is the most influencing factor regarding the accuracy and effectiveness of the mechanism. A robust and balanced dataset would pave the way for a fair and realistic evaluation or comparison of any IDS. For example, although the used dataset in [13] has yet to be presented in detail, the authors mentioned that there are only 1259 addresses labeled as phishing addresses among the millions of benign ones, representing an imbalanced dataset (see Table 1). An imbalanced dataset can lead the detector to overfit issues [30]. The dataset in [35] contains only 323 addresses labeled as phishing addresses; they mixed them with about 500,000 benign addresses. This is also an example of an imbalanced dataset. In another work by Xia et al. [38], the authors evaluated the proposed framework using an imbalanced dataset of 12,834 non-phishing labels and 451 phishing labels. Li et al. [37] started with an imbalanced dataset; then they balanced the data by resampling. Later, the authors randomly sampled the data to three subsets to train the model because the balanced data was huge scale. However, this random sampling does not guarantee the balance of each subset. Although they eventually relied on the average of those three subsets, that does not give an accurate representation.
As such, the resulting machine learning model may suffer accuracy from being trained using lesser phishing samples.
Another issue, most of the recent predictors used to detect phishing scam or other anomalies were based on machine learning techniques. Hence, the machine learning techniques over-classify the large classes because of the increased prior probability of those classes. Thus, the classifiers typically misclassify smaller classes more often than larger ones. Table 2 summarizes the drawbacks of the related works pertaining to datasets.

2) FEATURE ENGINEERING
The authors often developed feature selection methods added to the proposed model/solution but did not evaluate their utility in the aforementioned related works. For instance, the utilized feature selection methods were not prospectively filtered for evaluation significance prior to adding them to the model. Moreover, some matters must be considered by the used feature engineering method, such as simplicity, applicability, and efficiency of the proposed solution (see Section III.A). For example, feature engineering in [10] still requires simplicity and effectiveness in applicability and efficiency compared to other feature extraction methods. Minimally, the analysis of how a utilized feature engineering method positively influenced the proposed solution would be required. Pertaining to the applicability, the authors in [13] used Skip-gram architecture to maximize the probability of the objective occurrence, borrowed from the natural languages processing research domain.
Additionally, seven related works utilized graphical representation in their feature engineering [10], [13], [36], [37], [38] to select significant features, except [37] only represents the data (transactions), i.e., there was not reducing to the data dimensionality as a purpose of feature engineering.  This representation mainly contains non-informative variables, which might impact the general performance [39]. Moreover, they did not compare to other graphical representations (except [37]) or even compare to the evaluation prior to applying the proposed feature engineering solutions, except for Wu et al. [13]. The authors in [13] applied five embedded algorithms to select the best performance.
Meanwhile, the rest of the four works could have explained how the feature engineering methods improved their models/ solutions. The authors [34] extracted a new set of features, including 18 account features and seven network features, without focusing on reducing the number of original features. Feature engineering often attempts to select a smaller set of features that contains only critical informative features (also known as predictors) rather than big whittle data [30]. In other words, adding more features increases the overall complexity. Therefore, feature engineering needs to show how to improve the proposed solution to develop a good empirically driven solution.
According to Kuhn et al. [30], proposing feature selection as a complex approach to determining feature significance is not recommended. Moreover, inferential statistical methods are the preferable solution to evaluate the contribution of selected features/predictors to the proposed solution. We will handle these issues by proposing a statistical feature engineering technique to select the most significant features via a straightforward and applicable technique.
Finally, none among the related works shows the influence of feature engineering methods. For example, we can show the positive influence of the used feature engineering methods by comparing the evaluation metrics before and after applying the used feature engineering methods in terms of complexity, time consumption, and even the detection rate. For that, this paper tackles this issue by showing the effectiveness of Eth-PSD after applying feature engineering methods via comparing the evaluation metrics before and after reducing the features (later in section VI)

3) DETECTION
Yet, only a few classifiers have been utilized as detectors to detect phishing scam in Ethereum because of a limited number of related works. We cannot judge each used classifier separately since each has been used with a different approach and preparation. From our point of view, most of the detectors/classifiers in existing works achieved accepted accuracy rates because their approaches/models detect mainly the majority class (i.e., the non-anomaly 0) under the overfitting impact. Therefore, we propose an empirically driven detector by applying many effective machines learning classifiers that have not been used yet for phishing scam detection in Ethereum. Another issue, only some of the authors in related works mentioned what testing approach had been used in their solutions, whether Supplied Set Test or a Cross-Validation approach. For that, we tackle this issue by applying both testing approaches to show the difference after using eight machine learning classifiers for detection purposes.

IV. ETH-PSD
In this paper, we propose a new approach Eth-PSD to detect phishing scams in Ethereum. Eth-PSD achieves certain requirements and tackles the derived limitations of the related works as discussed in Section III.C. In general, Eth-PSD consists of four main stages, as explained in Fig. 2: Data Collection, Data Preprocessing, Feature Engineering, and Phishing Scams Detection, respectively.

A. DATA ACQUISITION
This stage is the basis for analyzing the transaction records in the following stages, whether for phishing scam detection purposes or a more profound understanding. According to Ethereum yellow paper [16], each client in the Ethereum network contains all the transaction history. Therefore, because of the openness of the Ethereum blockchain, researchers can access the Ethereum transaction records.

1) TRANSACTION DATA
The transaction records of normal addresses were crawled from an authorized block explorer and analytics platform (Etherscan.io). In this paper, we utilize the same method [13], [40] to obtain the Ethereum transactions [36]. The researchers can obtain the historical transaction data of Ethereum.

2) LABELLED PHISHING ADDRESSES
We collected the latest labeled addresses of phishing scams from an authorized academic blockchain data platform XBlock.pro. The API XBlock.pro provides a validated sample set. We downloaded the dataset that includes officially reported phishing scam addresses from XBlock.pro [36].
Besides, this stage achieves the first contribution of this work by constructing a new balanced dataset that involves the latest phishing scams and then publishing it publicly for other researchers (as soon as this paper gets published). The constructed dataset will be well-prepared and balanced to enable other researchers to directly use it to do their experiments (i.e., ready-to-use dataset). Fig. 3 illustrates the data mining process of the newly constructed dataset.

B. DATA PREPROCESSING
This stage works on preparing the collected dataset through a series of sub-stages. Moreover, this stage makes the collected dataset more understandable for methods and algorithms in the following stages. Three sub-stages are used to preprocess the collected dataset: Balancing, Numericalization, and Scaling of the dataset as follows.

1) BALANCING THE DATASET
Resampling is one of the most adopted techniques to tackle the imbalanced dataset. There are two types of resampling, under-sampling and over-sampling. The under-sampling technique removes samples of the majority classes (normal transactions). Still, we cannot use this solution if there is a limited dataset (not big data) because removing random records from the majority classes can cause the loss of important information. Thus, this resampling can be used if there is significant enough data [31].
While the over-sampling is to duplicate random records of the minority classes (phishing scam) to avoid the imbalance proportions (look at the table 1 to check when the dataset might be considered imbalanced), therefore, we tackle this issue by over-sampling the phishing scam samples of the collected, labeled address to make the minority classes exceeds 40% of the dataset.

2) NUMERICALIZATION
In Ethereum, the address is a 42-character hexadecimal address (e.g., 0xb794f5ea0ba39494c-e839613fffba7427 9579268) means two features of the collected dataset contain VOLUME 10, 2022 letters mixed with numbers. Another feature is the Transaction Hash (TxHash), where each transaction has a unique TxHash. TxHash is calculated based on input details such as the source address, destination address, value, data, and a nonce [16]. Thus, this feature is also calculated as a hexadecimal value. Consequently, we numericalize these features by transforming them into numeric values to make them understandable to the following methods and algorithms.

3) SCALING
Scaling the original numerical values to fit within a specified range, often 0 to 1. After numericalization, the resulted features will have wide ranges in the records. The wide ranges lead to a degraded ranking of features. To tackle this issue, we scale (or normalize) the numericalized dataset by an operation named, Scaling, calculated by (1). In addition, the methods that we use in feature engineering read binary values.
Z represents the new scaled result, z represents the data wanted to be scaled, max is the maximum value in the feature scale, and min is the minimum value in the feature scale. Fig. 4 shows the sub-stages of this stage sequentially.

C. FEATURE SELECTION
Recently, researchers have widely applied feature engineering because of its vital role in their proposed solutions. There are many feature engineering methods, such as feature selection. The feature selection reduces the data dimensionality, saving time and reducing the complexity of the proposed solutions [30]. Still, each method has its characteristics, advantages, and shortcomings. For that, feature engineering should empirically prove its effectiveness on the proposed solution.
In this paper, we propose a novel vote technique based on three ranking methods to come out with a shortlist of only the significant features. The proposed technique selects the significant features based on two pivots. In the first pivot, we evaluate the worth of each feature in the class individually by using the CorrelationAttributeEval method. In the second pivot, we evaluate the worth of a set of features rather than individually using the PairwiseCorrelationAttri-buteEval. Consequentially, we guarantee both the individual  impact of each feature on the class and collectively the impact of many sets of features on the class. Finally, we set a wrapper for components ranking based on a pre-specified classifier (see section II.D).
As a result, there will be three lists of ranked features. The three lists import into a voting technique. Afterward, if the feature has over 60% turnout in one level, it passes into the final features list. The reason behind determining this proportion (60%) is because there were three ranking methods, i.e., each method contributes a third of that 100%. Then, the majority among three-thirds is two of them. Mathematically, two-thirds are equal to 66%. Therefore, we set it should be over 60%.
The voting should be applied over the same order of each list (resulting from one of those three methods), respectively. For example, three lists of ranked features are X, Y, and Z, where Xn, Yn, and Zn represent the features of those lists, and n represents the order of each feature, and it is bounded by 1 >= n >= 9 (where 9 is the total number of features). Table 3 shows the probabilities of the voting technique for better understanding. Moreover, Fig. 5 shows the mechanism of the voting technique.
After this statistical technique, we obtain a new shortlist of the significant features. In the Evaluation section, we show how this feature engineering contributes to Eth-PSD in the implementation. Finally, the output of this stage is the input to the fourth stage.

D. PHISHING DETECTION STAGE
According to the existing experiments of the related works, the behavior of phishing scam is distinguishable from the other normal addresses. Phishing scam detection in Ethereum can be modeled as a binary classification issue. In addition, we generated a balanced dataset with enough labeled addresses. However, the heterogeneous nature of Ethereum data makes the classification more complicated between phishing and normal addresses [13]. For that, typical supervised learning algorithms are experimentally applied to detect phishing scam to select the best performer as the primary classifier in Eth-PSD.
In this paper, we experimentally apply eight classifiers that differ from the ones used in the previous related works. The classifiers are J48 Consolidated, Fast Decision Tree, C4.5 decision tree, Naïve-Bayes Tree, PART decision List, JRip, K-nearest Neighbors, and OneR. We apply eight classifiers to optimize the detection stage to cover more good probabilities. Then, we select the best classifier according to its scores and the time taken to build a model. After that, we set the classifier chosen as the official classifier in the detection stage of Eth-PSD.
Two testing approaches are used to evaluate each classifier's performance: Supplied Set Test and Cross-Validation. Further, the optimization continues to find the best set of the best classifier for phishing scam detection. Experimentally, more than eight classifiers are used to detect phishing scam in our Eth-PSD. Still, only the top eight have finally been chosen to show their results and how we go further to configure the optimized detection stage.

1) SUPPLIED SET TEST
This testing approach works on two different datasets, the first is 80% allocated for training purposes, and the second is 20% allocated for testing purposes.

2) CROSS-VALIDATION
In this testing approach, the dataset is divided into folds of cross-validation tests. In this approach, we continue tuning the fold number to optimize the detection stage. For example, we start with a certain number of folds; then we increase slightly; if there is an improvement in scores, continue increasing folds slightly.
More clearly, Fig. 6 shows the difference between Supplied Set Test and Cross-Validation.

V. IMPLEMENTATION
This section describes Eth-PSD's design, implementation, and optimization to detect phishing scam in Ethereum. Eth-PSD has been thoroughly explained in the previous section. This section shows the experimental results of each stage, starting with preparing the dataset and ending with the detection stage.

A. DATA ACQUISITION AND PREPROCESSING
By crawling the two authoritative websites, Etherscan.io and XBlock, we obtained 84,664 records. The initial data consists VOLUME 10, 2022 of 79,216 normal addresses and 5,448 phishing addresses, according to the latest officially reported phishing scams of Jan 2022. Compared to the existing related works that were earlier discussed, we have more labeled phishing addresses than all the related works.
The generated dataset contains nine features, including the class as tabulated in Table 4.
As explained earlier in section IV.B, the preprocessing includes three substages. The first substage is to tackle the imbalance dataset issue by randomly duplicating samples of phishing addresses. Then, the new dataset contains 117,359 records, including both phishing and non-phishing addresses, where the proportion of phishing scams exceeded 40% of the total dataset.
The balanced dataset is processed with numericalization and normalization (see section IV.B). As a result, the balanced and normalized dataset assists us in avoiding the overfitting problem [41].
Finally, the resulted dataset represents the input to the next stage, where there are three feature ranking methods.

B. FEATURE SELECTION
The normalized balanced dataset contains nine features of each record. This stage aims to reduce the data dimensionality by selecting the most significant features that can reflect the whole feature set's influence and dropping the rest that might mislead the classifier in the next stage.
Each ranking method gives a different order of features compared to the initial order. Thence, the resulted lists are slightly different in their order from one another. Consequently, the voting technique comes up with a new shortlist of the most significant features. Table 5 shows the resulted features order of each ranking method and the proportion of turnout.
As shown in Table 5, the new list contains only four features: Input, BlockHeight, TimeStamp, and From. Therefore, the dimensions of the resulted dataset are 4 * 117,360.

C. PHISHING SCAM DETECTION
After obtaining a shortlist of the most significant features using the voting technique, we use them as feature inputs to the stage of phishing scam detection. Experimentally, we applied more than eight classifiers to cover more probabilities of performances. However, we list only the results of the top eight classifiers in both testing approaches (Supplied Set Test, Cross-Validation). After that, we go further in setting parameters of the best classifiers to optimize the detection stage in terms of high detection rate, low FPR, and the time taken to build a model. Overall, the Eth-PSD approach shows promising results by efficiently detecting the phishing scam in Ethereum when the accuracy detection rate is better than what is achieved in all related works. Moreover, as explained in the following, we apply both testing approaches (Supplied Set Test, Cross-Validation).

1) SUPPLIED SET TEST
In this testing approach, the dataset is split into two sets: 80% and 20% of the dataset. The first is to train the model, while the second is to test the model. The highest detection accuracy (AUC) is 97.76% by K-nearest Neighbors classifier, with the lowest FPR of 0.01. In addition, K-nearest Neighbors took only 0.03s to build a model as the best compared to other classifiers (J48 Consolidated, Fast Decision Tree, C4.5 decision tree, Naïve-Bayes Tree, PART decision List, JRip, and OneR).

2) CROSS-VALIDATION/10-FOLDS
In this testing approach, the dataset is divided into 10-folds of cross-Validation, where nine of them are for training and one-fold for testing purposes. The highest accuracy detection (AUC) is 98% by K-nearest Neighbors classifier, with the lowest FPR of 0.01. In addition, K-nearest Neighbors took only 0.01s to build a model as the best compared to other classifiers. Table 8 (Section VI.B) presents AUC, Precision, Recall, F1-Score, FPR, and ROC curve, and time is taken to build a model of 10-folds for each classifier in a Cross-Validation testing approach where the bold results are the best compared to other classifiers (J48 Consolidated, Fast Decision Tree, C4.5 decision tree, Naïve-Bayes Tree, PART decision List, JRip, and OneR).

VI. EVALUATION
The evaluation is slightly different from one to another work. In General, primary evaluation metrics must be calculated in any work, as mentioned in section III.A, such as Accuracy, FPR, Precision, and recall. Further, some additional evaluation metrics provide a better vision of the performance and give the solution priority over others. In this paper, we covered all the evaluation metrics utilized in the related works: Accuracy, Precision, Recall, F1-score, FPR, ROC curve, and the Time Taken to build a model(s). Noticeably, no work evaluated the solution with all these metrics, whereas some related works mentioned only the metrics where their solution has an advantage. For fairness also, we evaluate Eth-PSD with all the evaluation metrics in section A. Afterward, section B presents the results of both testing approaches, Supplied Set Test, and Cross-Validation. Finally, section C comparatively analyzes the results of Eth-PSD with those of related works.

A. EVALUATION METRICS
This section explains the evaluation metrics used to measure the performance of the Eth-PSD methodology to detect phishing scam on Ethereum.
The findings of classification represent the efficiency of Eth-PSD in terms of detection Accuracy (AUC), Precision, Recall, F1-score, False Positive Rate (FPR), Receiver Operating Characteristic (ROC) curve, as well as the time taken to build a model (s).
The evaluation metrics are explained and their mathematical equations in the following. The detection Accuracy (AUC), also known as classification rate, represents the standard measurement of IDS performance in terms of how accurate the Eth-PSD is? in detecting phishing scam as abnormal behavior. Equation (2) describes how to calculate the AUC. Meanwhile, FPR reflects the ratio between the incorrectly classified normal addresses as a phisher and the total number of benign addresses, which can be calculated by (3).
The proportion of phishing incidents that could be correctly predicted relative to the predicted number of phishers is represented by precision (P) and calculated by (4).
Recall (R) represents the proportion of correctly predicted phishing incidents to the number of actual phishing scams. Recall (R) can be calculated by (5).
In statistical analysis of binary classification, the F1-score represents a combined measure for precision and recall. It can be calculated by (6).
Last but not least, the Receiver Operating Characteristic (ROC) graphically has the False Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) on the y-axis. ROC curve shows the diagnostic ability of any binary classifier as its discrimination threshold is valid. Finally, the Time Taken to build the model has also been used as an evaluation metric since the speed of the solution also plays a vital role. Table 6 presents the definitions and essentials of evaluation metrics.

B. RESULTS
In both testing approaches, our experimental results showed superiority compared to the results of the related works. Tables 7 and 8 present AUC, Precision, Recall, F1-Score, FPR, ROC curve, and time taken to build a model of each classifier in both Supplied Set Test and Cross-Validation testing approaches, respectively. Take note that the best results are in bold.
To set an optimized approach, we noticeably recognize the findings of each classifier to highlight the best performance. The classifiers of the best four performances pass to the following tunning process to find the optimized setting for this stage by slightly decreasing and increasing the number of folds, as explained in Table 9 (next page). Another reason is to avoid the overfitting problem [41].
We can notice from the above table that we finally went further with the K-nearest Neighbors classifier by slightly increasing the number of folds into 60-folds to reach the highest possible accuracy detection (AUC) of 98.11%, with the lowest FPR of 0.01. Moreover, the K-nearest Neighbors  classifier took only 0.01s to build a fold, which is better than all the applied classifiers. Based on the abovementioned experimental results, we officially select the K-nearest Neighbors classifier as the best to optimize the detection in our Eth-PSD.

C. COMPARATIVE ANALYSIS
This section justifies why we came out with a new approach, although there are related works. The nature of academic research is to reach better solutions gradually. We firstly started with constructing a robust IDS by determining the requirements. After that, we exhaustively studied and analyzed the existing related works to determine defects that need to be addressed to configure an efficient solution against the phishing scam in Ethereum. There are three main pivots in IDSs: dataset used, feature engineering, and detection quality. The comparative analysis is based on three pivots, as aforementioned.
Starting with the dataset construction, we solved an issue of an imbalanced dataset that has yet to be addressed in most of the previous works to avoid the overfitting of the proposed detector. Section III.C (Table 2) illustrates the significant gaps in the related works in terms of the phishing addresses compared to the benign addresses. For this reason, we constructed a new balanced dataset. Another reason is that we could reach only two datasets from seven related works, which needed to be more balanced, as explained in Table 2. Accordingly, we publicly publish our dataset and make it ready-to-use by other researchers. In addition, the preprocessing procedures were meant to avoid the overfitting problem, according to [41].
In terms of feature engineering, we applied Eth-PSD with and without the feature engineering methods to distinguish the influence. The experimental results showed a positive impact of the used feature engineering methods. Without applying the proposed feature selection method, we had a less accurate detection rate of 95% in Supplied Set Test and 96% in Cross-Validation (10-folds). Another reason is, according to [41], adopting ensemble techniques before classification can assist in avoiding the overfitting problem.
The difference might seem slight, but the time taken to build a model is more than three times in Supplied Set Test and more than six times in Cross-Validation compared to when we apply the feature selection method. Therefore, the proposed feature selection technique positively influences the approach in terms of time consumption and reduction the complexity. Additionally, we selected only four significant features. Compared to other related works, we had the fewest features and achieved the highest accuracy.
Further analysis, the two features (From and Input) refer to the efficiency of labeling (i.e., logically, the 'class label' is directly affected by these two features). We utilized a dataset that includes officially reported phishing addresses from XBlock.pro. 'From' refers to the sender, and discriminatively, this proves that our methods correctly select the high influence feature on the 'class label'. Whereas for 'Input', normally, the phishers only steal money occasionally (or scandalously), but when the time is right. And that corresponds to the scarcity of reported phishing instances. However, for the latter two features (BlockHeight, and TimeStamp), their selection is also logical, where that reflects the peak of Ether transferred amount during the robbery.
Phishers do not send the little amounts of stolen money but big amounts (which affects BlockHeight). As aforementioned, phishers only steal occasionally but only at the right time (which affects TimeStamp). Taken together, these four features have reasonable/touchable changes during phishing scam operations.
Although the BlockHeight and TimeStamp are features of the transaction, they give an idea of the direct relationship between the node and the transaction. Our experiment results showed that these two features are correlated to the behaviour of the node. In case there is abnormal behaviour of a node, a change happens in these two features. In another word, any abnormality in the node can be distinguishable in the transaction itself. Furthermore, the features of the transaction give a wider vision/understanding of intercommunication. As a consequence, including the transaction features can facilitate identifying the abnormalities of nodes. For further understanding of Ethereum architecture, we recommend that readers read [1]. Table 10 presents the number of features in the related works and our proposed technique.
Finally, we have achieved the highest detection accuracy (AUC) rate of detecting phishing scam in Ethereum compared to the related works. Although we utilized two testing approaches in the detection stage (Supplied Set Test, Cross-Validation), we achieved higher scores than the related works in AUC, Precision, Recall, and F1-score. We provide FPR and ROC curve scores, which are provided in only some of the related works. Two of seven related works [13], [36] used only Supplied Set Test as a testing approach. Besides, only one work [35] used cross-validation when they only set the number of folds to 5. And four of the seven related works did not mention the testing approach they used. Table 11 tabulates the scores of related works compared to ours. Note that the best results of each work are bolded.

VII. CONCLUSION AND FUTURE WORK
In conclusion, we proposed Eth-PSD to detect the phishing scam in Ethereum. We started with derived requirements based on the limitations of related works and other effective IDSs from previous related works. The derived limitations such as dataset imbalance, feature engineering, and detection accuracy. In short, Eth-PSD consists of four main stages, and each stage tackles a certain limitation. The first and second stages are to collect and preprocess the dataset, respectively. Afterwards, the constructed balanced dataset is publicly published for other researchers. In the third stage, we proposed a novel feature engineering technique based on voting to select the most significant features. Finally, we evaluate phishing scam detection in the fourth stage. Eth-PSD achieved the highest detection accuracy rate compared to the related works. However, we have used only four features to detect the phishing scam, which is considered the fewest number of selected features compared to others. Taken together, we hope this work can attract more attention and efforts by researchers and industry to this field. In the future, we will evaluate whether Eth-PSD can efficiently detect other illegal behaviors on Ethereum based on gambling, Ponzi schemes, money laundry, etc. As much as the proposed approach proves its effective detectability, a better ability to infer the behaviour of illegal activities can be gained later. Although there are still more techniques worthy of new attempts in this research direction, like deep learning techniques, we will try to propose more robust detection approaches based on the limitations exposed currently. Furthermore, we will adopt unsupervised learning techniques to detect these illegal behaviors for further understanding