VulHunter: Hunting Vulnerable Smart Contracts at EVM Bytecode-Level via Multiple Instance Learning

With the economic development of Ethereum, the frequent security incidents involving smart contracts running on this platform have caused billions of dollars in losses. Consequently, there is a pressing need to identify the vulnerabilities in contracts, while the state-of-the-art (SOTA) detection methods have been limited in this regard as they cannot overcome three challenges at the same time. (i) Meet the requirements of detecting the source code, bytecode, and opcode of contracts simultaneously; (ii) reduce the reliance on manual pre-defined rules/patterns and expert involvement; (iii) assist contract developers in completing the contract lifecycle more safely, e.g., vulnerability repair and abnormal monitoring. With the development of machine learning (ML), using it to detect the contract runtime execution sequences (called instances) has made it possible to address these challenges. However, the lack of datasets with fine-grained sequence labels poses a significant obstacle, given the unreadability of bytecode/opcode. To this end, we propose a method named VulHunter that extracts the instances by traversing the Control Flow Graph built from contract opcodes. Based on the hybrid attention and multi-instance learning mechanisms, VulHunter reasons the instance labels and designs an optional classifier to automatically capture the subtle features of both normal and defective contracts, thereby identifying the vulnerable instances. Then, it combines the symbolic execution to construct and solve symbolic constraints to validate their feasibility. Finally, we implement a prototype of VulHunter with 15K lines of code and compare it with 9 SOTA methods on five open source datasets including 52,042 source codes and 184,289 bytecodes. The results indicate that VulHunter can detect contract vulnerabilities more accurately (90.04% accuracy and 85.60% F1 score), efficiently (only 4.4 seconds per contract), and robustly (0% analysis failure rate) than SOTA methods. Also, it can focus on specific metrics such as precision and recall by employing different baseline models and hyperparameters to meet the various user requirements, e.g., vulnerability discovery and misreport mitigation. More importantly, compared with the previous ML-based arts, it can not only provide classification results, defective contract source code statements, key opcode fragments, and vulnerable execution paths, but also eliminate misreports and facilitate more operations such as vulnerability repair and attack simulation during the contract lifecycle.

a method named VulHunter that extracts the instances by traversing the Control Flow Graph built from contract opcodes.Based on the hybrid attention and multi-instance learning mechanisms, VulHunter reasons the instance labels and designs an optional classifier to automatically capture the subtle features of both normal and defective contracts, thereby identifying the vulnerable instances.Then, it combines the symbolic execution to construct and solve symbolic constraints to validate their feasibility.Finally, we implement a prototype of VulHunter with 15K lines of code and compare it with 9 SOTA methods on five open source datasets including 52,042 source codes and 184,289 bytecodes.The results indicate that VulHunter can detect contract vulnerabilities more accurately (90.04% accuracy and 85.60% F1 score), efficiently (only 4.4 seconds per contract), and robustly (0% analysis failure rate) than SOTA methods.Also, it can focus on specific metrics such as precision and recall by employing different baseline models and hyperparameters to meet the various user requirements, e.g., vulnerability discovery and misreport mitigation.More importantly, compared with the previous ML-based arts, it can not only provide classification results, defective contract source code statements, key opcode fragments, and vulnerable execution paths, but also eliminate misreports and facilitate more operations such as vulnerability repair and attack simulation during the contract lifecycle.
Index Terms-Blockchain, smart contract, security analysis, multiple instance learning, symbolic execution.
Security issues of smart contracts.Currently, smart contracts from various fields hold more than $10 billion worth of virtual currencies.This gives much incentive to malicious users for discovering and exploiting potential vulnerabilities in smart contracts.For example, in May 2019, Binance suffered a hacking attack that resulted in the theft of over 7,000 Bitcoins [6], and then it also lost $100 million of cryptocurrency due to the cross-chain bridge contract's vulnerability in October 2022.
In addition to the economic value carried by smart contracts, these attacks stem from several critical characteristics.First, smart contracts run in a permission-less network, which means hackers can check them freely, and try to find their bugs.Second, the consensus protocol makes them immutable once deployed, requiring developers to anticipate all possible execution statuses, which is undoubtedly difficult.Therefore, effective vulnerability checkers 1 are essential to ensure smart contracts are bug-free and well-designed before deploying them to Ethereum.
Challenge 1: (Requirement) Support analysis of source code and bytecode/opcode simultaneously.The source code of smart contracts is usually developed using a high-level programming language, such as Solidity [31].It has been analyzed by many methods due to its legibility, such as pattern matching (e.g., NeuCheck [14] and Zeus [3]), similarity code matching (e.g., SMARTEMBED [15]), and machine learning (e.g., Peculiar [16] and DeeSCVHunter [17]).Nonetheless, according to the latest records, among the 1 million smart contracts running on Ethereum, only less than 2% open their source code [18], [32], [33].Also, contracts usually invoke others, and the called contracts may not open their source code for inspection.Since these works can only build intermediate representations/graphs based on the source code, it is difficult to analyze the vast contracts on Ethereum.
Instead, the Ethereum Virtual Machine (EVM) bytecode is compiled from the contract source code and stored in each node on the Ethereum system.Everyone can check it and convert it from/to opcode (i.e., the code executed directly on the EVM) unconditionally and lossless.Therefore, in order to meet the various developers' requirements, a practicable contract checker should work with the bytecode/opcode, not just the source code.However, this is hard to implement for the following reasons.
(i) The bytecode loses some contract semantics.When compiling a smart contract to bytecode, EVM will refine the source code, which means some information will be removed or optimized, so it is hard to know the original semantics of the source code from the bytecode.For instance, detecting whether functions have return values in the source code is straightforward.However, this is difficult to complete at the bytecode level as the EVM will automatically add default values for functions without return values.Also, the more details are discussed in Section V.B.1.
(ii) The contract introduces a lot of benign interference.A vulnerability generally involves few statements, yet vulnerable contracts hold many vulnerability-irrelevant statements, called noise code, which may confuse code matching.
Challenge 2: (Intelligent & Unmanned) Reduce the reliance on pre-manually defined rules/patterns and expert involvement.Even though there have been some studies on vulnerability detection based on bytecode, such as symbolic execution (e.g., teEther [25] and DefectChecker [20]) and fuzzy testing (e.g., Contractfuzzer [10] and SMARTIAN [28]), there is still a growing need to detect and prevent more and more kinds of contract vulnerabilities.A main limitation of these methods is that they require specific vulnerability patterns/ oracles or specification rules (collectively patterns) defined by experts to construct vulnerability detectors and/or code inspectors.This hinders their application to Ethereum for the following reasons.
(i) The manually defined patterns are subject to the knowledge of expensive contract experts and bear the risk of errors.Also, some complex vulnerabilities are non-trivial to be covered completely.For example, it is difficult to describe consistency rules for the transaction order dependent (TOD) vulnerability manually, and define bytecode-level patterns to consider all expressions of vulnerabilities, such as reentrancy.Even crafty attackers may use tricks to bypass fixed patterns.
(ii) The diversity of bytecode generation will impede the formation of vulnerability patterns.Currently, there are dozens of compiler versions, and a compiler may generate different bytecode for the same code pieces under diverse versions.As mentioned in [29], the bytecode similarity between the newer and older compilers is only 77.8% for the same contract.Therefore, the bytecode-based patterns may be ineffective just as vulnerable/defective codes were compiled with different versions and contain distinct instructions.
(iii) With the race between attackers and defenders, it can be far too slow and costly to write new patterns/oracles in response to the emerging vulnerabilities created by attackers.
Challenge 3: (Practical) Help developers to complete the contract lifecycle safely, such as vulnerability identification, verification, repairment, simulation, and monitoring.To mitigate the above limitations, ML-based methods (e.g., TMP [12] and ContractWard [30]) are used for automated learning of contract vulnerability features, thereby making full use of existing vulnerable (refers to defective and optimizable) contracts to express vulnerabilities more perfectly.Nevertheless, these approaches are limited in terms of scalability, generalizability, and interpretability, giving their insufficient detection accuracy and running speed.More importantly, as shown in Table I, they only inspect whether the contract is vulnerable, which is not enough to help developers to fix vulnerabilities, let alone verification, simulation, and monitoring.Therefore, it is an emerging yet crucial issue to detect various vulnerabilities (e.g., reentrancy and timestamp) of contracts in an effective, efficient, and interpretable manner, while enabling developers to finish the contract lifecycle more safely in the real world.Specifically, (i) report the defective source code statements, key bytecode fragments, and possible suggestions to them for further contract repairs during contract development and deployment phases.(ii) Support automated vulnerability verification to eliminate false positives and reduce the workload of manual review.(iii) Provide possible inputs for invoking contracts to trigger the vulnerabilities.(iv) Monitor contract calls to judge abnormal behavior in contract execution and destruction stages.
An insight into achieving these services is to combine ML with traditional methods instead of using the former alone, given the unique advantages of traditional methods, e.g., symbolic execution-based arts can obtain the contract inputs triggering the vulnerabilities.This could require ML-based methods to focus on the critical slices of contract runtime execution sequences in the bytecode/opcode form, rather than the entire contract bytecode.However, since a contract generally contains multiple incomprehensible execution sequences, it is challenging to get the specific labels of sequences when only knowing their contract vulnerability categories in the training dataset, so that the ML classifiers for identifying sequences cannot be trained.This problem is also known as the classification lacking fine-grained labels.
Contribution.To overcome the above challenges, we propose VulHunter, a method that can effectively detect vulnerable bytecode/opcode paths without manual pre-defined patterns.It extracted the contract execution sequences/paths based on the opcode, and completed the process of source codeto-bytecode-to-opcode conversion combined with the contract compilers such as solc [34].Then, it leverages multi-instance learning (MIL) to infer the fine-grained labels of contract execution sequences automatically, and employs Bag-instance/ self-model attentions based Bi-directional Long Short-Term Memory (Bi 2 -LSTM) model to inspect them accurately and output the vulnerable sequences with attention vectors.Also, it extracts the key opcodes with large weights in sequences, and locates the defective contract source code statements by mapping from the assembly language source code file.Furthermore, based on symbolic execution technology, the reported vulnerable sequences are used to construct and solve the symbolic constraints to validate their feasibility.Meanwhile, the solved parameters can be obtained to trigger the vulnerabilities such as integer-overflow, and the abnormal contract calls can be determined by verifying their inputs with the constraints.This paper mainly presents the design, implementation and evaluation of VulHunter.In total, we make the following contributions: • Comprehensive design requirements.We examine the numerous contract vulnerability types (c.f., Section II) and security analysis arts (c.f., Section VI), and further clarify the field demands (i.e., three challenges) to guide our designs.To the best of our knowledge, VulHunter is the most accurate and practical contract vulnerability method based on ML and symbolic execution.• Novel detection approach.We design and develop Vul-Hunter with six components, such as Vulnerability Learner, to detect vulnerabilities in contract source code or bytecode/opcode without expert involvement (c.f., Section III), thereby meeting the challenges (i)-(ii).It employs Bi 2 -LSTM model to identify/output the vulnerable runtime execution sequences and defective source code statements.Also, it delivers an optional constraint-solving module to construct the constraints of vulnerable sequences and compute contract inputs automatically, thus addressing the challenge (iii).• Superior analytical performance.We evaluate the performance of VulHunter on five open source datasets (c.f., Section IV).Compared with SOTA methods based on various technologies such as pattern matching and symbolic execution, our solution can detect contract vulnerabilities more accurately, efficiently, and robustly.Also, VulHunter is flexible given it can be configured with various baseline models and hyperparameters to adapt to diverse user Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
requirements.More importantly, compared with MLbased arts, it can produce classification results accurately while providing defective source code statements, key opcode fragments, and vulnerable runtime execution paths, benefiting the automated validation and vulnerability repair.Finally, we theoretically demonstrate its effectiveness and discuss the limitations, improvements, and more application scenarios in Section V.

II. BACKGROUND AND MOTIVATION
In this section, we briefly introduce the background information about smart contracts, as well as their vulnerabilities.

A. Smart Contracts
Operation procedure or lifecycle.The procedure of smart contracts consists of four states: development, deployment, execution, and destruction.Users can develop a contract with solidity language and deploy it to the Ethereum platform.Then, the contract source code will be compiled to EVM bytecode and identified by a unique 160-bit hexadecimal hash called contract address.The contract holds an amount of virtual currency Ether (called balance), whose execution depends on its code.It usually runs on a permissionless network, and anyone can invoke its methods through ABI (Application Binary Interface) [31].Specifically, the user at address α U can call the contract by sending a transaction T = α U , α A , E, G, D, . . . to the contract address α A , where E, D, and G denote the input amount, call parameters, and execution cost, respectively.Finally, the contract owner can destroy the contract by invoking the "selfdestruct" function.
Ethereum development language.Solidity is an objectoriented and Turing-complete programming language for implementing contracts on various blockchain platforms, most notably, Ethereum.Its grammar is similar to JavaScript, e.g., they both implement object-oriented features such as inheritance and complex user-defined types.Nevertheless, Solidity has some unique properties.For example, it can provide keywords like payable to mark payment operations, thus making transfer operations easier.Moreover, although some alternative programming languages (e.g., Obsidian [35] and Vyper [36]) have been proposed, Solidity is still the most popular language in Ethereum.Notably, the diversity of contract languages is challenging for auditors.Fortunately, contracts developed in various languages on Ethereum are executed in EVM bytecode, so that the contract security can better benefit from bytecodeoriented analysis methods.
Ethereum Virtual Machine (EVM) bytecode and opcode.EVM is a stack-based machine that maintains a stack of uint256s to hold local variables, function arguments, etc.When a transaction needs to be executed, EVM will split bytecode into bytes, each representing a unique instruction called opcode.There are 150+ opcodes by April 2022 [37].As described in Table II, they can be divided into 9 types according to their function. 2Monitoring them can benefit vulnerability detection.For instance, the opcodes of computing type (e.g., ADD and MUL) perform arithmetic operations and can be used to identify integer-overflow vulnerabilities.
Furthermore, EVM utilizes these opcodes to execute the task.Fig. 1 shows an example of contract transformation to illustrate the relationships among source code, bytecode, and opcode.The source code was compiled into bytecode 0x600060...5050 using compilers such as solc [34].EVM splits this bytecode into bytes (0x60, 0x00,..., 0x50), and executes the first byte 0x60, which refers to opcode PUSH1.PUSH1 pushes one-byte data (i.e., 0x00) to the EVM stack.Then, EVM reads 0x60 and pushes 0x01 (i.e., number 1) into the stack.Subsequently, it executes the remaining bytecodes, such as 0x80, 0x43, and 0x03, i.e., opcodes DUP1, NUMBER, and SUB.Among them, NUMBER extracts the block number.SUB obtains the top two values from the stack, i.e., block.numberand 0x01, and puts their subtraction result into the stack.

B. Vulnerabilities in Smart Contracts
1) Definition of Impact Levels: A sound and reasonable vulnerability assessment scheme for contract bugs or code optimizations can help developers to understand their contract security better.To this end, combined with CVSS2.0 (Common Vulnerability Scoring System), the vulnerability severity of smart contract can be rated as High, Medium, Low, Informational (Info), and Optimization (Opt) in terms of risk degrees and utilization difficulties.The detailed partitioning is shown in Fig. 2. The risk degree refers to the impact of vulnerability on the blockchain systems, users, and other resources.According to the three impact dimensions of confidentiality (C), integrity (I), and availability (A), the harm degree is divided into High, Medium, Low, Info and Opt.The utilization difficulty refers to the possibility of vulnerability occurrence.Based on the attack cost (e.g., money, time and technology), utilization condition  (i.e., the difficulty of attack utilization) and trigger probability (e.g., vulnerabilities can only be triggered by a few people), it is ranked into exactly, probably, and possibly. 3) Examples of Smart Contract Vulnerabilities: In order to understand the contract vulnerability detection at the bytecode/opcode level, we combine the contract code to explain some simple examples of vulnerabilities (c.f., Table III) supported by VulHunter in terms of occurrence principle, severity, repair strategies, and insights in bytecode. 4i) Reentrancy with Ether (reentrancy-eth): Reentrancy vulnerability is a classic problem, which leads to the loss of assets with a market value of nearly $60 million in 2016 [10].It refers to reentry with the following features: reentrant calling, Ether sending, and reading before writing.
Severity: High severity.RE can cause massive assets to be overspent or stolen (High risk).Also, it requires some conditions to trigger (probably utilization).For instance, the auxiliary contract is needed to complete the attack.Example: An attack scenario is depicted in Listing 1. Bob constructs an Attack contract and performs the "withdraw()" function by invoking the "attack()" function, which will trigger the fallback function.By this means, Bob implements multiple calls to "withdraw()".Since the "userBalance" variable hasn't changed before the secondary call, Bob obtained more than the amount he deposited into the contract.
Improvements to contracts: Put userBalance[msg.sender]= 0 before the call function.That is, the contracts should use the check-effects-interactions pattern to avoid this vulnerability.
Possible insight at bytecode-level: Ethereum provides three methods to transfer Ethers, i.e., address.send(),address.transfer(),and address.call().value().These methods all generate a CALL instruction, which reads seven values from the EVM stack.The first three values represent the gas limitation, recipient address, and transfer amounts, respectively.The CALL instruction that meets the following conditions is almost generated by call().value().(i) the gas limitation does not contain a specific value "2300"; (ii) the transfer amount is greater than 0. In addition, the SLOAD instruction is used to get a key value (named Slot ID) from the EVM stack and puts the mapping result read from storage back onto the stack [37].If the conditional expression contains the SLOAD instruction and its Slot ID are written by the SSTORE instruction after executing the CALL instruction, it means the CALL instruction can be executed again and cause Reentrancy.Among them, the SSTORE instruction is used to save data into storage, and it reads two values from the EVM stack, i.e., Slot ID and the stored value.
(ii) Transfer replaces send (send-transfer): Both the send and transfer functions specify that the operation has a limit of 2300 gas, but a failure of the send function does not trigger the exception and can be reentered easily.Thus, developers are recommended to use the transfer function.
Severity: Opt severity.ST can help developers improve the security of transfer operations (Opt risk).Also, it can be triggered by executing vulnerable codes (exactly utilization).
Example: As shown in Listing 2, the "func" function utilizes "send" to extract amounts, which can be improved.
Improvements to contracts: It is suggested to replace "send" with the "transfer" to keep the funds secure.

Possible insight at bytecode-level:
There exists a CALL instruction that meets (i) the gas limitation contains a specific value "2300"; (ii) the transfer amount is larger than 0.
Listing 2. The sample of send-transfer.

III. THE VULHUNTER APPROACH
In this section, we elaborate on the design principles and workflow of VulHunter, as well as its components.

A. Design Overview
Fig. 3 depicts an overview architecture of VulHunter.Vul-Hunter can take the Solidity source code, bytecode, or opcode of smart contracts as input, and eventually output contract vulnerabilities (e.g., RE, TO, and TS) with their severity as described in Section II-B, as well as the corresponding vulnerable runtime opcodes and their key fragments.In particular, it can highlight the defective contract statements when analyzing the source code.Also, the runtime opcodes can be used to build symbolic constraints and then execute secondary verification and utilization.Specifically, VulHunter contains six components.Contract Inputter is responsible for generating the contract opcodes, and CFG Builder constructs a CFG with three kinds of blocks.Then, Instance Builder performs the depth-first traversal to obtain the runtime opcode sequences (called instances).Vulnerability Learner captures vulnerability features automatically by training detectors on benign and malicious (vulnerable) contracts.Vulnerability Identifier employs the detectors to identify the vulnerable instances of contracts.Finally, Result Exporter locates the defective source code statements, validates the instance feasibility, and outputs the contract audit reports.

B. Contract Inputter
As shown in the left part of Fig. 3, Contract Inputter can feed the source code, bytecode, and opcode of contracts as input.Specifically, due to the opcode is directly used by CFG builder, the source code needs to be compiled into the bytecode, and then disassembled into opcodes using the API of Geth [38].Fig. 1 depicts this process vividly.In particular, the compiler can resolve multiple associated contracts uniformly to generate their bytecode.Also, the assembly language source code file (ASM) can be obtained during the contract compiling and further used to map bytecodes/opcodes to source codes, as illustrated in Section III-G.

C. CFG Builder
CFG Builder constructs the CFG of contracts based on their opcodes to explore the state transitions during the actual execution.This process is similar to the methods such as DefectChecker [20] and EtherSolve [39], including block identification and edge inference.At first, he splits the opcode into several basic blocks.A basic block is a straight-line code sequence without branches except for the entry and the exit.Its type is presented by the exit instruction, which usually marks  II (e.g., STOP and RETURN), the block type is terminal.The blocks that do not fall into both types are assigned as unconditional.
Then, CFG Builder performs symbolic inference on the instructions in each block to establish connections with neighboring blocks.Different from other stack-based machines (e.g., JVM), the jump positions of EVM opcodes need to be computed during instruction reasoning.Specifically, when operating an instruction, it reads several symbolic states from the top of the EVM stack and puts the computation result back on the stack.Through continuous reasoning operations, we can obtain jump relations between blocks, and their types are consistent with blocks, i.e., conditional and unconditional.
As an example shown in Fig. 4, there are 4 blocks, each containing several instructions.The instructions in block 1 represent the code if(val==0).Block 2 and block 3 put the value (1 or 0) to the EVM stack, respectively.Block 4 returns the value (0 or 1) to the environment.The leftmost number in each line indicates the instruction index ID, and the middle part is the instruction that needs to be reasoned.All instructions will reason sequentially based on their index ID.There is a Program Counter (PC) that records the ID executed at the current time.Specifically, the PC starts from ID 10 in block 1.Before the EVM executes the JUMPDEST instruction, there is a symbol "val" in the EVM stack, which represents the input value of the "iszero" function.JUMPDEST marks a valid jump destination and does not read or push any values.Then the PC points to ID 11, and EVM pushes a value 1 to the EVM stack.Also, "0" is pushed into the EVM stack and PC point to 13. DUP3 duplicates the 3rd stack item, i.e., the symbol "val" is pushed into the EVM stack again.EQ reads two values from the EVM stack.If the two values are equal, then the EVM pushes 1 into the stack, otherwise 0 is pushed.After that, ISZERO reads a value from the top of the EVM stack.If the value is equal to 0, then 1 is pushed into the stack, otherwise 0 is pushed.JUMPI (ID 17) reads two values from the stack, i.e., the jump position and a conditional expression.According to the expression's result, the PC will conditionally jump to the positions of IDs 23 and 18, respectively.
When the PC points to ID 23, it will execute the instructions on IDs 23-26, otherwise 18-22, and both eventually jump to block 4 unconditionally.When performing the first instruction of block 4, the EVM stack holds two values, i.e., val and 0/1.Eventually, block 4 (type of terminal) returns the value 0/1 and uses the STOP instruction to finish the execution.Notably, given the complex computation of destinations for jumps without immediate target offsets, the sensitive observations of ML models such as Bi 2 -LSTM (c.f., Section III-E) can tolerate slight CFG biases, thus enabling VulHunter to deliver robust and accurate detection 5 .Also, the feasibility validation (c.f., Section III-G) of paths may mitigate the impacts of imprecise/unsound CFGs, and multi-threaded operations can be used to improve the efficiency of CFG construction.

D. Instance Builder
In order to make VulHunter discover vulnerable execution paths, the Instance Builder focuses on each opcode sequence (called instance) that the contract actually runs, rather than a hodgepodge of all opcodes.The process of instance extraction is detailed in Algorithm 1. Specifically, CFG Builder constructs the contract CFG through the steps in Section III-C (lines [12][13][14].Then, Instance Builder obtains the instances by performing the procedure SEQDFS (lines 2-10) from the initial blocks according to the CFG (lines [16][17][18][19].Furthermore, SE-QDFS leverages the depth-first traversal to record in-block opcodes along the execution path.The search for the execution path is stopped when it meets one of the three conditions.(i) Continuous n cycle blocks that have been searched, which is considered as a cyclic execution path.(ii) The length of the path exceeds the limit n block .(iii) The type of the last block is terminal, representing the ends of paths.
Finally, due to sequence space explosion and performance limitations, the Instance Builder outputs the contract's n seq selected instances (lines [20][21].Note that he has various options such as random selection and ordered assignment.In Section IV-G, we tested the performance of different selection schemes (e.g., longest, shortest and interval) and the number opseqs ∪ = SEQDFS(∅, ∅, block, 0, ∅) 20: opseqs=sorted(opseqs,key=lambda d:len(d),reverse=True) 21: return choose(opseqs, n seq ) Choose the opcode sequences of selected instances under the same other settings, i.e., control variables.The instances with longer lengths generally achieve better effects, which may be attributed to the fact that they hold more opcodes and semantic information.

E. Vulnerability Learner
To detect contract vulnerabilities at the bytecode level, Vulnerability Learner leverages multi-instance learning mechanism to identify malicious contract behaviors automatically from the instances extracted by Instance Builder.Specifically, it trains a binary classification model for each vulnerability, and multiple models grant VulHunter the ability to analyze contracts comprehensively.In this part, taking a model as an example, we introduce the process of vulnerability feature learning and vulnerable instance detection, which consists of three parts: instance label initialization, model training/classification, and instance optimization, as depicted in Fig. 5. Also, its effectiveness/correctness will be discussed in Section V.A.2.
(i) Instance tag initialization.During the vulnerability feature learning stage, the training dataset includes instances opseqs of multiple benign and malicious contracts, and each contract can be vividly described as a bag.However, since it is challenging to understand the opcodes and label each instance of contracts, we only determine whether the contract is malicious/vulnerable (i.e., existing the target vulnerability) and cannot get the specific tags of its instances.In other words, malicious contracts include at least one vulnerable instance, yet the specific instance tag is unknown.This is known as the problem of missing fine-grained labels, preventing the model from being trained on instances.To this end, we first initialize the instance tag to coincide with its bag and integrate all instances as the training dataset D 0 .
(ii) Model training and classification.To distinguish between benign and malicious instances, we employ a basic Bidirectional LSTM (Bi-LSTM) model to focus on the underlying contextual relationships of the opcodes in instances.Next, the Bag-instance and self-model attentions based Bi-LSTM (called Bi 2 -LSTM) model is proposed to catch the salient instance fragments and consider both bag and instance learning effects.Notably, given the generality and extensibility framework of Vulnerability Learner, he can also employ other ML models (e.g., Random Forest) to further improve the detection performance, which is discussed in Section IV-F.This part details the components of Bi 2 -LSTM model.
Instance encoding.The instance consists of T opcodes opseq = {x 1 , • • • , x T }, and each opcode x i is converted into its bytecode e i by assembly.Then the vector is fed into the next layer as a real-valued vector Bi-LSTM network.For the opcode sequence modelling task, it is beneficial to consider future and past contexts.To this end, Bi-LSTM networks extend the unidirectional LSTM networks by introducing a second layer, where the hidden to hidden connections flow in the opposite temporal order.As shown in step 2 of Fig. 5, the Bi 2 -LSTM layer contains two sub-networks of LSTM units for the left and right sequence context, representing the forward and backward passes, respectively.In the forwardpass t th time step operation, the forget gate f t and input gate v t can be calculated as where − − → h t−1 is the current hidden state and e t denotes the t th input of the LSTM unit.Then the temporary memory cell Ct and the next cell state can be computed and updated by Finally, the output gate o t and the next hidden state − → h t can be obtained as follows.Particularly, The data flow of the backward-pass operations is similar to the above process, and the final output h t of the Bi-LSTM layer for the t th input is shown in the following equation: Self-model attention.Let H be a matrix consisting of output vectors [h 1 , h 2 , • • • , h T ] that the Bi-LSTM layer produced, where T is the sequence length.The sequence's representation r is formed by a weighted sum of these output vectors.
where H ∈ R T , w is a trained parameter vector, and w * is its transpose.The dimensions of w, α, and r are 1, T , and 1, respectively.Notably, the weight α reflects the importance of the input x t in each time step t during model inference and can be utilized to calculate the key sequence fragments, as further detailed in Section III-G.Therefore, this mechanism enables VulHunter to discover defective contract fragments.Then, the final sequence-pair representation h * = tanh(r) is obtained.
Classifying.We use a softmax function to predict the label ŷ from a discrete class set Y = {0, 1} for a sequence opseq.
Training with Bag-instance hybrid attention.As shown in the model training part of Fig. 5, Vulnerability Learner utilizes Bi 2 -LSTM to classify each instance of bags and takes the cross entropy as the loss loss ins of the instance features learning.Specifically, the average loss ins of the bag C j can be calculated from Eq. ( 13), where p(x) represents the probability that an instance x with the tag y(x) (0 and 1 denoting benign and malicious, respectively) is predicted as malicious.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
More importantly, a key judgment is whether the bag is malicious, i.e., there are malicious instances in the bag.To this end, Vulnerability Learner takes the result of predicting the most like the malicious instance (i.e., instance with the largest p(x)) in the bag as the prediction result of the bag, and further calculates the cross-entropy loss loss bag combined with the tag y bag of the bag C j .Then, two losses are fused to constitute the Bag-instance hybrid attention mechanism, enabling the model to identify malicious instances in the bag guided by the bag/contract identification.This mechanism is the key to mitigating the impact of inevitable false instance labels during the training process, as detailed in Section V.A.1.
) loss = λloss ins + loss bag (16) where a larger value of the weight λ ∈ (0, 1] ∈ Q + indicates that the learner concentrates on instances, otherwise bags. (iii) Instance optimization.As shown in step 3 of Fig. 5, the training dataset was updated after the Bi 2 -LSTM training, and then used for iterative training again.Specifically, the Vulnerability Learner employs the trained model to identify the original datasets D 0 , and filters some distinguishable instances to construct a new training dataset.The dataset consists of two parts, one is from the benign bags predicting like the benign instances, i.e., for the bag C i , the instances are arranged in ascending order of p(x), q b of the selected instances } has a smaller p(x).Another comes from the malicious bags predicting like the malicious instances, i.e., for the sorted package C j , and q m of the selected instances has a larger p(x).The value of q m enables the model to consider vulnerabilities triggered by multiple instances, such as extracting contract permissions and stealing the balance.The impact of these parameters and training epochs on feature learning were evaluated in Section IV-G, and their values were suggested by considering detection accuracy and overhead.

F. Vulnerability Identifier
Vulnerability Identifier employs multiple Bi 2 -LSTM models to detect contract vulnerabilities.Specifically, for each vulnerability described in Section II-B, a detector that identifies contract instances with corresponding vulnerability features can be trained on pre-collected datasets.Note that the instance extraction of contract samples takes only once during the vulnerability detection.Then the model performs fast inference, so that multiple detectors only require little time overhead and outperform many SOTA methods, which is discussed in Section IV-D.Also, since different models are independent during inference, technologies such as parallel computing [40] can be used to improve performance.More importantly, given the scalability of Vulnerability Learner, detectors can be easily trained based on datasets to identify new vulnerabilities.

G. Result Exporter
After the vulnerability detection, the Result Exporter generates a security analysis report for the contract, which consists of security conclusions and repair suggestions. 6Specifically, he outputs the "safety" conclusion and corresponding evidence (i.e., analysis details) when there is no instance with vulnerability features.Otherwise, the contract vulnerabilities with their severity (i.e., High, Medium, Low, Info, and Opt) are indicated.More importantly, the vulnerable instances and their key fragments can be further used to perform symbolic constraint solving and defective source codes mapping, thus granting VulHunter the ability to execute secondary verification and utilization.Also, he can output the possible suggestions based on the above information to enable developers to fix the vulnerabilities.
Locating the defective source code statements.After the model prediction for contracts, Result Exporter obtains the contracts' instances with labels and their weight vectors α from Vulnerability Identifier.He computes the index vector L of the largest m weight values and extracts the corresponding opcodes ops key , as stated in Eqs. ( 17) and (18).When the contract source codes are provided, he can map the begin and end positions (i.e., lines and columns) of these opcodes in the source codes from the ASM file extracted by Contract Inputter.Then, the defective contract statements are obtained by intercepting the source codes of the specific positions.
Symbolic constraint solving for instances.The identified vulnerable instances describe the state transitions during contract execution, as described in Section III-C.Inspired by the symbolic execution technique, Result Exporter presents an optional constraint-solving module.It employs these instances to build the accumulating constraints satisfied by the symbolic inputs, which can be further translated to the Satisfiability Modulo Theories Library (SMT-LIB) language.This process is mature and performed in tools like Oyente [4] and Manticore [24].Then, based on an SMT solver [41], it reports the instance is infeasible when its constraint condition is unsatisfiable.Otherwise, outputting the feasible conclusion and the input values that meet the conditions.As stated in [20], there will be some dead code in bytecode.In this way, the infeasible instances will be corrected as benign, thus eliminating some false positives.Currently, this module supports several SMT solvers (including Z3 [42], Yices [43], and CVC4 [44]), and users can employ either of them or a combination [45], given their unique performance and built-in theories.For instance, Z3 can handle linear/non-linear operations on more data types such as bitvectors and arrays, while Yices can achieve a faster solving process [46].Notably, compared with symbolic execution-based methods, VulHunter can be regarded as an automatic rule-maker for vulnerabilities and a filter for contract execution paths (works like reinforcement learning [11]), which performs fast reasoning to prune normal paths.Furthermore, other feasible and misreported instances are discussed in Section V.C.1.
Secondary verification and utilization.In addition to allowing developers to determine the audit correctness quickly through instance feasibility detection and defective statement positioning, VulHunter can also support other services not provided by current ML-based methods.For instance, auditors can invoke the vulnerable contract with the parameters calculated by symbolic constraints to trigger the vulnerabilities such as integer-overflow.Also, they can identify abnormal contract transactions by verifying vulnerability constraints, and these applications are detailed in Section V-D.
3) Datasets: Table V shows the details of five datasets in the experiment.Datasets_1~2 are both open-source datasets with partial labels, which are suitable for assessing the precision of the methods (exploring RQ1, RQ5, and RQ6).In order to ensure the correctness of the labels, we have manually checked and supplemented these labels based on the verification results of multi-methods such as SmartFast [6], Slither [8] and Oyente [4].We make these datasets and labels public to enable cross-checks from other researchers and further guarantee their accuracy.Among them, Dataset_1 consists of 38,600 real-world Ethereum contracts with source code, excluding the empty and uncompiled contracts.It has been used in [5], [12], [17], [49].According to the vulnerability severity mentioned in Table III, we can count the number of 30 kinds of vulnerabilities in Dataset_1 as High (13,149), INFO (19,659), OPT (38,314), etc.9 Dataset_2 contains 579 contracts with only bytecode and is marked with 8 kinds of known vulnerabilities, which is the same used in [20].In order to obtain a representative picture of vulnerabilities in the production environment, we have downloaded 13,413 contract Solidity codes in actual use by invoking the Etherscan API [32].These contracts make up Dataset_3 (the size is 284.3MB).Also, we collected 183,710 contracts with runtime bytecodes as the Dataset_4, which were crawled and filtered from Ethereum by Chen et al. [50].Note that the amount of bytecode far exceeds the source code, reflecting the necessity of bytecode-level analysis.Both datasets are employed to discuss the number of vulnerabilities in the Ethereum blockchain (exploring RQ2~RQ4 and RQ7).In addition, Dataset_5 consists of the collected contract source code for 29 well-known vulnerability events, which was used to further evaluate the authenticity of VulHunter.

4) Metrics:
We define the discovery of vulnerable contracts as a problem.By comparing the methods' detection results with the previous vulnerability labels, we can measure whether the problem occurs, which can be regarded as a binary classification.In this way, all problems found by methods are marked as true positive (TP), false positive (FP), true negative (TN) and false negative (FN).TP and TN indicate the results which correctly predict a contract with and without a vulnerability.In contrast, FP and FN describe false detection.Furthermore, the accuracy (ACC), precision (P), recall (R), and F-Measure (F1) are calculated as follows to evaluate each method, where #T P , #T N, #F P , and #F N refer to the number of contracts marked accordingly.
Parameter Settings: Without special mention in texts, we report the performance of all models with the following empirical settings: n cycle = 2, n block = 32, n seq = 10, T = 512, λ = 0.6, q b = 0.8, q m = 0.2, epoch = 50, n neurons = 512, and the instances with the longer lengths.Also, the Bi 2 -LSTM is selected by default, and the baseline models and hyperparameters are evaluated in Sections IV-F and IV-G.For ML-based methods on Datasets_1~2, in order to explore the effect of different proportions between benign and malicious contracts, we randomly select 80% of benign/malicious contracts as the training dataset and the other 20% as the testing dataset based on the five random seeds (i.e., 42, 1234, 2345, 3456, and 4567) respectively, and report the averaged results with standard deviations.Furthermore, the Bi 2 -LSTM model trained on Datasets_1~2 was employed to identify the contracts in Datasets_3~5.

B. Precision of VulHunter (RQ1)
To answer the first research question, we compared the ability of VulHunter with SOTA methods based on traditional analysis and ML to detect contract source code in Dataset_1 and bytecode in Dataset_2.Specifically, (i) we executed the methods on these contracts. 10(ii) We extracted all vulnerabilities detected by methods into JSON files.(iii) We transformed each vulnerability category detected by the method into a pre-uniform name, as described in Table IV.For instance, SmartCheck detects a vulnerability called SOLIDITY_TX_ORIGIN that we link to the tx-origin (TO) category.(iv) We used the true labels of contracts to calculate the metrics based on the detection results.
In order to balance the training dataset and evaluate the effect of different ratios between the benign and malicious contracts on VulHunter performance, we set two proportions of 2:1 and 5:1 for considering the minimum requirements of the sample numbers.The results are presented in Tables VI and VII, which illustrate the performance for each method. 11Note that methods such as SmartCheck have no standard deviations given their deterministic detection.Both tables contain the three parts: (i) metrics such as ACC and F1 of some vulnerabilities (e.g., reentrancy-eth); (ii) the total AVG and NAVG metrics of each vulnerability severity, where the NAVG only covers the vulnerability categories that methods can identify; (iii) the AVG and NAVG metrics of each method.
First of all, in the 2:1 experiment, we can summarize from part (i): VulHunter overperforms other methods for most vulnerabilities, such as controlled-delegatecall and timestamp.In contrast, the performance of traditional detection methods such as SmartCheck and Securify is restrained by their unrenewed rules.On the one hand, this reflects that the contract expression difference caused by the vulnerability evolution and the compiler upgrades can bypass their fixed detection rules.This reminds their experts to constantly develop and update the detection rules to stay current, which is quite cumbersome.On the other hand, it indicates that VulHunter can automatically capture the effective vulnerability features based on datasets through the Bi 2 -LSTM model, given its data representations/ fitting and temporal contextual correlation ability for contract instances.Nonetheless, there are some vulnerabilities such as reentrancy-eth, VulHunter is slightly insufficient to the methods such as Slither.It is because these vulnerabilities can be deterministically described by pre-defined rules, which can enable sophisticated detection by using pattern matching.This inspires ML-based methods including VulHunter to incorporate some deterministic knowledge, similar to AME [49] and CGE [5] Particularly, due to the variability and diversity of ML, we can enhance the performance of VulHunter by adjusting some variable parameters and baseline models, which is discussed in Sections IV-F and IV-G.For example, increasing the number of extracted instances can make the model detect more paths, facilitating vulnerability discovery.More importantly, VulHunter does not rely on manual pre-defined rules, making it easier to detect new vulnerabilities without the involvement of experts.This is one of the reasons why we used ML models rather than traditional detection logic.Notably, it identifies (almost) all vulnerabilities better than other ML-based methods such as TMP and ContractWard.This stems from its ability to focus on the runtime execution sequences (similar to symbolic execution-based methods such as DefectChecker) and accurately capture the subtle features of benign and malicious samples during the contract execution.In contrast, TMP and ContractWard are insensitive to vulnerability features by globally observing the generalized contract bytecode, holding the inferior recall rate and F1 score.In addition, some vulnerabilities or defects such as uninitialized-state and unused-state are challenging to identify at the bytecode level, which is discussed in Section V.B.1.
By observing the severity from Opt to High (i.e., part (ii)), the detection performance of VulHunter is almost the best compared to others.While some traditional methods, such as Slither and DefectChecker, can only work well with their supported vulnerabilities.For example, Slither just identifies two problems with Info severity and achieves ACC = 92.25% and P = 99.91%,indicating that these defects are adequately described by their rules.Nevertheless, it is difficult to develop well-established rules for complex vulnerabilities such as tod as they cannot execute contracts and account for all situations.Also, massive detection rules manually developed by experts for each vulnerability are frequent and time-consuming, restricting the vulnerabilities they can detect.Therefore, they have an inferior total AVG for each severity.Besides, detection rules based on the bytecode are more difficult to develop than the source code, given their different readability.This is why source code detection methods such as Slither can identify more vulnerability types than those based on the bytecode, e.g., Securify.Conversely, VulHunter overcomes the bottleneck of bytecode-based rule-making with its keen feature observation and representation on massive datasets, thus achieving or exceeding the detection performance of source code-based methods without expert involvement.
From part (iii), it is concluded that VulHunter can detect contracts more accurately, discover most vulnerabilities, and perform with an acceptable standard deviation below TMP, due to its effective information extraction and tailored multi-attention mechanism.Notably, VulHunter can choose suitable random seeds to achieve better results, which is impossible for traditional methods.Also, some contracts cannot be analyzed by other arts.Even Oyente and Mythril hold failure rates of 16.26% and 21.99%, respectively.This may be a reason for their poor detection.Instead, VulHunter can analyze more contracts given its robust implementation and refined model.
From Table VII, we can draw similar conclusions as above.VulHunter can still detect more vulnerabilities and overperform other methods.Since the proportion of the training set grows 5:1, i.e., the benign samples are expanded, the model is inclined to learn benign features, thereby improving the identification accuracy and reducing the standard deviation, i.e., the overall ACC = 92.57%(2.33%↑) and STD ACC = 1.81% (0.08%↓).Nonetheless, the diluted malicious features enable the model convergence biased to benign samples, reducing its recall rate and increasing the standard deviation.Although other methods may not require pre-training, this phenomenon also applies to them due to the dataset variation.Even the changes in the data distribution has seriously affected the performance of ContractWard, such as the failure for the timestamp vulnerability identification (P = 0 and R = 0).This reflects that their robustness needs to be improved.Thus, the auditors can sample in appropriate proportions based on their identification requirements and the contract distribution in collected dataset, so as to achieve their specific effects.For example, reporting suspected vulnerabilities as much as possible can set a small ratio between benign and malicious contracts to perform a superior recall rate.Instead, they can set a large proportion to detect more contracts correctly, thereby relieving the pressure on manual review.
Second, we utilized 7 methods supporting bytecode analysis to detect the contracts in Dataset_2.As described in Table VIII, the results agree with the above conclusion that VulHunter can detect most vulnerabilities accurately.Also, although the individual detection result changes in small datasets can lead to significant differences, VulHunter still keeps acceptable standard deviations, demonstrating its stability.Notably, for vulnerabilities such as reentrancy-eth, symbolic executionbased methods (i.e., DefectChecker, Oyente, and Mythril) achieve superior precision by executing symbolic inference, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and DefectChecker makes the best performance.Nevertheless, its inferior recall rate may be affected by incomplete predefined rules, reflecting that it is difficult to cover all cases given their massive number and unreadable bytecode.In contrast, VulHunter performs the accurate model inference to automatically capture the semantic features of benign and vulnerable contracts, thus considering both precision and recall, and holding a superior F1 score.Moreover, Securify and SMARTIAN failed to analyze most contracts due to procedural errors.They can improve the performance by optimizing their implementation code.The imperfect features of ContractWard lead to its limited analysis and zero fluctuations, suggesting its model needs to be refined to focus on subtle runtime features.
In a word, VulHunter can perform a superior detection based on bytecode due to its actual execution path inputs, well-designed model, and meticulous prototype implementation.

Answer to RQ1. What is the effectiveness of Vul-Hunter in detecting contract vulnerabilities?
Compared with SOTA methods on multiple datasets, VulHunter has a superior contract detection performance and acceptable standard deviations (ACC = 90.04%,P = 87.92%,R = 83.41%and F 1 = 85.60%).Also, it can detect more contracts normally (Failed = 0%), which illustrates its robust implementation.More importantly, unlike traditional methods, it can analyze source code and bytecode accurately without manual pre-defined rules, thus easily expanding to identify new vulnerabilities and consider the complex representations caused by vulnerability evolution and compiler updates.In turn, its identified defective source code statements and key opcode subsequences can help them maintain detection rules to consider more cases and improve the performance.

C. Vulnerabilities in Production Smart Contracts (RQ2)
To answer the second research question, we ran the 10 methods on contract source code in Dataset_3 and bytecode in Dataset_4.The detailed results are given in Tables IX and X, which aim to show the frequency of each vulnerability on Ethereum.Specifically, Table IX shows the detection situation of various severity for each method. 12Among them, VulHunter and TMP detected the most contract vulnerabilities.It can be attributed to the most vulnerability categories they detected and the comprehensive vulnerability features they learned, which require laborious development by many contract experts for traditional methods.Specifically, VulHunter discovered 443 contracts without vulnerabilities exceeding Info severity, and more than half of the contracts have multiple problems.Note that the higher number may be caused by a few vulnerabilities, e.g., the integer-overflow accounts for 35.3% of Medium severity.Although, as mentioned in [51], [52], many discovered defects with advanced utilization difficulties or even not exploitable in practice (i.e., the proportions of probably and possibly in each severity are High (67.2%),Medium (65.2%),Low (100.0%),respectively), identifying these problems can draw the attention of contract developers to possible threats, thereby reducing the risk of contracts being attacked.Moreover, numerous contracts analysis failed by Oyente and SMARTIAN, which may be a significant factor restraining their vulnerability discovery.
As conferred in Sections IV-B, IV-F, and IV-G, auditors can adjust the contract proportion of the training dataset, baseline models (e.g., Decision Tree and Random Forest), and hyperparameters (e.g.,q m and q b ) to determine whether Vul-Hunter focuses on vulnerability discovery or benign contract  identification.In this section, we employed the settings of ratio = 2:1, Bi 2 -LSTM model, q b = 0.8, and q m = 0.2 to detect contract security threats as much as possible.Nevertheless, this inevitably produces some false positives.To this end, as detailed in Section III-G, on the one hand, VulHunter outputs defective source code statements and key opcode fragments to facilitate the manual secondary verification and estimate false reports quickly.On the other hand, it can employ the optional constraint-solving module to automatically validate the feasibility of vulnerable instances and reduce the workload of manual verification, which is explored in Section IV-I.Note that the latter cannot be provided by other methods based on static analysis and ML.Besides, some contracts are discussed in Section IV-E to illustrate the validity of partial results.Overall, given the superior detection capabilities of VulHunter, the security of contracts on Ethereum needs to be taken seriously, and owners should develop contracts based on the specifications [53].Due to the expansive time overhead in other methods such as Mythril and Oyente, we randomly sampled and detected 10,000 bytecodes from the Dataset_4, and all of the 183,710 bytecodes were identified only by VulHunter. 13Table X describes similar phenomena as mentioned above.Since the number of failed contracts of methods such as Securify and SMARTIAN increases, their results are incomplete but referable.Furthermore, Vul-Hunter can also combine some relatively accurate detection patterns from symbolic execution-based bytecode-level methods (e.g., DefectChecker) to discover misreports, which is discussed in Section V.C.1.
Besides, the harm caused by the contract is related to both vulnerability number and frequency of use.The inactive contracts with vulnerabilities alone may not cause damage to users.Thus, in order to evaluate the contract severity more reasonably, we have introduced the transaction number as the activity frequency of contracts.Fig. 6 presents the correlation between the number of vulnerabilities with various severity and the number of contract transactions.It shows the following phenomenon.(i) A similar distribution held for the different severity, indicating that contracts tend to have vulnerabilities with varying severity.(ii) The number of contract transactions is widely distributed, and most transactions involve vulnerable contracts, suggesting that the contracts on Ethereum can be further improved.(iii) The number of vulnerable contracts gradually decreases as their transactions increase, and this phenomenon is most prominent in Medium and High severity.This may be attributed to the importance of these contracts prompting the developers to review them in detail.Note that it is dangerous to have many vulnerabilities in active contracts involving numerous transactions, and instead, the harm of vulnerable contracts with fewer transactions may not be serious.Fig. 6.Correlation between the number of vulnerability types in Dataset_4 detected by VulHunter and the number of contract transactions.In the figure, the upper limit of the number of transactions is set to 20,000 (that is, 20,000 for more than 20,000).Answer to RQ2.How many vulnerabilities are present in the Ethereum blockchain?Most of the contract source code in Dataset_3 and contract bytecode in Dataset_4 were detected with vulnerabilities ranging from Low to High severity.Although these vulnerabilities are not prone to exploit and cause harm, identifying them can allow contract owners to focus on risky code and then develop normative contracts.Also, many contracts in these datasets can be optimized to improve their operation status.More importanly, given its detection scalability, performance flexibility, and result verifiability, VulHunter can detect wild contracts better in the future based on the requirements of auditors.

D. Execution Overhead of VulHunter (RQ3)
In this section, we present the execution overhead of methods for analyzing Ethereum contracts.First, we selected about 100 contracts with a size of about 121KB, such as the contract with address 0xce5b23f11c486be7f8be4fac3b4ee6372d7ee91e (3,049 lines).Then Oscillo [54] was employed to monitor the time and memory overhead of 10 methods for detecting these contracts.As shown in Fig. 7(a), the ML-based (e.g., VulHunter and TMP) and pattern matching-based (e.g., SmartCheck and Slither) methods generally require less time overhead than those based on symbolic execution (e.g., Oyente and Mythril) and fuzzy testing (SMARTIAN).But there are exceptions.For instance, the time overhead of DefectChecker is similar to SmartCheck as its lightweight design.Nevertheless, VulHunter leverages ML to complete the detection within an average of 4.4 seconds.Also, most of time overhead was used to extract instances (3.7s) and load the model (0.65s), which can be improved by parallelism.Notably, although models learn numerous representations for vulnerabilities, they perform fast inference (30 models only require 0.05s), allowing them to be extended to identify more vulnerability types in imperceptible time.
As illustrated in Fig. 7(b), due to thousands of search paths need to be traversed and executed in symbolic execution-based methods such as Oyente, they generally require more memory overhead than ML-based and pattern matching-based methods.Also, the actual running of contracts based on numerous test cases introduces an additional memory overhead in fuzzy testing-based methods.Similar to pattern matching-based methods, the memory overhead of VulHunter mainly consists of three parts: extracted instances, pre-trained models, and intermediate variables during the inference process.The lower memory overhead reflects the main advantage of ML-based methods, i.e., contracts can be accurately analyzed in resourcelimited devices.Furthermore, this phenomenon brings great development potential for VulHunter.For instance, the remaining space can be employed to deploy a private chain that executes vulnerable sequences, thereby ensuring they can trigger the vulnerabilities.

Answer to RQ3. How much overhead does VulHunter require to analyze the smart contracts?
VulHunter takes an average of 4.4s and 81.3MB to analyze a 121KB Ethereum contract, which is only one-tenth (or even less) of fuzzy testing-based (e.g., SMARTIAN) and symbolic execution-based (e.g., Oyente) methods.Even more, compared with the pattern matching-based (e.g., Slither) and ML-based (e.g., TMP) methods, VulHunter has superior performance in both time and memory overhead, which delivers the vast potential for future development.

E. Authenticity of VulHunter (RQ4)
Towards exploring the superior performance of VulHunter, we demonstrated the detection results of real-world contracts (including those deployed on Ethereum in Dataset_3 and those derived from vulnerability incidents in Dataset_5).Table XI shows some examples of contracts in Dataset_3. 14 Detection of reentrancy-eth vulnerabilities.VulHunter detected a reentrancy-eth vulnerability in the contract RedExchange (2.00E+16Wei).The "payFund()" function (corresponding to line 11 of Listing 3) is declared as public.Although this function is guarded with the "onlyAdministrator" modifier, anyone can become a member of administrators by invoking the "RedExchange" function.Moreover, the gas specified by the call function is too large, while the secure gas is usually 2300 ( 40,000).Thus, attackers can construct an attacking contract and utilize the "setBondFundAddress()" function to set the withdrawal address "bondFundAddress" as the attacking contract address, thereby realizing a reentrancy attack.However, methods such as Oyente and DefectChecker missed this vulnerability.Particularly, when "RedExchange" is not provided, the attackers will not be able to invoke "payFund" normally.However, Bi 2 -LSTM and other methods such as Slither will still identify this path that includes the reentrant call function as the invariable execution sequence.As detailed in Section V.B.2, 14 More examples and the "Others" column are detailed in https://github.com/Secbrain/VulHunter/tree/main/Dataset3/Example_results. it belongs to a false positive of multi-instance collaborations.Notably, VulHunter can build symbolic constraints of vulnerable paths and verify their feasibility with solvers, thus eliminating misreports.Also, this problem can be mitigated by expanding similar contracts in the training dataset, making the model observe these cooperation features.In addition, there are reentrancy-eth vulnerabilities in line 1,060 of the VokenPublic-Sale contract and line 77 of the Acid contract, which cannot be identified by arts like Oyente, Securify, and Mythril.Detection of tod vulnerabilities.This vulnerability refers to inconsistent behavior caused by miners or nodes interfering with the transaction sequence.It contains three types: key storage variable, owner authentication, and approved tokens.For Vesting contract (3.00E+10Wei), the state variable "tokenReward" is assigned in "createVestingPeriod" function and used in "release" function (as shown in Lines 5 and 9 of Listing 4).When the transactions that invoke these two functions appear in the same block, the miner can alter the order of transactions to cause an incorrect transfer account.Similarly, the state variable "owner" in the Marketplace contract can be assigned twice in a single block, confusing the user sets a different owner.The "StandardBounties" contract is a representative of approved tokens, and the vulnerability is caused by the "erc20-approve" function declared in line 80.When the authorizer changes the authorization, the user creates a consumption transaction that spends the original authorization token and sets more gas than the changed authorization transaction.In this way, the miners will prioritize the consumption transaction, so that the user can spend both old and new authorized amounts.VulHunter successfully identified the above contracts, while methods such as Securify, TMP, and Oyente misreported them.It is noted that this vulnerability also belongs to the cooperation of multi-path executions, which is discussed in Section V.B.2.In addition, we employ the contracts of well-known vulnerability incidents in Dataset_5 to further clarify the performance of VulHunter.Table XII describes the information of these contracts, including security incidents, vulnerability names, economic losses, and detection results. 15The integeroverflow vulnerability incident for the SmartMesh contract.In April 2018, the transactions of the SmartMesh contract were suspended by various platforms such as Ethereum [55].However, as shown in Listing 5, attackers can manipulate the input parameter of the "transferProxy" function to make _fee+_value = 0 (integer-overflow), so that the verification in line 3 will be passed.In this way, attackers can obtain a lot of money.Besides, contracts such as EMVC yielded economic losses given the ineffective arithmetic examination performed by the non-conforming SafeMath library.Particularly, although these arithmetic variables are checked in BEC contract (as shown in Line 4 of Listing 6), errors in its examination logic also caused an integer-overflow vulnerability (line 3).Thanks to focusing on contract runtime execution sequences and model perception ability to their semantics, VulHunter successfully identified these vulnerable contracts, which were missed by methods such as Slither and SmartCheck.

F. The Performance of Various Baseline Models for Vulnerability Learner (RQ5)
Due to the extensibility of Vulnerability Learner, it can employ various ML models as its engines to make VulHunter embrace different detection abilities, given their distinct characteristics.To illustrate this fact, we ran VulHunter with 11 other supervised ML (i.e., DL and traditional ML) models on contracts in Dataset_1.Note that all models ran under the framework shown in Fig. 5 and used the Bag-instance hybrid attention described in Section III-E.Also, the Bi 2 -LSTM and Bi 2 -GRU are equipped with the self-model attention based on the Bi-LSTM and Bi-GRU, respectively.Table XIII shows their partial results and total standard deviations. 16It reflects the following laws.
(i) VulHunter is well compatible with multiple supervised models, regardless of DL and traditional ML.It is because all models achieved satisfactory results (e.g., ACC > 82%).Note that Graph Neural Network (GNN) can play a better effect in VulHunter than in TMP (81.84%), which also applies to XGBoost (XGB) in ContractWard (70.06%).This reflects the effectiveness of VulHunter in terms of instance information extraction and MIL detection framework.Also, DL models outperform most traditional ML models, as they have great iterative learning, fitting, and generalization capabilities on massive datasets.As shown in Fig. 10, with the training epoch growing, they can gradually understand the data features and steadily improve the metrics, such as accuracy and F1 score.This also demonstrates the framework correctness of VulHunter.Notably, traditional ML models are easy to interpret given their adequate mathematical foundations, and most of them are so lightweight that they can be trained on small datasets.
(ii) More superior models tend to embrace better results in VulHunter.For example, the performance ranking of DL models is Bi 2 -LSTM > Bi-LSTM > LSTM and Bi 2 -GRU > Bi-GRU > GRU.This can be attributed to their self-model attention and bidirectional semantic association properties.Also, LSTM-based models are slightly better than GRU-based ones (e.g., Bi-LSTM > Bi-GRU), as LSTM has one more gating unit than GRU, which facilitates it to fit samples.This fact applies to traditional ML models, e.g., Random Forest (RF) outperforms Decision Tree (DT) given its additional bootstrap sampling mechanisms.Therefore, VulHunter can improve its performance by employing superior models in the future.
(iii) Each model has various characteristics and can better detect some vulnerabilities.For instance, DT achieves a high recall rate of 88.57% given its convenience and randomness detection, and RF achieves a high accuracy rate of 93.85% by employing multiple DTs for voting.Moreover, due to the unique model structure and reasoning style, other models deliver more observables on some vulnerabilities.For example, for the reentrancy-eth detection, CNN achieved significant accuracy and precision rates (ACC = 94.47% and P = 100%), and RF obtained the greatest recall rate and F1 score (R = 90.59%and F 1 = 94.48%).Nevertheless, thanks to modeling capabilities for the front and rear elements in sequences, Bi 2 -LSTM holds a superior overall performance, e.g., ACC = 90.04% and F 1 = 85.60%.More importantly, its self-model attention Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. is the key to locating the defective bytecode fragments and source code statements.In the future, visualization tools such as Captum [56] may help other models obtain the importance distribution of inputs.
(iv) The standard deviation of detection results in VulHunter is mainly related to the model structure/scale.Complex models containing large numbers of neurons have strong generalization/fitting capabilities while also introducing some randomness.For example, Bi-LSTM outperforms LSTM and holds a higher standard deviation.Also, the voting mechanism of RF and attention mechanism of Bi 2 -LSTM improve detection performance and reduce prediction fluctuations, i.e., their metrics are superior to those of DT and Bi-LSTM, respectively.Besides, SVM and KNN are held constant during the training process given the certainty of their predictions, obtaining zero standard deviations and inferior performance.Overall, the standard deviation of each model is within an acceptable range, reflecting the detection framework stability of VulHunter.

Answer to RQ5. Can VulHunter support other baseline models for detection?
VulHunter can employ multiple DL (e.g., GRU and CNN) and traditional ML (e.g., RF and SVM) models as its detectors and achieve satisfactory performance with acceptable standard deviations.Also, the various models grant it diverse abilities.It can optimize or utilize superior models to improve its detection metrics, and contract auditors can select different models to complete their goals based on time requirements and hardware constraints.For instance, they can combine VulHunter with DT to identify more vulnerabilities, with RF to reduce the workload of manual verification, and with Bi 2 -LSTM to make both requirements as compatible as possible.

G. The Effect of Parameters on Performance (RQ6)
To evaluate VulHunter with different hyperparameters, we produce the experiments with varying model settings for detecting reentrancy-eth vulnerabilities based on the control variable principle.Table XIV involves the hyperparameter evaluation of instance production.The trends are similar in n block and n cycle .That is, as their values grow, the effective length of instances gradually increases, so that the model can observe more useful information and improve the metrics such as ACC and P.However, since the available instance length for Bi 2 -LSTM model is limited to the parameter T = 512, the metrics will stabilize when the length exceeds this value.Moreover, the metrics also improve with the number n seq of instances, as more contract execution paths are covered and then facilitate the classifier to make decisions.Meanwhile, the probability of the model outputting malicious (i.e., 1) will also increase, thereby improving the recall rate and reducing the precision rate slightly when n seq overreaches the critical value.For the instance selection strategy, it is more practical to choose the longest instances as the classifier can consider more semantic information.In this way, some vulnerable paths with shorter lengths can be viewed by expanding n seq .
We also test VulHunter with diverse parameters in building the model.As described in Table XV, the metrics are improved in T = 768, and reduced in T = 256, which can be attributed to the more available information that can be delivered to the classifier.Note that the instances are not as long as possible, as the numerous bytes may dilute valuable information and carry the opposite effect.The parameter epoch holds a similar trend, that is, the model converges gradually as the number of training rounds increases.Nevertheless, excessive rounds may lead to overfitting problems.Moreover, the lower λ makes VulHunter focus on the overall contract labels and more inclined to improve the ACC and P. Instead, it is biased toward detecting individual instances, so that more vulnerabilities can be discovered but inevitably present some FPs, thereby reducing ACC and P. Given the feasibility verification of the instances, VulHunter can correct some misreports, which is detailed in Sections III-G and V.B.2.Besides, q b and q m can adjust the number of benign and malicious instances during the iteration process, and the larger q b can enhance the learning of benign instances.Also, the larger q m improves the recall rate but reduces the precision rate, due to the benign instances in vulnerable contracts being mislabeled.As discussed in Section V.A.1, they can help VulHunter mitigate the impact of model overfitting and false instance labels.Finally, the number of neurons has an unobvious effect on detection results, and an appropriate value can also alleviate the overfitting problems.
In total, the detection time and memory overhead in these experiments are stable and less device-demanding, which facilitates VulHunter to be deployed in more scenarios.Therefore, serving as a path filter for symbolic executors, VulHunter may help them alleviate the path explosion problem given its accurate and fast inference capabilities.Also, similar to the baseline models, we encourage developers and auditors to allocate appropriate parameters based on their requirements (e.g., biased towards the precision or recall) and hardware limitations (e.g., CPU and memory).In the future, with the maturity of hyperparameter optimization techniques such as scikit-learn [57], they can be considered to tune parameters automatically for better performance.

Answer to RQ6. What is the effect of the variable parameters in VulHunter on its detection performance?
Similar to the baseline models, a variety of parameters can also enable VulHunter to meet the various requirements of the contract auditors.For example, working on discovering as many vulnerabilities as possible while tolerating some FPs by improving the value of q m .Overall, most parameters perform the superior and stable effect, which can guide VulHunter to analyze contract security well.

H. Assist in Vulnerability Analysis and Strategy Development at the Source Code and Bytecode Levels (RQ7)
The ultimate goal of the contract inspection is to repair the potential security threat, not just to discover them.However, the current ML-based arts ignore this intention and can only output whether the contracts hold vulnerabilities.To this end, VulHunter identifies vulnerabilities accurately and outputs the specific defective source code statements, key opcodes, and vulnerable execution sequences of contracts, as well as the universal repair methods and delivering remarkable insights for contract developers.In this section, we present two cases on how they can guide the repair of vulnerabilities.
The Dao vulnerability incident.In June 2016, the reentrancy-eth vulnerability in the DAO contract caused a loss of $60 million ETH.As shown in Listing 7, the behaviors of lines 3-6 describe the implementation of the "withdrawRe-wardFor" function.Thanks to the unique semantic association and fitting properties, VulHunter identified two sequences with reentrant features from ten contract runtime sequences, while methods like Mythril, SMARTIAN, and TMP cannot.Fig. 8(a) visualizes the attention weight distribution of reentrancy-eth detector for instances, and VulHunter obtains the key opcode CALL with weights of 99.94% and 89.55% from two defective instances, respectively.Then it locates the defective statement "msg.sender.call.value" in the contract source code by mapping the ASM file.That is, attackers can deliver the attacks for this statement, and auditors can update this statement to repair this vulnerability.Moreover, from the bytecode perspective, auditors or attacks can obtain the defective intersection subsequence [PUSH1 0x14, SLOAD, PUSH1 0x16, SLOAD... MLOAD... GAS CALL... 0x16 SSTORE... 0x14 SSTORE] around the key opcode, which can convey some useful information about the vulnerability.Among them, 0x14 and 0x16 are the Slot ID of state variables "balances", "totalSupply", respectively.By analyzing these opcodes, the three features of the reentrancy-eth vulnerability can be summarized: (i) the GAS consumed by the CALL instruction is not restricted, i.e., there is no specific value of 2300, representing it belongs to a call().value()function; (ii) the "reward" variable loaded by an MLOAD instruction is not a zero constant; (iii) the state variable with the same Slot ID is read before invoking the call function and updated after executing the call function.
Similarly, equivalent source code features can be obtained according to the located defective statements.Nonetheless, given that most contracts on Ethereum only own bytecode, observing helpful information at the bytecode level may be slightly hard to understand but more meaningful.Contract analysis tools based on pattern matching, symbolic execution, and fuzzy testing can further use key bytecode or source code to formulate and optimize vulnerability detection rules/logic/oracles.Besides, an attacker can create an attacking contract DAOAttack as shown in Listing 7. It can trigger the fallback function by invoking the "ReAttack()" function to execute the "msg.sender.call.value()"statement in "splitDAO()" again and again, causing three variables to not be updated in time (i.e., lines 6-8 invalid temporarily).Thus, the amount can be continuously withdrawn until the contract balance becomes zero or the GAS is exhausted.
To fix this vulnerability, the developers can advance the position of the SSTORE instruction to before the CALL instruction, preventing attackers from making secondary calls.Also, they are advised to use the "transfer()" function instead of "call.value()" in the source code, given that it can limit 2300 GAS overhead and roll back all transactions on transfer errors.Furthermore, the analysis reports made by VulHunter hold additional information such as vulnerability description, location, key bytecode distribution, and examples with generic repair methods 17 , so as to help them develop customized repair strategies more conveniently. .Among them, 0x8 and 0x7 represent the address of storage variables "trials" and "trial", respectively.Both of them need to be read from the storage using the SLOAD instruction.Also, block-related instructions such as DIFFICULTY and NUMBER are employed to generate random numbers.Since these variables can be known in advance by miners and nodes, the random numbers can be further inferred, thus destroying the fairness of the contract and losing its balance.Therefore, similar to the above case, VulHunter will report these defect positions in the source code and bytecode, and recommend developers to (i) use the business data as the seed of random number generators; (ii) select a combination of multiple pseudo-random data; (iii) employtheonlineor offlinerandom oracles.
In addition, VulHunter also detected the vulnerability in the Pets contract (2.50E+16Wei), while methods such as SmartCheck and TMP missed it.The contract uses block.number to generate the random numbers in the "getRan-domNumber" function (line 81), which can also be predicted in advance by working with the miners and nodes.To this end, the contract owners need to avoid these problems during the contract development.Notably, thanks to focusing on the contract execution paths (i.e., opcode sequence or instances), Vul-Hunter can employ the constraint-solving module to verify the feasibility of the vulnerable paths detected by models, thereby eliminating some false positives automatically.The details are illustrated in Section IV-I, and more applications of VulHunter are discussed in Section V-D.
Answer to RQ7.What are the advantages of Vul-Hunter over other ML-based methods in vulnerability repair?Compared with the existing ML-based SOTA arts, VulHunter delivers various services.For instance, it can complete the contract analysis accurately while giving the specific defective source code statements and key opcodes, as well as the universal repair methods.This information can enable developers to check and understand the occurrence mechanisms of vulnerabilities, which can guide them in formulating repair strategies.Also, VulHunter outputs the vulnerable execution sequences, which can be used to build symbolic constraints and compute inputs to eliminate some false positives and restore defective-feasible execution paths, thereby assisting contract auditors better.

I. The Capability of Constraint-Solving Module (RQ8)
As one of the main advantages of VulHunter, the symbolic constraints of vulnerable paths can be constructed and further solved by SMT solvers to verify their feasibility, like symbolic execution-based methods such as Oyente [4] and Manticore [24].As detailed in Section III-G, we have initially designed and implemented a constraint-solving module.In order to evaluate the performance of this module in contract vulnerability detection, we analyzed some benign and malicious contract examples holding feasible/infeasible paths.The experiment results are elaborated in Table XVI 18 .They reflect the following facts.(i) The static analysis methods, such as pattern matching (e.g., Slither and SmartCheck) and pattern analysis (e.g., Securify), are insufficient to reason the variable states during the contract execution, resulting in some false positives, such as the arbitrary-send detection for contracts related to CIN.Particularly, due to DefectChecker ignoring the constraint-solving in its implementation, it also delivers false positives, e.g., the block-other-parameters detection for Cloud_AFalse contract.
To this end, we employ the constraint-solving module to solve the constraints of vulnerable instances and correct the misreports.Note that, three solvers, i.e., Z3, Yices, and CVC4, are used one-by-one to illustrate the fact that they are available and effective in the module.As verification examples shown in Fig. 9, the CIN contract is embedded with the breaking operations such as if-false and assert-false to construct the feasible and infeasible vulnerable paths.This makes the contracts with/ without arbitrary-send vulnerabilities.Fig. 9(c) and 9(d) visualizes the attention weight distribution of arbitrary-send detector for CIN_AOTrue and CIN_AOFalse contracts, respectively.VulHunter located the suspicious statement "msg.sender.send()"with the confidence of above 90% in both contracts, given almost identical paths with only one different opcode LT (Lessthan) and GT (Greater-than).However, the latter vulnerable path is terminated at the assert function, so that the send operation cannot be executed, which belongs to a false positive.The static analysis methods like Slither also misreport it, as they ignore whether the paths can be performed normally.In order to eliminate the false discoveries, VulHunter constructed and solved the constraints of vulnerable paths.Three solvers obtained the same verification results, which is shown in Fig. 9(a) and 9(b).Among them, the "True" conclusion and inputs that satisfy the constraints were outputted in the case of feasible paths.In contrast, the "False" decision was obtained.
(ii) The imperfect manual detection rules or oracles may make the traditional formal methods miss some vulnerabilities.For example, SMARTIAN ignores the most vulnerabilities such as arbitrary-send and integer-overflow.This fact is applied to the ML-based methods given their coarse-grained observation.Furthermore, developing the complement detection logic of vulnerabilities requires massive efforts, making the traditional formal analysis methods challenging to detect more types of vulnerabilities.For instance, DefectChecker cannot check the defects such as send-transfer and low-level-calls.On the contrary, VulHunter leverages ML technology to automatically learn the subtle detection logic and fully exploit information from existing contract datasets, thereby enabling AI-assisted contract analysis.
(iii) The constraint-solving module makes methods such as VulHunter and Oyente hold more capabilities, e.g., obtaining the inputs that restore the vulnerable execution paths.For instance, VulHunter can invoke the "bad" function of Agent contract with the parameters solved by integer-overflow constraints to cause numerical overflow and get unexpected results 19 .This is conducive to verifying the correctness of detection results and cannot be implemented by the current methods based on static analysis and ML.Also, the ML-based methods such as TMP and ContractWard missed almost defects, as they are insensitive to the subtle features of vulnerabilities.Inspired by the performance of VulHunter, they are suggested to adjust the processed inputs and adopt well-designed models based on mechanisms such as attention, thus improving their detection capability.
In total, this module has implemented the constraint-solving and feasibility verification for the single path at the bytecode/opcode level.Nevertheless, parameter solving for multi-path execution has not been completed, as discussed in Section V.B.2.Also, the challenges of solvers and usable solutions are discussed in Section V.C.2.Therefore, this module is currently available as an optional function in VulHunter, given its incomplete functions and the time consumption for large-scale path verification.In the future, it will optimize solvers and build a state pool to maintain the values of storage variables during the continuous runtime path operation, thus further considering the impact of different path executions.

Answer to RQ8. What is the performance or capability of the constraint-solving module in VulHunter?
Currently, the constraint-solving module implements the constraint construction, parameter solving, and feasibility verification for each vulnerable instance, while it needs to be perfected for multiple instance executions.Notably, compared with the existing SOTA arts based on static analysis and ML, VulHunter can not only correct false positives automatically, but also solve the invoking parameters to perform defective paths and trigger the vulnerabilities such as integeroverflow.

A. The Rationality of VulHunter
1) The Impact of Imperfect Instance Labels: During the training process of VulHunter, it is inevitable to introduce false labels (called label noise) to instances, given the empirical initialization and optimization operations on their labels.In order to mitigate the impact of label noise, we propose the Baginstance hybrid attention mechanism for Bi 2 -LSTM model.It enables the model to correct labels of instances in misreported contracts, such as turning malicious instances of benign contracts into benign ones.Fig. 10 shows the detection performance of VulHunter with and without this mechanism.Among them, Fig. 10(a) depicts their detection accuracy for 30 types of vulnerabilities. 20It illustrates that the Bag-instance hybrid attention mechanism is indeed useful for VulHunter to correct misreports and improve its detection metrics, as well as standard deviation.Besides, Fig. 10(b) and 10(f) describes the changes in model loss and metrics during the training process.The model without this mechanism only learns instance-level features, whose training loss drops rapidly and then becomes stable, i.e., fails to converge.This can be attributed to the fact that it introduced more false labels during the training process, which may lead to model overfitting.Correspondingly, its metrics (e.g., accuracy and precision) are increased and then maintained at an inferior value.Meanwhile, since the larger number of benign instances in the step of instance optimization, it may mistake the features of malicious instances as benign ones and hold an insufficient recall rate.
In contrast, with the guidance of this mechanism, the model can adjust its prediction direction based on the contract true labels, so that it can be continuously optimized (i.e., a declining loss) during the iterative optimization process to steadily improve the recall rate and maintain a superior accuracy rate.Therefore, the better performance of VulHunter is beneficial from the Bag-instance hybrid attention.Also, as described in Section IV-F, the model self-attention mechanism enhances the recognition effect and can help mitigate label noise.Furthermore, the ratio of benign and malicious contracts in the training dataset (c.f., Section IV-B), as well as the model training parameters λ, q b , and q m (c.f., Section IV-G), can be adjusted to prevent model overfitting.
2) The Theoretical Analysis of Effectiveness: In this section, we illustrate the effectiveness of VulHunter in terms of instance extraction, model detection, and result validation with theoretical analysis, respectively.21Effective information coverage.
To demonstrate the effective coverage of instance information, we develop a theoretical analysis framework (i.e., instance/ path recording entropy model) to quantitatively evaluate the information preserved in the process of VulHunter.Specifically, given that the aperiodic irreducible discrete-time Markov chain (DTMC) can model sequence data [58], we leverage it to formalize an instance as a sequence of opcode operations (random variables).Let G = {V, E} denote the state diagram of DTMC, where V is the set of states (i.e., the variable values) and E denotes the edges.We define s = |V| as the number of states and use W = [w ij ] s×s to denote the weight matrix of G.The state transition matrix P = [P ij ] s×s is constructed based on weights, i.e.,P ij = w ij /w i .Similar to [58], we assume that the stationary state distribution μ = [μ i ] s (μ i = w i = s j=1 w ij ) of DTMC is a binomial distribution with the parameter 0.1 ≤ p ≤ 0.9 to approach Gaussian distribution with low skewness, and the length of instances obeys a geometric distribution with high skewness (the parameter 0.5 ≤ q ≤ 0.9).
Furthermore, it adopts three metrics: (i) the amount of information, i.e., the average Shannon entropy obtained by recording one opcode operation; (ii) the scale of data, i.e., the space used to store the information; (iii) the density of information, i.e., the amount of information on a unit of storage.By using this framework, we model the opcode sequences-based instance recording mode used by VulHunter, as well as four typical types of recording modes, including (i) idealized mode that records and stores all information of instances; (ii) event-based mode that records specific events for instances denoted by random variable sequences; (iii) sampling/summary-based mode that records coarse-grained instance information, i.e., the sum of opcode values; and (iv) statistics-based mode (e.g., S-gram [13], ContractWard [30], and DeeSCVHunter [17]) that records the statistical state information of instances via several counters.Then, we select the opcode value as the per-operation feature and perform numerical studies to compare the instance recording modes in real-world settings, i.e., measuring the parameters |V| and |E| based on instances of contracts (total of 222,310) in Dataset_1 and Dataset_4.The evaluation results with distribution parameters p and q are shown in Fig. 11, they depict three key facts.(i) VulHunter maintains more information using the opcode sequences of instances than sampling and statistics based recording modes, e.g., it achieves at least 1.5∼3 times information entropy than others.Note that the more long instances (i.e.,q↓), the more obvious the gap.(ii) VulHunter maintains near-optimal information as its information loss ranging from 5.60 × 10 −14 to 6.80 × 10 −4 nat.Also, the larger the model hyper-parameter T , the less operation loss.(iii) VulHunter has higher information density than other recording modes, especially for the idealized system.Note that the more long instances, the more obvious superiority.It can be attributed to the fact that VulHunter restricts instance length and reduces the data scale while maintaining as many instance semantics as possible.In summary, VulHunter extracts highfidelity and compact instance information, which ensures that the model observes enough semantics to identify malicious instance fragments.
Accurate model inference.As tested in Section IV-F, Vul-Hunter can employ DL networks (e.g., RNN and CNN) and traditional ML models (e.g., RF and SVM) to detect extracted instances.(i) For the former, their inference effectiveness depends on that of neural networks and hybrid attention mechanisms.Among them, a multi-layer neural network is essentially a composite function, and its fitting ability is demonstrated by the Universal Approximation Theorem, which is similar to the polynomial approximation [59].On this basis, as detailed in Section III-E, RNN (e.g., LSTM and GRU) connect neurons in the hidden layer through hidden states − → h t / ← − h t and the gating mechanism such as forget gate, enabling it to consider front/ back temporal relationships and better handle timing-related tasks such as contract instance analysis.Also, CNN can be viewed as a cascade of linearly weighted filters and non-linear functions for scattering data, and its modeling capabilities were elucidated in [60] from a mathematical perspective.
(ii) VulHunter leverages the self-model attention to improve the perception and fitting ability of the model, enabling it to handle the inputs of long instances without vanishing gradients and overfitting.Also, as discussed in Section V.A.1, the Baginstance hybrid attention can allow the model to be continuously optimized and learn critical instance features under the guidance of both contract and instance labels.
(iii) The effectiveness of traditional ML models is illustrated by their inherent interpretability, such as adequate mathematical foundations.They can be divided into two categories, namely, mathematical theories-based and rules-based.The former is designed and realized by a series of mathematical operations.For instance, SVM and KNN perform based on the linear/nonlinear regression and similarity distance measurement (e.g., Euclidean and Manhattan), respectively.The rules-based algorithms, such as DT, RF, and XGBoost, can be regarded as rule collections in the form of a tree-like structure, which are learned automatically based on contract instance datasets and easy to interpret.Also, they are constructed on the statistical theories, e.g., information gain and Gini index.Besides, more explanations about their characteristics and reasons are illustrated in Section IV-F.
(iv) Based on the above reasonable instance analysis by ML models, the MIL framework is used to detect contract defects, whose effectiveness has been demonstrated by its wide spectrum of applications, such as computer vision and natural language processing [61].Also, some arts [62], [63] combine theorems to analyze the effectiveness of MIL theoretically, and it mainly relies on the premise that a positive bag contains at least one positive instance, whereas a negative bag includes only negative instances.Nevertheless, for contract analysis, this premise retains exceptional cases.That is, against the contract defects caused by multiple instances, the absence of malicious cooperation instances may cause false positives.To this end, as discussed above, the Bag-instance hybrid attention can be utilized to allow the model to observe this nuance, thereby delivering an accurate decision.Overall, the valuable instance information of contracts can be fitted and represented by the models, and finally make VulHunter identify them under this framework.
Reliable path verification.As experimented in Sections IV-I and V.B.2, the optional constraint-solving module can validate the feasibility of identified instances and tolerate some false positives of the model.It guarantees the reliability of detection results while enhancing the overall interpretability of our approach, given that the formal analysis method is based on rigorous mathematical foundations (e.g., SMT).
In conclusion, VulHunter holds evidence-based effectiveness in terms of instance information extraction, ML-based model inference, contract overall detection, and result feasibility verification.Notably, the empirical results, including the performance evaluation and example visualization researches in Section IV, illustrate the performance superiority of VulHunter, which also can demonstrate the above theoretical analysis and confirm its detection effectiveness.

B. The Limitation of VulHunter 1) Detect Vulnerabilities Without Bytecode-Level Features:
The deep insight of VulHunter is to discover the vulnerable execution sequences of the contract bytecode automatically.However, some vulnerabilities make contracts compile failed, so that their bytecode cannot be generated and VulHunter cannot detect them.For example, contracts with multiple constructors (multiple-constructors, High severity) cannot be compiled.In fact, we can discover them by reviewing the compiler's error messages without using VulHunter.
Moreover, EVM will remove or optimize some semantics (e.g., the compiled version) after the contract compilation.Thus, it is hard to detect the vulnerabilities that depend on this information.For example, incorrect (solc-version, Info) or multiple (pragma, Info) developer-specified compiled versions will affect the contract compilation.These defects are suggested to be marked before the contract deployment.However, the contract compilation information is ignored in the bytecode, so as to they are missed.To this end, we must check the pragma [53] in combination with the source codes.Also, the use of the uninitialized state variables (uninitializedstate, High), storage variables (uninitialized-storage, High), local variables (uninitialized-local, Medium), and function pointers (uninitialized-fptr-cst, Low) will cause unpredictable bugs such as storage slot 0 to be overwritten.It is insufficient to detect them in the bytecode based on whether the contract modifies the value of a specific storage location, as it is unknown that the operation was performed by the vulnerability.As described in Section IV-B, we discover the uninitialized-state with an inferior ACC of 84.11%.Thus, we need to detect it in the source code by checking whether the variables (e.g., array) are pushed a value before being assigned a storage value.
On the contrary, there are some complex vulnerabilities with many forms that are difficult to be detected by methods such as symbolic execution at the bytecode level, while it is easy to ML techniques.For example, after the execution of critical functions (events-access, Low) and arithmetic operations (events-maths, Low), throwing a log event is recommended to notify a caller whether the execution is successful.They can help users track the operation states off-chain and reduce unnecessary errors, as well as wasted gas.However, there are many operations that require adding reminders, such as receiving Ethers.To detect these vulnerabilities, the symbol executors or fuzzers are asked to summarize what kinds of  functions need to add reminders, and then detect them one by one.By contrast, VulHunter can detect the vulnerability easily by employing the Bi 2 -LSTM model to learn various features of vulnerabilities automatically.
2) Identify Multiple-Instance Collaboration Vulnerabilities: Since the methods such as symbolic execution and ML cannot run the contracts actually, they are challenging to simulate the cooperation of multiple execution sequences (i.e., instances), such as tod vulnerability.Nonetheless, VulHunter is doing its best to meet the following observations: (i) For the vulnerabilities triggered by multiple instances, there is a critical instance in the operations triggering the vulnerability, e.g., the last executed instance.In order to explore the key instance which enables VulHunter to identify the vulnerabilities, we performed the ablation experiments.Table XVII shows the detection results of VulHunter and other methods22 , and the first example is detailed in Section IV-E.As an another example depicted in Fig. 12, the StandardToken contract holds two functions, i.e., "modifierowner()" and "withdrawBalance()".Attackers can first invoke the former function to change the owner of the contract, and bypass the permission of the latter function to execute the reentrancy attack.In fact, the latter function is the key to triggering the vulnerability, and the former is the prerequisite for executing the latter.Specifically, we detected the contract with/without the former and latter functions, respectively.Fig. 12(a) and 12(b) visualizes the weight distribution of reentrancyeth detector for contracts with and without the former function.
Due to the path invariance, i.e., the deletion of the former instance does not change the latter instance, VulHunter accurately identified the CALL instruction with the 99.96% of confidence and located the defective source code "msg.sender.call.value()" in both contracts.Particularly, VulHunter detected the contract without the latter function as benign.These facts are applied to all examples in Table XVII, which reflect that the latter or last instance is most likely the basis for VulHunter identifying multi-instance collaboration vulnerabilities.
As shown in Fig. 12(a), the input solved of vulnerable instance is "msg.sender= 0", i.e., the initial value of the storage variable "owner".This can be attributed to the constraintsolving module not implementing the solved parameters associated with the multiple instances at present.In order to discover these false positives for missing cooperation paths, e.g., the detection in Fig. 12(b), we can continue to refine the module to correct them, as detailed in Section IV-I.Also, they can be eliminated by checking the defective source code statements or key opcode subsequences based on the tools such as Manticore [24], which are further illustrated in Section V.C.1.
(ii) During the training process, multiple instances of the contract are consecutively trained under the bag-instance attention mechanism to update the state of hidden neurons, thereby realizing the potential connection between numerous instances leading to vulnerabilities.For example, invoking multiple functions (i.e., instances) that operate the same state variables simultaneously (e.g., read and assignment) may cause different results, that is, the tod vulnerability.This mechanism can notify the model to adjust its judgments by feedbacking a larger loss when it misses cooperative and malicious instances in vulnerable contracts, so as to make it identify the problems.In this way, multiple malicious instances in some vulnerable contracts can be reported, which may cooperate to trigger vulnerabilities based on their constraint-solved parameters.

C. The Improvement of VulHunter 1) Correcting Misreports for Executable Benign Instances:
According to the correct rules of symbolic execution-based tools such as DefectChecker [20] that support the source code and bytecode, we can employ them to check the vulnerable bytecode instances.Also, similar to Maian [21], ETHBMC [46] and Manticore [24], VulHunter can build a private chain or simulated executor to create the transaction sequence to execute the suspicious function with the parameters solved by symbolic constraints, and then observe whether the results are unexpected, thus discovering the false positives.
2) Available Optimizations for Symbolic Constraint Solvers: Nowadays, as core components of the constraint-solving module, SMT solvers (e.g., Z3 [42]) are widely used in symbolic execution-based methods such as Oyente [4] and Manticore [24].Nevertheless, they still need to be improved for the verification of large-scale vulnerable contract paths, e.g., solving complex symbolic constraints with higher efficiency.Although some solvers such as Z3 can handle non-linear and floatingpoint arithmetic, these unique capabilities introduce additional time overhead, especially for longer constraints.Therefore, more efficient and effective solving algorithms are required for these operations in actual use.The available solutions can be divided into three parts according to their implementation stages. 23Specifically, (i) pre-processing path constraints before the constraint solving to reduce the constraint complexity, including independent constraint slicing, constraint simplifying, and redundant constraint elimination [64].(ii) Optimizing solver operations during the solving process, e.g., fast unsatisfiability check, assertion stack optimization, and multi-solver ensemble [45].(iii) Storing and reusing constraint results after solving, such as constraint storage and incremental solving [65].

D. The Application Prospect of VulHunter
Similar to tools such as Oyente, VulHunter inspects contracts based on source code or bytecode, and report the defective source code statements with multiple vulnerabilities, as well as their opcode subsequences.Also, the symbol constraints of contract execution paths can be built and solved to verify their feasibility, as detailed in Section IV-I.As another way of contract protection, run-time monitoring and validation are explored by some methods (e.g., Sereum [66] and Contractguard [27]).They identify and prevent transactions related to vulnerabilities during the contract execution.Similarly, VulHunter can take the execution parameters of contract transactions as inputs, and determine abnormal transactions by verifying them with constructed symbolic constraints for vulnerabilities, guaranteeing the contract operation security.
Meanwhile, as mentioned in Section IV-I, it can deliver the contract execution with inputs that meet the constraints of vulnerable instances to restore their paths and trigger the vulnerabilities such as integer-overflow.Nevertheless, some vulnerabilities need to execute one path repeatedly or multiple paths simultaneously.For example, the reentrancy-eth vulnerability requires repeated execution of paths with the reentrant call function, and its invoking method needs to be selected by implementers based on the vulnerability knowledge, i.e., achieving the reentrant through the auxiliary contract with the fallback function.Therefore, with the refinement of the constraint-solving module, VulHunter can protect the contract security throughout the entire lifecycle from development to deployment in the future. 24I.RELATED WORK Source code based vulnerability detection.Numerous formal verification-based methods have attempted to model the Ethereum contract source code.Bhargavan [67] and Jiao [68] suggest translating a subset of Solidity to F* and K framework for formal verification.Also, Trail of Bits developed a static analysis method called Slither with pre-defined rules to detect problematic source codes [8].Similarly, SmartCheck [7] was proposed by SmartDec and it employs detection rules to determine contract vulnerabilities.Other arts include ZEUS [3] and NeuCheck [14].However, only < 2% of contracts on Ethereum open up their source code [18], [32], [33], which restricts the usage of these methods.
EVM bytecode/opcode based vulnerability detection.The contract bytecodes are visible to everyone, giving an opportunity to evaluate the security of Ethereum contracts.Theorem proving based methods.Grishchenko [69] and Hildenbrandt [70] employed F* and K frameworks to transform EVM bytecode in formal tools.Park et al. [71] presented a deductive verification tool to detect the contract bytecode.While these approaches enable formal machine-assisted proofs of various contract security properties, none of them can provide a fully automated analysis.As a result, other automated works based on symbolic execution, etc., have been proposed to ensure the contract correctness and security.
Other methods.Securify [19] and Securify2.0[75] developed by SRI System Laboratory (ETH Zurich) uses semantic facts and predefined patterns based on EVM bytecode to detect contract vulnerabilities.Wang et al. [27] proposed Contractguard to defend Ethereum contracts against intrusion attacks by matching the benign contract execution paths.In addition, TokenScope [33] defined bytecode rules to detect inconsistent token behaviors with token standards.
In summary, the above methods rely on several expertdefined patterns, rules, or oracles to detect contract vulnerabilities.However, expert rules have the risk of errors, and it is difficult to define bytecode-level patterns that cover complex vulnerabilities completely.Also, as the number of contracts increases rapidly, it is impossible for a few expensive experts to design precise rules by reviewing all contracts.
AI exploration in vulnerability detection.Some arts use ML to analyze contracts without expert knowledge, which can be divided into two categories based on their input types.
Source code based machine learning.Zhuang et al. [12] proposed a temporal message propagation network (TMP) to detect contract source code.Similar methods include AME [49] and CGE [5].Other arts [13], [15], [16], [17] use ML models such as CNN to learn the characteristics of contract source code and further complete the error detection.These methods are tailored for the source code and cannot analyze numerous contracts with only bytecodes on Ethereum.
Bytecode based machine learning.Huang et al. [29] identified vulnerable contracts by measuring the bytecode vector similarity.Wang et al. [30] proposed a system called Contract-Ward for automated contract vulnerability detection with ML algorithms such as XGBoost and RF.Besides, Hara et al. [76] employed Word2Vec to identify honeypot contract bytecode.
Different from the aforementioned works, on the one hand, VulHunter employs the MIL mechanism and Bi 2 -LSTM model to detect contract runtime bytecode paths with contract labels, making it not only discover various vulnerabilities (e.g., reentrancy-eth and timestamp) in an effective, efficient and interpretable manner, but also identify defective source code statements and vulnerable bytecode sequences.On the other hand, it actively seeks fusion with symbolic execution to build and solve path constraints, thus enabling developers to complete the contract lifecycle more safely.

VII. CONCLUSION AND FUTURE WORK
We presented VulHunter, a novel ML-assisted detection method for analyzing source code and bytecode/opcode of Ethereum smart contracts without manual pre-defined rules.It leverages the MIL mechanism to address the problem of classification lacking fine-grained labels, and employs a self-designed Bi 2 -LSTM model to capture the subtle features of benign and malicious contracts for identifying vulnerable instances.Then, it automatically locates the defective source code statements by mapping the key opcodes with the ASM file, and validates their feasibility via SMT solvers.The experiment results on five datasets demonstrate that VulHunter can detect contract vulnerabilities more accurately, efficiently, robustly, and flexibly than SOTA methods.More importantly, compared with ML-based arts, it can provide the defect positions and vulnerable instances while producing classification results, enabling the developers to eliminate the false positives and repair the vulnerabilities more conveniently.In the future, the perfection of the constraintsolve module will make VulHunter embrace bright scenarios, such as contract vulnerability simulation and abnormal monitoring for multiple cooperative transactions.

Fig. 4 .
Fig. 4. Example of symbolic reasoning by the CFG Builder.

Fig. 5 .
Fig. 5.The architecture of Vulnerability Learner with multiple instance learning.
and W o ) and b (i.e.,b f , b v , b c , and b o ) are learnable parameters.

Fig. 7 .
Fig. 7. Comparison of program overhead in terms of time and memory.Each value in the figure refers to the average of 50 execution results.

Fig. 8 .
Fig. 8. Case study of locating defective contract source code statements.

Fig. 9 .
Fig. 9.The feasibility verification for vulnerable paths of contract examples with and without arbitrary-send.Among them, each source code comment corresponds to the contracts in Table XVI, respectively.

Fig. 11 .
Fig. 11.The instance/path information retained by different recording modes on the feasible region of the parameters.

TABLE III THE
SEVERITY OF VULNERABILITIES (BUGS AND OPTS) SUPPORTED BY VULHUNTER

TABLE IV THE
KINDS OF VULNERABILITIES DETECTED BY THE METHODS.AMONG THEM, THE COLUMN ''OTHERS'' REPRESENTS THE NUMBER OF VULNERABILITIES WITH EACH SEVERITY OUTSIDE TABLE 2. MOREOVER, "∼" INDICATES THAT THE VULNERABILITY KINDS CAN BE EXTENDED EASILY

TABLE VI COMPARATIVE
RESULTS DETECTED BY EACH METHOD ON DATASET_1 (BENIGN:VULNERABLE = 2:1).THE FORMS OF BOLD AND UNDERLINES HIGHLIGHT THE BEST AND SECOND RESULTS FOR EACH PROJECT, RESPECTIVELY.NOTE THAT ''SE.''IS THE ABBREVIATION OF ''SEVERITY'', AND ''-'' MEANS THE PROJECT IS NOT SUPPORTED.ALSO, ''TOTAL NAVG (NET AVERAGE)'' REFERS TO THE AVERAGE RESULTS OF VULNERABILITIES SUPPORTED BY EACH METHOD, WHILE ''TOTAL AVG'' REPRESENTS THAT OF ALL VULNERABILITIES

TABLE VIII COMPARATIVE
RESULTS OF ACC, P, R, AND F1 DETECTED BY EACH TOOL ON DATASET_2 (BENIGN:VULNERABLE = 2:1)

TABLE XI REAL
-WORLD EXAMPLES OF SMART CONTRACTS ON ETHEREUM.1ST COLUMN: DETECTED CONTRACT VULNERABILITIES.2ND COLUMN: CONTRACT NAME.3RD COLUMN: THE ETHEREUM ADDRESS OF THE CONTRACT.4TH COLUMN: NUMBER OF CONTRACT TRANSACTIONS.5TH COLUMN: CONTRACT BALANCE IN WEI (1 ETHER = 1.00E+18WEI).6TH COLUMN: THE LINE NUMBER OF THE VULNERABLE CODE.7TH COLUMN: DETECTION RESULTS OF VULHUNTER, WHERE ''TP'' MEANS VULNERABILITY CAN BE DETECTED CORRECTLY.8TH COLUMN: DETECTION RESULTS OF OTHER METHODS, WHERE FNS AND FPS REPRESENT FALSE NEGATIVES AND FALSE POSITIVES, RESPECTIVELY

TABLE XII EXAMPLES
OF MAJOR SMART CONTRACT SECURITY INCIDENTS.AMONG THEM, ''-'' INDICATES THAT THE INFORMATION IS UNKNOWN

to RQ4. Can VulHunter discover contracts with substantial and serious vulnerabilities in public chains such as Ethereum?
From the above examples, it is concluded that VulHunter can indeed identify contracts that are misreported and underreported by other methods.This illustrates its detection effectiveness on Ethereum contracts.Also, the contract examples in Dataset_3 indirectly verify the authenticity of the detection results in § IV-C.Notably, VulHunter is not only a method for error correction, but also can optimize contracts and reduce unnecessary costs.

TABLE XVI THE
VERIFICATION EXPERIMENTS FOR CONTRACTS WITH INFEASIBLE PATHS.AMONG THEM, THE SYMBOL "→" REFERS TO THECONSTRAINT-SOLVING OPERATION The contract Revolution (4.74E+13Wei) is an active and wiled contract on Ethereum (as described in TableXI).The part of the source code is shown in Listing 8.

TABLE XVII THE
RESULTS OF ABLATION EXPERIMENTS FOR DETECTING VULNERABILITIES THAT RELY ON MULTI-INSTANCE COOPERATION