EVMTracer: Dynamic Analysis of the Parallelization and Redundancy Potential in the Ethereum Virtual Machine

Ethereum is one of the first blockchains executing smart contracts, i.e., financial applications directly executed on the ledger using a virtual machine. High transaction volumes caused by financial applications, including decentralized finance and non-fungible tokens, slow down the Ethereum Virtual Machine. Hence, there is a need to detail the execution characteristics of the Ethereum Virtual Machine before its performance can be improved. This work introduces an off-line Ethereum virtual machine tracer called EVMTracer that produces runtime data dependence graphs from executed transactions as an alternative program representation. From the runtime dependence graphs, we can deduce valuable metrics about program execution characteristics, including the degree of parallelization and computational redundancies in smart contracts. Our experiments encompass all blocks up to <inline-formula> <tex-math notation="LaTeX">$\mathrm {12~ \text {M} }$ </tex-math></inline-formula> on the Ethereum mainnet. We found a geometric mean of <inline-formula> <tex-math notation="LaTeX">$1.90\times $ </tex-math></inline-formula> theoretical maximum speedup when executing the smart contracts in parallel and identified <inline-formula> <tex-math notation="LaTeX">$\mathrm {34.97~\%}$ </tex-math></inline-formula> of SLOAD instructions as redundant.

The associate editor coordinating the review of this manuscript and approving it for publication was Mueen Uddin . The Ethereum Virtual Machine (EVM) is the underlying, Turing-complete execution platform that processes the bytecode of smart contracts through transactions [24]. With the increasing complexity of blockchain applications, the efficiency of the virtual machine becomes paramount for smart contract execution. According to Ethereum inventor and co-founder Vitalik Buterin, scaling the transaction throughput from currently 15 tx s −1 to 100 000 tx s −1 is one of the most pressing issues of the Ethereum blockchain [25]. Visa, in comparison, is a direct competitor of Ethereum in terms of total transaction volume [26], claims a throughput of up to 65 000 tx s −1 [27].
Likewise, in an experiment that executed the initial 9 M blocks of the Ethereum mainnet, the EVM's throughput was below 4208 tx s −1 [29]. Both experiments measured the net performance of the EVM itself, excluding the consensus protocol and world-state database updates. The reported figures are thus indicative of the bytecode interpretation overhead and are not mitigated by Ethereum's recent consensus protocol shift from proof-of-work (POW) to proof-ofstake (POS) [30].
It is necessary for the community to understand the execution characteristics of smart contracts to reduce the EVM's execution overhead and to design and implement more efficient virtual machines and environments for the evergrowing blockchain applications.
However, in the blockchain domain, no established benchmark suite exists that could be adopted for the performance evaluation of the EVM. This is in stark contrast to the ubiquity of standardized benchmark suites in general-purpose computing areas, including embedded, compute, JVM-based, edge, cloud, and HPC [31], [32], [33], [34], [35], [36], [37], [38], [39]. Although many attempts [25], [40], [41], [42], [43], [44] have already been taken to understand the performance characteristics of smart contracts, they relied on the availability of the source code, hence focused more on high-level language elements. Note that only 5.1 % of all smart contracts deployed on the Ethereum blockchain have been open-sourced [45], up from 2.2 % by 2018 [25]. Given the lack of available source code, we argue that performance analysis based on open-sourced smart contracts will not be representative of real-world blockchain workloads.
We thus turn our attention to bytecode, which is the program representation for the deployment of smart contracts on the blockchain. Programmers typically write smart contracts in a high-level, specialized language such as Solidity [46], [47] or Vyper [48], which is then compiled into an immutable bytecode representation for the EVM. Bytecode is deployed persistently on the blockchain and invoked for execution on the EVM through transactions. Because all deployed bytecode and the entire execution history of all transactionsincluding input data-is available on the chain, we base our workload analysis on smart contract bytecode.
To provide a scalable and representative benchmarking tool for blockchains, we introduce an offline dynamic tracing system called EVMTracer, which replays the transactions locally and collects the runtime dependence graph of the transactions. The runtime dependence graph is an alternative representation of the original program's execution trace containing data flow and control information. This information can be used to determine the execution characteristics of smart contracts.
EVMTracer is based on an existing efficient transaction replay system [29] that enables us to collect runtime information for millions of transactions without the substantial overhead experienced with client software [49]. We collect the runtime dependence graphs for all transactions in the initial 12 M blocks of Ethereum and use them to determine two performance characteristics of smart contracts. Two metrics are presented to showcase that EVMTracer can discover useful metrics from millions of blocks and help the community to get a better understanding of overall runtime characteristics of transactions. The first metric is the degree of parallelism at the bytecode instruction level. This metric can help the community to understand the potential parallelism in smart contracts and determine whether developing a contract-level parallel execution model is beneficial for the EVM. The second metric is the number of redundant memory and storage computations in the smart contract runtime, in which we count the number of redundant instructions 1 that the EVM executes. This metric reveals the potential for overlooked optimizations and the execution characteristics of a stack-based instruction set.
This paper makes the following contributions: • An offline dynamic tracing system that produces runtime dependence graphs of smart contract executions for large volumes of transactions.
• At-scale investigation of contract-level parallelism on the Ethereum blockchain.
• At-scale investigation of redundant computations for memory and storage IO operations on the Ethereum blockchain. We have released the EVMTracer framework as open source, as described in the paper's availability statement.
The remainder of this paper is organized as follows: In Section II, we provide the background for Ethereum and the EVM. In Section III, we discuss technical details about the system and the runtime dependence graph. In sections IV and V, we explain how to obtain the contract-level parallelism and redundant computation metrics from the runtime dependence graph. Section VI contains the experimental results. We discuss the related work in Section VII and draw our conclusions in Section VIII.

II. BACKGROUND
Ethereum can be viewed as a transaction-based state machine that maintains a world state as shown in Figure 1. The world state comprises information about the accounts on the blockchain. Each account is referred to by its account address and contains the following information: (1) A nonce, which is a counter used to prevent replay attacks (double-spending), (2) an account balance representing the endowment of the account, (3) a code section with the smart contract bytecode (although an account is allowed to have an empty code section), and (4) the storage space of the account. The storage space provides smart contracts with a persistent state across transactions. Storage is represented as a key-value map where the key is the storage address, and the value is the actual data stored at the address. Each transaction results in a side effect that advances the state of Ethereum. Transactions are further collated into blocks, and each block's transactions are applied sequentially on the Ethereum blockchain. We classify transactions into three types. First, an asset transfer transaction takes an asset from the sender and transfers it to the recipient. Next, a contract creation transaction creates a new account on the chain and initializes and stores the smart contract bytecode in the storage space of the account. Finally, a message call transaction invokes a target contract on the blockchain.
Dynamic tracing [50] refers to the action of collecting and investigating runtime behavior of the target program by recording the executed instructions at program runtime. Hence, we enable dynamic tracing only on transactions that result in the execution of smart contracts, i.e., tracing is performed on message call transactions. Note that tracing depends on the program's input, unlike static program analysis [51] such as abstract interpretation [52], which is not concerned with a concrete input.
For our purpose, we need to understand four of the necessary fields in a message call transaction. A from and a to address, which specifies the sender (caller) and recipient (callee) addresses. A gas limit variable, which specifies the amount of gas that the sender is willing to pay to execute the contract. Finally, a call data section that is part of Ethereum's application binary interface (ABI), specifies the entry function to call and its input arguments. Once the miners confirm a block, the EVM executes the block's transactions sequentially.
The EVM is a stack-based virtual machine that is specified in the Yellow Paper [24] and interprets (executes) the bytecode of smart contracts. The following components define the internal state of the EVM: 1) Stack: A stack data structure that stores 256 bit values and has a maximum depth of 1024 stack slots. 2) Memory: An unlimited, linear byte array that supports random access at runtime. Memory is accessed through the MLOAD and MSTORE instructions of the EVM.  The PC points to the EVM's next instruction. 5) Gas: The remaining gas available for the current execution. The EVM will terminate a transaction that runs out of gas. Figure 2 shows the interworking of the components that comprise the internal state of the EVM. The PC determines the next instruction to be executed. The executed instruction may have a side effect that modifies the stack, memory, or storage. The executed instruction consumes gas and updates the PC to point to the next instruction to be executed.
We present a smart contract in Solidity syntax in Figure 3 to familiarize readers with basic EVM operations. For the sake of demonstration, the depicted bytecode is a simplified version of the bytecode generated by the Solidity compiler. In line 2, the 32 B wide unsigned storage variable MyVar is declared. The contract contains a single public function setStorage, which sets the value of MyVar to input argument val. A message call transaction has a call data section (a byte array specifying the target function to be invoked and its input arguments). In the call data, the first four bytes represent the ID of the target function, which we assume to be the function setStorage in this example. After receiving the call data, the EVM identifies the target function, dispatches control to the specified function, and starts execution.
The right-hand side of Figure 3 shows the bytecode representation of function setStorage. The purpose of the first line is to push the value 0 × 4 onto the stack, which is the offset of the first argument in the call data. In the second line, the CALLDATALOAD instruction pops the top of the stack, uses it as the offset and pushes a 32 B value from call data onto the stack. (This effectively pushes the argument value Val   onto the stack.) Finally, the code in lines 3-4 stores the result at the address of variable MyVar in storage.
In Figure 1, the storage variables that belong to a particular account are located in the account's <Key, Value> map that is stored in the <Address, Account> map in the Ethereum world state. The SSTORE instruction implicitly takes the current contract as the execution environment. The Key of MyVar in our example is 0 × 0. Hence, the code in line 3 pushes 0 × 0 onto the stack. In line 4, SSTORE will pop the Key and Value from the stack and perform the write operation on the storage.

III. ETHEREUM DYNAMIC TRACING SYSTEM
The runtime dependence graph (RDG)-or dynamic program dependence graph [53], [54]-is a dynamic variation of the program dependence graph [55], [56], [57] (PDG) encoding data and control dependencies among statements in a program. The PDG has been used in compiler optimizations [58] such as program parallelization [59], and program analyses such as slicing [60].
As a variation of the PDG, our RDG encodes the dynamic data and control dependencies for executing the EVM bytecode for a concrete state on the ledger. A node in the RDG represents an executed bytecode instruction of a smart contract execution, and edges represent data-and control-flow dependencies between executed EVM instructions. Our RDGs are acyclic, whereas the original graphs [53], [54] treated each statement as an individual node and produced cyclic multigraphs.
Our EVMTracer uses the RDGs to reveal the execution characteristics of smart contracts. Computations may involve EVM's stack, memory or storage operations. The data flow of the EVM is produced from the inputs and outputs of the computations and is captured in the RDG in the form of edges. The task of EVMTracer is to capture the data flow of the EVM faithfully. In addition to data flow, control flow must be captured as well. Control-flow dependencies are induced by jumps and calls while executing a smart contract.
EVMTracer extends the EVM interpreter to build RDGs as a side effect of the smart contract execution; we call this component the tracer. For each message-call transaction, EVMTracer invokes the tracer to replay the transaction, instruction by instruction. During replay, the tracer observes how run-time data is combined via the stack, memory and storage for each instruction. From these observations, RDGs are constructed. The sequence of executed instructions constitutes an instruction trace, or trace, for short. Figure 4 illustrates the internals of the tracer. The internal state of the EVM (cf. Section II) consists of five elements (i.e., PC, memory, storage, stack, and message buffers). During execution, these elements affect each other and produce data dependencies. To record the effect of an executed instruction, the state of the EVM is observed by the tracer and recorded into the shadow state, which contains: 1) Trace step: An integer variable i that starts from 1 and increments by one for each instruction executed until the contract terminates. We can then uniquely identify the state of the EVM by referring to a specific trace step. is used to record the last instruction that extended the memory size of the EVM, which is used for recording the dependency for the MSIZE instruction.
Our RDG consists of trace steps and edges representing data and control dependencies. Data dependencies describe the data flow of the instructions, such that the instruction at trace step i must use the result of another instruction at trace step j to compute the correct state of the EVM. The control dependencies describe the program's path through the control-flow graph. An instruction at trace step i is executed only if a certain path was taken from a branch statement at trace step j. Formally, the graph G(V , E) is defined by a set of nodes G.V representing the trace steps i, and a set of edges G.E ⊆ V × V × L where L indicates the type of dependency. We employ the following four dependence types from the literature [56], [57]: read-after-write (RaW), write-after-read (WaR), write-after-write (WaW), and con-trol dependency. For each RDG G, we use a unique entry node Root and a unique exit node End. The Root node represents trace step 0 and is used for graph initialization. The End node represents the last trace step in the graph. Both nodes are used to cover corner cases that we will discuss later.
A. SHADOW STRUCTURES Figure 5(a) depicts a source code example where we want to determine the dependency relations for the add operation in line 6. In the process of compiling this source code to assembly code of a register-based CPU, variables b and c have been assigned to registers r1 and r2, which are the input operands of the add instruction in line 3 of Figure 5(b). To reveal those data dependencies at runtime, a tracer would have to record for each CPU register the trace step of the most recent update (line 1 for register r1, and line 2 for register r2). When execution reaches line 3, the register operands induce the dependencies of the add instruction for trace steps 1 and 2. Thus, the RDG in Figure 5(c) contains the dependence edges (1, 3) and (2,3).
However, the EVM uses a stack-based execution model, where instructions encode the operation only, while operands are implicitly consumed from the stack and results produced onto the stack. The ADD instruction in the EVM bytecode in Figure 5(d) thereby relies on the MLOAD instructions in lines 3 and 5 to push its operands on the stack. Additional bookkeeping is thus needed in the tracer to capture these relations. Figure 5(e) shows the EVM stack for the bytecode in Figure 5(d) before the execution of the ADD instruction. For each stack slot, the numbers in parentheses depict the trace step from which the stack value originated. The tracer maintains this bookkeeping information at runtime. Thereby it becomes evident that the top-most two stack slots that constitute the operands b and c of the ADD instruction originated from lines 3 and 5 in the executed bytecode, which is reflected in the dependence edges (3,6) and (5,6) in the RDG in Figure 5(f). Each MLOAD instruction depends on a prior PUSH instruction for the addresses of variables b and c, reflected in the dependence edges (2, 3) and (4,5).
Similar bookkeeping information is required for the memory of the EVM, and for the storage of an account. For this purpose, we introduce shadow structures. Instead of storing actual values from the EVM state, the shadow structures store a set of trace steps by observing the state of their corresponding parts in the EVM state. In this section, we explain the details of each shadow structure. We apply function ts op (i) to map from trace step i to the opcode of the instruction executed at trace step i.

1) SHADOW STACK
Most of the intermediate results within the EVM are transferred through the stack. Therefore, dependencies occur when data from instruction a is transferred through the stack to instruction b. In this case, instruction a must strictly execute before instruction b, which constitutes a RaW dependency.
We implement a shadow stack named SSTACK to trace stack dependencies. The shadow stack stores trace steps and observes the effect that instructions executed by the EVM exhibit on the stack (we call the reader's attention to the subtle but vital distinction between the terms shadow stack and stack for the remainder of the exposition). We utilize the shadow stack and the following stack-related rules (s1) and (s2) to construct the dependence graph G. (s1) When a value is pushed onto the stack at trace step i, the trace-step value i is pushed onto the shadow stack. (s2) When a value is popped from the stack at trace step i, we pop the shadow stack's top-of-the-stack j and add the dependence edge (i, j, RaW) to the dependence graph G. Intuitively, this means that when an instruction x pushes a value onto the stack, the subsequent instruction that uses that value has a RaW dependency on instruction x.
Two EVM instructions do not create values on the stack and need special treatment, i.e., SWAP and POP.
The EVM specification [24] employs a zero-indexing scheme for the stack such that stack[0] denotes the top-ofthe-stack (the first stack item), stack [1] denotes the stack slot below the top (the second stack item), and stack[n] generally denotes the n th stack slot below the top (stack item n + 1). We adopt this indexing scheme for the shadow stack.
Let The POP instruction pops the top-of-the-stack and discards it, i.e., the discarded value is not needed for the computation. Therefore, no data nor control dependencies are required for this instruction-it is ignored in the dependence graph.

2) SHADOW MEMORY
The EVM provides a memory area representing a linear, zeroindexed sequence of bytes. Memory instructions like MLOAD and MSTORE can read from and write to this memory. Instructions like CALL and CALLCODE rely on memory to retrieve function arguments and provide a return value. The EVM automatically extends the memory size whenever necessary, and the MSIZE instruction returns the current memory size in bytes.
We apply an idea similar to the shadow stack by implementing a data structure called shadow memory (SMEM), which stores a linear sequence of trace steps. We use index operation SMEM[i] to represent the i th value stored in the shadow memory, corresponding to the i th byte of memory, i.e., mem[i]. The EVM initializes memory to zero. To comply with this convention, we initialize the shadow memory with our artificial Root node. Unlike the stack, memory is not restricted to last-in-first-out semantics, which allows WaR and WaW dependencies in addition to the RaW dependencies that occur on the stack.
If the instruction at trace step i interacts with memory at index n, we employ the following two memory-related rules (m1) and (m2) for constructing the dependence graph G. Rule (m1) models the standard RaW dependency we have discussed so far where an instruction depends on the result of a previous instruction. The reason for rule (m2) is that if an instruction at trace step i overwrites the result of trace step j, we need to ensure that all instructions use the result of step j before executing step i. Otherwise, the input of those instructions would be overwritten by step i. Note that WaR and WaW are both known as false dependencies because they can be eliminated by using different memory locations to store different values [57]. E.g., a WaR dependency between ts op (i) and ts op (j) can be eliminated if the program is rewritten such that ts op (i) and ts op (j) operate on different memory locations. However, such transformations are not in the scope of this paper.
Finally, the MSIZE instruction reports the current size of the memory at runtime. The EVM grows the memory on demand. E.g., if the current size of the memory is 32 B and a write operation occurs that writes to the 42 nd byte in memory, the EVM will resize the memory to satisfy the request of the operation. Therefore, we record the last instruction's trace step that extended the memory size in variable Last Memory Modifier (LMM ) of the shadow state initialized to the Root node. Execution of instruction MSIZE entails the creation of a RaW dependency on the trace step stored in variable LMM.

3) SHADOW STORAGE
Unlike linearly-indexed memory, access to a variable in EVM storage requires two key-value maps: an <Address, Account> map to retrieve the target account and a <Key, Value> map to retrieve the value of the targeted storage variable of the selected account. For the purpose of this paper, it is sufficient to regard storage as a single <Key, Value> map where the Key comprises the account address and the key of the variable. Storage accesses exhibit the same ordering constraints as memory, i.e., RaW, WaR, and WaW. Similar to our approach with memory, we create a shadow storage (named SSTORAGE) to express dependencies between executed storage instructions.
Like the EVM storage, our shadow storage is a keyvalue map. Similarly, as the already introduced shadow data structures, it stores trace steps instead of actual values for tracing dependencies. Conceptually, each value in the shadow storage is set to the Root note before the start of the tracer. Practically-as the index space of the key-value map is infinite, and the actual keys are unknown ahead of the tracethe tracer starts with an empty key-value map. Whenever a non-existing entry is visited, it is initialized to the Root node to represent the initial state of the execution. The EVM instruction set contains only one storage write instruction, SSTORE, and one storage read instruction, SLOAD.
We collect dependencies similar to our approach for memory. In the following, let k be the storage key accessed at trace step i.

4) CONTROL DEPENDENCIES
A control dependency is a constraint due to the control flow of a program [56]. E.g., the instruction following a branch instruction can only be executed after the branch target has been decided. For this purpose, we employ variable LB to record the tracing step of the last branch instruction (variable LB is initialized to the Root node). For each instruction at trace step i, add edge (i, LB, Ctrl) to graph G. This effectively creates a strict control dependency among all basic blocks, so each basic block must be executed in order. Finally, after the trace is completed, we create an artificial node End such that End is dependent on all leaf nodes in the dependence graph.

5) EFFICIENT DEPENDENCE GRAPH CONSTRUCTION
In this section, we provide the technical details of how dependence graphs are represented inside EVMTracer. In our implementation, shadow structures store references to graph nodes instead of raw trace steps. (We chose trace steps in the exposition of the paper to facilitate reading and highlight the correspondence between graph nodes and trace steps.) A node in the dependence graph contains references to its dependent nodes (i.e., outgoing edges), dependencies (i.e., incoming edges), dependency types, and the corresponding trace step. During tracing, whenever a new node is created, its dependencies are fetched from the shadow structures in the form of references to graph nodes. Therefore, construction can be done immediately by creating a new graph node and connecting it with its dependencies which become the new node's incoming edges. A WaR dependency represents a special case that requires the tracer to search one level deeper: instead of fetching the immediate dependency directly from a shadow structure, the outgoing edges of the dependency need to be searched. This pertains to the implementation of Rule (m2) of the shadow memory and Rule (p2) of the shadow storage as introduced above.

B. RUNTIME DEPENDENCE GRAPH EXAMPLE
The left-hand side of Figure 6 depicts an example of a smart contract written in Solidity. It contains two storage variables storage1 and storage2 and a function reset that sets both storage variables to the value 42. The right-hand side of Figure 6 shows the EVM bytecode snippet corresponding to the reset function. We show the process of the RDG construction in Figure 7. The example contains only RaW dependencies. Because the bytecode contains no branch instruction, it generally holds that line i of the bytecode from Figure 6 is executed at trace step i. Figure 7 provides one subfigure (i.e., 7(a)-7(g)) per trace step (i.e., 1-7). For each subfigure, the caption specifies the trace step and the executed instruction.
In trace steps 1 and 2, the EVM interpreter pushes two values onto the stack. The shadow stack observes those effects and records each stack value in the trace step. I.e., trace step values 1 and 2 are pushed on the shadow stack.
In trace step 3, DUP2 is executed, duplicating the stack's second item. Therefore, a RaW dependency is created in the RDG from trace step 3 to trace step 1, which is the trace step that produces the duplicated value. (Note that the RDG node labeled ''R'' represents the Root node. Trace steps 1 and 2 do not depend on any actual trace step. To keep the RDG connected for practical matters to be discussed later, trace steps without dependencies depend on the Root node.) In trace step 4, instruction SWAP1 swaps the first two values on the stack. As shown in the previous shadow stack in Figure 7(c), those stack values were created in trace steps 2 and 3. Thus trace step 4 has RaW dependencies on VOLUME 11, 2023 47165 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   trace steps 2 and 3. The first two values of the shadow stack are updated to trace step 4, which produces the corresponding swapped stack values.
In trace step 5, the SSTORE instruction is executed, which consumes the first and second stack items. The first stack item constitutes an address in storage (for the sake of the example, we assume that the Solidity compiler has allocated storage variable storage1 to storage location 0). The second stack item constitutes the value stored at the given address (0 × 2A in this example). The dependence edge (4, 5) is added to the RDG to record the dependency of trace step 5. Meanwhile, the value of storage1 in the shadow storage is rewritten to trace step 5, reflecting the last trace step that modified the storage value.
Trace steps 6 and 7 follow along the same lines. The final state of the EVM's stack and storage and the generated RDG are shown in Figure 7(g). The RDG node labeled ''E'' represents the End node.

C. EFFICIENT TRACING
Collecting the trace information of the whole chain requires the execution of all transactions. Therefore, EVMTracer is built on a transaction replay mechanism to replay transactions locally and collect metrics efficiently. With millions of blocks on the blockchain, it is crucial to have an efficient tracing system for rapid collection. However, blockchain programs are inherently sequential to execute block n, and the world state after block n − 1 is required to be the input state. The Ethereum world state contains the data of all the accounts on the blockchain. Data is encoded with a recursive length prefix (RLP, [24,Appendix B]) and is associated with each account by key-value (KV) pairs. The KV pairs are then stored as Merkle Patricia tries (MPTs). The storage information can be accessed or updated at runtime by executing the SLOAD and SSTORE instructions. However, because of the size of the MPTs, the SLOAD and SSTORE instructions are prohibitively expensive in terms of performance [49].
There are several options when it comes to historical transaction replay. An Ethereum client can be configured as an Ethereum archive node to store the entire history of the Ethereum world state locally. A server like Geth then provides a JSON RPC server, where the client sends the request to the server containing the arguments and the ID of the block it wants to replay. The server then retrieves the world state at block n − 1, replays block n and sends the result back to the client. However, this approach suffers from scalability issues due to the overhead of the JSON RPC API and the large amount of disk space occupied by the world state database. It requires 14 TB of disk space [61] to support full chain replays up to block 14 M and can take on the order of several weeks to complete the replay [29].
An alternative approach configures the client as an Ethereum full node. A client can download the world state at block n − 1 and then replay and verify block n. However, to download a particular state, the client must be able to find a peer in the P2P network that can provide the requested world state. Such a peer may not be available, and a sought world state far behind the tip of the chain will likely not be provided. If such an absence is encountered, the client must compute all the local states to produce the state at block n − 1. The empirical data reported in [29] suggests that replay from full nodes is essentially faster than archive nodes but still suffers from severe performance and storage issues.
Finally, the above approaches all lack built-in multithreading features. Although it is possible to run multiple Geth JSON servers to increase replay throughput with archive nodes, the performance of the archive node is inherently slow. Adding more instances does not help much in practice. For full nodes with multiple instances, it is currently impossible to allow multiple full nodes to share the same copy of the world state. Therefore, each instance must obtain its copy of the database, further aggravating the storage issue.
We built our EVMTracer system on top of the Substate Replayer, a scalable and efficient replay system developed in [29]. By encoding the states of each transaction to their minimal form, the Substate Replayer can reproduce the transactions efficiently and in parallel. Figure 8 illustrates the overall replay system. To record a transaction in the Substate Replayer, it first executes the transaction by the EVM. It then records the minimal information necessary to reproduce the result faithfully. This includes the storage information, which is a map of <Address, Key> to Value that are accessed or modified during the execution, the transaction description and the transaction result. That related information is stored in the substate database. Most importantly, the replayer records only the set of KV pairs accessed or modified by a transaction. In practice, a transaction only involves a few account addresses. As a result, a substate's size is much smaller than the size of the complete world state at a given block height. The recording step is required only once. After all the substates are recorded, the replay can be performed off the chain. In comparison, the substate replayer can perform 2817 tx s −1 when replaying blocks from 0-9 M, being on average 4.54× faster than the full node replay from Geth and consuming 59 % less disk space [29]. More importantly, the substate replayer provides multi-threading support without disk-space overheads. Those characteristics make the Substate Replayer a viable candidate for collecting execution traces for EVMTracer.
For our work, we first obtained the recorded substates up to 12 M blocks from the Substate Replayer. Given the substate database, we replayed those transactions locally through the EVMTracer system. Because the analysis of each transaction is conducted off the chain and in isolation, EVMTracer parallelizes the tracing of transactions through the use of multiple threads. For each transaction, a tracer is invoked to record the runtime dependence graphs of the EVM. A tracer takes as input a substate, which contains the contract bytecode and the program input, and outputs the runtime dependence graphs. The dependence graphs are collected for subsequent analyses and thus constitute the input for our parallelization and redundant computation metrics. Figure 9 demonstrates the overall workflow of the tracer.

IV. CONTRACT-LEVEL PARALLELISM
Leveraging the parallelism inherent in computations through parallel programming [62], [63], [64], [65], [66], [67], [68] is an important aspect of high-performance systems. A parallel execution model calculates a schedule based on the dependencies between tasks to take advantage of the multiple computation units available on modern hardware to execute VOLUME 11, 2023  multiple tasks concurrently, thereby potentially increasing system performance.
In this section, we investigate the potential gain from contract-level parallelism, i.e., the performance improvement that can be obtained from scheduling the tasks of a message-call transaction so that the EVM can execute multiple instructions in parallel. The runtime dependence graph describes the dependence relations among tasks. E.g., from the final runtime dependence graph in Figure 7(g) we observe that the two SSTORE instructions at trace steps 5 and 7 do not have to be executed sequentially. Because SSTORE instructions generally incur a high execution-time overhead [49], executing them in parallel can substantially improve EVM performance.
The parallel execution model considers all dependence types, including false dependencies (WaR and WaW) and control dependencies. Although false dependencies can be avoided by replicating data in memory, extending the memory space is not free in Ethereum and consumes gas. Note that EVMTracer is an analysis tool and does not conduct program transformations in the manner of optimizing compilers. The model assumes zero communication cost among processors. The basic principle to calculate the theoretical speedup is first to obtain a cost function f c which returns the estimated cost for the instruction at trace step i. Then, based on the cost of each task, calculate an instruction schedule L, which determines the execution order of instructions. Finally, we simulate the execution on n processors and compute the performance gain T n .
We obtained our cost function from the empirical result of Baird et al. [49]. In their work, they measured the execution time of EVM instructions in segments of one million blocks.
The performance model varies with block height because of (1) implementation changes in the EVM client due to protocol updates, performance improvements, and bug fixes, and (2) the costs of storage instructions like SSTORE and SLOAD that increase with block height.
We then devise a list-scheduling algorithm to compute the schedule L for the tasks. The scheduling problem is known to be NP-complete even with the assumption of zero communication cost and uniform resource distribution [69]. However, Graham [70] showed that any valid list-scheduling heuristic would be within a factor of 2× of the optimal schedule. Even better, it has been shown empirically [71] that a greedy scheduling heuristic using the critical path (CP) is within 5 % of the optimal schedule in 90 % of all cases. Hence, for our experiment to obtain task schedules at scale, it is advantageous to employ greedy CP scheduling.
The CP scheduling algorithm always picks the node v with the longest path to the artificial end node (i.e., End). Intuitively, the CP scheduling algorithm greedily picks the node that leads to the path with the highest workload. We first calculate the priority function pri() with the help of backward induction. Let N + (v) denote the out-neighborhood of node v, which is the set of nodes o adjacent from v, i.e., connected by an outgoing edge (v, o). Then, pri(v) denotes the cost from node v to the End node: After obtaining the priority function, we simulate the bytecode execution on n processors to calculate its expected execution time. Our simulation, as stated in Algorithm 1, employs the dependence graph to identify instructions that have all dependencies met and are thus ready to execute. (For such a ready instruction, the corresponding node in the dependence graph has no incoming edges.) Ready instructions are scheduled for execution on the executor, a simulated n-way superscalar EVM interpreter that we employ to record the overall execution time of the bytecode instructions. When the executor has completed the execution of an instruction, the node of the corresponding trace step is removed from the dependence graph, which will render subsequent instructions ready for execution. The simulation terminates when all nodes of the dependence graph have been executed.
Our simulator employs priority queues [72], which are sets of elements where each element has a key associated. Given a set S, operation Insert(S, x) will add element x to the set. A min-priority queue provides operation ExtractMin(S), which will remove and return the element with the smallest key from S. Likewise, a max-priority queue provides operation ExtractMax(S) to remove and return the element with the largest key.
In lines 7-9 of Algorithm 1, the simulator inserts all nodes from the vertex set G.V of the dependence graph that have an in-degree of 0 into the ready-queue S. The nodes with an in-degree of 0 are those nodes which do not have incoming dependence edges and hence contain instructions that are Exec ← ∅ ▷ Executor min-priority queue 5: while G.V ̸ = ∅ do 6: ▷ Add nodes n with no incoming edges in G to priority queue S, according to node's priority pri(n): ◁

21:
▷ Update costs of instructions in executor: ◁

22:
Temp ← ∅ 23: for each n ∈ Exec do 24: n.key = n.key − t i 25: Insert(Temp, n) 26: Exec ← Temp 27: ▷ Update dependence graph G: ◁ 28: remove node i and its outgoing edges in G 29: return t ready to execute. The loop in lines 11-16 fills executor Exec with instructions from the ready queue up to the executor's maximum capacity of n instructions. Because S is a max-priority queue where the key is the priority of a node, the extraction operation in line 13 will always choose the ready instruction with the highest cost towards the End node. (As mandated by the CP scheduling algorithm.) The executor itself is a min-priority queue, where the key is the execution-time cost of instructions (see lines [15][16]. The executor will thereby always select the shortest-running instruction for execution (line 18). The executor adds the instruction's execution time t i to the accumulated time t (lines [19][20]. Because execution is n-way parallel, the cost of each of the remaining instructions in the executor is reduced by t i (lines [22][23][24][25][26]. We update the dependence graph by removing node i and its outgoing dependencies (line 28). The simulator iterates until all the vertices in the dependence graph have been processed.
To find out the theoretical maximum speedup that can be obtained from parallel execution of independent instructions, we determine the bytecode execution time for a superscalar EVM interpreter with an infinite number of parallel execution units. In this case, no simulation is needed because the execution time is equivalent to the length of the critical path in the dependence graph G, starting from the artificial root node Root to the artificial end node End. In what follows, we write T n for the bytecode execution time achieved by an n-way superscalar EVM interpreter, and T ∞ for the execution the time when utilizing an infinite number of execution units. Note that T ∞ is equivalent to the cost of the critical path in G, which is computed during the scheduling step, i.e., pri(Root).

V. REDUNDANT COMPUTATIONS
Let EVM i be the input state of the EVM at trace step i. The instruction executed at trace step i constitutes a side effect f i that produces an output state f i (EVM i ). The instruction at trace step i is considered redundant if there exists a previously-executed instruction of the same operation type with side effect f j that results in f i (EVM i ) = f j (EVM i ). Intuitively, this means that there is no need to compute the result of the instruction at trace step i if we can memoize and reuse the result of the instruction at trace step j.
For this work, we are interested in the number of redundant memory and storage instructions, i.e., MLOAD, MSTORE, SLOAD and SSTORE. Instructions MLOAD and MSTORE are used for accessing the EVM memory. They enable features like dynamic data structures and customized data types in high-level smart-contract languages. The SLOAD and SSTORE instructions are the only means for the user to interact with the storage of an Ethereum contract, and they are the most expensive instructions in terms of gas cost and execution-time overhead [49]. Therefore, we are focusing on those four instructions for investigating redundant computations on the Ethereum blockchain.
We use the runtime dependence graph to determine redundancy. This is done by finding instructions that have the same sequence of data dependencies in the dependence graph. For these metrics, we ignore WaR, WaW, and control dependencies because they do not concern the input data flow of an instruction, but only the order in which the instructions should be executed.
Our redundancy analysis focuses on each contract in isolation, that is, an external contract call will generate its own (separate) metrics. We accomplish this distinction by identifying contract call instructions (e.g., CALL, CALLCODE) in the dependence graph such that redundancy is only counted within the same contract call.
The dependence graph we build represents each trace step as an individual node. To find the instructions that share  the same data dependencies, we reduce the data dependence graph to its minimal form, i.e., we merge redundant instructions into a single node.
We classify the data input of instructions into two categories, either stateless or stateful. Stateless input constitutes input values that are always the same for each execution of a given instruction. E.g., the PUSH instruction pushes a constant onto the stack; the value being pushed is statically encoded in the bytecode format and hence stateless. Another example is the SLOAD instruction, where one of its input values is the address of the target contract. The context is the current contract address if invoked through the CALL instruction. If invoked through the DELEGATECALL instruction, the context is the caller's address. This value is predetermined before execution starts and is always the same within the same contract execution; therefore constitutes a stateless input. Our dependence graph records the stateless inputs for each trace step. We use function f stateless (i) to return an ordered list of stateless inputs for the instruction at trace step i. Likewise, function f RaW (i) returns an ordered list of inputs from RaW dependencies at trace step i.
A stateful input is an input value that has been created by another instruction and that is dependent on the EVM state (i.e., stack, memory, or storage). Stateful inputs are described by the data dependence graph. For example, the second argument of SLOAD is the storage key to load from. This value is always read from the top of the stack.
We can then find redundant instructions and merge them into the same node incrementally. Starting at trace step i = 1, for each trace step, an instruction is redundant if there exists another instruction at trace step j such that the following four conditions hold.
Intuitively, the instruction at trace step i is redundant if there exists a trace step j prior to step i in the trace, trace steps i and j contain the same instructions, and the instructions agree on the stateless inputs and on the RaW dependencies.
E.g., consider the example program and its corresponding bytecode in Figure 10. The example is similar to the one in Figure 6, with the only difference that the two storage values are reset to another storage value init. Figure 11(a) shows the final dependence graph produced by the tracer. All edges represent RaW dependencies; With each trace step i we depict the stateless input f stateless (i) next to it. E.g., trace step 1 is a PUSH instruction associated with the value 0 × 2, and the SLOAD and SSTORE instructions are associated with the contract address, which is represented by @ (they are all the same in this example). Starting at trace step 1, the first redundancy is found at trace step 5, where the two push instructions 1:PUSH 0 × 2 and 5:PUSH 0 × 2 are associated with the same stateless input 0 × 2 and only dependent on the Root node. After merging trace step 5 into 1, the next redundancy is found at trace step 6 which coincides with trace step 2, where the two SLOAD instructions share the same data dependencies (trace step 1 and Root) and are associated with the same contract address. The remaining trace steps 7 and 8 do not yield further redundancies showing that the dependence graph in Figure 11(c) is in its minimal form, and one redundant PUSH and one redundant SLOAD instruction have been identified.

VI. EVALUATION
The hardware specification of our evaluation platform is stated in Table 1. This platform is an on-premise bare metal server maintained by the authors of this study. To accommodate the substate database and the trace data, an array of five 6 TB Seagate HDDs was used.
Our tracing stage employed 64 worker threads and collected statistics for the initial 12 M blocks of the Ethereum mainnet. The details of the runtime dependence graphs for each segment of 1 M blocks are depicted in Table 2. Tracing finished within three days and resulted in 614 million runtime dependence graphs, which amount to 29.13 TB of data. The number of traces and the size of the dependence graphs increase in the number of blocks (i.e., with block height), which indicates that the number of transactions per block and the complexity of the computations inherent in transactions have been steadily increasing since the inception of Ethereum in 2015.

A. PARALLELISM METRIC
Regarding our parallelism metric we investigate the following research questions: 1) How much parallelism can be leveraged at runtime? 2) How many execution units do we need to obtain a given speedup? 3) Is it worth developing contract-level parallelism for future smart contract execution models? Table 3 depicts the average speedup for each segment of 1 M blocks. We write T ∞ to denote the theoretical maximum  speedup from an infinite number of execution units, and T n for the speedup using n execution units. It follows from column ''T ∞ '' that the parallelism inherent in the bytecode is similar across all segments. The largest speedup of a factor of 2.29× is observed with the segment 2-3 M, but we note that in this segment, the Ethereum network was facing denial-of-service (DOS) attacks that exploited underpriced EVM instructions to slow down block processing [73]. The 2-3 M segment may thus not be the most representative wrt. the performance of main-stream smart contract workloads. Leaving out this segment, we obtain a geometric mean [74] for the theoretical maximum speedup T ∞ of 1.90×. As can be observed from columns ''T 2 ''-''T 8 '', the speedup from increasing the number of execution units levels off quickly. Four execution units already leverage a large part of the inherent parallelism, and the performance with eight execution units is very close to the theoretical maximum speedup. Figure 12 shows the distribution of the theoretical maximum speedup. We define the performance ratio as 1/T ∞ It is instructive to draw a comparison of these performance figures from the EVM with the instruction-level parallelism (ILP) observed with CPU architectures. The computer architecture field has a 25-year history of research and development into ILP, which has produced comprehensive quantitative data about the performance of various workloads on contemporary, register-based microarchitectures [75], [76], [77], [78], [79]. The EVM, in comparison, is a novel, stack-based virtual machine architecture with smart contracts as its only workload.
Despite these inherent differences, we observe the following striking similarity: the reported speedup that ILP yields from leveraging the parallelism within basic blocks is on average 1.74×, median 1.5× [77]. In comparison, the median speedup obtained in our study is 1.79×. (Note that our use of control dependencies between basic blocks enforces in-order execution of basic blocks.) Although advances in computer architecture (e.g., speculative execution [75], register renaming [80], alias analysis [81], predicated execution [82]) and compilers (e.g., loop unrolling, software pipelining [56]) further increased ILP, the diminishing returns from those measures prevented modern CPU designs from exceeding 8-fold instruction issue [75,Fig. 3.46], and gains have been found diminishing already beyond 4-fold instruction issue [76].
To assess the question as to what extent the ILP in EVM bytecode can be further leveraged by the before-mentioned TABLE 4. Redundant computations and wasted resources in the initial 12 M blocks of the Ethereum mainnet. For each instruction, we provide the total number of occurrences (non-redundant plus redundant), the number of redundant occurrences (out of the total), and the wasted resources in terms of Ethereum gas and USD. All quantities stated in units of million (M). Gas price and Ethereum price are averaged at 80.50 gwei per gas unit and 2453.65 USD per Ether, using the historical average price [84], [85] from Jan. 2021 until Nov. 2022 (the time of writing).
computer-architectural measures, we note that the EVM meters the execution of bytecode instructions in units of gas, and that any measure that increases the number of executed bytecode instructions (including failed speculative execution across basic blocks due to branch misprediction) will inevitably increase the gas cost of transactions and thus will not be economical on the Ethereum platform. Likewise, the amount of storage used by a smart contract is metered, and measures that may increase the storage size of a smart contract (including duplication of storage locations to eliminate false dependencies) will not be economical. In contrast, compilation techniques such as software pipelining and loop unrolling can potentially be adopted for the economic constraints of the EVM's execution model. We consider this as an area of future research that exceeds the scope of this paper, which focuses on the bytecode deployed on the Ethereum mainnet. The contribution of EVMTracer for such work will be the provision of a suitable metric that can readily be computed for a smart contract deployed on a testnet.
To answer our third research question, our experiments suggest that the bytecode already deployed on the Ethereum mainnet contains a non-negligible amount of ILP (1.79×). To make its exploitation profitable, optimizing compilers may be required to further increase the inherent amount of ILP in bytecode. But optimizing-compiler support for the blockchain domain is reported to be immature and lacking optimizations for ILP [41]. Smart contracts have been found to contain a sufficient amount of loops [83], which provides the potential for the before-mentioned optimizations on software pipelining and loop unrolling, as long as the inherent economic constraints of the Ethereum gas cost model can be met. The Ethereum community has been found to anticipate new language and compiler versions quickly [42], but already-deployed contracts are immutable and hence unaffected by future changes in the compiler.

B. REDUNDANT COMPUTATIONS
With our redundancy analysis, we focus on the MLOAD, MSTORE, SLOAD and SSTORE instructions to conduct 47172 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. a quantitative analysis to answer the following research questions: 1) What is the amount of redundant computations that smart contracts execute at runtime? 2) What is the cost overhead of redundant computations in terms of gas and USD? Table 4 summarizes the redundant computations and wasted resources for the initial 12 M blocks of the Ethereum mainnet. Among the surveyed instructions, SSTORE exhibits the most efficient utilization with a redundancy rate below 0.1 % of all executed instructions. Next, in terms of efficient utilization is the MSTORE instruction, with a redundancy rate of 14 %. Efficiency decreases further with the SLOAD and MLOAD instructions, of which we find that 34.97 % of all SLOAD instructions and 54.54 % of all MLOAD instructions are redundant. Consequently, the amount of wasted gas and its monetary equivalent of 253.79 million USD for SLOAD and 7.03 million USD for MLOAD are substantial. Connecting these findings with quantitative performance data reported for the Ethereum blockchain [49], we note that with the later blocks in the cited study (i.e., the range 5-8 M), the EVM on average had to spend more than 75 % of the overall bytecode interpretation time on SLOAD instructions. Therefore, the executiontime overhead of the SLOAD instructions that EVMTracer detected to be redundant can be expected to be non-negligible and must be tackled to reach the transaction throughput goals of Ethereum (cf. Section I).
We attribute the low redundancy rate of SSTORE to its prohibitively high gas cost. The Ethereum specification as of Oct. 2022 [24] charges 2900 units gas for an SSTORE instruction (category G sreset ), compared to 3 units gas for MSTORE (G verylow ). Programmers of smart contracts thus always had a high incentive to eliminate redundant storage write operations already in the source code, e.g., by deferring updates to the storage until after the entire computation has been accom-plished. Related, a study from the year 2020 already found the Solidity compiler solc [46] to cache the most frequently accessed storage data in memory [41]. The historical data that EVMTracer collected for redundant instructions per transaction as depicted in Table 5 supports these facts: the overall number of SSTORE instructions (column ''SSTORE.Total'') per transaction does not significantly change across the 1-million segments of blocks on the Ethereum mainnet. However, we observe a steady increase in the number of MSTORE instructions (column ''MSTORE.Total'').
We present two contributors for the high redundancy rate of MSTORE instructions. First, compilers of smart contracts will cache storage data in memory, as mentioned above, which will convert redundant SSTOREs into redundant MSTOREs. Second, it has been reported on the Solidity GitHub repository that the Solidity compiler will generate redundant MSTORE instructions under certain circumstances. In particular, we have found two issues related to redundant MSTOREs, and both can be detected by our analysis. Issue 12211 [86] mentions that redundant MSTORE instructions are issued when copying struct objects. Issue 10755 [87] outlines a possible improvement of the compiler so that MSTORE will no longer store zero values in locations that are known to be zero already.
The high number of redundant MLOAD instructions are expected as they are related to the EVM's stack model of execution. Interpreters that employ a stack instead of registers to hold temporary results during the evaluation of a program are attractive for their compact bytecode representation, which spurred the network computing ecosystem of Java [88], Microsoft's .NET common language runtime [89], and several VM infrastructures for sensor networks [90], [91], [92], [93]. Stack-based interpreters found renewed interest with blockchains where the small code size of smart contracts reduces the on-chain storage requirements, e.g., with the EVM [24], TVM [94], and WebAssembly [95]. Stack code is compact because instructions only encode the operation (op-code), while operands are implicitly consumed from the stack and results produced on the stack. Operands in registers can be used several times, but a stack operation consumes its operands. To reuse a value computed on the stack at a later time, the compiler will have to generate code that spills the value into a temporary local variable in memory from where it can be later reloaded onto the stack (facilitated by the EVM's MSTORE and MLOAD instructions). Stack access is more efficient than access to local variables, which spurred research into optimizations that convert local variable accesses to stack accesses [96], [97], [98], [99], [100]. But their adoption for the Solidity compiler has not been confirmed yet [41], [46]. The EVM instruction set provides the bytecode instructions SWAP⟨n⟩ and DUP⟨n⟩, which allow to directly address the 16 top-most stack slots and hence facilitate such optimizations. However, the margin of profit is narrow: the MSTORE and MLOAD instructions are in the same gas-cost category as the before-mentioned stack manipulation instructions (G verylow ). Storing and later reloading a temporary value will thus cost six units gas, and any optimization that caches the temporary value on the stack must stay below this limit to be profitable.
In our metric, if a value is referenced more than once in the program, subsequent loads will result in data redundancies in the trace. Therefore, the high redundancy observed with MLOAD instructions is an artifact caused by the EVM architecture. Even for a highly-optimized, register-based interpreter the code size overhead has been found to be 26 % [101], which means that the stack-based architecture of the EVM is likely to stay and the before-mentioned optimizations, as well as design changes in the EVM instruction set that benefit such optimizations are necessary means to mitigate the runtime overhead. As mentioned with the parallelism metric, the contribution of EVMTracer can be the provision of a metric that is readily computable to guide the design and development effort.
Finally, it follows from Table 4 that 34.97 % of all SLOAD instructions have been identified as redundant. Unlike MLOAD, where the current infrastructure does not have a more economical solution to cache a repeatedly used value, storage values can indeed be cached in memory or on the stack to reduce gas consumption and improve performancestorage instructions are much more expensive compared to memory and stack instructions. E.g., as of Oct. 2022 [24], loading a value from storage for the first time during contract execution costs 2100 units of gas (G coldsload ). For subsequent loads (G warmaccess ), the cost is 100 units of gas. Those costs are substantially higher than the costs of the MLOAD and MSTORE instructions and thus make caching a profitable target for further performance optimizations.

VII. RELATED WORK A. ILP IN COMPUTER ARCHITECTURE
Research in computer architecture has a 25-year history in ILP, which has produced comprehensive quantitative data about the performance of a variety of workloads on contemporary, register-based microarchitectures [75], [76], [77], [78], [79]. In comparison, the EVM is a stack-based virtual machine architecture implemented in software, with operations for persistent storage, and smart contracts as the sole workload. Our parallelism metric models all dependencies of the EVM runtime environment, including dependencies related to the EVM stack and to persistent storage. By restricting our metrics to EVM bytecode, they are applicable to all smart contracts deployed on the Ethereum mainnet, and they can be readily applied with testnet-based development environments. As discussed in Section VI-A, the cost model of the EVM charges a nominal gas fee per executed bytecode instruction, which renders many ILP mechanisms from computer architecture infeasible on the EVM. Prior work in the area of ILP in computer architecture does not support our redundancy metric.

B. CONCURRENCY CONTROL MECHANISMS IN SOFTWARE
A large body of work [102], [103], [104], [105], [106], [107], [108] focuses on execution schemes that facilitate the execution of multiple smart contract transactions in parallel. This is different from exploiting parallelism within the contract itself. Muchhala et al. [109] purpose a system that allows multiple nodes to execute smart contracts using a MapReduce approach with a focus on Big Data applications. Other, more generic approaches in the virtual machine community enable parallelism by redesigning the execution model of Python interpreters [110], [111], adding specialized instructions to access data and compute resources in parallel for Java [112], enabling thread-safe built-in collections [113], and facilitating accelerator-assisted garbage collection [114], [115]. However, no work has been done yet in the context of smart contract virtual machines. Our study is the first to investigate the parallelism inherent in Ethereum transactions, and at scale.

C. REDUNDANT COMPUTATIONS
Our redundancy metric determines the number of redundant computations that occur in a smart contract at runtime. Many optimizations exist that aim at eliminating redundant computations. They can be categorized into static and dynamic approaches. Common subexpression elimination and partial redundancy elimination are static optimizations that are based on control flow analysis [51], [56], [116]. Memoization [117], [118], [119] is a dynamic redundancy elimination technique that trades memory space for performance. For each executed function call, the input argument values and computed results are cached in a lookup table. If the function is subsequently called with the same argument values, the cached results will be returned to avoid re-computation. For the stack model of execution, a large body of work has been conducted on compiler optimizations that replace local variable accesses by stack accesses [96], [97], [98], [99], [100]. In [120], dynamic instruction scheduling is performed to reduce the stack usage of a JVM. In the process of improving the compiler optimization pipeline, the Ethereum community has identified several optimization opportunities regarding redundant operations, as evident in issues 10690, 10755, 12211, 12460, 12735, and 12755 from the Solidity GitHub repository [121]. Those issues are open and waiting to be fixed.

D. TRACING
Tracing has found various applications, including coverage analysis [122], performance profiling [123], and validation [124]. On the topic of blockchain, Chen et al. [125] introduce Forerunner, which pre-executes transactions and generates constrained-based, highly-optimized program representations. The technique is similar to a tracing just-in-time compiler [126] and relies heavily on tracing to collect runtime information for future optimization. Different from EVMTracer, their system is an optimizer and works in real-time to speed up the transaction throughput through speculative execution. Ding et al. [127] introduce a prototype system, SCMon, for monitoring smart-contract runtime behavior such as function execution time, function call graphs and gas consumption. The system uses an instrumented tracing technique that requires the source code of a smart contract. A similar shadow stack is implemented to track function calls. Their experimental results are restricted to a set of synthetic smart contracts. In contrast, our system does not require the smart-contract source code and is able to provide atscale statistics of all historical transactions on the Ethereum blockchain. Because the metrics of EVMTracer require more fine-grained statistics to construct runtime dependence graphs, we had to implement a complete set of shadow data structures, including stack, memory, and storage.

VIII. CONCLUSION
We have introduced EVMTracer, an offline tracing framework to obtain runtime dependence graphs during transaction execution on the EVM. From the runtime dependence graphs, EVMTracer computes two valuable metrics: (1) contractlevel parallelism and (2) redundant computations. We used EVMTracer to collect the runtime dependence graphs and compute both metrics for the initial 12 M blocks on the Ethereum mainnet. We found that Ethereum smart contracts include a non-negligible amount of contract parallelism, with a geometric mean of 1.90× theoretical maximum speedup. Wrt. redundant computations, we found that transaction execution is affected by a high number of redundant MLOAD and SLOAD instructions. The redundant MLOAD instructions are related to the EVM's stack model of execution and further work in compiler optimizations is required to mitigate this overhead. Redundant SLOAD instructions, by their high gas costs, have already caused significant economic damage and should be treated as a priority. Overall, we have shown that EVMTracer is capable of performing large, at-scale runtime tracing of Ethereum transactions and we have used the obtained runtime dependence graphs for the computation of two metrics to shed light on the design of future blockchain engines and smart contract compilers.

AVAILABILITY
The source code of the EVMTracer infrastructure is publicly available at https://github.com/verovm/evmtracer.