Highly Available Blockchain Nodes With N-Version Design

As all software, blockchain nodes are exposed to faults in their underlying execution stack. Unstable execution environments can disrupt the availability of blockchain nodes interfaces, resulting in downtime for users. This paper introduces the concept of N-version Blockchain nodes. This new type of node relies on simultaneous execution of different implementations of the same blockchain protocol, in the line of Avizienis' N-version programming vision. We design and implement an N-version blockchain node prototype in the context of Ethereum, called N-ETH. We show that N-ETH is able to mitigate the effects of unstable execution environments and significantly enhance availability under environment faults. To simulate unstable execution environments, we perform fault injection at the system-call level. Our results show that existing Ethereum node implementations behave asymmetrically under identical instability scenarios. N-ETH leverages this asymmetric behavior available in the diverse implementations of Ethereum nodes to provide increased availability, even under our most aggressive fault-injection strategies. We are the first to validate the relevance of N-version design in the domain of blockchain infrastructure. From an industrial perspective, our results are of utmost importance for businesses operating blockchain nodes, including Google, ConsenSys, and many other major blockchain companies.


I. INTRODUCTION
Blockchain technology is fundamental to offer secure, reliable, and decentralized software services [1].Blockchains enable the transaction of digital currencies [2], as well as the creation and execution of smart contracts [3].Several businesses depend on the correct operation of blockchain networks to provide services to their clients [4].Major actors such as banks or cryptocurrency exchanges with high volumes of end-users rely on trustworthy and uninterrupted access to said networks [5], [6], [7].
All those actors exercise their mission-critical business on top of blockchain nodes [8].However, blockchain nodes are never without risk of failure, and blockchain outages have occurred several times, causing downstream service disruptions and loss of revenue [9], [10].Software failures are often the consequence of operating system, network, or hardware problems, which cause unstable execution environments [11].Therefore, there is an overt need for techniques that allow blockchain node operators to mitigate the effects of software failures.
N-Version programming is a proven approach to building fault-tolerant systems [12].N-Version programming consists of creating several implementations called versions of a program based on the same specification.These versions are meant to be executed simultaneously with the same inputs, and the produced outputs are compared afterward as means of fault detection and fault tolerance.
For some blockchains, multiple compatible implementations exist, given that blockchains are protocol-driven by design.For example, Ethereum's execution and consensus layers have respectively four and five major implementations [13].In general, the usage of many versions of blockchain implementations is regarded as an important addition towards achieving systemic reliability [14], [15], [16].
In this paper, our key insight is to take advantage of diverse implementations of blockchain node to enhance dependability properties.This is realized as the novel concept of "N-Version blockchain node", which is an ensemble of diverse blockchain nodes which collectively provide improved services to external clients.We evidence that N-Version blockchain nodes are a valuable approach for users with high-availability requirements.To the best of our knowledge, taking advantage of N-Version design for blockchain is novel and has not been proposed before.
To implement our vision of an N-Version blockchain node, we carry out the following steps.First, we design a novel architecture that provides higher availability to external clients while encapsulating the inner complexity of N-Version software.This N-Version design includes strategies for request routing, error handling, and response comparison and ranking in the context of blockchain nodes.Second, we implement an N-Version blockchain node prototype for one of the most sophisticated, feature-rich, complex, and globally adopted blockchains: Ethereum.We deploy and coordinate several Ethereum client implementations behind a common interface, in production.Third, we set up an original experimental framework for measuring blockchain node availability.This includes characterizing response latency and correctness, designing a comparison oracle in the domain's specific context.For these experiments, we synthesize realistic fault injection strategies that cause unstable execution environments.Last, we compute the availability rates of common blockchain nodes as well as the rates of our N-Version prototype.
To sum up, our contributions are: • The concept of N-Version design for blockchain nodes.A blueprint of an N-Version blockchain node that leverages the natural diversity of blockchain implementations.To our knowledge, this is the first-ever realization of the N-Version design vision in the context of blockchain systems.• An implementation of the diverse N-Version blueprint for the Ethereum blockchain using state-of-the-art Ethereum implementations.• A sound methodology for studying availability of blockchain nodes in production using realistic fault injection, and the corresponding sound results demonstrating the strong advantages of N-Version for blockchain.

A. Blockchain Technology
Blockchains are distributed ledgers, where data is aggregated and stored in discrete units called blocks [2].Blockchains are created by peer-to-peer networks, where each peer, or node, is a host that executes a blockchain client.A blockchain stored on disk is unlikely to crash until it runs.Fig. 2: An external application may interact with a blockchain through any node that exposes its interface.
Blockchain clients implement a protocol, which describes how blocks are verified, agreed upon, and propagated throughout the peer-to-peer network.Decentralization is essential to blockchains, therefore every participating node must be able to store a replica of the ledger, and verify all incoming blocks before making them part of their own state.Ideally, every node also exposes an interface, allowing the blockchain's users to issue read/write queries.
Each blockchain has its own protocol, prime examples of these protocols are Bitcoin's [2] and Ethereum's [17].All the client implementations for a given blockchain must comply with this blockchain's protocol, and no other restrictions are made on the implementation.This allows for the creation of several, diverse, client implementations for the most popular blockchains [18], [14].
In addition to peer-to-peer communication, blockchain nodes provide standardized outward-facing channels to interact with external applications.For example, Ethereum clients implement the JSON-RPC API specification [19].This API allows external applications to connect and query the blockchain using a uniform set of methods.The available operations include, e.g.querying the status of the blockchain at an arbitrary point in the past or issuing new, state-altering transactions.External applications take advantage of the API to build various services, such as cryptocurrency exchanges, dApps, or games.
Figure 2 depicts an overview of both a blockchain network and an external application.Here, two types of interactions are highlighted: Dotted lines show external applications requests toward a target blockchain node through its outward-facing interface; and solid lines show the blockchain node which sends and receives data to and from its connected peers.
An example of data exchange between a blockchain node and an external application is shown in Listings 1a and 1b.Listing 1a shows a complete Ethereum JSON-RPC call.It contains the version of the interface (jsonrpc), the name of the method to be invoked (method), the parameters passed to the method (params), and a request identifier (id).Listing 1b contains the data returned for the previous request, which is specific to the "eth_getBlockNumber" method.Among other information, it contains the number of the block (number), its size (size), a timestamp (timestamp), and a list of transactions (transactions).
The example in Figure 2 is greatly simplified compared to reality: a production blockchain network is composed of thousands of nodes spread in complex topologies all over the globe [20].Each node serves numerous external applications, meaning that there are many more external applications than nodes.

B. N-Version Design
An N-Version software application relies on the simultaneous execution of N programs that are different implementations of the same specification [21].This type of architecture is used to enhance a desired property of the application, such as reliability [15], [22], performance [23], or security [24], [25].In most cases, this is achieved by comparing or matching the outputs of the N programs given the same input.
N-Version software applications are typically produced through N-Version programming.N-Version programming is defined as the independent development of the same specification by different teams [26].In this context, each program developed by one team is called a version.Ideally, each version is implemented independently using competing designs, programming languages, and software stacks [12].A high degree of diversity is key, as it lowers the probability of shared faults between versions [27], [28], [29].
Interestingly, there exist software specifications for which multiple isolated implementations surface naturally [30].These implementations emerge with no coordinated effort to produce them.Instead, they emerge spontaneously due to market competition, the need for optimization, or opinionated design approaches [31].Web browsers are an example of this pattern: they are developed independently of each other by competing actors, and yet they conform to the same standards [32].
When independent versions that comply to the same protocol emerge naturally, it is possible to harness them to build N-Version software applications [33].This approach can be called natural N-Version design.
Definition: Natural N-Version design is the creation of N-Version software applications by harnessing, deploying, and simultaneously executing already-available implementations of a software specification.

III. HIGHLY AVAILABLE BLOCKCHAIN NODE A. Blockchain Node Availability
The availability of a blockchain can be defined as the probability that it is functioning correctly at an arbitrary point in time [34].For the purpose of this paper, we include in this definition the possibility of single blockchain nodes to participate in the network in a degraded state.Therefore, we characterize blockchain nodes' availability as a categorical variable, which can hold three values: Available: The API's responses are timely, compliant, and fresh; Degraded: The API's responses are not compliant or not fresh; and Unavailable: The API's responses are not timely or requests are denied.
The properties of the responses are defined as follows.Timeliness refers to obtaining a response to a request within a time span T .For example, if T is set to 100ms, a response time We define the availability status S of a blockchain node n with response r as follows: Where t r is the response time, c r the compliance of the response, and f r the freshness of the response.We measure t r as time in milliseconds, c r as a boolean value, and f r as the block distance between n and an external oracle.The upper bounds for response time and block distance are T and F , respectively, and are set by the node's operator according to their own requirements.
Given that an individual node response defines the status of the node at one point in time, we consider the node to hold that status until it sends a new response.

B. Architecture of an N-Version Blockchain Node
For some blockchains, there are several compatible client implementations [35].For example, Ethereum's execution layer has four major implementations.Our goal is to build on this favorable property of blockchains, and apply N-Version design in this context, as shown in Figure 3.We call the resulting construct an N-Version blockchain node.
Definition: An N-Version blockchain node is an ensemble of N sub-nodes and a proxy, where a sub-node is a normal node executing a unique client implementation; and the proxy encapsulates the sub-nodes under a single interface.
Our study focuses on assessing the impact of N-Version nodes on availability.The goal is to demonstrate that an N-Version blockchain node provides higher availability than regular nodes in isolation, specially under unstable execution conditions.
1) Overview: Figure 3 shows a blueprint of the proposed N-Version blockchain node.First, it shows the components presented in subsection II-A: Blockchain network, external application, and peer-to-peer communication.The key novelty in our architecture is the proxy component.This component exposes an interface which encapsulates N blockchain subnodes, where each sub-node executes a different implementation of the blockchain protocol.Each request directed to the N-Version node must be done through this proxy.
The proxy is responsible for the orchestration of the N-Version node, by routing requests to the sub-nodes depending on a dynamic priority policy and fail-retry mechanisms as explained in subsubsection III-B2.The proxy is also in charge of deciding which response to return to the caller in cases where several responses for a single request are produced (subsubsection III-B3).
2) Dispatching Policy: The presence of sub-nodes is an opportunity to have an adaptive dispatching policy based on their observed behavior.Such a policy allows the system to dynamically adjust to the effects of unstable execution environments, by prioritizing the most available sub-node.This is alike dynamic load balancing, enabling us to achieve systemwide optimization using the global state of the system [36].
The policy works as follows.We keep an availability score for each sub-node.The score is the percentage of successful responses for all requests sent to that sub-node.The score is updated every time a response is received from a subnode.With this score, we keep a ranking of the sub-nodes.Every time the scores are updated, the ranking is sorted in descending order of availability score.When the proxy receives a request, it is forwarded to the top sub-node in the ranking.If an AVAILABLE response is returned, said response is sent immediately to the requester.Otherwise, the proxy saves the response, and retries the request on the next subnode of the ranking.This process is repeated until either one of the sub-nodes responds with an AVAILABLE response, or all of the sub-nodes have provided one response.
3) Comparison Oracle: When no sub-node is fully available, there is a need to select the best degraded response to be returned to the external application.For this, the system compares the sub-noed responses and sends back the best one according to the following rules.Rule 1: A compliant response is better than a non-compliant response; and Rule 2: The most fresh response from all compliant responses is better.Compliance is used as a primary filter, because non-compliant responses can cause undefined downstream behavior.For instance, a response with an incomplete JSON object, could trigger unhandled errors on the external caller.

C. Implementation
We fully implement a prototype based on the blueprint architecture described in subsection III-B in the context of the Ethereum blockchain.We call this prototype N-ETH.We use the readily available implementations of Ethereum executionlayer clients GETH v1.12.2, BESU v23.7.0, ERIGON v2.48.1, and NETHERMIND v1.20.1 as the sub-nodes of the system.Each of the chosen implementations has an active community and is open source.In the rest of this paper, we refer to them as Ethereum node versions.The proxy component is written in Go, using networking components from the standard library.

A. Research Questions
To systematically evaluate our architecture and prototype, we propose the following research questions: RQ1 What are the behavioral consequences of unstable execution environments for blockchain nodes?
Blockchain nodes may behave incorrectly due to unstable execution environments.We aim to identify the effects caused by this instability.To this end, we deploy four Ethereum nodes, each with system-call error injection in place.By varying the fault injection strategies, we observe and record a wide range of irregular behavior that may have an impact on availability.

RQ2 To what extent do different blockchain node versions exhibit different availability rates under unstable execution environments?
To establish a baseline for availability, we perform a quantitative analysis of the identified effects.To obtain this baseline, we deploy single Ethereum nodes, each with a corresponding system-call error injection module.The baseline consists of precise measurements of the availability state of the nodes under increasingly aggressive fault injection strategies.The availability state is recorded as defined in Equation 1.

RQ3 To what extent does an N-version blockchain node increase availability compared to a single-version node?
We argue that an N-Version blockchain node enhances availability properties of the node under unstable execution environments.To measure this improved availability, we deploy our N-Version Ethereum node prototype with attached system-call error injection modules.We measure its availability score with varying N and compare it against the baseline derived from single-version nodes.

B. Deployments
To answer the research questions, we deploy single-version Ethereum nodes and N-ETH nodes, on which we exert fault injection strategies and workloads.When selecting the versions that constitute the N-Version deployments, we can only select among existing Ethereum node versions.According to the Ethereum community, there are 4 implementations that make up for virtually all participating nodes in the main network [14].We use these 4 node implementations as Single-version deployment: Figure 4a shows the scope of the single-version deployment.To realize it, we first deploy a particular Ethereum node version and configure it to synchronize with Ethereum's Mainnet.Once synchronized, we attach a fault injection module to the Ethereum node process, as explained below, in subsection IV-D.The data collected from this deployment provides insights into RQ1 and RQ2, i.e. to identify and measure the effects of unstable execution environments in the nodes' availability.

N-Version deployment:
To realize an N-Version deployment, we go through the following steps: First, we create N instances of Ethereum nodes, each coming from a different version.We configure them to synchronize with Ethereum's Mainnet.Second, we deploy an instance of the proxy and connect it with those synchronized nodes as sub-nodes.Third, we attach a fault injection module to each of the sub-nodes.We perform these three steps with N equal to 2, 3, and 4. Figure 4b shows the scope of N-ETH with N = 4: the proxy is connected to 4 subnodes, an instance of Geth, an instance of Besu, an instance of Erigon, and an instance of Nethermind.For each value of N, we consider all possible sub-node combinations.In total this adds up to 29 deployments.The data collected from these deployments provides insights into RQ3, i.e. measuring the improved availability under unstable execution environments.
In both single-version and N-Version deployments, we record the availability state for each request, which can take three different values: AVAILABLE, DEGRADED, or UNAVAIL-ABLE, as described in subsection III-A.Additionally, we measure the resource consumption of each deployment, to determine the tradeoff between N and any change of measured availability.

C. Workloads
Workloads are exerted into the deployments through a custom component, which acts as the 'external application' component depicted in Figure 4a and Figure 4b.The workloads consist of an arbitrary number and types of JSON-RPC method invocations targeting the deployment.To quantify the availability of deployment, the workload component records the received responses' conformity, freshness, and latency.We devise two workloads: Workload A consists of 360 000 JSON-RPC method invocations, where each invocation's method name and parameters are sampled from a pool.This pool contains 21 methods and corresponding parameters, which query both current and past states of the blockchain.The aim of this workload is to discover the widest possible range of availability-related issues induced by our fault injection strategies.
Workload B consists of 360 000 invocations of a single JSON-RPC method, which queries for the latest block available on the target deployment.The aim of this workload is to collect freshness information.It is important to note that while the requests are identical, the responses are expected to change regularly over time as more blocks are added to the chain.
Both workloads are configured to perform each request 5 milliseconds apart.This means that in total, the workloads will perform requests steadily for 30 minutes.Over this time span, the Ethereum blockchain adds 150 blocks to the chain.The selected workload duration and ensuing block distance allow us to observe the effects of unstable execution environments on our deployments.

D. Unstable Environment Simulation
Blockchain nodes are executed on top of an operating system (OS).Consequently, blockchain nodes are susceptible to OS or hardware instability.In environments such as the cloud, single hardware or network faults can propagate to multiple virtual machines [37] affecting multiple blockchain client instances simultaneously.Such instability typically manifests downstream as system-call invocation errors [38].For example, a read system-call may repeatedly fail with error code -EAGAIN due to a disk malfunction.Previous research shows that high-frequency system-call errors may cause unexpected behavior [39].In the context of blockchains, it can result in disruption of block transmission, and chain synchronization [40].Additionally, permanent side effects and crashes can also be the result of system-call errors.Therefore, we consider system-call errors as the fault model under which we analyze the degradation of blockchain client APIs.Fault models that can make blockchain nodes unavailable with 100% certainty, such as power outages, are out of the scope of this study.
Since fault injection is regarded as an effective way to test N-Version systems [41], we devise several realistic fault injection strategies (FIs) to apply into blockchain nodes.Figure 5 shows the process used to craft realistic fault injection strategies.It consists of the following steps: 1 For all the Ethereum node versions used in our deployments, we perform a monitoring procedure, where we record all system-calls and system-call return codes.The return codes reveal system-call invocations which fail.Unsuccessful system-calls are frequent even during correct execution of processes, and may be caused e.g. by temporarily unavailable resources or lost connections.2 We produce a system-call error profile in the form of a set S of tuples with the form ⟨syscall, err, f ⟩, where syscall is the name of a system-call, err is a system-call return code, and f is the frequency with which syscall returns with code err. 3 We aggregate the sets of each analyzed version into a single set, where no pair of syscall and err is repeated and f is the minimum value between any pair that was repeated.The aggregated set is then sorted in descending order based on f 4 We create n subsets from the aggregated set following a top-n pattern, i.e., subset 1 contains the top-1 tuple from the aggregated set, subset 2 contains the top-2 tuples from the aggregated set, and so on.5 In all tuples of the resulting sets, f is amplified with an arbitrary factor of 5%.We consider this factor to result in balanced scenarios where sporadic errors are likely to be observed, while the relative frequency of systemcall errors is kept.
After applying these steps, we obtain 20 fault injection strategies with the following attributes: (1) They are realistic, this means that they only include system-call errors known to occur spontaneously in at least one of the analyzed blockchain nodes; (2) They generate faults uniformly and independent of the deployments' nodes or sub-nodes, and therefore allows comparing resilience between deployments; (3) There is a clear increase of aggressiveness from FI 1 to FI 20.FI 1 contains a single tuple, which is the one with the highest f .We consider the most frequent system-call errors to be handled with high certainty, therefore we regard FI 1 as the least aggressive strategy.On the other hand, FI 20 contains all 20 observed error tuples.This means that FI 20 is the most aggressive strategy, which triggers the highest number of errors, including the most uncommon ones.
To monitor the Ethereum clients' system-calls and to perform fault injection, we rely on the tool ChaosETH [40].Step 1 Step 2 Step 3 Steps & 4 5 Fig. 5: Fault injection strategy synthesis.

E. Running Experiments at Scale
For our experiments to be the closest to a real-world setting, we require them to be done over the main network of Ethereum, called "Mainnet".Consequently, all nodes must be fully synchronized with Ethereum's Mainnet.However, synchronizing an Ethereum node on Mainnet can be challenging [42].
First, it requires a significant amount of resources: The selected node versions require from 0.8 TBs to 1.2 TBs of space on fast storage devices, as well as between 8 GBs and 16 GBs of RAM.Second, it requires the parallel execution of another kind of blockchain node, known as a consensus layer node.Third, synchronizing an Ethereum node from scratch takes from 10+ hours to several days.The time largely depends on the hardware where the node is executed, and the available bandwidth.
We devise experiments as follows: Regarding single-version deployments, we perform one experiment per Ethereum node version (4), per fault injection strategy (20), per workload (2).Regarding the N-Version deployment, we perform one experiment per fault injection strategy (20), per value of N , and corresponding combinations.Setting all experiments ultimately amounts to executing and synchronizing 660 Ethereum node instances.
To handle this scale, we implement a cloud pipeline that allows us to replicate nodes efficiently.The pipeline uses a cloud computing setup which provides access to n > 1 SSD devices.The first SSD is reserved for a node instance which is always kept up to date and where no fault injection is performed.We call this instance the source node, and it is synchronized from scratch.
The pipeline then continues by executing three asynchronous procedures, as detailed in Algorithm 1. MAIN starts the initial source node synchronization by calling SYNC_SOURCE, waits for the source node to be up-to-date, and finally calls one instance of RUN_EXPERIMENT for each fault injection strategy.We must keep a source node to later Algorithm 1 Experiment pipeline The experiments are executed in Microsoft Azure virtual machines of type L64s v3, each of which provides 64 vCPUs, 512 GBs of RAM, and access to 8x 1.8 TBs NVMe SSDs.This type of environment fits our use case perfectly, since each instance allows us to simultaneously execute up to 8 Ethereum nodes.The estimated total cost of performing the experiments on the Azure platform is 10 000 USD.

A. What are the behavioral consequences of unstable execution environments for blockchain nodes?
Table I shows all the error types received by an external application while the target blockchain nodes are under fault injection.Each row represents the sum of observed errors for all fault-injection strategies, per each single-version deployment, under workload A. We observe that unstable execution environments have different visible effects on blockchain nodes.Specifically, 17 different types of errors surface on the external application, and are related to either network issues, timeouts, or data corruption.Overall, the most frequent type of error is "connect: connection refused".We determine under log analysis that this error arises when a request is directed towards a crashed deployment.On the other hand, data corruption errors such as "malformed HTTP response", or "unexpected end of JSON", happen very rarely and represent only a small fraction of the total error count.
Regarding the distribution of errors, not all error types have the same frequency in every node version, i.e. there are errors which are common in some versions, but rare in others.For example, the error "Post: Client.Timeout while waiting headers", occurs with all versions, however its absolute frequency in each deployment varies by orders of magnitude.Furthermore, the error "invalid character in response" is very frequent in the BESU deployment, but is never triggered in the rest of the deployments.Finally, we observe that within the same deployment, the frequency of errors is not distributed uniformly, and the absolute frequencies of errors range from zero to hundreds of thousands.Table II shows the proportion of RPC calls which triggered an error in the external application.The first column contains the JSON-RPC method names, and from then on, each column presents the results for each deployment under three fault injection strategies and workload A. For instance, the cell at the intersection of "eth_blockNumber", GETH, and FI 18, indicates that 13.8% of the requests for this RPC-deployment-FI combination cause an error observed in the external application.
The bottom row of Table II shows the standard deviation (SD) for each column.Analyzing the SD values, we find that the effects of the fault injection strategies on the different methods of the APIs are highly uniform.The column with highest SD corresponds to ERIGON under FI 20, with a value of 0.007.This means that the tested fault injection strategies do not have significantly varying effects depending on the measured API method, which is a good indicator of external validity.For simplicity, Table II presents only the results of three fault injection strategies for each node version, the ones with the largest SD.The complete information for all RPCdeployment-FI combinations is available at http://github.com/ ASSERT-KTH/N-ETH

Answer to RQ1
Unstable execution environments disrupt the behavior of blockchain nodes in the form of connection issues or broken responses: resets, timeouts, invalid checksums, malformed HTTP or JSON data, etc.These effects depend on the node version, some effects are observed in all versions, while others are version-specific.This validates the core assumption of N-Version blockchain nodes: not all subnodes will fail in the same way at the same time in an unstable environment.

B. To what extent do different blockchain node versions exhibit different availability rates under unstable execution environments?
Table III shows the availability rate of all tested node versions while executing Workload B and under all fault injection strategies (FI).Each row corresponds to one FI and the resulting availability rates of each node, the columns correspond to the availability states described in subsection III-A.Regarding full availability, it can be observed that fault injection affects the blockchain nodes in different degrees.There is a pattern where the nodes can handle increasing aggressiveness of the FIs up to a certain point.This pattern is consistent with our way of constructing fault injection strategies by increasing aggressiveness.The first FIs contain the most common systemcall errors, and our observations confirm that they are also better handled.The last FIs use rare and potent systemcall errors and put more pressure on nodes' availability.Nonetheless, the first noticeable degradation varies between nodes: GETH is able to keep high availability under the first 16 FIs, and NETHERMIND and BESU under the first 6.ERIGON presents a different pattern where full availability is slightly disrupted even by the first FI.Regarding degraded availability, we identify that its main source is the disruption of the deployments' live synchronization under the most aggressive FIs.This results in responses that do not fulfill the freshness property.Regarding full unavailability, we do not identify any global pattern other than correlation to the aggressiveness of the FIs.Additionally, for all node versions, FI 17 causes a sharp increase in unavailability.
The results also capture a diversity of effects on the nodes given the same FI.For example, all clients show contrasting behavior under FI 15: while GETH is almost always available, BESU and NETHERMIND are mostly degraded, and ERIGON shows intermittently available behavior.Overall, the data shows that BESU has the highest availability rate in average, followed by GETH, ERIGON, and NETHERMIND  respectively.Nonetheless, GETH has the lowest average for unavailability, managing to reply in average to 98.1% (only 1.9% unavailable) of the requests with either available or degraded responses.Furthermore, NETHERMIND significantly underperforms the rest of the nodes when full and degraded availability are combined, under the four most aggressive FIs.The difference in how the fault injection strategies affect the nodes' availability can be explained by the distinct error handling paradigms of the underlying programming stacks.

Answer to RQ2
The availability of blockchain nodes deteriorates noticeably under unstable execution environments.The measure at which this happens varies depending on the node version and fault injection strategy, in average: GETH's availability drops to 0.8486; BESU's availability drops to 0.9113; ERIGON's availability drops to 0.5149; and NETHERMIND's availability drops to 0.3788.All node versions remain available in certain conditions where the others become unavailable.In other words, none of the tested fault injection strategies make all nodes unavailable simultaneously.Now, we have strong evidence that the available diverse blockchain node versions are suitable for our novel N-Version design.

C. To what extent does an N-version blockchain node increase availability compared to a single-version node?
Table IV shows the availability measurement of our N-Version blockchain node prototype N-ETH under unstable environment while executing workload B. It shows the best performing combinations in average given N values of 2, 3, and 4. With N = 2, and executing GETH and ERIGON, N-ETH is able to maintain 94.2% full availability, and 99.9978% combined full and degraded availability.When N = 3, with GETH, BESU, and ERIGON, N-ETH is able to maintain 97.3% full availability, and 99.98% combined full and degraded availability.With N = 4, and executing GETH, BESU, ERIGON, and NETHERMIND, N-ETH is able to maintain 98.5% full availability, and 99.9999% combined full and degraded availability.Similar to the results presented in subsection V-B, N-ETH is able to perform normally under the first 16 fault injection strategies.For FI 19, the only strategy with imperfect mitigation, only 0.02% of the requests result in an unavailable response.
By comparing N-ETH's rates with single-version's rates, we see that these are equal or improved under all fault injection strategies.This is, the 'available', 'degraded', and 'unavailable' rates are either better than or equal to the best single-version node for all fault injection strategies.More specifically, in the most aggressive 4 FIs, both the 'available' and 'unavailable' rates are strictly better.For instance, in the single-node deployments, BESU is the best at handling injection FI 20, with scores: 0.6254 available, 0.2887 degraded, and unavailable 0.0859.In contrast, N-ETH with N = 4 handles the same fault injection strategy with significantly better scores: 0.9210 available, 0.0790 degraded, and unavailable 0. This represents a difference in scores of: +0.2956 available, −0.2097 degraded, and −0.0859 unavailable.In summary, this is close to 50% more available.
Table V shows the availability scores of all possible combinations of N-ETH given N = 2, 3, and 4. We observe that the increase of availability is correlated to N , and increasing N also increases resource usage, as expected.These results show that there are differences in trade-offs in the combination space.For instance, the combination of GETH and BESU is significantly more available than the combination of GETH and NETHERMIND.In these cases, during the experiment using FI 20, full and degraded combined availability rates are 99.98% and 92.47% respectively.Table V also shows the variability of resource usage from the different combinations.For example, with N = 2 the combinations vary by an order of magnitude in terms of RAM.This information is useful for users who want to select sub-node combinations with N < 4 for sake of resource constraints or other reasons.
Overall, these experimental results demonstrate that N-Version blockchain nodes provide higher availability than single-version nodes under the same unstable execution environments.

Answer to RQ3
N-Version blockchain nodes have better availability compared to single-version blockchain nodes.Our data shows that the gains are significant, especially for aggressive fault injection scenarios.As compared to single-version blockchain nodes, we observe in average an increase in availability from 84.7%, up to 98.5%.Notably, the N = 4 deployment reduces full unavailability to a negligible amount.These results validate the overall usefulness of our novel use of N-Version design in the context of blockchains.Our results are of utmost importance for practitioners who either provide (e.g.Infura) or rely on blockchain nodes (e.g.exchanges, banks, and art platforms).

A. Overhead and trade-offs
The main trade-off of N-Version design is the enhancement of a desired property versus increased resource usage.In the case of N-ETH, availability under instability is enhanced significantly at the expense of an increase of computing resources, dictated by the number of versions N , as shown by our results.Yet, access to computing resources is not a major issue for large service providers, since they are already used to run a sizable amount of blockchain node instances [43].Those providers would greatly benefit from higher availability and automatic resilience to hardware or software-related faults.

B. Threats to validity
Threats to internal validity: We identify two sources of noise that can have an effect on the produced data and corresponding findings.First, we use non-deterministic fault injection strategies, which can trigger system-calls at any time of the experiment, and during different stages of execution.Second, we use fully-synchronized blockchain nodes, which implies that every experiment is performed over a blockchain node that uses a different underlying state.We mitigate both sources of noise by using a large number of requests in each of the experiments.
Threats to external validity: We identify two stages where the obtained data produces generalizable results: First, we generalize the availability metrics obtained from one RPC method (Workload B), to the whole API.We argue that this is realistic, since the effects of unstable execution environments are uniform across the API.External validity would be improved by considering write methods of the API, this is considered as future work.Second, we choose to realize our prototype implementation in Ethereum, as it is a popular, actively supported, and mature blockchain.We argue that our results are generalizable to other blockchain technologies, as they follow generally the same design principles, yet this has to be verified empirically.

VII. RELATED WORK A. Blockchain Dependability
Kolb et al. [44] survey open challenges regarding blockchain technology, including the need to enhance non-functional properties such as scalability and availability.Weber et al. [45] present a thorough analysis of the availability of major blockchains, and conclude that while read availability is typically high, write availability is low due to uncommitted transactions.While these studies address blockchain availability, they do not account for the effects of unstable execution environments.
In the field of Chaos Engineering, Ma et al. [46] present Phoenix, a system to detect resilience issues in blockchains using context-sensitive fault injection.In contrast to our work, the focus of Phoenix is to discover the causes of unrecoverable states in blockchain nodes.
The state of client diversity in Ethereum is described by Ranjan [47], and is continuously tracked by the Ethereum community [14].In the area of dependability through diversity, Garcia et al. [48] present Lazarus, a tool for automatic management of diversity in Byzantine fault-tolerant systems.Breidenbach et al. [49] present the Hydra framework, whose goal is to enhance security of smart contracts using N-Version programming.These works study dependability in closely related application domains, however their specific focus is different.Lazarus [48] focuses on systemic-level dependability of BFT systems; and Hydra in dependability of smart contracts.In contrast, this work is focused on the external availability of nodes of blockchain systems.
Regarding security of blockchains, Chen et al. [50] outline the security of the Ethereum ecosystem, by detailing vulnerabilities, attacks, and defenses.Groce et al. [51] invited 23 professional stakeholders to audit Ethereum smart contracts using both tools and manual analysis, with 246 individual defects identified, categorized based on their severity and difficulty.Groce et al. [52] investigate weaknesses in the Bitcoin Core's fuzzing project.While these works address security, a fundamental attribute of blockchain dependability, availability is not their main consideration.
In the work of Li et al. [53], [43], the effects of certain DoS attacks are measured at the node level.Likewise, Yang et al. [15] and Kim et al. [54] perform differential testing on Ethereum nodes.Their effort led to the discovery of several bugs in the target nodes, which greatly contributed to strengthening the Ethereum blockchain.However, these works' focus is different from ours, namely security, consensus reliability, and response consistency.

B. Software Diversity
Multi-version approaches such as N-Version programming and N-variant systems have been extensively researched and proven to enhance security and fault tolerance [30].Seminal work from Avizienis et al. [12] introduced N-Version programming, which highlights the opportunities of diverse computation for making fault-tolerant systems.Theoretical analyses and models of N-Version software [55], [56] agree that independence of behavior is crucial for achieving faulttolerance goals.These works also agree that in practice this independence cannot be guaranteed, even if the versions are developed by distinct teams or using distinct methodologies [57].Therefore, the applicability of N-Version software has been empirically studied over an expansive range of domains.Within this range we highlight the works of Xue [61], because of their use of natural diversity, as opposed to planned diversity in Avizienis' vision.Similar to N-ETH, the mentioned works leverage domain knowledge to achieve faulttolerance, availability, and security.However, to the best of our knowledge, this is the first work to propose natural diversity and N-Version design as a means of hardening blockchain infrastructure.
N-Version programming traditionally relies on majority voting to select a response to a request; however, several alternatives have been presented.For example, Vouk et al. [62] proposes consensus voting, where a response in an N-Version system is selected only after M < N/2 versions agree to accept a response.Going beyond voting, Gao et al. [63] describe the use of Hidden Markov Models to compute the lowlevel behavioral distance of versions for anomaly detection.In our work, we propose selecting a final response after only one timely, compliant, and fresh response is produced from any subnode.This approach allows N-ETH to provide high availability.
A similar, but more security-focused concept is N-variant systems [64], [65].In contrast to N-Version programming, "variants" are automatically generated.Koning and colleagues propose MVARMOR, an N-variant execution engine that exploits hardware virtualization to detect divergent behavior among program variants [66], where behavior divergence is observed at the level of system-calls.Voulimeneas et al. [67] show how N-variant execution can be based on the diversity of Instruction Set Architectures by running programs natively in different machines.Berger et al. [68] propose a runtime system to handle errors through diversified memory layouts.Polinsky et al. [69] propose an extension to N-M-variant systems, where M represents a number of replicas for each variant and guarantees a constant N throughout a period where a variant's instance might be unavailable.N-variant systems increase diversity at low levels in single software stacks.As such, they do not mitigate flaws originating from application design, dependencies, or programming languages, which is N-ETH's explicit goal.Nonetheless, the mentioned N-variant approaches and N-ETH are not mutually exclusive.These can potentially be used in combination, providing an even greater spectrum for resilience by diversity.
The proxy pattern and N-Version design are often studied in combination.For instance, Espinoza et al. [25] describe a design where N-Versioned microservices are placed in between proxies, allowing the system to compare both upstream and downstream request-response pairs.Simillarly, Durieux et al. [70] leverage protocol diversity through an HTTP proxy to introduce self-healing for HTML and JavaScript code.These works show that diversity together with proxies can be used to augment targeted aspects of software, e.g.mitigating security concerns or providing automatic code repair.In the intersection of blockchains and N-Version design, proxies are addressed in smart contract resilience.For instance, The Hydra framework [49], describes the entry point of their N-Version smart contracts as a "generic proxy" which delegates incoming transactions to each version.Similarly, Péter et al. [71] propose a proxy between N-Versioned smart contracts and the underlying storage of the Hyperledger Fabric blockchain.These works show that the proxy pattern can be applied at distinct layers of blockchain systems.However, none suits N-ETH's problem statement: high availability in unstable environments.These previous works are different from N-ETH's original solution of RPC routing, response selection and adaptive proxying tuned for blockchain nodes.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we identify the potential of taking advantage of existing diversity of blockchain node implementations.We devise an architecture that aims to improve the availability of blockchain clients under suboptimal, unstable execution.We implement a prototype based on this design: N-ETH, and evaluate its availability against regular blockchain nodes.To simulate unstable execution environments, we use a systemcall error injection tool.
Our findings show that: (1) External applications which consume blockchain nodes' APIs perceive erratic behavior when the target node is under unstable execution environments; (2) The availability of blockchain nodes is affected by the tested unstable execution environments.The severity of the effects in availability scales with the aggressiveness of the used fault injection strategies; and (3) The N-Version blockchain node prototype is able to stay in available or degraded state under most of the tested unstable execution environments.Additionally, N-ETH presents a drastic reduction in unavailability when compared to common blockchain nodes, which present much larger unavailability windows under the same unstable execution environments.Ultimately, this is the benefit of relying on strong versions that have different weaknesses, as N-ETH mitigates the failures surfacing on specific node version and fault scenario combinations.
In the presented architecture and prototype, we focus on blockchain node implementation diversity.However, we can identify two other dimensions where diversity is relevant and applicable.First, operating system diversity, where blockchain nodes are executed on top of diverse operating systems.This approach has the potential to enhance OS-related fault tolerance and security.Second, single node diversity, where different versions of the same node are used to detect regressions or errors introduced in newer versions.
In this paper, we focus on improving on availability, which is business critical to external clients and applications.Yet, we envision that N-Version blockchain nodes can enhance dependability attributes other than availability, such as reliability and security.Furthermore, it can be used to enhance performance metrics perceived by external clients such as latency and throughput, given that different blockchain nodes are based on competing design principles.

FI
Overview of the experimental Ethereum deployments.Each node or sub-node takes 8-16 hours to synchronize with Ethereum's Mainnet.In both figures, the Ethereum logo represents Ethereum's Mainnet, which comprises tens of thousands of nodes.
Stopping the source node's synchronization is necessary, since the state of the node changes constantly with each added block.Keeping the source node constantly synchronized results in the experiments being carried out with the latest production state.Finally, RUN_EXPERIMENT copies the state from the source node's SSD to a newly provisioned SSD, and then starts a deployment which includes the blockchain node, fault injection module, and external application.Then, it starts the experiment's workload.Performing each experiment on a deployment with a clean copy of the deployment's state guarantees that no lingering effects from fault injection are carried from experiment to experiment.This novel pipeline is designed for paralellization, and enables us to carry out the experiments in an acceptable timeframe.
SYNC_SOURCE runs an infinite loop, which starts the source node's synchronization, and pauses it while state copying is in progress.This procedure also restarts source node's synchronization if no copying actions are in progress.

TABLE I
There are 5 error types in total, which are triggered only in one node version.These observations suggest that the type and prevalence of errors are non-coincidental, meaning that the same injected fault triggers vastly different errors, depending on the node version.This is fully in line with the core N-Version design assumption: diverse implementations exhibit diverse errors.

TABLE II :
Error rate per method, client, and fault injection strategy.Workload A.

TABLE III :
single-version deployments' availability rate for workload B, with varying fault injection (FI) strategies.The node version with the highest availability rate for the FI row is marked with ( ).

TABLE IV :
N-Version node availability for workload B, with varying N and fault injection (FI) strategies.The arrows represent the change of the rates compared to the best single-version node.The table shows only the combinations with the highest availability in average for each N. ( ) indicates the value where the highest gain in availability was achieved.

TABLE V :
N-ETH configurations and their respective Available + Degraded scores, and resource usage measurement.(GE.)GETH, (BE.) BESU, (ER.) ERIGON, (NE.) NETHERMIND.This table only presents the scores achieved under the 3 most aggressive FIs.Resource usage is measured under normal execution.