A Novel Methodology-Based Joint Hypergeometric Distribution to Analyze the Security of Sharded Blockchains

Cryptocurrencies (e.g., Bitcoin and Ethereum), which promise to become the future of money transactions, are mainly implemented with blockchain technology. However, blockchain suffers from scalability issues. Sharding is the leading solution for blockchain scalability. Sharding splits the blockchain network into sub-chains called shards/committees. Each shard processes a sub-set of transactions, rather than the entire network processing all transactions. This raises security issues for sharding-based blockchain protocols. In this paper, we propose a novel methodology to analyze the security of these protocols (e.g., OmniLedger and RapidChain). In particular, this methodology estimates the failure probability of one sharding round taking into consideration the failure probabilities of all shards. To illustrate the effectiveness of the estimated failure probability, we conduct a numerical analysis of our methodology based on a huge number of trials. Finally, we compute confidence intervals to accurately estimate the failure probability and compare our methodology with existing approaches.


I. INTRODUCTION
In recent years, blockchain, which is the underlying technology behind digital cryptocurrencies, e.g., Bitcoin [1] and Ethereum [2], has attracted considerable attention from both academia and industry. Blockchain plays a significant role in emerging fields such as Internet of Things (IoT), the healthcare sector, edge computing, artificial intelligence and the government sector. All these emerging fields benefit from blockchain's decentralization, immutability, robustness, security, transparency and peer-to-peer network that records digital transactions (e.g., cryptocurrency transfer). However, Blockchain has a number of open issues such as scalability [10]. Indeed, scalability is one of the key limitations and the main challenge of blockchain [13]; while traditional centralized payment systems (e.g., Visa [5]) can handle 1000s of transactions per second (tx/s), Bitcoin and Ethereum process about 7 and 15 tx/s, respectively. Several solutions, The associate editor coordinating the review of this manuscript and approving it for publication was Jonghoon Kim .
to the scalability issue, have been proposed in the literature, such as sharding (e.g., Elastico [7], OmniLedger [8], RapidChain [9]), Directed Acyclic Graph (e.g., [16]), Plasma [14], and Lightning Network [15]. The most promising solutions of the scalability in the blockchain literature make use of sharding [10]. Sharding splits/shards the blockchain into sub-chains called shards/committees. Each shard processes a sub-set of transactions rather than the entire network processing all transactions. This increases the throughput (i.e., number of transactions per second) of the network. However, sharding may compromise the blockchain security. Indeed, for the blockchain to be secure, all shards need to satisfy the committee resiliency (i.e., maximum percentage of malicious nodes that a shard can tolerate without being compromised); throughout the paper we will use the terms committee and shard interchangeably. In most networks, this resiliency is 33% (e.g., Elastico [7] and OmniLedger [8]); beyond that resiliency, a consensus instance is fundamentally insecure. The critical issue is that even if the whole network falls well under the total resiliency (i.e., maximum VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Sharding divides the network into subsets (shards), which means only a shard can handle a set of transactions, rather than the entire network. A scenario where there is a single shard takeover attack (shard 2 in this case).
percentage of malicious nodes that the blockchain network can tolerate without being compromised); this limit is 25% in most blockchain networks, e.g., Elastico [7] and OmniLedger [8]), a single shard could be compromised. Figure 1 shows a scenarios in which a network, that contains 20 nodes with 25% malicious nodes (i.e., 5 malicious nodes), is split evenly into 4 shards where 3 malicious nodes end up in shard 2. This means that 60% of the nodes in shard 2 are malicious, which is bigger than the committee resiliency (33%). This is known as a single shard takeover attack. In sharding-based blockchain protocols, the network is compromised if only one shard is compromised (i.e., 1% attack). In this paper, we analyze the security of sharding-based blockchain protocols. In particular, we compute the failure probability of the whole network by taking into consideration the failure probability of each committee. The key contribution of this paper is to propose a novel methodology that outperforms the computation accuracy of existing approaches [3], [4], [9]. The limitations of these approaches [3], [4], [9] come from the fact that they assume that the failure probability of the first committee is indicative of the failure probability of any other committee; more specifically, they assume that the failure probability of one epoch (i.e., fixed time period; e.g., once a day) is the failure probability of the first committee times the number of committees [3], [4], [9]. However, when the sampling, to partition the network into shards, is done without replacement, the samples are not independent; this means that when we sample the first committee, it is clear that the parameterizations of the model change (i.e., the number of nodes in the network, as well as the number of malicious nodes). Thus, the failure probability of the second committee will be different from the first, and the third will be different from the first and the second, and so on. In addition, the changes in the values of the parameters increase with the sampling process (e.g., values when sampling the second shard are very different from the values when sampling the fifth shard compared to the values when sampling the third shard). This means that the inaccuracy of the failure probability estimate proposed in [3], [4], [9], grows with the number of committees. Our methodology computes the real failure probability of each committee, then computes the failure probability of the entire network in one sharding round (aka, one epoch), taking into consideration the failure probabilities of all committees. The contributions of this paper can be summarized as follows: • We develop a probabilistic methodology to analyze the security of sharding-based blockchain protocols. This methodology corrects and outperforms, in terms of accuracy, existing approaches; • We estimate the failure probability and compute the confidence intervals (CIs) in order to lower and upper bound the estimated failure probability; • We compare the proposed methodology with existing approaches; • We identify the parameters that impact the security of sharding-based blockchain protocols (e.g., the size of the committee, the number of sharding rounds in a predefined period of time and the number of nodes in the network). The paper is organized as follows. Section II presents definitions and notations used in the paper; in addition, it presents the details of the proposed methodology. Section III evaluates the proposed methodology. Finally, Section IV concludes the paper.

II. METHODOLOGY
In this section, we propose a methodology to estimate/compute the failure probability of one sharding round.  Table 1 shows the list of symbols/variables that are used to describe the proposed approach.

A. ABBREVIATIONS AND DEFINITIONS
where Definition 2 (Committee Resiliency: The maximum percentage of malicious nodes that the committee is able to contain whereas still being secure.
Definition 3 (Total Resiliency: The maximum percentage of malicious nodes that the whole network is able to contain whereas still being secure. Definition 4 (Failure Probability: The probability that the number of malicious nodes exceeds the malicious nodes limit (i.e., maximum percentage of nodes that can act in a malicious manner, e.g., in case of RapidChain [9], this limit is 50% of nodes in a committee and 33% in the network) in the network/committee.

B. HYPERGEOMETRIC DISTRIBUTION
In sharding-based blockchain protocols, the process of assigning nodes to shards can be modeled as sampling without replacement because the committees do not overlap. When the sample is done without replacement, we make use of hypergeometric distribution instead of binomial distribution [4]. Indeed, assigning nodes from the network to shards can be modeled as sampling without replacement because the committees can not overlap. When the sampling is done without replacement, the hypergeometric distribution yields better approximation compared to the binomial's, especially when the sample's is bigger than 10% of the entire network [4], [12]. Let X i denote the random variable corresponding to the number of malicious nodes in committee i and P(X i = m i ) denote the failure probability that committee i contains m i malicious nodes.
We assume that we have a network of N nodes where M nodes (M < N ) are malicious. The probability that a node is malicious is p = M N . We split N nodes into committees where each committee has a size n = N λ where λ is the number of committees. When we sample the first committee, the parameterizations of the model change (i.e., N and M); in particular, N changes to N − n and M changes to M − m 1 , where m 1 is the number of malicious nodes sampled in committee 1. Then, when we sample the second committee, N −n changes to N − 2n and where m 2 is the number of malicious nodes sampled in the committee 2. The third committee will have m 3 malicious nodes, and committee λ will have m λ malicious nodes such that m 1 + m 2 + . . . + m λ = M (see Figure 1). The distribution of the first committee can be modeled by the hypergeometric distribution with the parameters N , M and n as follows: Similarly, the second committee can be modeled by the hypergeometric distribution with the parameters N − n, M − m 1 and n as follows: And for the third committee we get: Finally, the distribution of committee λ (last committee) can be expressed as follows: The probability density function of X = (X 1 , X 2 , . . . , X λ ) (i.e., joint distribution) is given in (7).
The distribution in (7) is difficult and complex to compute. Using Theorem 1 (see proof in Appendix), (7) can be rewritten as (8). VOLUME 8, 2020 A simple way to prove that the distribution in (7) is the distribution in (8), is as follows: We have N nodes and we need to pick M malicious nodes out of them. Thus, the total number of possibilities is N M . For shard 1, the number of possibilities to arrange m 1 from n is n m 1 , for shard 2 is n m 2 , and for shard λ is n m λ . Consequently, the number of all possibilities taking into account all shards is the product of n m i for i ∈ {1, 2, . . . , λ}. To compute the required probability, we need to divide this product by N M . Now, to ensure that our methodology adapts well to the situation, Lemma 1 (see proof in Appendix) proves that the probability distribution in (8) is a proper Probability Distribution Function (PDF).
Lemma 1: The probability in (8) is a proper PDF; this means that: Note that m 1 malicious nodes in shard 1 can assume any of the following values: n, n − 1, . . . and 0. Similarly, m 2 malicious nodes in shard 2 can assume any of the following values: n, n−1, . . . and 0, and so on until the last shard. Thus, the distributions in (7) and (8) represents only one particular outcome. To consider all the possible outcomes, we need to compute the joint hypergeometric distribution, which is expressed in (11).
Finally, the failure probability (the probability that at least one committee fails) can be expressed as follows: Even after the simplification we make (from (7) to (8)), the probability in (12) is still complex and difficult to compute, especially, when we consider a huge number of nodes. For this reason, in the section III, we estimate this probability instead of computing it.

C. EXISTING APPROACHES
In this section, we present existing approaches that are devoted to analyze the security of sharding-based blockchain protocols [3], [4], [9]. More specifically, we present Hoeffding's bound since it is the better bound (in terms of accuracy) proposed in [3], [4] as well as RapidChain methodology [9].

1) HOEFFDING's BOUND
We present Hoeffding's bound [11] in order to compare it with the proposed methodology. We choose Hoeffding's bound because it is the accurate bound, proposed in [3], [4]. This bound can be expressed as follows: where p = M N and m = (p + y)n with y ≥ 0. Hence, we can bound the failure probability of one committee with resiliency r as follows: where Hafid et al. [3], [4] compute the epoch failure probability by multiplying the failure probability for one committee by the number of committees λ = N n . In addition, it is possible to ignore the bootstrap probability (i.e., the probability that the committee election fails in the first epoch) since it is too small (e.g., for RapidChain [9], this probability is smaller than 2 −26.36 ). The epoch failure probability (p e ) can thus be bounded as follows:

2) RapidChain METHODOLOGY
In this section, we present RapidChain methodology [9] to analyze security of sharding-based blockchain protocols. Unlike Ethereum-sharding [19] and OmniLedger [8] that use binomial distribution to analyze the security of their sharding-based blockchain protocols, RapidChain methodology uses the hypergeometric distribution. Note that using binomial distribution does not model correctly the sampling [4]. However, the limitation of RapidChain methodology comes from assuming that the failure probability of the first committee is the same as the other committees; this is because RapidChain methodology assumes that the failure probability of one epoch (i.e., one sharding round) is the failure probability of the first committee times the number of committees. As reported in Section I, the parametrizations of the model change after we sample a shard; thus, in practice each shard has its own failure probability. The failure probability for a committee with resiliency r by using the cumulative hypergeometric distribution is expressed as follows: In this section, we investigate the reliability of simulations in estimating the failure probability. For this purpose, we would like to compute the confidence intervals in order to lower and upper bound the estimated failure probability. There are different and several methods to compute confidence intervals including Normal approximation interval, Wilson score interval [17], [18], Jeffreys interval [17], Clopper-Pearson interval [17], and Agresti-Coull interval [17]. A commonly and popular method to compute confidence intervals is Normal approximation interval. This method is based on the Central Limit Theorem (CLT); it is inaccurate and unreliable when the sample size is small or the success probability (the failure probability in our case) is close to 0 or 1.
In this paper, we choose Wilson score interval since this method has been shown to be the most accurate and the most robust [17], [18]. Agresti-Cull method also provides a good accuracy for larger sample sizes [17].

E. YEARS TO FAIL
To measure the security of a given protocol, we propose to compute the average number of years to failure. To perform this computation, we need to determine the failure probability of epoch per sharding round, which refers to the failure probability that at least one committee fails. The average number of years to fail corresponding to the proposed methodology is given by: The average number of years to fail corresponding to Hoeffding's bound as well as RapidChain methodology is given by:

III. RESULTS AND EVALUATION
In this section, we present a simulation-based evaluation of our methodology and we compare it with existing contributions including Hoeffding's bound [3], [4], and RapidChain methodology [9].

A. SIMULATION SETUP
To estimate the probability proposed by our methodology (i.e., the probability in (12)), we use NumPy Python library, which offers mathematical functions, random number generators, etc. In particular, we use numpy.array() to set up an array of M malicious nodes and N − M honest nodes. We also use numpy.random.choice() to distribute randomly and without replacement these nodes across shards. Whenever, we distribute nodes without replacement across shards; we know the number of malicious nodes in each shard. If only one shard exceeds the limit (committee resiliency), we save 1 (i.e., failure), otherwise we save 0. Once this procedure is complete, we have one trial/simulation. To consider all the possibilities (i.e., the possible number of malicious nodes in each shard), we need to repeat this trial a large number of times. After repeating this procedure, we sum the numbers that we save (i.e., 1 or 0) and we divide by the number of trials to get the estimated failure probability. For example, let us assume we executed N t = 10000 trials and we encountered at least one shard failure in each of 500 trials; in this case, the estimated failure probability is: The relation between the exact failure probability (f p ) and the estimated failure probability (f p ) can be expressed as follows: Table 2 shows the values of the parameters used in the simulations. In Table 2, we assume that the number of malicious nodes in the network is the maximum number of malicious that the network can support (for Elastico and OmniLedger [7], [8], this maximum number should not exceed 25% of the entire network); this means that M assumes 25% (M = R × N ) of the entire network. Note that we can assume values smaller than 25% of the entire network. For the values of N , we assume different values of the network size for the purpose of analyzing how the size of the network impacts its security. Table 3 shows the estimated failure probability of the proposed methodology when varying the size of the committees (125, 200, 250) as well as the number of trials (10 4 , 10 5 , 10 6 ). Table 3 illustrates the Wilson score confidence interval for the purpose of computing bounds (i.e., computing lower and upper bounds) to better bound and estimate the failure probability. In addition, Wilson score confidence interval allows us to bound the failure probability with a high confidence rate of 95% and with a low error rate of 5%; this is means that, we are confident 95% that the estimated failure probability is between lower and upper bounds of Wilson score CI.

B. RESULTS AND ANALYSIS
In particular, Table 3 shows that when the size of the committee increases the failure probability decreases. In addition, Table 3 shows that as the number of trials increases the width VOLUME 8, 2020  of Wilson score interval gets smaller; this means that, as the number of trials increases, we better bound (lower and upper bound) the failure probability.
It is worth noting that we could not run a very large number of trials due to the limited performance of our personal computer. A fundamental question we need to answer is ''How does the number of trials influence the estimated failure probability?''. To answer this question, we make use of confidence intervals. Table 3 shows a lower bound and an upper bound of the estimated failure probability using Wilson score confidence interval. For 1000000 number of trials computed by a regular computer (i7-2677M CPU 1.80 GHz and 6GB RAM), the execution time (running time) is 249.84 seconds, which is about 4.16 minutes. Table 3 shows that the ''width'' of Wilson score interval gets smaller as the number of trials gets larger. This means that, as the number of trials gets larger we better bound the estimated failure probability. However, when the number of trials gets larger, we need a supercomputer to calculate/estimate the failure probability in a reasonable time. It turns out that we have to make a trade-off between accuracy and computational overhead. Figure 2 compares the estimated failure probability computed by using our methodology and that of Hoeffding's bound and RapidChain when varying the size of the committee (100-250) in a network of 1000 nodes. We observe (as expected) that the failure probability decreases as the size of the committee increases. As mentioned in section I, Hoeffding's bound and RapidChain methodology allow us to compute ''false" failure probabilities since they estimate/compute the failure of the first shard and multiply it by the number of shards to get the epoch failure probability. Let us consider an example to show that existing approaches [3], [4], [9] produce inaccurate results. Let assume a network that contains N = 1000 nodes and each shard contains n = 25 nodes. The failure probability (by using RapidChain's methodology) for one epoch (one sharding round) is 1.569, which is bigger than 1. This means that RapidChain's methodology computes "false" probabilities. The failure probability (by using Hoeffding's bound) for one epoch is 9.118, which is bigger than 1. However, the proposed methodology computes (by considering N t = 100000) 0.99987, which is smaller than 1 and it will not assume values greater than 1 because it is a proper probability distribution (see Lemma 1). Table 4 shows the failure probabilities, computed by the three methods, and the corresponding years to fail. It is worth noting that we did consider a small committee size (i.e., n=25) to show that existing approaches compute probabilities that are bigger than 1 (which is not correct); indeed, the smaller committee size, the bigger the failure probability. In this example, RapidChain (resp. the approaches in [3], [4]) computes a failure probability that is equal to 1.569 (resp. 9.118). The smaller the size of the committee the bigger the failure probability; thus, by decreasing the size, we can show that the failure probabilities computed by the existing approaches [3], [4], [9] exceed 1. Finally, it worth noting that as the number of years to fail decreases; this means that computing "false" probabilities impacts the number of years to fail, which impacts the security of the network. Figure 3 compares the estimated years to fail using our methodology with that of Hoeffding's bound and Rapid-Chain's when varying the size of the committee (100-250) in a network of 1000 nodes. More specifically, Figure 3 illustrates that as the size of the committee increases the number of years to fail increases; this is expected since when the size of the committee increases, the failure probability decreases (Figure 2) leading to increasing the number of years to fail. Figure 4 shows the number of trials (varying from 10000-1000000) versus the width of Wilson score confidence   interval. We observe that as the number of trials increases the width of Wilson score interval gets smaller; we conclude that as the number of trials gets larger we better bound the estimated failure probability (as expected). Figure 5 illustrates the number of trials (varying from 10000-1000000) versus the running time in seconds in a network of N = 1000 nodes. We observe that as the number of trials increases the running time ''sharply'' increases due to the limited performance of our machine. From Figures 4  and 5, we conclude that as the number of trials increases we  better estimate the failure probability but the running time sharply increases. It turns out that we have to make a trade-off between accuracy and computational overhead. Figure 6 shows the number of years to fail for different numbers of sharding rounds per year (N sy = 180, N sy = 360, and N sy = 730) when varying the size of the committee (100-250) in a network of N = 1000 nodes. We observe that as the number of sharding rounds per year decreases the number of years to fail increases; this means that, as the number of sharding rounds per year decreases the security of the network increases. We conclude that the number of sharding rounds impacts the security of sharding-based blockchain protocols. Figure 7 shows the failure probability of one sharding round for different network's sizes (N = 1000, N = 2000, N = 4000) when varying the committee's size (100-250). Specifically, this failure probability is calculated by using the proposed methodology for N t = 1000000 trials. We observe that as the network's size increases the estimated failure probability increases; this means that the size of the network affects its security.
Finally, we identify numerous factors that impact the security of the network, which are the size of the committee, the number of years to fail, and the size of the network. Now,  let us determine the best combination of the values of these factors to achieve the best security. In practice, the network size is given (i.e., an average) since it is public blockchain (i.e., users can leave/join at any time). However, we can increase/decrease the size of the committee and the number of sharding rounds per year in order to determine a predefined level of security (i.e., a predefined number of years to fail).
Let us consider some 3D graphs to show the best combination that gives us the best security (the bigger number years to fail). Figure 8 shows the number of years to fail versus the size of the committee (n varying from 10 to 200) and the number of sharding rounds per year (N sy varying from 45 to 730) by considering N t = 100000 in a network of N = 1000 nodes. We observe that for n = 192.421 and N sy = 22.589 we have the best combination that achieves the biggest number of years to fail, which is about 2.58026 years. Figure 9 shows the years to fail for different network's sizes (N = 2000, N = 3500 and N = 5000) versus the size of the committee (n varying from 10 to 200) and the number of sharding rounds per year (N sy varying from 45 to 730) by considering N t = 10000. We observe three surfaces; the higher one corresponds to N = 5000 nodes, followed by the surface that corresponds to N = 3500 nodes, and the last one corresponds to N = 2000 nodes. We conclude that as the network's size increases the number of years to fail increases, which shows again that the size of the network impacts its security.

IV. CONCLUSION
In summary, the paper proposes a novel methodology to analyze and investigate the security of sharding-based blockchain protocols. This methodology corrects the existing approaches [3], [4], [9]. In particular, we estimate the failure probability of the entire network in one sharding round taking into account the failure probability of each committee/shard. To validate and confirm that our methodology gives better estimation, we compute confidence intervals using Wilson score method since it is the most accurate and robust. After estimating the failure probability, we can measure the security of the network by estimating the number of years to fail.

A. PROOF OF THEOREM 1
First we prove the equality for λ = 2.
For λ = 2, we have M = m 1 + m 2 and N = 2n. Let A 2 = P(X 1 = m 1 , X 2 = m 2 ) We need to prove that: We have: We need to prove that: . We need to prove that: We have: We need to prove that the sum over this probability equals to 1. We have: This means that P is a proper probability distribution function.