PartitionChain: A Scalable and Reliable Data Storage Strategy for Permissioned Blockchain

Blockchain, a specific distributed database which maintains a list of data records against tampering and corruption, has aroused wide interests and become a hot topic in the real world. Nevertheless, the increasingly heavy storage consumption brought by the full-replication data storage mechanism, becomes a bottleneck to the system scalability. To address this problem, a reliable storage scheme named BFT-Store (Qi <italic>et al.</italic> 2020), integrating erasure coding with Byzantine Fault Tolerance (BFT), was proposed recently. While, three critical problems are still left open: (i) The complex re-initialization process of the blockchain when the number of nodes varies; (ii) The high computational overload of downloading data; (iii) The massive communication on the network. This paper proposes a better trade-off for blockchain storage scheme termed PartitionChain which addresses the above three problems, maintaining the merits of BFT-Store. First, our scheme allows the original nodes to merely update a single aggregate signature (e.g., 320 bits) when the number of nodes varies. Using aggregate signatures as the proof of the encoded data not only saves the storage costs but also gets rid of the trusted third party. Second, the computational complexity of retrieving data by decoding, compared to BFT-Store, is greatly reduced by about <inline-formula><tex-math notation="LaTeX">$2^{18}$</tex-math><alternatives><mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>18</mml:mn></mml:msup></mml:math><inline-graphic xlink:href="du-ieq1-3136556.gif"/></alternatives></inline-formula> times on each node. Third, the amount of transmitted data for recovering each block is reduced from <inline-formula><tex-math notation="LaTeX">$O(n)$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="du-ieq2-3136556.gif"/></alternatives></inline-formula> (assuming <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="du-ieq3-3136556.gif"/></alternatives></inline-formula> is the number of nodes) to <inline-formula><tex-math notation="LaTeX">$O(1)$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="du-ieq4-3136556.gif"/></alternatives></inline-formula>, by partitioning each block into smaller pieces and applying Reed-Solomon coding to each block. Furthermore, this paper also introduces a reputation ranking system where the malicious behaviors of the nodes can be detected and marked, enabling PartitionChain to check the credits of each node termly and expel the nodes with misbehavior to the specific extent. Comparing with BFT-Store, our scheme allows blockchain system to suit dynamic network with higher efficiency and scalability.


INTRODUCTION
B LOCKCHAIN, first proposed in 2008 as a solution to the double-spending problem existing in purely peer-topeer version of electronic cash system in [2], [3], has aroused interests in various fields for recent decades. Generally, blockchain is well known as a special shared-database which integrates cryptology, Peer-to-peer network and consensus mechanism, in order to maintain the integrity of data and prevent from being deceived by frauds [4], [5]. The blockchain technology has been applied in a wide range of popular scenarios recently, containing the Internet of Things [6], [7] and electronic invoice [8], and accordingly the amount of stored data on the chain increases constantly. Based on the latest statistics shown in Fig. 1, the size of Bitcoin blockchain system, for instance, has exceeded 340 gigabytes in June 2021, and still has a rapid growth. In order to ensure data consistency in blockchain, most nodes in the system are required to store all the historical data in the existing blockchain systems. According to this full-replication storage strategy, the storage overhead per block is OðnÞ, where n is the number of nodes in the system. Namely, the storage overhead of each block increases linearly with respect to the number of nodes in the system, which leads to a huge storage cost and may even break through the storage capacity of a single node when the system is extended and more nodes join it. Apparently, the high-speed increase of storage consumption for maintaining the complete ledger [9], is becoming a barrier to the system scalability.

Related Work
When the concept of blockchain was first formed in [2], we suppose that the blocks are generated every 10 minutes. Apparently, the storage consumption of the data on the chain will increase continuously, which impedes the system scalablity. Thus, some solutions are offered along with the proposal of blockchain. One primary method is to reclaim disk space on the nodes, in which the previous transactions will be discarded [2]. To achieve this without breaking the hash of blocks, Satoshi et al. [10] suggest to hash the transactions in a Merkle-Tree [11], [12], which allows the blocks to merely include the root hash of the tree in the block header. Through this method, the size of each block header is roughly decreased to 200 bytes; meanwhile, the nodes can still verify the payments through the root hash. To further reduce the storage consumption, a new kind of nodes termed light-nodes is introduced in [13], [14]. In contrast to the traditional full-nodes who preserving the complete blocks, the light-nodes merely store the block headers.
Addition to the above, other technologies and mechanisms are also introduced recently, to reduce the storage bloating problem. Dimakis et al. [15] suggest to apply erasure codes and network coding to distributed storage [16], [17]; thus, the intermediate nodes merely store the previously-received input data, and generate the output data by computing certain functions of the input. Moreover, utilizing micropayment channels [18], [19] seems to be another feasible solution. A mass of micropayments can be offloaded from the chain to a trusted custodian, and just the latest results after these micropayments are stored on the chain. In addition, the recent works [20], [21] introduce lightning network, which transfers the micro-payment transactions from online to offline, so that only the final results of the transactions are preserved on the chain. However, the aforementioned schemes do not fundamentally overturn the full-replication storage strategy. In order to avoid the adverse effects of the full-replication strategy as well as to guarantee the availability of all the blocks, a reliable storage scheme named BFT-Store for permissioned blockchain is proposed recently in [1].
BFT-Store combines RS (Reed-Solomon) coding and PBFT (Practical Byzantine Fault Tolerance) Protocol, so as to reduce the storage consumption per block from OðnÞ to Oð1Þ in regard to the number of nodes n, by enabling each node in the system to store just a part coded blocks (dubbed as chunks) rather than the whole chain. BFT-Store adopts ðn À 2f; nÞ-RS schema which means to encode every n À 2f original blocks into n chunks by adding 2f additional redundant blocks (dubbed as parities), and all the original blocks can be recovered through random n À 2f chunks. After encoding, these n chunks are distributed to n corresponding nodes through some specific algorithms. Since there are at most f Byzantine nodes and f crashed honest nodes in a system with n53f þ 1 nodes based on the precondition of PBFT [22], each node is guaranteed to receive at least correct n À 2f chunks while decoding, ensuring data availability.
BFT-Store also introduces an online re-encoding process. Assume that there are n nodes in the system, and distinctly the blocks are encoded by ðn À 2f; nÞ-RS, once a new node joins the system, the number of node increases to N ¼ n þ 1 and the threshold of malicious nodes becomes F ¼ b NÀ1 3 c accordingly. Therefore, the schema of RS should be adjusted to ðN À 2F; NÞ-RS, leading to an online re-encoding process. The whole system broadcasts and downloads chunks to recover them into original blocks, and re-encodes them through the updated schema afterwards. The re-encoding process for removal of nodes is similar.
Undoubtedly, BFT-Store [1] provides an innovative storage partition, subverting the full-replication strategy for permissioned blockchain; however, there existing three critical problems in this scheme: Complex re-initialization of the blockchain. Each time the number of nodes n changes, all the nodes have to participate in a blockchain update process. To re-initialize the system, each node devotes to broadcasting and downloading all the chunks for recovery and sequentially re-encoding the recovered blocks through the updated RS schema. High computational complexity of coding. Before replying the data to a client, a node needs to decode the requested chunk first in BFT-Store, and thus the computational complexity of decoding becomes a predominant factor influencing the response time and system throughput. In BFT-Store, the computational complexity of decoding and encoding per block are OðT 2 Á n 3 Þ and OðT 2 Á n 2 Þ respectively, where T is the assumed size of a block. Obviously, frequent decoding and encoding badly increase the computational overheads and impair the performance of the nodes in BFT-Store. Massive communication on the network. BFT-Store encodes every n À 2f blocks into n chunks through RS code. Therefore, each time a block needs to be recovered, at least n À 2f blocks will be transmitted through the network. Namely, the amount of communication over the whole network is OðnÞ, which will impair the performance of the network due to the specifics of P2P networks [23] when increasing nodes join the system.

Our Contribution
In addition to the aforementioned existing implementation problems, BFT-Store also leaves two points open: Reliance on the trusted third-party. BFT-Store employs TS (Threshold Signature) to verify the correctness of each chunk [1]. However, most existing TS schemes depend on the attendance of a TTP (Trusted Third-Party) or a center node termed dealer [24], [25], which is contradictory to decentralization in blockchain. Relying on a TTP will probably incur unexpected data damages and unavoidable extra costs [17], [26]. Additionally, it is troublesome to appoint a third-party who is completely trusted by all the participants in practical application scenarios. Lack of measures against malicious behaviors. In BFT-Store, the nodes merely check the correctness of received chunks, ignoring the forged messages and the dishonest nodes. That is, the nodes can exhibit a series of dishonest behaviors without any probable costs and punishment. Apparently, BFT-Store lacks an effective measure to detect the malicious behaviors and remove dishonest nodes. To address the above, we propose a better storage partition scheme named PartitionChain, which optimizes the performance of the system and re-initialization process, meanwhile, maintaining other advantages of BFT-Store. To avoid the reliance on a TTP, we adopt an aggregate signature scheme instead of Threshold Signature to guarantee the correctness of the encoded data. Moreover, we design an audit mechanism for detecting dishonest behaviors and punishing malicious nodes. The major advantages and functionalities of PartitionChain are listed as follows: 1) PartitonChain maintains the advantage of BFT-Store that it reduces the storage overhead of each block to Oð1Þ. Moreover, when the number of nodes changes, it relieves the system overload by optimizing the application of RS coding. For example, when a fresh node joins the system, PartitionChain enables the original nodes just to update their aggregate signatures respectively, instead that all the nodes update both their chunks and signatures. Thus, the computation consumption of system re-initialization can be greatly reduced. 2) In order to reduce the computational complexity, PartitionChain partitions each block into smaller pieces on account of the size of a single block before encoding. Through this method, the computational complexity of decoding per block is declined from OðT 2 Á n 3 Þ to OðTc Á n 2 Þ, where c is the fixed size of each piece of an original block. Similarly, the computational complexity of encoding per block is reduced from OðT 2 Á n 2 Þ to OðTc Á nÞ. 3) In PartitionChain, we apply RS encoding polynomial to each block rather than every n À 2f blocks to reduce the amount of communicated data. Thus, when a block needs to be recovered, each node merely broadcasts the encoded pieces of the original block. The size of each piece is limited to c ¼ 1024 bits in this paper. Therefore, PartitionChain decreases the amount of communicated data of the whole system from Oðn Á T Þ in [1] to OðT Þ, assuming the size of each original block is T . 4) On the contrary to adopting TS in BFT-Store, Parti-tionChain adopts CLAS (Certificateless Aggregate Signature) [27] to verify the correctness of stored information. After each participant gets his partial private key from a KGC (Key Generation Center), the whole system can completely eliminate the depen-dence on any third-party. In addition, even the KGC does not know the secret keys of the participants, and thus the adversary has no chance to steal the keys by attacking the KGC. 5) PartitionChain introduces an audit mechanism that detects the malicious behaviors and expels dishonest nodes. PartitionChain checks the credits of the nodes termly, and the nodes whose credits are below a preset lower bound will be removed from the system. The comparison of the performance and functionalities between PartitionChain and BFT-Store are summarized in Table 1.

Organization
The rest of this paper is organized as follows: Section 2 describes the assumptions of the model, the tools utilized in PartitionChain and the goals of this paper. Section 3 proposes the realized functionalities and algorithms of PartitionChain. Section 4 analyzes the performance of Parti-tionChain and Section 5 presents the corresponding experimental evaluations. Section 6 concludes this paper.

Assumptions
PartitionChain is an optimized combination of PBFT and RS coding, applied in permissioned blockchains where a group of identifiable participants who have a common goal but do not fully trust each other [28], [29]. We set two assumptions for PartitionChain as follows on account of the characteristics of permissioned blockchains and the preconditions of PBFT: 1) The existing malicious nodes in the system may randomly exhibit a series of bad behaviors (e.g., keep silence, send conflicting messages to different nodes in the system), in order to prevent the system from reaching consensus. Additionally, the honest nodes in the system may keep silence due to malfunction for any reasons. In order to tolerate these two kinds of failures, we assume that in a system containing n nodes, there are at most f malicious nodes may behave dishonestly and f honest nodes may fail to work for some reason at the same time, where f ¼ b nÀ1 3 c. 2) PBFT provides both safety and liveness of a system [22]. It can provide safety in an asynchronous system, while it must rely on synchronization to provide liveness. Therefore, we build a partial synchronous model where a sent message may be delayed for some reason but is bound to be received by its destination within a certain delay. In addition, all the channels in the model are assured to be secure. This partial synchronous model is feasible to be applied in real systems, as long as the network faults can be eventually repaired [22].

Tools
Reed-Solomon Coding. RS coding [30], [31], a widely-applied EC (erasure coding), assumes that F q is a finite field, and a 0 ; . . . ; a nÀ1 are n distinct elements (dubbed as evaluation points) from F q . For an original message m ¼ ðm 0 ; m 1 ; . . . ; m kÀ1 Þ, ðk; nÞ-RS generates a polynomial evaluated by the n distinct evaluation points respectively, where k4n4q. A participant can recover the original message so long as it receives random k of the n values. The corresponding RS polynomial f m ðXÞ of degree k À 1 is defined as Accordingly, the original message fragments are mapped to n values through RS coding as follows: Each participant N i in the system computes and then preserves its own value f m ða i Þ evaluated by the assigned evaluation point a i . When receiving a request for the original message from a client, the participant just needs to receive k different values: f m ða 1 Þ; . . . ; f m ða k Þ from other participants for decoding. Then the participant substitutes the k received values for f m ðXÞ, and the k corresponding evaluation points: a 0 ; . . . ; a kÀ1 for X into the RS polynomial and obtains a k-element linear equation set as follow: The participant will solve this equation set and apparently the solutions ðm 0 ; . . . ; m kÀ1 Þ are the original message fragments. For a k-element linear equation set, the time complexity to solve it by Gaussian elimination method [32] is Oðl 2 Á k 3 Þ, where l is the assumed size of each element in the equation set. Through RS coding, each participant merely needs to preserve its own value instead of the complete message while guaranteeing the same reliability as full-replication strategy.
Practical Byzantine Fault Tolerance Protocol. PBFT is widely-adopted in permissioned blockchain system as a consensus protocol and has been proved to perform well in [33]. Based on the precondition of PBFT [22], it can guarantee the consistency of information, provided there are at most f ¼ b nÀ1 3 c malicious nodes which can cause Byzantine failures in the system.
In PBFT, a newly-generated block is committed in a view which contains a primary node (selected in Round-Robin order [34] or other methods) and several backup nodes. After a client sends transactions which are packaged as a new block, the consensus on this block will be reached through three phases called Pre-prepare, Prepare and Commit respectively [1], [35]. In Pre-prepare phase, the primary node proposes the new block and broadcasts a signed preprepare message to all the backups after receiving a request from a client. Once a backup node accepts the pre-prepare message, it enters Prepare phase and then broadcasts a signed prepare message to all the nodes in the view. If a node receives 2f different prepare messages which matches its accepted pre-prepare message, it becomes prepared and broadcasts a signed commit message to others. Again, if a node receives 2f þ 1 different commit message matching its accepted pre-prepare message and prepare message, it commits the newly-proposed block and replies to the client. The block will be considered verified and can be appended to the blockchain so long as the client receives f þ 1 same replies from different nodes in the view.
Certificateless Aggregate Signature. CLAS scheme [27] is proposed on the basis of Schnorr Signature [36]. On the contrary to Identity-based public key cryp-tography (ID-PKC) [37] and Certificateless public key cryptography (CL-PKC) [38], CLAS enables participants to generate their complete private key and public key by themselves. PartitionChain adopts CLAS as the verification of correctness of the stored evaluations on the nodes for the following reasons. First, instead of relying on the TTP to update the secret keys of the participants and system parameters for signing each time the number of participants n varies in TS [1], [39], CLAS merely requires KGC to provide the partial private key to each participant. Once the public key and complete private key of a participant are generated, they maintain invariable whether n changes or not, achieving totally decentralization [27]. Second, CLAS reduces the total signature length (e.g., the size of a CLAS is always under 320 bits) as well as the computational cost for signature verification [40].
In general, for n certain users u 1 ; u 2 ; . . . ; u n whose identities are ID 1 ; ID 2 ; . . . ; ID n with n corresponding public keys pk 1 ; pk 2 ; . . . ; pk n and some state information d, CLAS scheme can compress the n signature s 1 ; s 2 ; . . . ; s n on the given messages m 1 ; m 2 ; . . . ; m n into a single aggregate signature s [27]. CLAS also provides an algorithm for verification. The verification algorithm takes the user identities, the related public keys, the original messages and the same state information d of the sign process as inputs, and then it will compute and output whether the aggregate signature s is valid.

Goals
PartitionChain inherits the fundamental advantage of BFT-Store that reduces the storage consumption per block to Oð1Þ. Besides, this paper also aims to solve the open problems left by BFT-Store and add new functionalities detailedly illustrated in Chapter 1. According to these requirements, PartitionChain is designed on the basis of the following aims: 1) Availability. So long as there are at most f ¼ b nÀ1 3 c malicious nodes in the system, PartitionChain can guarantee that any original block can be recovered by all the non-faulty nodes (i.e., honest nodes which work properly without any malfunctions) through decoding, so that a client can request for data successfully at any time.
2) Scalability. The complexity of storage consumption per block of PartitionChain will maintain within Oð1Þ while the number of nodes n increases. In addition, we attempt to reduce the computational costs of system re-initialization when the number of nodes varies. 3) Efficiency. Due to the extra calculation and communication in data recovery compared to the traditional full-replication strategy, the response time of clients' request to read a block will certainly increase [1]. This paper aims to lower the computational complexity of decoding and encoding and minimize the loss of read performance. 4) Validity. Byzantine nodes exist in permissioned blockchain, randomly behaving dishonestly to impede the system consensus. On the premise of achieving availability and scalability, the validity of the system should be ensured and the number of existing malicious nodes should be reduced to a certain extent.

Notations
To make it clear to illustrate, we first present the principal notations mentioned in the next sections and the corresponding descriptions in Table 2.

Block Partition
In BFT-Store, at least n À 2f complete blocks are transmitted over the network when a node needs to recover an original block, which brings about massive communication bits over the network. In order to relieve the communication overheads, we attempt to partition each original block before encoding in PartitionChain. The number of partitioned pieces is determined by the block size together with the number of nodes in the system. Each node preserves encoded pieces of every original block rather than the entire chunk, hence the realization of lower communication and storage consumption. The computational complexity of RS encoding and decoding are respectively Oðc 2 Á n 2 Þ and Oðc 2 Á n 3 Þ, where c is the assumed size of the pieces of an original block. Intuitively, the size of pieces is a dominant factor influencing the overall throughput of the system. Thus, we limit the size of each piece to c ¼ 1024 bits, to avoid the heavy computational overloads on the nodes and improve system performance. In order to guarantee the availability of the blocks, meanwhile, we intend to apply ðn À 2f; nÞ-RS to each original block based on the fault tolerance of PBFT [1]. In existing permissioned blockchain systems, the number of nodes n is generally under 100, and the size of a single block is about 1 MB; predictably, partitioning once is insufficient to bound the size of each piece to 1 Kb. Therefore, we partition each block twice before encoding in PartitionChain. To make it clear, we term the pieces generated in the first partitioning as primary-pieces and the pieces generated in the second partitioning as second-level-pieces.
For an original block BðhÞ of size T , where h is the unique hash value of the block, we initially determine the number of primary-pieces. According to PBFT, the number of second-level-pieces k partitioned from each primarypiece should equal to the lower bound (i.e., n À 2f) of the honest nodes in the system. Assuming that we will partition BðhÞ into r primary-pieces, r can be deduced from the following equation: Here, we take T ¼ 1 MB, f ¼ 33, n ¼ 3f þ 1 ¼ 100 as an example, so we can determine the value of r is Then, as shown in Fig. 2, each block is partitioned by the following two steps in PartitionChain:

1) Partition an original block BðhÞ into r primary-pieces
BðhÞ ¼ fB 1 ðhÞ; B 2 ðhÞ; . . . ; B r ðhÞg: In our example, r approximately equals to 2 8 and thus the size of each primary-pieces is 2 15 bits. To further reduce the size of the primary-pieces, we need to partition each of them again into smaller second-level-pieces.
2) Partition each primary-piece B i ðhÞ where 14i4r into k ¼ n À 2f smaller second-level-pieces which are represented as the green ones in Fig. 2. According to our assumption, r is approximately equal to 2 5 ; thus, the size of each second-level-piece is reduced to 2 10 bits.

RS Encoding
After partitioning, we select a large prime p which is large enough, (e.g., p is greater than 2 1024 in our example), and then convert every binary-stored second-level-piece b i;j ðhÞ (14i4r and 14j4k) to a p-base number.
In GF ðpÞ field, each partitioned primary-piece B i ðhÞ ¼ fb i;1 ðhÞ; b i;2 ðhÞ; . . . ; b i;k ðhÞg; is mapped by a ðn; n À 2fÞ-RS (in our example it turns to be ð100; 34Þ-RS) to a degree n À 2f À 1 polynomial f h;i ðXÞ which is defined as where X 2 fa 0 ; . . . ; a nÀ1 g. As Fig. 2 presents, the k second-level-pieces of a primarypiece are mapped to n specific evaluations, which are determined by substituting the corresponding evaluation points and calculating the above polynomial, and then distributed to the n nodes in the system. In order to determine the distinct evaluation point X for each node, we sort all the nodes in the system according to their public keys and take the position s of a node in the sort as its corresponding evaluation point. That is, after encoding, the node N s whose index is s, preserves a specific evaluation f h;i ðsÞ for the primarypiece B i ðhÞ. It can be seen from Fig. 2 that, for a block BðhÞ which is divided into r second-level-pieces each node s is required to calculate r generated polynomials and preserve the corresponding evaluation set for instance, and thus t Á r degree-n À 2f À 1 polynomials are generated. For a node N s in the system, it maintains its own t evaluation sets which can be also presented as t Á r evaluations f h 1 ;1 ðsÞ; . . . ; f h 1 ;r ðsÞ; . . . ; f h t ;1 ðsÞ; . . . ; f h t ;r ðsÞ:

Certificateless Aggregate Signature
In consideration of the existence of Byzantine nodes [41], [42] which may arbitrarily behave maliciously to violate the data integrity in the system, we adopt CLAS [27] to ensure the complete and correctness of the stored evaluations on the nodes.
After a new block is generated and partitioned, each node is required to calculate and provisionally preserve all the evaluations of each encoded primary-pieces for the next verification. The verification and signing process is embedded in the Commit phase of PBFT. In this phase, each node N s first signs its all computed evaluations for the block BðhÞ: f h;i ðsÞ (where 14i4r and 14s4n), and then broadcasts its signatures and index s embedded in the commit messages.
When node N s receives the commit message from another node N q , it checks the validity of the evaluation f h;i ðsÞ computed by N q through verifying the related signature. If the signature is valid, the node N s will extract the signature for the next aggregation. Until the node N s receives sufficient commit messages from different nodes and extracts the f þ 1 valid signatures on the evaluation f h;i ðsÞ, it aggregates them into a single signature s h;i;s with a constant length (the length of CLAS is always under 320 bits). This aggregate signature is regarded as the proof of the correctness of the evaluation. After that, each node preserves its own evaluations together with the related aggregate signatures.
Since CLAS can be performed incrementally [39], the signatures s h;1;s ; s h;2;s ; . . . ; s h;r;s ; on each encoded primary-piece of BðhÞ stored on node N s can be further aggregated into a single signature s h;s , which is regarded as the proof of the evaluation set F h ðsÞ for block BðhÞ stored on node N s . Through this method, the nodes can verify the received evaluations by just verifying a single aggregate signature instead of a set of signatures. After all the nodes have aggregated the signatures for the evaluations which they need to preserve, they can remove the calculated evaluations for other nodes. The evaluations and relative aggregate signatures that a node N s needs to store for a series of blocks fBðh 1 Þ; Bðh 2 Þ; . . .g are shown in Fig. 3.
Algorithm 1 shows the details of partitioning and encoding process for an original block BðhÞ, which is implemented on node N s in PartitionChain.
First, the original block BðhÞ is partitioned into r primarypieces B ¼ fB 1 ðhÞ; . . . ; B r ðhÞg through function PartitionðÞ (Line 1). Then, for each primary-piece B i ðhÞ, where i is in the range from 1 to r, we partition it into k ¼ n À 2f second-levelpieces B i ¼ fb i;1 ðhÞ; . . . ; b i;k ðhÞg based on the assumption of PBFT (Line 2-3). After that, each set of second-level-pieces partitioned from one primary-piece is encoded into n evaluations F i ¼ ff h;i ð1Þ; . . . ; f h;i ðnÞg by function RSEncodeðÞ (Line 4-5). Sequentially, node N s signs every computed evaluation (Line 6).

Algorithm 1. Partition and Encoding in PartitionChain
Input: node N s , original block BðhÞ 1: B ¼ fB 1 ðhÞ; . . . ; B r ðhÞg Partition(BðhÞ, r) 2: for i from 1 to r do 3: B i ¼ fb i;1 ðhÞ; . . . ; b i;k ðhÞg Partition(B i ðhÞ, k) 4: for j from 1 to n do 5: After encoding and signing, node N s packages the evaluations which it should preserve into F and the signatures into H respectively, and then it broadcasts the signature set H together with the commit message during Commit phase in PBFT (Line 8-11). If node N s receives f þ 1 commit messages from different nodes, it aggregates the signatures on each evaluations which it needs to preserve (Line 12-18). In detail, it first extracts the f þ 1 signature sets from the commit messages (Line 13). For each evaluation f h;i ðsÞ, node N s packages the f þ 1 signatures into signs and then aggregates them into a single signature s h;i;s with constant length via AggregateðÞ (Line 14-17).
Gaining the r aggregate signatures: s h;1;s ; . . . ; s h;r;s , node N s aggregates them again into a signature s h;s , served as the proof of the evaluation set F (Line 18). Finally, node N s preserves its own evaluation set F along with the corresponding aggregate signature s h;s (Line 19).

Data Recovery
Through RS encoding, each node preserves its own evaluations rather than the whole historical data. According to the assumption of our model, there are at most f ¼ b nÀ1 3 c malicious nodes in the system, and thus we can guarantee that at least n À f evaluations for each primary-piece B i ðhÞ are stored on the honest nodes. Even in the worst case that there are f honest nodes which are crashed for any reasons and do not reply to the request, it can be assured that at least n À 2f valid evaluations for each primary-piece B i ðhÞ can be received for data recovery. Consequently, data availability of the system can be achieved through ðn À 2f; nÞ-RS schema.
PartitionChain will nominate a node as the leader to be responsible for decoding while the remaining nodes are termed followers. In order to ensure that the elected leader is most likely to be honest, PartitionChain selects it from the nodes whose credit is higher than the lower bound by Robin-Round.
When the leader receives a request from a client to read a certain block BðhÞ, it will launch the data recovery process for the target block BðhÞ. The details of the process for block BðhÞ is presented in Fig. 4, which contains following three steps: 1): The leader first broadcasts a decode message < DECODE; h; n; n À 2f > (where h is the unique hash value of the target block BðhÞ, and n and n À 2f indicate the related RS-scheme) to the followers in this process. Since there exist at most f malicious nodes and f faulty nodes according to the assumption, it can be guaranteed that the leader can receive at least n À 2f correct evaluation sets which are sufficient for decoding.  level-pieces b i;j ðhÞ according to the index j of each second-level-piece. As mentioned in previous sections, each primary-piece B i ðhÞ corresponds to such a k-element equation set; thus, to recover all the primary-pieces, we need to solve altogether r such k-equation sets. After all r primary-pieces B i ðhÞ are recovered in the same way illustrated above, the leader concatenates these primary-pieces by their indexes i to recover the original block BðhÞ. Through the steps described above, the original block can be recovered and sent by the leader to the client.
Algorithm 2 detailedly illustrates how the leader recovers the data when it receives a request for a target block BðhÞ from a client. When the leader L receives a request for a target block BðhÞ, it first broadcasts a decode message < DECODE; h; n; n À 2f > to all the followers (Line 2). The follower N s replies its preserved evaluation set F h ðsÞ, the corresponding aggregate signature s h;s and its own index s back to the leader (Line 13-17). After the leader receives n À 2f messages including the evaluation sets and CLAS from different followers, it extracts the evaluations for each primary-piece and decode them into the corresponding second-level-pieces by function RSDecodeðÞ sequentially (Line 4-10). This process is performed r times for r primary-pieces. After that, the leader concatenates the recovered second-level-pieces by their unique indexes through function ConcatenateðÞ (Line 10). Finally, it replies the target block BðhÞ to the client (Line 11).

System Re-Initialization Process
In the permissioned blockchain, the number of nodes n may change if a new node is accepted to join the system or an existing node decides to leave. The variations of nodes can be divided into two main categories which are outlined in Fig. 5.
When a new node is accepted to join the system, the updated number of nodes N ¼ n þ 1 is greater than the previous number n, so the updated ðn À 2f; n þ 1Þ-RS schema can still guarantee the data availability. If an existing node intends to quit the system, there are two different strategies provided by PartitionChain. In the first scenario, after a node leaves, the current number of nodes n 0 is still larger than n, and thus ðn À 2f; nÞ-RS schema still works. Therefore, in this situation, it is unnecessary to update the RS scheme. In the second scenario, the current number of node n 0 is smaller than n which means ðn À 2f; nÞ-RS can no longer ensure that the leader can receive enough correct evaluations. Consequently, PartitionChain will launch a system re-initialization with the updated ðn 0 À 2f 0 ; n 0 Þ-RS where f 0 ¼ b n 0 À1 3 c. In this paper, we assume that the ðn; n À 2fÞ-RS schema is still resultful for the system when a node leaves because PartitionChain has a dishonest behavior audit mechanism which can maintains the number of malicious nodes at a low level in practical; therefore, we principally focus on the first two scenarios.
Joining of New Nodes. When a new node intends to join the system, the number of nodes changes from n to N ¼ n þ 1 and accordingly the bound of malicious nodes changes to F ¼ b NÀ1 3 c. In order to relieve the calculation stress of re-encoding process, we adopt ðn À 2f; n þ 1Þ À RS schema which can still guarantee the data availability for n þ 1À 2f < n À 2f, indicating that the leader is capable of receiving n À 2f evaluations in decoding.
After the new node is approved to join the system, the reinitialization of the whole chain is launched. Here, we present the details of the re-initialization of an original block BðhÞ in Fig. 6 and Algorithm 3 (Note that the details of reencoding and decoding are not explained in Algorithm 3 since they have been illustrated in Algorithms 1 and 2): 1): The new node N nþ1 broadcasts a decode message < DECODE; h; n; n À 2f > for block BðhÞ to all the original nodes (Line 2). 2): Receiving the decode message, each original node N s replies its preserved evaluation set F h ðsÞ of the block BðhÞ, the aggregate signature s h;s , the hash h and its index s to the new node N nþ1 (Line 14-17). For recovering each second-level-piece, the new node needs to receive at least n À 2f correct evaluation sets from different nodes (Line 4). It checks the validity of the received evaluations by verifying the corresponding signature. 3): After decoding, the new node N nþ1 calculates and signs the evaluations F on all the other nodes (Line 5-6). Then it broadcasts a re-initialize message < RE INITIALIZE; B; H nþ1 ; h; n þ 1; n À 2f>, where B is the recovered second-level-pieces set, H nþ1 is the signature set on F , and n þ 1 and n À 2f indicate the modified ðn À 2f; n þ 1Þ-RS scheme for re-initialization process to all the original nodes (Line 7). 4): Each original node N s calculates and signs the evaluations of the new node N nþ1 after receiving the re-initialize message (Line [19][20]. Afterwards, the node N s sends its signatures on the evaluation set F h ðn þ 1Þ to the new node N nþ1 (Line 21). The new node N nþ1 aggregates the signatures after it has received F þ 1 signature sets from different nodes (Line 9-11). Meanwhile, if the node N s receives the signature set from node N nþ1 , it extracts the signatures of node N nþ1 on its evaluations and then compresses it along with its previous aggregate signature into a single signature (Line 23-26). If the verification is achieved, the new node N nþ1 preserves the aggregate signature s h;nþ1 together with its own evaluation set F h ðn þ 1Þ and discards the computed evaluations of other nodes. Note that the re-initialization process can be launched in parallel if more than one new nodes join the system at the same time. The first new node who launch the re-initialization is responsible for the recovery and broadcast of the blocks, while the latter nodes merely calculate the evaluations of all the other nodes and sign the evaluations for verification.
Quit of Original Nodes. Since the model of this paper is based on the assumption of PBFT illustrated above, we just consider the case where the number of nodes nrsquo; still satisfies the assumption n 0 53f 0 þ 1 after an existing node quits.
In this case, the quit of the node has no influence on the schema of ðn À 2f; nÞ-RS and the remaining nodes. Since the ðn À 2f; nÞ-RS schema still works, the leaving node just needs to delete its preserved data and quits the system, and no additional behaviors on other nodes are required. The data availability can still be guaranteed.

Malicious Behavior Audit
Each node in PartitionChain has its own credit based on the malicious behaviors it have done, such as sending fake messages or keeping silence. To record the credits of the nodes in the system, each node maintains a dynamic array. Since the nodes can verify the correctness of the received evaluations from other nodes through CLAS, the misbehaviors can be detected. A node checks the correctness of the received evaluations, and if it detects the evaluation from another node N s is forged, it broadcasts a message < DETECTED; f h;j ðsÞ; s>, where f h;j ðsÞ is the forged evaluation from node N s and s is the index of the malicious node. When a node receives more than f þ 1 messages that tell N s has behaved dishonestly, it reduces the credit of node N S accordingly in the array.  When initializing, PartitionChain sets a lower bound of credits on account of the related system status. Partition-Chain checks the credit of each node termly and expels the nodes with the credits below the bound. In order to relieve the overload of the system, the removal of dishonest nodes will only occur when the current number of nodes n 0 is greater than the n used in ðn À 2f; nÞ-RS schema so that the unessential updating operations can be avoided. Moreover, PartitionChain classifies the nodes into different security levels on the grounds of their credits, enabling clients to select more credible nodes to handle their requests.
Algorithm 4 presents the detailed process of periodic inspection of malicious nodes in the system. Partition-Chain nominates a leader L to be responsible for this process. First, the leader L broadcasts an inspect message < INSPECT; t > where t is a specific time point when the inspection process initiates (Line 1-2). When a node N s receives the inspect message, it finds out the indexes of the nodes whose credits are lower than the lower bound and signs the found indexes (Line 10-15). Then node N s aggregates the signatures into a single signature for the purpose of relieving communication pressure and sends it together with the found indexes and the specific time point t to the leader (Line [16][17][18]. After receiving the messages, the leader L extracts all the index lists from the different nodes (Line 3-4). Similar to the evaluations, the correctness of the indexes can also be verified through the corresponding aggregate signatures. If a node whose index is existing in more than f þ 1 index lists which means the consensus that this node is malicious is achieved, the leader will expel it from the system by function QuitðÞ (Line 5-7). In this part, we will prove that PartitionChain reduces the storage consumption per block to a constant complexity Oð1Þ with respect to the number of nodes n. As illustrated in Section 3, an original block BðhÞ is first divided into r primary-pieces and then each primary-piece is divide again into k ¼ n À 2f second-level-pieces. Assume that the size of each block is T , and then the size of each primary-piece and second-level-piece are respectively equal to T r and T kÁr . After encoding, every k second-level-pieces are mapped to n evaluations through the corresponding RS polynomial. Therefore, the storage consumption per block is on the basis of the precondition of PBFT that n ¼ 3f þ 1. In comparison of the storage consumption OðnÞ in full-replication strategy, PartitionChain greatly economizes the storage overheads while maintaining the same reliability.

Computational Efficiency
Now we will prove that the computational complexity of encoding and decoding are greatly reduced to OðTc Á nÞ and OðTc Á n 2 Þ in regard to the number of nodes n respectively. Further, the computation consumption in coding is reduced by about 2 18 times in PartitionChain compared with BFT-Store. In ðn; n À 2fÞ-RS schema, the computational complexity in encoding and decoding can be expressed as and DRSðnÞ ¼ Oððn À 2fÞ 3 Þ: First, we compare the computational complexity in decoding between PartitionChain and BFT-Store. Assuming that the size of a block is T , the computational complexity of decoding a block in BFT-Store can be expressed as Note that we simplify ðn À 2fÞ to n in the deductions based on the precondition of PBFT [22] that f ¼ b nÀ1 3 c. In PartitionChain, we limit the size of each second-levelpiece to c ¼ 2 10 bits, and thus the number of primary-pieces is r ¼ T cðnÀ2fÞ , deduced detailedly in Section 3. To recover a single block, the leader needs to decode r times for the r primary-pieces in total; therefore, the computational complexity of decoding a block in PartitionChain is Further, we can learn that the computational complexity of decoding in PartitionChain is decreased approximately by 2 18 times than that in BFT-Store from Equation (5) where the size of an original block T is 2 10 and n is under 100 in our model. Similarly, the computational complexity of encoding a block is in BFT-Store, while the computational complexity of encoding in PartitionChain is which is accordingly decreased by about 2 18 times compared to BFT-Store. Undoubtedly, PartitionChain reduces the response time and improves the overall throughput of the system significantly.

Improvements of Re-Initialization Process
In BFT-Store, each time the number of nodes n in the system varies, all the nodes are forced to participate in the re-initialization, decoding all the chunks and then re-encoding the recovered blocks with the updated RS schema. Namely, the re-initialization of BFT-Store requires OðnÞ operations on each node.
On the contrary, in PartitionChain, when a new node joins the system, the original nodes merely need to calculate the evaluation set of the new node for verification, and update their aggregate signature sets. That is, Partition-Chain only requires Oð1Þ operations on the original node to accomplish the re-initialization. When a node quits the system, since ðn À 2f; nÞ-RS schema still works, no additional operation is required, and thus the computational overload on each node for quit of nodes is also reduced to Oð1Þ.
During the re-encoding process, the leader is requested to deliver all the recovered second-level-pieces to the followers, which makes the system upload bandwidth a bottleneck. Thus, we employ Gossip Protocol [43] in PartitionChain for communication to relieve the communication pressure as well as to improve the network fault-tolerance. In Gossip network, each node can choose to relay the received messages to others randomly so that all the nodes can receive the broadcast message within time complexity Oðlog nÞ.

Read Performance
Since the original blocks are partitioned and encoded into relative evaluations, each time a client requests for a block, the system has to launch the data recovery process. Consequently, the response time of a client requests to read a block will accordingly increase compared to the traditional full-replication strategy. In order to minimize the read performance loss, PartitionChain adopts caching technology for the several most frequently-requested blocks on a node. A Least Recently Used (LRU) cache [44] is proved to have good performance and relatively high hit-rate for Hot Data [45]. In PartitionChain, each node maintains a cache for several blocks most frequently-requested during the recent period. After receiving the recovered original block for the client, the node stores it in its cache. Thus, when a client request for these blocks again, the node can directly replies the target blocks to the client without data recovery since they have been locally-stored on the node.

EXPERIMENTAL EVALUATION
In order to compare the performance between full-replication strategy, BFT-Store and PartitionChain more visibly, we follow the basic preconditions and setups of the experiments in [1], [35]. According to the principal goals of this paper, our experimental evaluations focus on the following three main aspects of the performance: storage consumption, computational efficiency and system re-initialization. Note that we select the case that r ¼ 3 the faulty factor of the system and the number of replicas of each block is equal to 1 as the BFT-Store sample in our experiment illustrated in [1].
Storage Consumption. To evaluate the storage consumption of PartitionChain, we also employ full-replication strategy and BFT-Store for comparation. In this case, we set the size of each block fixed in 1MB and there is merely one block in the blockchain, while the number of nodes ranges from 4 to 40. For each node, we assume that each node has 200 MB storage space, meaning that each node can store 200 blocks at most.
We can learn from Fig. 7 that in full-replication strategy, the storage overhead per block is proportional to the number of nodes while the storage consumption in both BFT-Store and PartitionChain approximately remains steady. Specially, the storage consumption of BFT-Store and PartitionChain maintain roughly the same. The storage overhead per block when the number of nodes increases to 40 in PartitionChain is only one eighth of that in full-replication strategy. Therefore, it can be proved that PartitionChain successfully improves the system scalability in practical application.
Computational Efficiency. To evaluate the computational efficiency of PartitionChain, we observe the throughput of encoding and decoding. Note that coding is unnecessary in full-replication strategy, so we neglect it in this section. Now, we compare the throughput of encoding and decoding against the number of nodes between PartitionChain and BFT-Store.
As shown in Table 3, while the throughput of both Parti-tionChain and BFT-Store decreases with the number of nodes n ranging from 4 to 40, the throughput of Partition-Chain remains approximately 2 18 times of that in BFT-Store. The overall throughput of decoding and encoding in Parti-tionChain is about 10 12 MB/sec; thus it can been seen that the computational efficiency is significantly improved in PartitionChain.
System Re-Initialization Process. In this case, we compare the average re-initialization time per node in BFT-Store and PartitionChain for joining of a new node as well as quit of an existing node(n 0 > n). Note that full-replication strategy is neglected in this part because when the number of nodes varies, no additional operation for re-encoding is needed in this case.
It can be seen from Fig. 8 that the re-initialization time per node of BFT-Store increases as the number of nodes is growing, while the re-encoding time per node in PartitionChain for both cases hold steady against the number of nodes. The time consumption of re-initialization in PartitionChain for joining of a new node and quit of an original node are significantly reduced to 50 ms and 10 ms respectively, less than a third of that in BFT-Store.

CONCLUSION
In order to relieve the heavy storage overloads of nodes in blockchain systems caused by full-replication strategies, we propose an optimal storage partition strategy termed Parti-tionChain. The overall contributions of this paper include: (i) reduce the storage consumption per block from OðnÞ to Oð1Þ, on the premise of maintaining data availability; (ii) relieve the computational overload of the re-initialization when the number of nodes varies, so that the original nodes are merely required to update signatures; (iii) reduce the computational complexity and communication overload in decoding and encoding; (iv) adopt CLAS instead of TS as the proof of data, achieving total decentralization without rely on any TTP; (v) introduce an audit mechanism for dishonest behaviors of nodes in system to purify the network. Furthermore, Parti-tionChain enables blockchain systems to suit dynamic network with higher efficiency and easier scalability.