Double SHA-256 Hardware Architecture With Compact Message Expander for Bitcoin Mining

In the Bitcoin network, computing double SHA-256 values consumes most of the network energy. Therefore, reducing the power consumption and increasing the processing rate for the double SHA-256 algorithm is currently an important research trend. In this paper, we propose a high-data-rate low-power hardware architecture named the compact message expander (CME) double SHA-256. The CME double SHA-256 architecture combines resource sharing and fully unrolled datapath technologies to achieve both a high data rate and low power consumption. Notably, the CME algorithm utilizes the double SHA-256 input data characteristics to further reduce the hardware cost and power consumption. A review of the literature shows that the CME algorithm eliminates at least 9.68% of the 32-bit XOR gates, 16.49% of the 32-bit adders, and 16.79% of the registers required to calculate double SHA-256. We synthesized and laid out the CME double SHA-256 using CMOS $0.18~\mu m$ technology. The hardware cost of the synthesized circuit is approximately 13.88% less than that of the conventional approach. The chip layout size is $5.9 mm \times 5.9 mm$ , and the correctness of the circuit was verified on a real hardware platform (ZCU 102). The throughput of the proposed architecture is 61.44 Gbps on an ASIC with Rohm 180nm CMOS standard cell library and 340 Gbps on a FinFET FPGA 16nm Zynq UltraScale+ MPSoC ZCU102.


I. INTRODUCTION
Bitcoin is the most popular cryptocurrency and was invented by Satoshi Nakamoto in 2008 [1], [2]. Leveraging blockchain technology, Bitcoin uses a distributed public ledger to record all transactions without any third party [3]. Each block added to the public distributed ledger is created by hashing a 1024bit message, including a version number, a hash of the previous block, a hash of the Merkle root, timestamp, target value, and a nonce. In the 1024-bit message, the nonce must be valid to create a hashing output smaller than the specified target value. Therefore, miners relentlessly seek valid nonces when adding new blocks. The process of finding a valid nonce is called Bitcoin mining [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Huang .
In Bitcoin mining, the double SHA-256 algorithm is used to compute the hash value of the bitcoin block header, which is a 1024-bit message. The use of double SHA-256 protects against the length extension attack [5]. Technically, SHA-256 consists of a message expander (ME) and a message compressor (MC). During the SHA-256 operation, the ME expands the 512-bit input message into 64 chunks of 32-bit data. The MC compresses these 64 32-bit data chunks into a 256-bit hashed output.
Most of the energy consumption required for maintaining the Bitcoin network stems from calculating double SHA-256 values. Therefore, reducing the hardware cost and energy consumption of the SHA-256 circuit is a popular research trend. In [6], the authors optimized the double SHA-256 operation for Bitcoin mining from an algorithmic perspective, but no hardware design was available to evaluate the power consumption. From a hardware perspective, [7]- [22] proposed solutions to improve SHA-256. For instance, the authors of [7] employed the carry-save adder to improve the computation time of the critical path, which increased the maximum frequency and processing rate, while [8]- [12] used pipeline technology to improve the SHA-256 throughput. A cache memory technique was presented in [13] to reuse data, minimize the critical paths, and reduce the number of memory accesses for SHA-256 processing. The authors of [14] adopted the unfolding technique to reduce the computing latency for SHA-256. The authors of [15] proposed using a 7-3-2 array compressor to reduce the critical path delay for SHA-256. The carry-save adders technique is used in [16] to reduce the latency of additions in the SHA-256 algorithm. The authors of [17] used a combination of techniques such as carry-save-adders and pipelines to increase the performance of SHA-256. Pipeline and unrolled techniques are presented in [18] and [19] to increase the throughput of SHA-256. The authors of [20]- [22] presented a SHA-256 implementation on an FPGA for performance evaluation, with no technique optimization. Despite providing improvements in terms of hardware cost and power consumption, the hardware circuits developed in [7]- [22] have low processing rates because they require several (up to 64) clock cycles to compute a single 256-bit hash value.
To be applicable for Bitcoin mining, a SHA-256 circuit needs not only efficient hardware and power cost but also a high processing rate. To reach a high processing rate, the authors in [23] proposed the fully unrolled SHA-256 datapath for Bitcoin mining hardware. Additionally, the fully unrolled SHA-256 datapath can be designed to run on an application-specific integrated circuit (ASIC) [24], which can reach even higher processing rates. However, because an ASIC implementation of a fully unrolled datapath has high power consumption and hardware costs, [25]- [28] proposed eliminating an 8-round unrolled datapath in the double SHA-256 architecture to reduce the chip area. Furthermore, several technical solutions, such as carry-save adders and optimized message compressor (MC) architectures have been proposed and applied to reduce the hardware and power costs.
In this study, we propose a new approach for reducing the hardware cost and power consumption of high processing rate fully unrolled SHA-256 architecture. We analyze the characteristics of the 1024-bit input data of double SHA-256 and propose compact message expander (CME) algorithms that significantly reduce the hardware cost required to compute the message expander (ME) process of SHA-256. In addition, we propose a CME double SHA-256 accelerator architecture that adopts the proposed CME algorithms to reduce the power consumption. Our architecture generates one 256-bit hash value per clock cycle. We implemented the proposed double SHA-256 accelerator architectures in ASIC CMOS 0.18 µm technology to demonstrate their energy efficiency. The Verilog code and synthesized results of the experiment are publicly available from GitHub. The remainder of this paper is organized as follows. Section II presents a preliminary study. Section III describes our proposed CME double SHA-256 architecture, and the CME algorithms and hardware circuits are explained in detail. Section IV reports our evaluation in terms of theory, ASIC, and FPGA experiments. Finally, Section V concludes the paper.

II. PRELIMINARIES
A. DOUBLE SHA-256 ARCHITECTURE FOR BITCOIN MINING Fig. 1 shows the overview architecture of double SHA-256 applied for Bitcoin mining. The input to the double SHA-256 process is a 1024-bit message, which includes a 32-bit version, a 256-bit hash of the previous block, a 256-bit hash of the Merkle root, a 32-bit timestamp, a 32-bit target, a 32-bit nonce, and 384 bits of padding. The 1024-bit message is split into two 512-bit message parts; then SHA-256 1 calculates a hash value of the first 512-bit message, and SHA-256 2 computes a hash value of the final 512-bit message. Due to the double SHA-256 requirement, the 256-bit hash output from SHA-256 2 must be compressed into the final 256-bit hash by using SHA-256 3 . In the Bitcoin mining process, the final 256-bit hash output from SHA-256 3 is compared to the target value. If the final hash is smaller than the target value, the valid 32-bit nonce is specified, and a new Bitcoin block is successfully created. Otherwise, the 32-bit nonce is increased by one and the double SHA-256 circuit recomputes to find a new hash value. This process is repeated until the 256-bit hash of SHA-256 3 meets the target requirement.
Algorithm 1 shows the ME process, which expands the 512-bit input message into 64 chunks of 32-bit data W j (0 ≤ j ≤ 63). In the first 16 rounds, the ME parses the 512-bit message into 16 32-bit data chunks (denoted as W j , j = 0 to 15 where j is the round index). In the final 48 rounds, the ME calculates 48 chunks of 32-bit data W j (17 ≤ j ≤ 63). Three 32-bit adders and two logical functions σ 0 (x) and σ 1 (x) are VOLUME 8, 2020 Fig. 2 shows the conventional circuit C required to calculate W j (17 ≤ j ≤ 63), in which the logical functions σ 0 (x) and σ 1 (x) are respectively defined as follows:  For j from 0 to 63 {   an ASIC-based double SHA-256 accelerator that implemented ME and MC processes in a fully unrolled datapath for high processing. Technically, the fully unrolled SHA-256 datapath enables the 64 rounds of ME and MC to run in parallel and be pipelined. Fig. 3 illustrates a prototype SHA-256 architecture with 64-round unrolled datapaths for the MC and ME processes. The unrolled ME datapath is denoted as Block j (j = 0, . . . , 63), while the unrolled MC datapath is denoted as Loop j (j = 0, . . . , 63).
Because the goal of this study is to optimize the ME process, we focus specifically on a hardware implementation for ME. For the first 16 blocks (i.e., Block j (j = 0, . . . , 15)), each ME block requires a 512-bit register (or 16 32-bit registers) to pipeline and store the 16 W j (j = 0, . . . , 15) values. For the last 48 blocks, i.e., Block j (j = 16, . . . , 63), each block needs a 512-bit register (or 16 32-bit registers) and C circuits ( Fig. 2) to compute W j (j = 16, . . . , 63). As shown in Fig. 1, the double SHA-256 accelerator for Bitcoin mining requires three individual SHA-256 circuits. This means that the accelerator must implement 48 × 3 = 144 C circuits (in the 16 th to 63 th blocks of SHA-256 1 , SHA-256 2 , and SHA-256 3 ). Thus, it is necessary to both optimize the C circuit and reduce the number of C circuits required for double SHA-256.

C. THE OPTIMIZED DOUBLE SHA-256 ARCHITECTURE
The prototype double SHA-256 accelerator has high power consumption because the fully unrolled datapath results in a large chip area. To reduce the power consumption, [25]- [28] proposed the optimized double SHA-256 accelerator, in which a 64-round unrolled datapath is optimized into a 60-round unrolled datapath.   4 shows a schematic diagram of the 60-round unrolled ME datapath used in SHA-256 2 and SHA-256 3 . In SHA-256 2 , the 60-round unrolled ME datapath includes rounds 4 to 63 (denoted as Block j (j = 4, . . . , 63)). In SHA-256 3 , the 60-round unrolled ME datapath includes rounds 1 to 60 (denoted as Block j (j = 1, . . . , 60)). Consequently, 8 ME blocks are eliminated compared with the prototype architecture mentioned above.

III. THE PROPOSED CME DOUBLE SHA-256 ARCHITECTURE A. ARCHITECTURAL OVERVIEW
In Bitcoin mining, the 512 bits of data input to SHA-256 1 does not change frequently because it does not include the 32-bit nonce field. Conversely, the 512 bits of data input to SHA-256 2 are updated frequently because of the changing value of the nonce field. Whenever the output of SHA-256 2 changes, SHA-256 3 also needs to be recomputed. Because the nonce field has 32 bits, each computation of SHA-256 1 requires SHA-256 2 and SHA-256 3 to recompute their values up to 2 32 times. Therefore, we propose the CME double SHA-256 accelerator architecture, as shown in Fig. 5. To achieve a high processing rate as well as efficient hardware and power cost, we implement a resource-sharing architecture for SHA-256 1 and a fully unrolled datapath architecture for SHA-256 2 and SHA-256 3 . The SHA-256 1 has a single Block 0−63 circuit for calculating W j (j = 0, . . . , 63) and a single Loop 0−63 circuit for calculating the internal hashes a, b, c, d, e, f , h in 64 clock cycles. Each clock cycle computes one W j value and updates the internal hash one time.
Using pipelined and parallel operations, SHA-256 2 and SHA-256 3 can produce an output hash every clock cycle. However, the resource-sharing SHA-256 1 circuit produces one hash value every 64 clock cycles. The low processing rate of the SHA-256 1 circuit does not affect the final processing rate of the CME double SHA-256 accelerator because one SHA-256 1 output value can be used to calculate SHA-256 2 and SHA-256 3 up to 2 32 times. The final processing rate of the CME-based double SHA-256 is one 256-bit hash value per clock cycle.
In the following subsections, we explain our proposed CME algorithms and the equivalent hardware designs.

B. COMPACT MESSAGE EXPANDER (CME) ALGORITHM
We propose the CME algorithms by analyzing the characteristics of the input data of SHA-256 2 and SHA-256 3 .

1) CME FOR SHA-256 2
As seen in Fig. 1, the 512 bits of data input to SHA-256 2 include a 32-bit Merkle root hash, a 32-bit time stamp, a 32-bit target, a 32-bit nonce, and a 384-bit padding+length VOLUME 8, 2020 field. It is worth noting that most of the content of the padding+length field consists of zeros (refer to Fig. 6a).
Assume that the 512 bits of data are separated into 16 32-bit words M j (j = 0, . . . , 15). The CME operation for SHA-256 2 is illustrated in Algorithm 3. The algorithm processes the data in 64 loops. During the first 16 loops, W j (j = 0, . . . , 15) are assigned to M j (j = 0, . . . , 15). The values of W j (j = 5, . . . , 14) are all zero because they are equivalent to the zero values of the padding+length field. In addition, W 4 and W 15 are constants. During the last 48 loops, the CME calculates W j (j = 16, . . . , 63) by using (7): The logical functions σ 0 (x) and σ 1 (x) are shown in (1) and (2), respectively. Utilizing the zeros or constant values of W j (j = 4, . . . , 15), we can optimize the calculation of (7). For example, the W 16 calculation can be analyzed as follows: Note that W 14 = 0 and W 9 = 0. By comparing (7) with (8) for calculating W 16 , it can be seen that the logical function σ 1 (x) and two 32-bit adders have been eliminated. The computations of W j (j = 17, . . . , 63) are analyzed and optimized similarly. The final results are shown in Algorithm 3.
2) CME FOR SHA-256 3 The 512 bits of input data to SHA-256 3 include the 256-bit hash output from SHA-256 2 concatenated with a 256-bit padding+length field. The value of the first 32 bits of padding is 32 ′ h80000000, while the value of the last 32 bits padding+length is 32 ′ h00000100. The remaining values are all zeros (refer to Fig. 6b).
Utilizing Algorithms 3 and 4, we can significantly reduce the number of 32-bit adders and the number of logical func- The proposed shortened computation circuits: SC 1 , SC 2 , SC 3 , and SC 4 for the CME process.

C. CME HARDWARE CIRCUITS
From Algorithm 3 and 4, we propose four types of shortened computation (SC) circuits as shown in Fig. 7. Compared with the traditional C circuit shown in Fig. 2, the proposed SC 1 eliminates two 32-bit adders and the logical function σ 1 (x); SC 2 eliminates one 32-bit adder; SC 3 eliminates two 32-bit adders and the logical function σ 1 (x); and SC 4 eliminates one 32-bit adder and the logical function σ 0 (x). Note that eliminating either σ 0 (x) or σ 1 (x) also eliminates two 32-bit rotations, one 32-bit shift, and two 32-bit XOR circuits. Based on the C circuit shown in Fig. 2 and the four types of SC circuits shown in Fig. 7, we develop hardware architectures for the CME processes of SHA-256 2 and SHA-256 3 as shown in Fig. 8 and Fig. 10, respectively.
The proposed CME circuit for SHA-256 2 (Fig. 8) is divided into three phases. Phase 1 includes CME2 4 to CME2 19 . Each operation requires a 128-bit register (or four 32-bit registers) to store and pipeline W 0 to W 3 . In phase 1, instead of using the conventional C circuit in Fig. 2, the SC 1 and SC 2 circuits in Fig. 7 are implemented to reduce hardware costs. Phase 2 includes CME2 20 to CME2 30 , for which the SC 2 and SC 3 circuits are appropriately implemented (refer to algorithm 3). Phase 3 includes CME2 31 to CME2 63 , and the C circuit is implemented in all the blocks of this phase.
The three phases are classified based on the characteristics of the datapath bit width. In phase 1, the datapath bitwidth is constant (128 bits). The 384-bits of W 4 to W 15 are fixed constants. Hence, phase 1 do not need to store and pipeline W 4 to W 15 . In phase 2, W 20 to W 30 must be stored and pipelined. Thus, the datapath bit-width in phase 2 is appropriately increased from 160 bits to 480 bits. In phase 3, the datapath bit-width of CME2 31 to CME2 57 is 512 bits without optimization. To eliminate unnecessary values of W j in subsequent blocks, the datapath bit-width of CME2 57 to CME2 63 appropriately reduces from 480 bits to 32 bits. To understand the reason for the datapath bit-width adjustment,  we show the detailed data flow and computational circuit of the CME2 process in Fig. 9. In this figure, the number represents the j index of W j . For example, we need four 32-bit VOLUME 8, 2020 registers (equivalent to 128 bits) to store W 0 to W 3 in blocks CME2 4 to CME2 15 . As another example, CME2 32 needs sixteen 32-bit registers (16 × 32 = 512 bits) to pipeline store 16 values of W j (j=16, 17,. . . ,31), which are required for the calculation of its following blocks.
Similarly, the proposed CME circuit for SHA-256 3 has three phases (Fig. 10). Phase 1 includes CME3 1 to CME3 23 . Because of the zero and constant property of input data W 8 to W 15 , all blocks of phase 1 have the same datapath of 256 bits only (which is required to pipeline store eight 32-bit values W 0 to W 7 ). A large number of registers are thus eliminated. In this phase, circuits SC 1 , SC 2 , and C are appropriately implemented (refer to algorithm 4). Phase 2 includes blocks from CME3 24 to CME3 30 . Circuits SC 4 , SC 3 , and SC 2 are appropriately implemented (refer to algorithm 4). Phase 3 includes blocks from CME3 31 to CME3 60 . We do not implement blocks from CME3 61 to CME3 63 because we can detect early whether the final hash is smaller than the target value without waiting for results from CME3 61 to CME3 63 . Circuit C is implemented in all blocks.
Three phases are classified based on the characteristics of the datapath bit-width. In phase 1, the datapath bit-width is constant (256 bits).The 256-bits of W 8 to W 15 are fixed constants and do not need to be stored and pipelined in phase 1. In phase 2, W 24 to W 30 must be stored and pipelined. Therefore, the datapath bit-width of CME3 24 to CME3 30 is appropriately increased from 288 bits to 480 bits. In phase 3, the datapath bit-width of CME3 31 to CME3 53 is 512 bits without optimization. The datapath bit-width of CME3 54 to CME3 60 is reduced from 480 bits to 32 bits. To prove that the datapath bit-width adjustment is appropriate, we show the detailed data flow and the computational circuit of the CME3 process in Fig. 11. In this figure, the number represents the j-th index of W j . For example, each block from CME3 0 to CME2 15 requires eight 32-bit registers (equivalent to 8 × 32 = 256 bits) to store W 0 to W 7 . These values are required to calculate the blocks from CME3 16 to CME2 22 . As another example, block CME3 59 requires five 32-bit registers (5 × 32 = 160 bits) to store W 44 , W 45 , W 53 , W 58 , and W 59 , which are required for the CME3 60 calculation.

IV. EVALUATION
In this section, we evaluate the efficiency of the CME method when it is applied in the CME double SHA-256 accelerator. We evaluate the performance from three aspects: theory, ASIC, and FPGA experimental results.

A. THEORETICAL REVIEW
For comparison purposes, we developed three hardware circuits, all of which follow the architecture proposed in Fig. 5. The three circuits differ only in how they implement the ME processes of SHA-256 2 and SHA-256 3 . The first circuit (named Prototype double SHA-256) was proposed in [23] and mentioned in section II-B. The second circuit (named Optimized double SHA-256) was proposed in [25]- [27], and [28], and is mentioned in Section II-C. The last circuit is our proposed CME double SHA-256. Table 1 shows the theoretical hardware resources required by the three architectures in terms of the number of adders, XOR gates, rotations, shifts, and registers. In Table 1, SHA-256 2 and SHA-256 3 are the evaluation targets because they are the most hardware-intensive parts.
Compared to the prototype and optimized architectures, the proposed architecture respectively decreases the total number of 32-bit adders by approximately 19.1% and 16.49%, the total number of 32-bit XOR gates by approximately 12.5% and 9.68%, and the total number of 32-bit rotation operations, by approximately 11% and 8.17%.
In addition, the proposed architecture reduces the total number of 32-bit shift operations by approximately 19.8% and 17.2% compared to the prototype and optimized architectures, respectively.
Notably, the proposed architecture eliminates 33.2% and 16.79% of the total number of registers compared to the prototype and optimized architectures, respectively.

B. ASIC EXPERIMENT 1) AREA AND POWER APPROACH
To ensure a fair comparison, the three double SHA-256 circuits were coded in Verilog and synthesized in an ASIC using the Synopsys Design Compiler with the Rohm 0.18µm CMOS standard cell library [29]. Table 2 shows the synthesized area of the three architectures. Note that the total area is the sum of the combinational and non-combinational area (registers), as well as other types of circuits, including buff/Inv, wires, etc. The total area of the proposed  CME double SHA-256 is smaller by 17.6% and 13.9% compared to the prototype and optimized architectures, respectively. Fig. 12 summarizes the energy consumption of the three architectures obtained from the ASIC synthesis results. In terms of cell internal power, the proposed double SHA-256 circuit consumes 133 mW, which is a reduction of 15.82% and 11.92% compared to the prototype and optimized architectures, respectively. In terms of net switching power, the proposed CME double SHA-256 circuit consumes 95 mW, which constitutes reductions of 12.04% and 9.52% compared to prototype and optimized architectures, respectively. These energy consumption reductions are due to the smaller hardware circuit, which matches our expectations.
Based on the timing report of ASIC synthesis, the maximum frequency of the three architectures is 60 MHz. This means that the architectures achieve throughput of 1024 bits × 60 MHz = 61.44 Gbps.
In addition, we successfully laid out the proposed CME double SHA-256 circuit in ASIC technology with the Rohm 0.18µm CMOS standard cell library. Fig. 13 shows the chip layout, and Fig. 14 shows the chip energy distribution map. The size of the chip layout is 5.9 mm × 5.9 mm.

2) PROCESSING RATE AND HARDWARE EFFICIENCY APPROACH
In this experiment, we prove that the ASIC design of our proposed CME double SHA-256 architecture outperforms previous works in terms of processing rate and hardware efficiency. To ensure a fair comparison, we also synthesized our architecture in ASIC TSMC 0.18µm technology using the CMOS standard cell library. We then compare our results with the previous works in [15], [16], and [17].
The comparison is shown in Table 3. It is worth noting that the designs of [15], [16], and [17] are single SHA-256 circuits. To be applied to Bitcoin mining, these circuits must repeat their calculations three times to generate a double SHA-256 hash value from the 1024-bit input message. The number of cycles required to compute the double SHA-256 (denoted by C d ) is thus triple the number of cycles required to compute a single SHA-256 (denoted by C s ); refer to (9).
Then, we calculate the processing rate for double SHA-256 (R d ) by using (10). The BlockSize is 1024 bits.
From the R d and area results, the hardware efficiency for double SHA-256 (denoted by E d ) is computed by (11). Table 3 summarizes the synthesized area results, the calculated processing rate, and the hardware efficiency. The processing rate (R d ) and hardware efficiency (E d ) of our proposed architecture are significantly improved compared to those of the works in [15], [17], and [16]. The numerical results are as follows.
The results are shown in Table 4. It is worth noting that the existing architectures in [18]- [22] and [23] are single SHA-256 architectures that must repeat the computation three times to generate a double SHA-256 hash value for Bitcoin mining. Thus, the number of clock cycles required  to compute a double SHA-256 is tripled. We focus on evaluating the hardware efficiency (Mbps/LUT) of the single and double SHA-256 architectures in this subsection. In general, the proposed CME double SHA-256 outperforms the existing SHA-256 architectures in terms of hardware efficiency. The numerical results are as follows.
On the Artix 7 FPGA, the hardware efficiency of the proposed architecture is enhanced by 332% (2.94 vs. 0.68) and 7% (2.94 vs. 2.76) compared to the the hardware efficiencies of the architectures in [22] and [23], respectively.
On Zynq UltraScale+ ZCU102 FPGA, the hardware efficiency of the proposed architecture is enhanced by 7% (8.32 vs. 7.8) compared to the hardware efficiency of the architecture in [23].

C. FPGA EXPERIMENT 1) FUNCTIONAL VERIFICATION ON A REAL SoC HARDWARE PLATFORM
To prove that the circuit operates correctly not only in the software simulation tool but also on real hardware, we built a System on Chip (SoC) platform to execute the proposed CME double SHA-256 circuit. The SoC platform overview is shown in Fig. 15.  The platform includes two primary components: a host PC and a Zynq UltraScale+ ZCU102 evaluation board. The host PC exchanges data with the ZCU102 board via JTAG and UART cables.
The ZCU102 board includes an ARMv8 microprocessor, a programmable logic (PL), and a clock generator. Our developed circuit, CME double SHA-256, is embedded in the PL of ZCU102. The PL also has block ram (BRAM) and an integrated logic analyzer (ILA). We used the BRAM to store the valid nonce value for Bitcoin mining and ILA to monitor the outputs of the CME double SHA-256 circuit. The maximum operating frequency of the ZCU102 board is 333 MHz.
The host PC consists of a Vivado, a Software Development Kit (SDK), and a Bitcoin Mining Verification (BMV) program. Vivado is a software suite for SoC development. We use the Vivado suite to design and load the SoC-based system onto the Zynq UltraScale+ ZCU102 board. Moreover, the Vivado helps to export the outputs of the CME double SHA-256 circuit in the ZCU102 into an ILA result file for verification by the BMV program. The SDK is intended for the development of embedded software applications for SoC systems. We use the SDK to embed the real block information from the Bitcoin blockchain network onto our SoC-based system. The BMV is a C-code program that verifies the correctness of the embedded CME double SHA-256 circuit. The BMV executes a double SHA-256 on the host PC and compares the results with the outputs of the CME double SHA-256 circuit. The abovementioned SoC system has been used to thoroughly verify the correctness of the CME double SHA-256 circuit at different operating frequencies, such as 333 MHz (maximum frequency) and 200 MHz. All the cases result in 100% accuracy, which proves that the proposed CME double SHA-256 architecture works correctly in a real hardware platform. The maximum processing rate of the circuit on the ZCU102 board is 333 MHash/s (or 333 MHz × 1024 bit/CLK = 340.992 Gbps). Fig. 16 shows an image of the SoC evaluation platform, which includes a host PC (Toshiba Satellite B652 / G Core i5 3320M 2.6GHz / 4GB) and the UltraScale+ ZCU102 evaluation board.

2) PROCESSING-RATE EVALUATION ON A REAL HARDWARE PLATFORM
In this subsection, we evaluate the processing rate and power consumption of the proposed CME double SHA-256 on real hardware platform ZCU102 to prove that our architecture outperforms other high-performance platforms, including CPUs, GPUs, and the existing SHA-256 architectures. Table 5 shows the execution time of the double SHA-256 algorithm on several hardware platforms, including a CPU, GPU, and FPGA. To compute the same number of hashes (e.g., 500.000 hashes) the proposed architecture running on the FPGA ZCU102 requires only 1.5 ms, while the CPU i7-6950X, CPU XEON 6144, and GPU Tesla V100 require 770 ms, 740 ms, and 140 ms, respectively, which means that the proposed architecture reduces the execution time by 513 times, 493 times, and 93 times, respectively. Table 6 summarizes the hash rate and power consumption from several studies that reported double SHA-256 results. As the table shows, the hash rate of our proposed architecture running on an FPGA is significantly higher than those of the works in [30] and [31]. Although [32] was executed on an ASIC and our architecture was executed on an FPGA, our architecture still achieves the same hash rate but consumes less power.

V. CONCLUSION
Bitcoin mining is an important process in keeping the Bitcoin network secure; however, it consumes massive amounts of energy. To reduce the power consumption and increase the processing rate of the Bitcoin mining process, we proposed a CME double SHA-256 hardware circuit in this paper. The architecture includes three SHA-256 circuits in which the first circuit (SHA-256 1 ) is a resource-sharing architecture while the last two circuits (SHA-256 2 and SHA-256 3 ) are fully unrolled datapath architectures. The combination of these two types of architecture results in a high processing rate but low hardware costs. Specifically, we propose several compact message expander (CME) algorithms and associated hardware architectures to further reduce the power consumption and hardware costs. Our proposed circuit generates one 256-bit hash value per clock cycle. We thoroughly verified and evaluated the proposed circuit on both ASIC and FPGA platforms. The experimental results showed that the proposed circuit outperforms other high-performance CPU and GPU platforms for computing double SHA-256 values. The proposed circuit also outperforms existing works with specific hardware circuits for computing the double SHA-256 values. The double SHA-256 circuit was laid out on the ASIC with Rohm 0.18 µm CMOS standard cell library, resulting in a chip size of 5.9 mm×5.9 mm and the throughput of 61.44 Gbps. The circuit is also proven to work correctly in a real hardware platform (ZCU102), achieving a processing rate of 340.992 Gbps.
Blockchain is not only the Bitcoin network. Blockchain technology is outgrowing in its potential to be applied in many fields of life, such as smart health care, autonomous cars, and supply chains. Other blockchain networks may employ not only SHA-256 but also other cryptography hash functions, such as SHA-512 or SHA-3. Therefore, developing a flexible and programmable accelerator that can compute several hash functions is a future need. By developing a lowcost low-power-consumption blockchain accelerator, we help to enhance the security and decentralized features of the blockchain network. Therefore, we believe that developing a blockchain accelerator that can compute multiple cryptography hash functions at low cost and with low power consumption will be an important research trend in the near future.