A Case for Space Compaction of B-Tree Nodes on Flash Storage

It is well known that the in-use space in B-tree nodes is only two-thirds, on average. Such low space utilization is detrimental to flash storage in terms of cost as well as performance. In particular, as the database grows, the logical space waste in B-tree nodes will exacerbate the physical write amplification inside flash devices, worsening the write performance and eventually degrading transaction throughput. To address this problem, we propose two space compaction techniques for B-tree nodes: space reservation and data redistribution. The space reservation technique dynamically reserves the free space per index to prevent node split caused by future updates. Meanwhile, data redistribution redistributes records to a neighboring node before the split occurs. Experimental results using MySQL show that our approach can reduce the in-use database space by up to 40%, doubling the throughput. In addition, our approach outperforms MyRocks with less space. Our experimental results confirm that space compaction techniques obsolete on hard disks are very rewarding on flash storage.


I. INTRODUCTION
B-tree is arguably the most popular index structure used by many database engines. In some database systems such as MySQL and SQLite, B-tree is also used to store table themselves. B-tree structure also effectively supports range as well as point queries and guarantee a predictable response time [1]. Unfortunately, B-tree has been criticized for its poor space utilization; on average, only 2/3 of the index nodes are used [2].
Meanwhile, flash storage has been replacing the hard disk as database storage because of its fast I/O speed. With flash storage, however, the low space utilization in B-tree becomes a fatal drawback. This is because $/GB of flash storage is, though gradually dropping, still quite expensive. In practice, the low space utilization is detrimental to the operating costs of data centers and the profit of cloud service providers [2]. To be worse, the increased logical space amplification will reduce the physical over-provisioning (OP) area in effect The associate editor coordinating the review of this manuscript and approving it for publication was Cristian Zambelli . available to the FTL (Flash Translation Layer) module, which can in turn amplifies physical writes during garbage collection. Consequently, this increased write amplification results in poor I/O performance and transaction throughput.
In this paper, we propose two space optimization techniques for B-tree nodes. The first one is space reservation which dynamically reserves different amount of free space depending on the index to accommodate future record growths and thus avoid node splits. The second technique is data redistribution, which redistributes its entries to one of its sibling nodes with free space instead of splitting the node when the index node is full. We prototype both techniques in MySQL/InnoDB which uses B-tree to store tables as well as indexes [1]. Evaluation results confirm that our approach improves TPC-C throughput by 96% with 40% less storage space and cost than vanilla. Compacting B-tree nodes also increases the buffer hit ratio, which in part accounts for the performance improvement. Additionally, our space-optimized MySQL delivers 16% higher throughput than MyRocks with less storage space. To the best of our knowledge, this is the first work to show that compacting logical space in B-tree nodes on flash storage is beneficial for storage cost savings and transaction throughput. Note that with hard disks, space compaction does not lead to better performance and lower storage costs, so it is mostly obsolete. In particular, the extra read I/O cost for visiting a sibling node for data redistribution is unbearable on hard disks with slow read speed. As will be discussed in Table 6, space compaction techniques have degraded performance on hard disks. In contrast, with flash storage offering fast reads and slow writes, the extra reads for visiting siblings are worth trading for saving storage space and lowering write amplification. In particular, the lowered WAF will eventually boost write speed in return for extra reads. These benefits also hold for other storage engines using B-tree.

II. BACKGROUND
This section details the impact of space waste in B-tree nodes on flash storage in terms of performance and cost. We also review two node split mechanisms of MySQL and explain why index nodes tend to split heavily in OLTP workloads.

A. IMPLICATION OF SPACE AMPLIFICATION IN SSD
To understand how the storage space in use affects the performance of flash storage and application, we measure the write amplification factor (WAF), IOPS, and TPS while running the TPC-C benchmark on MySQL. We generate 1,500 warehouses (i.e., about 130GB of initial database) to fill a SSD with 250GB capacity. As shown in Fig. 1 (a), as the logical space in use by the database grows over time, the running WAF also worsens. In other words, as the size of the physical OP area shrinks, the write amplification due to the garbage collection exacerbates, which is especially true for OLTP workloads with random writes [3]. The increased WAF, in turn, significantly drops IOPS and TPS, as shown in Fig. 1 (b).
We also run the TPC-C benchmark for 60 hours with 2,000 warehouses on SSD-A, SSD-B and HDD and compare TPS and DB size at two different time points. As presented in Table 1, TPS at 60 th hours is lower than that of 30 th hours for both SSD-A and SSD-B. In contrast, in HDD, TPS is barely affected by increased space usage.
The results from Fig. 1 and Table 1 indicate that the space amplification by B-tree will make MySQL suboptimal on flash storage in terms of cost and performance. Recall that, on average, 2/3 of the B-tree nodes are filled with data, and the remaining space is empty. Therefore, if we can improve their average space utilization (e.g., 90%), the total in-use storage space can be reduced by 20%. This means that considerable performance gains can be achieved by compacting the in-use space of the B-tree. In addition, given that the cost of flash storage (i.e., $/GB) is still considered expensive and operating costs are a crucial competitive edge in cloud service providers [2], space-efficient use of flash storage is a top priority. Hence, it is imperative to improve the space utilization of B-tree for flash storage devices.

B. TWO TYPES OF NODE SPLIT IN MySQL
When B-tree nodes are split, the physical space used by the database increases. If there is no space left on the page for INSERT or UPDATE, the page has to be split in two. In most cases, the normal split is performed. That is, as depicted in Fig. 2 (a), the page is split at the middle record of the page, distributing half of the existing entries to each of the two pages. On the other hand, if data is added to the node in ascending order of the index key, the split point of the node becomes the rightmost record, as illustrated in Fig. 2 (b). We call this type of splitting an appending split. Now, let us explain how node split works in MySQL. Nodes in a B-tree index are guaranteed to be at least half full and are on average 2/3 full, but the actual space occupancy varies depending on the index type and index update pattern. For clustered indexes (i.e., primary indexes that stores table data), MySQL intentionally leaves 1/16 of all nodes to absorb future updates of existing records. In contrast, for secondary indexes, no free space is reserved. In realistic workloads such as TPC-C and LinkBench, records are inserted into the table in key order or randomly. Except for the appendonly tables, such mixed record insertion patterns will yield frequent normal splits. Note that, in normal splits, 1/16 of the reserved free space on individual nodes will not be leveraged as intended but simply wasted. Also, even though we detailed the node split algorithm of MySQL, we surveyed the B-tree implementations of other DBMSs including PostgreSQL [4], MariaDB [5], Percona Server [6], and Oracle [7], by reading the documents and/or B-tree source codes of their latest version and found that they are still taking basically the same implementations as MySQL/InnoDB with regard to space reservation and node split. Thus, we believe that the space-saving schemes are also effective on other DBMSs.

C. INDEX STATISTICS IN OLTP WORKLOADS
In this section, we present space utilization and split pattern of B-tree nodes while running TPC-C and LinkBench on MySQL. We then analyze the leading cause of low space utilization based on the two statistics. In particular, we observe that the existence of numerous half-full frozen pages further deteriorates the B-tree space efficiency.

1) INDEX SPACE UTILIZATION
To understand how the space utilization in B-tree nodes changes depending on access pattern and index type, we measure the space utilization of leaf nodes in primary and secondary indexes right after loading the TPC-C data. As shown in Table 2, the average space utilization of primary index pages is high (92%) in the data load phase (denoted as Sequential). This is because data is added in ascending order, so every leaf node will undergo the appending split. As a result, 15/16 of the page space, excluding the reserved area, is filled. We also measure the index space utilization after running the TPC-C benchmark for 110 hours (denoted as Random). Interestingly, as shown in Table 2, the average space utilization of primary indexes drops from 92% to around 60% to 85%. This decline is neither expected nor intuitive, given that data is still inserted into the primary index in ascending order of keys. The reason for this will be explained in detail below.
Meanwhile, the average space utilization of secondary indexes is also only 60-70%, regardless of the access pattern. This is because inserts are usually random to secondary indexes, so their leaf nodes are subject to normal splits.

2) SPLIT PATTERNS
To understand split patterns in B-tree, we collect split type statistics by index type and SQL operation while loading and running the TPC-C benchmark on MySQL. As presented in Table 3, appending splits account for 43% of all splits in Order-Line table. This is mainly because, in the large table (e.g., Order-Line), records are inserted into the primary index in ascending order of the key. The other 19% of splits are normal splits in secondary indexes. This is an obvious result, given that index entries are inserted into secondary indexes mostly in the random order of index keys, and thus secondary indexes should experience normal splits.
The remaining 38% of all splits are normal splits on primary indexes. As previously discussed, MySQL intentionally reserves 1/16th of the page space to absorb future record growth and avoid further page splits. However, in many cases, the reserved space is not large enough to absorb the record growth caused by the UPDATE statement. For example, for Order-Line table in TPC-C, records are inserted in ascending order of the key by New-Order transactions so that leaf nodes of the primary index will conduct appending VOLUME 11, 2023  splits, thus leaving 15/16 of each node full. However, as orders are delivered by Delivery transactions, the corresponding records in Order-Line table will be updated to reflect the delivery time, so their sizes grow. As more records grow on the node and its reserved space is thus run out, the node must perform a normal split. This is why, as indicated in the last row of Table 3, normal splits that occur when primary indexes are updated account for 38% of the total splits.
To make matters worse, after a primary index undergoes a normal split and all order-line records on both nodes are marked as delivered, the newly-split two nodes and their records are no longer accessed. That is, the split nodes become frozen [8]. The problem is that they are only half full. As shown in Table 3, normal splits triggered by UPDATE statements take up 38% of the total splits, resulting in numerous frozen pages. This increased number of frozen pages worsens space utilization. Such normal splits and frozen pages which worsen the space utilization in B-tree, are not confined to TPC-C but are ubiquitous in other OLTP workloads such as LinkBench [9].

III. SPACE-OPTIMIZING TECHNIQUES
This section proposes two space-optimization techniques for B-tree-based database storage engines. The first one is space reservation: reserving free space per individual node to avoid normal splits due to record growth. Instead of providing all indexes with the same amount of reserved space, our design offers a different amount of reserved free space according to the index. The second technique is data redistribution: redistributing index node entries to one of its neighbors to avoid normal splits during random INSERT. Both optimizations aim at storing records inside the B-tree node in a more compact way. This way, storage engines can keep the same amount of logical data with less space. Recall that, in flash storage, the less in-use room can lead to lower WAF and better I/O performance.
While MySQL supports the zlib-based page compression as a way to minimize space amplification [1], it has two drawbacks. First, it requires excessive CPU overhead in compressing and decompressing pages when the space for the modification log runs out [2]. Second, it incurs memory amplification since both compressed and decompressed versions of each page should be cached in the data buffer [1]. According to a separate experiment, while the zlib-based MySQL version can save a lot of space, its CPU overhead is too high, resulting in lower performance than the no-compression version. This result is consistent with a recent study [2].

A. PER-INDEX FREE SPACE RESERVATION
MySQL already has a mechanism to reserve free space: 1/16 of the space in clustered index pages is left free to avoid future normal splits. However, as exemplified in Section II-C, if the fraction of free space is set too small for primary indexes, those indexes will undergo excessive normal splits due to record growth. Therefore, the newly split pages become frozen, with only half-filled data. Meanwhile, for append-only tables where records are only inserted but never updated, the free space is simply wasted. In this case, the space can be better utilized by minimizing the free space. Likewise, a fixed-sized space reservation is an inconsiderable way to prevent excessive node splits.
To address such space inefficiencies, we suggest allocating a different amount of free space to each B-tree depending on the update pattern to the index, and this technique is called space reservation. If the UPDATE query that increments record size is detected, reserved free space (RS) of the B-tree leaf nodes is updated according to (1): This way, we can contain as many records as possible and simultaneously avoid normal splits. If the record sizes are all different, we rely on the average size of the records. Suppose records of size 61 bytes are appended to a primary index in ascending order of the index key. In vanilla MySQL with a 16 KB page size (16,384 bytes), a single node will contain 251 records. And if more than 205 records are updated and the size is increased to 66 bytes, the normal split occurs. However, space reservation only stores 248 records at first, and 1,256 bytes of each node in the index nodes are deliberately left free. The increased free space on each node intends to absorb the growing records without node split; hence, the nodes generated by the appending splits and thus are mostly full with data are unlikely to experience normal splits due to record growth. Conversely, without an UPDATE, there is no need to reserve additional space (i.e., append-only tables). In this case, we completely remove the reserved free space of the according index leaf nodes and store as much records as possible.

B. DATA REDISTRIBUTION
To address the space inefficiency caused by normal splits, we propose data redistribution. The mechanism is detailed in Algorithm 1. If a leaf node NODE cur does not have room to accommodate a new entry for a random insert, we first check whether the two sibling nodes have free space to store the new entry. If both siblings do not have enough free space, NODE cur has to conduct a normal split. Otherwise, a subset of records in NODE cur will be migrated to one sibling Algorithm return insert 39: end function node NODE target with more free space so that NODE target and NODE cur have a similar amount of free space. While node split moves records to the newly allocated node, data redistribution moves records to the existing adjacent node. Hence, redistribution does not allocate new nodes but stores the same amount of data with fewer nodes. By minimizing the in-use space, redistribution can improve I/O performance and transaction throughput by reducing the physical write amplification of flash storage. J Jannink proposed a similar idea of redistributing data within neighbor nodes in 1995 [10]. However, our work is different from that work in terms of the target node and the time point when the redistribution is invoked. The previous work lazily addresses the space utilization issue by shifting data from the neighbor node to the current node where deletion is performed. The purpose of the move is to increase the chance of merging the current node and its neighbor node into one, which is only possible if the two nodes are filled less than half. Though it may seem efficient, it can easily become a time-wasting operation if additional inserts are performed, thus splitting the merged node. On the other hand, data redistribution moves data from a split target node to a neighbor node before the split is conducted. Since our goal is to aggressively reduce the number of splits and maximize page space utilization, calling data redistribution before a split reduces the number of splits and alleviates space amplification more effectively than calling it during deletion.
Especially, the space compaction by data redistribution becomes more appealing on top of flash SSDs than slow hard disks. To check whether redistribution is possible, we have to visit both sibling nodes and thus may need to issue one or two read operations when the nodes are not cached in the buffer. For hard disks, the benefit of delaying node splits by redistribution is not large enough to justify slow-read operations. In contrast, for SSDs with fast-read speed, the cost of reading sibling(s) can be offset by the benefits of redistribution (i.e., reduced WAF and less cost for storage space). Nevertheless, to minimize data redistribution costs, we choose to distribute records only to one sibling with more free space not to both siblings, as in the work by Maelbrancke and Olivié [11].

IV. EVALUATION
We use a Linux platform equipped with Intel(R) Xeon(R) Silver 4216 CPU 2.10GHz with 32 total cores and 32GB main memory. To evaluate the performance of our techniques, we prototype two space-optimization techniques on MySQL/InnoDB. And we measure various performance metrics such as transaction throughput, IOPS, buffer pool hit ratio, WAF, and the space utilization of B-tree nodes while running two OLTP benchmarks, TPC-C and LinkBench. As the database storage, we use the two SSDs in Table 1.
A. RUN-TIME PERFORMANCE

1) TPC-C
To verify the effect of the proposed techniques, we measure three metrics (i.e., database size, WAF, and TPS) over time while running the same TPC-C benchmark used for Fig. 1 on top of SSD-A until 36 million transactions are completed. We repeat the same experiment using four MySQL versions: vanilla MySQL, MySQL with space reservation, MySQL with data redistribution, and MySQL with both techniques. Three metrics from four experiments are plotted in Fig. 3 (a), (b), and (c), respectively. It is evident from the figures that   two techniques can effectively save the total database size, thereby reducing WAF and ultimately improving transaction throughput. Recalling that transactions are processed faster in the optimized version, it makes more sense to compare performance metrics when the same number of transactions are executed in both versions. Hence, we compare the database size and TPS when 36 million transactions are executed. As shown in figures, our approach consumes 40% less space and improves throughput by 96%. Fig. 3 (d) compares the space utilization of B-tree for three ever-growing tables in TPC-C (i.e., Order-Line, History, and Order) after the TPC-C benchmark completes. The figures clearly show that our approach improves the space utilization of both primary and secondary indexes for three tables. With the help of space reservation, the space utilization of three primary indexes increased by up to 90%. Also, with the help of redistribution, the space utilization in all secondary indexes increased by up to 86%. Table 4 compares performance metrics measured after executing 36 million transactions with vanilla and the spaceoptimized MySQL, respectively. First, our approach saves the increased space per transaction (bytes/tx) by 40% over vanilla. Second, the optimized MySQL show 68% higher read IOPS and 86% higher write IOPS. This directly reflects the huge gap in the running WAF between the two MySQL versions in Fig. 3 (b). Lastly and interestingly, the space compaction improves the hit ratio by 0.7%. In terms of miss ratio, our approach can reduce the miss ratio by 20% (i.e., from 3.5% to 2.8%). This partly explains why the optimized version improves transaction throughput by 70% over vanilla.

2) LINKBENCH
To validate that our approach works for other workloads and different SSDs, we measure the same metrics as in TPC-C while running LinkBench on SSD-B until 2.5 billion operations are processed. For simplicity of presentation, Fig. 4 only shows results from two MySQL versions, vanilla and space-optimized MySQL. By comparing database size and OPS when 2.5 billion operations are executed, our approach improves OPS by 23% using 14% less space and reduces WAF from 1.67 to 1.46. In addition, the optimized version improves the average space utilization from 66-75% to 80-86%. Table 5 compares other I/O metrics during the last hour of LinkBench between two MySQL versions. Compared to vanilla, the optimized one reduces the increased space per transaction by 15% (the third column in Table 5), improves read IOPS by 60% and write IOPS by 66%, and reduces  the miss ratio by 12% (i.e., from 3.2% to 2.8%), improving throughput by 12%.

3) COMPARISON WITH MyRocks
MyRocks provides a MySQL interface on top of an LSM (Log-structured merge)-based RocksDB storage engine, which is claimed to have better space and write efficiency than InnoDB [2], [12]. To verify whether the space-optimized MySQL is comparable to MyRocks in terms of space efficiency and performance, we measure database size, WAF, and TPS while running TPC-C onSSD-B until 60 million transactions are executed. We compare the performance of vanilla, the optimized MySQL, and MyRocks and plot the TPC-C results in Fig. 5.
While the database size of MyRocks is less than 50% of each MySQL version after initial data loading (i,e., 0 transactions are processed), it grows rapidly until 30 million transactions are completed. As displayed in Fig. 5 (a), despite the immutable size of LSM tree nodes, MyRocks occupies even more space than the optimized MySQL due to the compaction and level increment of the LSM tree [2]. In addition, the amplified database size deteriorates the performance of MyRocks in terms of TPS and write amplification, whereas optimized MySQL suffers less from space amplification by leveraging our effective space-optimizing techniques. Adding to the claim that B-tree can achieve similar write amplification than LSM tree [13], experiment result shown in Fig. 5 implies that B-tree-based storage engines do achieve even smaller space and write amplification than LSM tree.

4) EFFECT OF SPACE OPTIMIZATIONS ON HDD
We also run the TPC-C benchmark on vanilla and space-optimized MySQL using the same hard disk in Table 1. Table 6 presents TPS and the number of read, write, and written bytes per single transaction after executing 40,000 transactions. Rather than improving the performance, space optimization techniques worsen the transaction throughput and IOPS, as shown in Table 6. Due to data redistribution, it takes longer to proceed with a single transaction than vanilla since records are moved back and forth between the sibling nodes and thus require additional disk I/O. Therefore, space optimization techniques are not worth applying when using hard disks as storage devices, but they provide much benefits on flash storage devices.

V. CONCLUSION
Optimizing space amplification is paramount to efficient flash storage in terms of economy and performance. B-treebased storage engines are particularly prone to space amplification since, on average, only 2/3 of the page is filled. To address this problem, we propose two space-optimization techniques to improve the space utilization of the B-tree: perindex free space reservation and data redistribution. By applying these two techniques, the average space utilization of the B-tree increased by up to 80-91%, saving 40% of storage space in TPC-C and 16% in LinkBench, respectively. Mitigating space amplification, in turn, lowers WAF by securing the over-provisioning area and improves throughput by at most 65%.