TieredHM: Hotspot-Optimized Hash Indexing for Memory-Semantic SSD-Based Hybrid Memory

Memory-semantic solid-state drives (MS-SSDs) provide a promising opportunity to enable the hybrid memory architecture (HMA). The memory-semantic interface enables the CPUs to directly access structured data in SSDs and eliminate bulk data copy/swap between the memory and storage devices. However, existing hash indexings issue many random writes, resulting in two problems when directly deployed on MS-SSD-based HMA: 1) highly random traffic persisted to the underlying nand flash of MS-SSDs incurs significant garbage collection (GC) overhead and 2) placing frequently updated memory pages of hash indexings in persistent memories (PMs) is anticipated to reduce write latency, failing to work effectively due to the lack of skewness. To address the above problems, we propose a novel MS-SSD-friendly hash indexing scheme called TieredHM. It employs a multilayer structure and opportunistic data movement (ODM) to construct skewed writes. Hence, the MS-SSD can transform the writes into multistreamed writes, separating data with different update frequencies to reduce GC overhead. Besides, since the top layer is updated much more frequently (more skewed) than other layers, placing the top layer of TieredHM into PM can significantly reduce write latency. TieredHM further leverages a prefetch mechanism based on the internal parallelism of nand flash to reduce search overhead incurred by ODM. Experimental results show that TieredHM reduces the average write latency and GC overhead by up to $8.3 \times $ and $20.0 \times $ compared to state-of-the-art hash indexings without sacrificing read performance.

and large storage capacity.On the one hand, hash tables are becoming the most privileged indexing structure because they perform point queries, including lookups and insertions, at constant time complexity (O(1)) regardless of the inserted data amount compared to a tree-like indexing structure.For example, mainstream in-memory databases, such as Redis [1] and Memcached [2], employ hash indexing for fast data access.On the other hand, due to the growing conflict between the extensive working data sets and the high cost to scale main memory, embracing a hybrid memory architecture (HMA) to extend DRAM with more cost-effective memory-semantic solid-state drives (MS-SSDs) [3], [4], [5], [6], [7] is becoming a promising and practical method.HMA provides in-memory applications with extra benefits, such as extended memory capacity and persistent data storage, with only a few or no code changes.The emerging memory-semantic interface [3], [4], [5], [6], [7], [8], [9] of MS-SSD further bypasses the costly data copy/swap between main memory and SSDs by enabling direct access from the CPU, as shown in Fig. 1.
However, simply deploying hash indexing under MS-SSD-based HMA ignores the distinct features of underlying MS-SSDs, thus incurring significant performance degradation.The main reason is that the access pattern of hash indexing is highly randomized, which is unfavorable with NAND flash SSDs.Because NAND flash is only page-addressable and because of its distinct erase-before-write feature, random write traffic from hash indexing can overburden the garbage collection (GC) in SSDs. 1 In addition, given the above limitation of NAND flash, MS-SSDs adopt internal DRAM to enable byte-addressability.However, data written cannot be efficiently cached and should persist to NAND flash immediately.This further exacerbates GC overhead and increases write latency.
Spotting and placing data with different update frequencies separately is the key to reducing GC overhead and write latency in an MS-SSD-based HMA.On the one hand, the multistream technology [10], [11], [12], [13] stores pages with different updating frequencies into separate logging areas.Data in the same logging would invalidate simultaneously, and GC for frequently updated pages will not touch infrequently updated pages and vice versa.On the other hand, the hotspot-aware page placement mechanism [5], [14], [15], [16], [17] detects and places frequently accessed pages in faster memories [e.g., DRAM and persistent memory (PM)] and infrequently accessed pages in slower memories (e.g., MS-SSD).Thus, a significant fraction of random hash writes can be absorbed by faster memories, reducing the write traffic to MS-SSDs and average write latency.Nonetheless, as shown in Fig. 2, hash indexings do not show such skewed writes since a given key is randomly mapped to the hash table by a set of different hash functions.Keys with varying update frequencies are likely to collocate to the same bucket.Hence, the existing hash indexings can hardly leverage multistream or page placement technology to improve write efficiency, corroborated by the experimental statistics in Section VI-D.
This article aims to develop an efficient hash indexing that renders skewed write traffic that is more MS-SSD-friendly without sacrificing the point query performance.A naive solution is to borrow the hierarchical data movement used by many tree-based indexings to generate naturally skewed writes.However, several challenges arise.
1) Skewed write does not come as a free lunch.We find that the computation and storage overhead of corresponding data movement overshadows the benefits it brings.2) Flushing and persisting data to MS-SSDs [3], [4] introduce significantly more latency.However, existing hash indexings employ write-ahead logs (WALs) to prevent system crashes while relocating KVs at hash collision, thus increasing write and flushes [2].3) Although increasing the number of layers in a hierarchical indexing structure promotes skewed writes and increases write efficiency, relocating data to lower layers leads to longer search latency.We propose TieredHM to overcome the high write overhead raised by traditional hash indexing against MS-SSD-extended main memory under HMA.First, we employ a hierarchical structure and provide the range of each layer to SSDs.By doing so, we can tag different stream IDs to each layer dedicatedly, and place the top layer in PM and the rest layers in MS-SSD.Second, we invent an opportunistic data movement (ODM) strategy based on the hierarchical layout to build skewed writes by moving data from the upper to lower layers beforehand, enabling the hash indexing to leverage the multistream and page placement feature.Third, we adopt maximum one flush policy and in-cacheline crash consistency to merge multiple flushes into one, which solves the dilemma that the data movement introduced by ODM could overshadow the benefits it brings.Finally, since moving data to lower layers increases the average length of the read path, we propose Parallelism-Aware Prefetching to achieve predictable read latency regardless of the number of layers in TieredHM.
Our experimental results show that compared to state-ofthe-art hash indexings (Level Hashing), TieredHM speeds up over 8.3×/2.7×and reduces the GC overhead by over 20.0×/4.5×during insertion/update.The result indicates that we can significantly reduce write latency by placing the top layer of TieredHM into a small (1/6-1/96 of SSD Capacity) piece of PM.
The remainder of this article is organized as follows.Section II describes the background and motivation.Sections III-V present the design.Section VI evaluates the performance, and Section VII concludes this article.

II. BACKGROUND AND MOTIVATION
This section discusses the background and motivation.

A. Memory-Semantic SSD-Based HMA
Using SSDs to extend main memory has become a cost-effective and practical solution [5].For example, many widely deployed in-memory databases, such as MongoDB [18] and LMDB [19], leverage memory-mapped interface (mmap()) [20] in operating system (OS) to map data in SSD into their virtual memory to allow CPU access.Traditional memory hierarchy treats SSD as secondary storage, and application requests are involved in many software operations, including context switch, page fault handling, and I/O processing in the storage stack (file system, blkmq layer, and NVMe driver), to copy the SSD data into the main memory.With the emergence of ultralow latency (ULL) SSDs (e.g., Z-NAND [21], XL-FLASH [22], and Optane SSD [23]), the corresponding latency is around six times longer than ULL-NAND read latency (3 µs), constituting a significant portion of the entire delay [6], [24].Thus, several works enable memory-semantic interface for ULL-SSDs (MS-SSDs) to directly serve the load/store instructions from the CPU by leveraging advanced interconnections, such as CXL [4], [8], PCIe [5], [7], NVDIMM [6], [9], CCIX [25], or OpenCAPI [26].Other works employ I/O coprocessors to access standard NVMe SSDs directly from the CPU [24] or GPU [27].In addition, Samsung has announced the first MS-SSD based on CXL in the industry [3].The memorysemantic interface and significantly lower price per byte make the MS-SSDs a compelling building block for HMA.However, the HMA constituted by DRAM, PM, and MS-SSD exhibits more diverse accessing latencies.As a result, the performance of HMA is sensitive to access skewness, as it enhances the effectiveness of hotspot-aware page placement to direct write traffic to faster memories.
Random writes to MS-SSD incur significant latency for HMA.Since most data resides in NAND flash and the internal caching mechanism becomes less effective under randomized hash workloads, the NAND access latency of MS-SSD could be directly exposed to the CPU.This is not much of a concern for hash query since with ULL technology, the latency of NAND read is only 3 µs [21], and hash indexings have limited read amplification (RAF); thus, read requests can be served within a few microseconds.However, hash writes should be directly persisted to nonvolatile media (NAND flash) immediately to prevent data loss, and NAND write latency is as high as 100 µs, exacerbating GC overhead and write request delay.Therefore, we need to rethink the design of hash indexing to improve hash write efficiency for MS-SSD-based HMA.

B. Hotspot-Aware Page Placement for HMA
To scale memory capacity while preserving low access delay, modern hybrid memory systems employ hotspot-aware page placement mechanisms to place frequently accessed (hot) data in fast memories (e.g., DRAM and PM) and offload less frequently accessed (cold) data to cost-efficiency devices (e.g., MS-SSDs).Recent works focus on optimizing the overhead and accuracy of page access frequency (temperature) detection [5], [14], [15], [16], [17], [28], [29].Tracking and mining the page temperature for large memory via access bit in page tables [30], [31] are expensive.Hardware-assisted hybrid memory then modifies the translation lookaside buffer (TLB) and hardware page table walker [28], or the host bridge [5] to track the page access and migrate pages accordingly.However, they cannot effectively distinguish and recognize the highlevel application requirements.Many recent works propose software-based approaches [14], [15], [16] to track detailed memory access behavior.G-swap [29] and SAP HANA [32] even employ machine learning to adapt to long-term behavior shifts of applications.However, the above page placement methods take limited effect under hashing workloads as they demand user access patterns to be naturally skewed.

C. Multistream Technology
Multistream technology allows the application [11], [33], or system software [10], [12], [13] to assign a stream ID to the data with a similar lifetime (inverse of the update frequency) during writing.SSD can then place the data with the corresponding stream ID into one flash block to improve GC efficiency because data that belong to the same block are likely to be invalidated simultaneously.In this article, we let TieredHM assign stream IDs and inform the memory range of each stream to MS-SSD, which is illustrated in Section IV-B3.
Other than how to advise data lifetime, another essential aspect is how to identify and categorize the data's updating frequency.To achieve this, researchers develop various technologies to identify data with different lifetime [12] or redesign the software to write separately [13].However, multistream SSDs work more efficiently only when the application's writes are notably skewed.For example, in RocksDB [10], which employs a tree-based indexing, different layers in the log-structured merge tree (LSM) have distinct updating frequencies naturally.Researchers thus seek ways to manually change the applications to specify a stream ID in each write.Also, automatic stream management, such as FStream [13], AutoStream [12], and PCStream [10], makes stream allocation decisions transparently for the applications.
Note that both manual and automatic stream management require the data lifetime to be naturally skewed [10], [34].However, this requirement is not valid for many in-memory applications.First, in-memory applications usually perform more random requests, especially hash indexing.Second, the number of streams in commercial SSDs is typically limited to only 4 to 16 [10], [11], [12], [33], which restricts the flexibility of memory allocation for in-memory applications further.

D. Hash Indexing
In recent years, researchers have proposed several new hash schemes, such as CCEH [35], Level Hashing [36], and Path Hashing [37], for PMs to guarantee the crash consistency based on conventional hash indexings, such as Cuckoo Hashing [38].However, they focus more on NVMs including ReRAM [39], PCM [40], SST-MRAM [41], and 3-D XPoint [42], which endure intensive random and inplace accesses better than flash.It is still essential to study how to develop MS-SSD-friendly hash indexing under HMA.Several studies propose SSD-oriented hashings [43], [44], [45], [46], [47] to reduce GC overhead.They employ log structures to record random writes sequentially.However, each read request must scan entire logs to find the up-to-date version, resulting in poor search performance.Merging the log structures also increases write traffic and latency.Moreover, they omit memory semantics and crash consistency guarantee; adapting them to HMA introduces costly memory flushes and WALs.In the following, we analyze different hash schemes to motivate our work.
Hash Collision and KV Relocation: One critical issue of hash indexing is handling hash collisions when multiple keys map to the same bucket.Existing hash indexings apply keyvalue relocation to solve hash collisions, and each relocation would accomplish multiple costly flush operations, which overburdens the GC in SSD.To address collisions, Cuckoo Hashing [38] allocates multiple slots in one bucket and employs two or more hash functions to map a given key to multiple buckets.If one of the hash functions cannot locate a bucket with an empty slot, it will try another one.In case none of the functions can identify such an open slot, it will attempt to evict an existing key-value pair in the corresponding buckets using alternative hash functions iteratively, known as cuckoo displacement.Level Hashing [36] also leverages two hash functions to increase load factor but limits cuckoo displacement to only once.Level Hashing also adopts a sharing-based two-level structure to handle hash collision.KVs are first inserted into the top layer, and the bottom level is used to store evicted conflicting items.If all the above strategies fail, both Cuckoo and Level Hashing will resize by allocating a larger hash table/level and moving corresponding data to the new one.Resizing could block data access and trigger more flushing.To reduce resizing overhead, Clevel Hashing [48], [49] offloads the resizing to background threads.It shares a similar structure with Level Hashing but limits the size of a slot to 8 bytes for atomicity.Dynamic hashings, such as CCEH [35], Dash [50], and SEPH [51], revise extendible hashing [52] for PM, which only resizes the overflowed buckets of the hash table.However, dynamic hash indexings trigger resizing quite frequently, which raises many small random memory allocations.These allocations scatter data temperature uniformly across a broader address range, making them ineffective to be deployed with page placement and multistream features under HMA.

E. Motivation 1) Lack of Write Skewness:
The skewness represents the variance in update frequencies across different ranges of memory addresses, e.g., 4 kB.We experimented to analyze the distribution of update frequencies for existing hash indexing methods by inserting 680 million key-value pairs, amounting to 20 GB of data.The experiments were performed on a sheer MS-SSD for simplicity, and the update frequencies were collected at the 4-kB memory page level.In our experiment, we compared different hash tables by configuring their size and making each resize twice.The access skewness of Linear, cuckoo, and Level Hashing was illustrated in Fig. 2(a)-(c) by displaying their update frequencies on various memory addresses.
Results imply that existing hash indexings deliver slight skewness.The random mapping from keys to buckets amortizes the update frequencies of 4 kB memory pages, resulting in a uniform distribution of update frequency across the entire memory space.The Level Hashing technique employs a configuration in which 2 25 buckets are allocated on the top level and 2 24 buckets on the bottom level, with each bucket 2) Extra Flushing Overhead During KV Movement: Enabling key-value pairs' movement between multiple buckets is a primary way to help reduce the necessary rehashing and build skewness.However, each move operation is accompanied by multiple flush operations in the critical execution path.Specifically, as shown in Fig. 2(d), a reasonable strategy is to 1 first write to the log area (i.e., WAL) to trace the moved key (A).It then 2 persists the insertion to destination slot (A ) and 3 the deletion of source data (A) to NAND flash sequentially.4 Finally, it marks the KV relocation as completed and persists to underlying storage, such that the KV relocation will be considered successful in case of a sudden crash.The illustrated movement strategy results in a considerable amount of writes and flushes to SSD, which may even overshadow the benefits brought by other hash optimizations.This motivates us to avoid eviction during insertion and propose an ODM strategy (see Section IV-B) that can share the same flush with new writes.

III. LAYOUT OF TIEREDHM
This section describes the structure of TieredHM.1) Logical Layout: Equivalent to many other contenders, TieredHM employs two hash functions to mitigate hash collisions.Then, TieredHM adopts a multilayered structure, which provides an opportunity to differentiate the update frequencies of memory ranges corresponding to each layer, such that TieredHM can achieve noticeable skewness to leverage hotspot-aware page placement and multistream strategies.As shown in Fig. 3(a), each layer is 2 n times larger than the layer above (where n is an integer, 1 by default).Buckets are indexed using the most significant bits (MSBs) of the key's hash.For a given key, the candidate bucket in the next layer  is located with n (1 by default) more bits of its hash value.For example, as shown in Fig. 3(a), the index for buckets in L0 is the leading 2 bits of H1(key), the index in L1 is the leading 3 bits, and so on.Each bucket contains four slots.Each slot stores a 16-byte key and a 15-byte value, which is large enough for most key-value pairs in Facebook's keyvalue store [53].To support larger KV sizes, we would save the data in a separate log and then index a pointer to the log item, which is out of the scope of this article.
2) Optimization Opportunities and Compensation: The pyramid-like structure naturally divides the memory pages of TieredHM into multiple consecutive regions (layers).This provides an opportunity to distinguish the update frequency of each layer by allowing data to sink between them.However, conventional hierarchical-structured hash indexing provides limited skewness among layers, as shown in Fig. 2(c), which makes the hotspot-aware page placement and multistream ineffective.We revamp the write strategies to generate obvious write skewness among layers in Section IV, enabling treeindex-like write efficiency.We mitigate search overhead in Section V to avoid reading iteratively from the top to the bottom layers, thus leading to search latency comparable to single-layered hashing.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.IV.GENERATING MS-SSD-FRIENDLY WRITE This section explains how TieredHM generates MS-SSDfriendly writes.

A. Insertion
Fig. 4(a) depicts how TieredHM inserts data.When a new KV (B) arrives, 1 TieredHM will first try to find an empty slot in the bucket directed by the first hash function (H1()) in the top layer (L0).If failed, TieredHM will try to search the alternative bucket in the same layer directed by the second hash function.If both buckets in L0 are full, 2 TieredHM will try to find a free slot in the successive layer (i.e., L1) instead of evicting an existing KV (e.g., A and C) from L0 to L1 to avoid costly flushing due to KV relocation.The position of candidate buckets in the next layer can be calculated using n more bits of the hash value (H1(B)).TieredHM depicted in Fig. 4(a) succeeds in this step, with only one flush needed to insert B. Suppose candidate buckets in all layers are full.In that case, 3 TieredHM will then try to evict one of the existing KVs (e.g., A) in candidate buckets to its alternative bucket in the same layer, as shown in Fig. 4(a).Such an eviction is known as cuckoo displacement, which starts from L0 to lower layers.Different from cuckoo hashing, for each successful insertion, TieredHM allows only one cuckoo displacement (i.e., without iterative eviction) and must only be within the same layer.Resizing happens when TieredHM cannot solve a collision at existing layers.

B. Opportunistic Data Movement
To enhance the skewness among different layers, TieredHM adopts an ODM strategy.The insight is that if we can grab the opportunity to move data in batch among layers during a regular hash write (insertion, update, and deletion), we can eliminate extra flush due to KV relocation.In this section, we first describe the procedure of ODM and then verify that it can share the flush with regular writes.Finally, we illustrate how we implement page placement and multistream strategies using skewed writes.
1) Procedure of ODM: While TieredHM serves a hash write in lower layers, the ODM strategy will scan the page where the newly written data is located.Whenever encountering a bucket with writable slots, ODM will then move down upper layers' KV pairs whose hash paths direct to that bucket and mark their source slots as writable.Take insertion as an example, as shown in Fig. 4(b), when inserting B to L1, 1 ODM will first scan page1.Given that both buckets in page1 have writable slots, 2 ODM will then move A and C from L0 to L1, whose hash paths direct to those buckets.3 After persisting the modification of page1, 4 ODM will mark source slots in L0 as writable [shadow in Fig. 4(b)].Note ODM will first try to move data from the top layer and then from lower layers until the destination buckets are full or the source buckets are vacant.Furthermore, steps 1 and 2 benefit from CPU cache since they exhibit high access locality.
ODM renders significant skewness among layers in two aspects: 1) for insertion with new data, it batches them in KV granularity in upper layers and then flushes them to lower layers together without extra flushes (see maximum one flush policy) and 2) for data updated with different frequencies, less frequently updated data sink to lower layers during ODM.In contrast, frequently updated data will be reinserted into upper layers by lazy update (see Section IV-C).Consequently, ODM helps TieredHM generate a highly skewed write distribution, which can benefit from page placement and multistream technologies.ODM differs from compaction in LSM-tree since it moves down and merges data opportunistically, which only happens along with regular hash writes and avoids evicting any existing KV in the destination layer.
2) Maximum One Flush Policy: Although ODM generates skewed writes between layers, the crucial problem is that such a policy incurs numerous costly flushing operations.Fortunately, we can share the flush with the insertion (we only need to flush page1 and only once) without sacrificing the crash consistency.The critical insight is that data flushing incurred by ODM can be carried out during the KV insertion operation.The rationale behind this is that First, the write traffic incurred by the ODM can be cached and wait for the write-back, eliminating the costly flushing illustrated in Fig. 2(d).Second, since data movements and queries follow the same single direction, and no extra data is altered except for the inserted one, there is always an up-to-date copy for opportunistically moved data.Therefore, even a WAL log will not be needed (details in Section IV-D).Third, as all the movement destinations in ODM always colocate with the page that is serving the update or insertion request (page1), TieredHM only needs to perform one flush operation toward the destination page after finishing all the writes.
3) Enabling Multistream and Page Placement: Leveraging the skewness built by ODM, we illustrate how to implement the two strategies.
Hotspot-Aware Page Placement:  PM to absorb significant hash writes.We adapt the size of L0 to fit into PM.Note that each layer is mapped to a contiguous memory address range exposed by PM or MS-SSD to avoid random writes.The signature array located in DRAM (Section V-B) is proposed as an auxiliary structure to improve search performance.
Multistream: To enable the multistream feature, TieredHM assigns a stream ID and provides the memory range of each layer to the MS-SSD during initialization.The MS-SSDs then maintain a range table, recording the start offset, length, and corresponding stream ID of each layer, as depicted in Fig. 7.The requests to MS-SSDs are a set of memory writes transferred via a memory interface.The MS-SSD firmware determines the corresponding stream ID by comparing the memory address of the write request with the range table.Employing specified hardware can accelerate such search processes.This design is feasible because the layers in TieredHM are relatively stable and limited in number.

C. Lazy Update and Deletion
Since data are moved down during insertion, updating or deleting in-place incurs more writes to lower layers, reducing the effectiveness of page placement and deteriorating write skewness.Also, TiredHM should avoid frequently updated data sinking in lower layers, which reduces search performance and write skewness.To address these limitations, we propose lazy update and deletion to reinsert updated or deleted data into upper layers without compromising consistency.Fig. 4(c) and (d) shows how lazy update works.The old copy of updated data (A ) sits in the lower layer (L1) while there exist open slots in the upper layer (L0).Lazy update scheme 1 directly writes new data (A ) into the upper buckets within the same hash path. 2 Afterward, when a write request (insert, update, or delete) hits page1, ODM will compact A and A , as shown in Fig. 4(d).One exception is when there is no available space to conduct ODM, so stale slots cannot be recycled.In this case, TieredHM will directly recycle the stale slot to insert the new data.Such a direct recycling procedure happens right before cuckoo displacement [step 3 in Fig. 4(a)] during subsequent insertion.Therefore, lazy updates and deletions do not incur storage leakage.
Lazy deletion is similar to lazy update, except it recycles both the upper and lower slots occupied by the deleted KV during merging.TieredHM expresses a lazy deletion of a key using a deletion flag in the extra metadata area, depicted in Fig. 5.When there is no available slot in upper layers to conduct lazy update or deletion, TieredHM updates or deletes a KV in place.
1) Merge Overhead: Intuitively, merging between two layers is costly since TieredHM scatters duplicated keys by two hash functions.During lazy update or deletion, TieredHM could write the newer copy to another hash path relative to the stale version of the same key.To identify duplicated keys, TieredHM has to search all lower-level candidate buckets for each KV in upper-level buckets.Since it is unlikely for two different keys to collide under both hash functions, merging multiple keys in a bucket would result in random searches and updates across the lower layers, introducing significant read and write amplification.Moreover, such random access exhibits low cache efficiency, leading to higher GC overhead in MS-SSD.
To reduce merge overhead, TieredHM limits merge operations within a single hash path of a given key by gathering duplicated keys in the same hash path.To this end, TieredHM always conducts lazy updates or deletions in the same hash path.Consequently, since the merged keys colocate within the range of ODM, as shown in Fig. 4(d), merging can be carried out along with ODM and benefit from the CPU cache.

D. In-Cacheline Crash Consistency 1) In-Cacheline Metadata Design:
To further reduce the overhead of flushing for data consistency, we carefully design the placement of metadata (referred to as flags).The key insight is if we can guarantee the writing orders between the data and flags, we can eliminate logs and extra flush operations.Fortunately, while performing multiple writes to the same cacheline, the writing orders to the cacheline are equivalent to the order they reach the persistent memory [54].Such an order can be guaranteed with released memory ordering supported in C++11 or the fence instruction on X64 architecture, both of which incur no runtime overhead.Therefore, TieredHM substitutes WAL with the flagging mechanism and places each slot and its metadata in the same cacheline to eliminate extra flushes to guarantee consistency while updating the metadata and slot.This is feasible because, for 15-byte keys and 16-byte values, each cacheline can place two key-value pairs plus a 2-byte metadata area that saves 2bit readable, 2-bit nonwritable, and 2-bit deletion flags.As a comparison, existing hash indexings, such as Level Hashing, gather the metadata of all slots in the footer of each bucket, failing to guarantee that all slots and their corresponding flags are located within the same cacheline.Thus, Level Hashing still needs to conduct multiple flushes most of the time.
Fig. 5 shows the metadata structure for slots in TieredHM, while Table I lists the state of two flags regarding three different states of a slot.TieredHM employs two flags for each slot to indicate its "readable" and "nonwritable" states.Initially, they are both false, which means this slot is unreadable and writable.When a slot is inserted, its nonwritable flag is set to 1.The readable flag is used to indicate a shadow slot, which is used to mitigate search overhead without compromising consistency (Section V-A).In addition, TieredHM employs deletion flags to enable lazy deletion.
2) Crash Consistency and Recovery: We use multiple strategies to guarantee crash consistency with minimum flushing overhead under different scenarios.
First, when inserting data into an open slot, once we write the key-value pair, we can alter the readable flag of the slot Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and persist both data and flags using one flush to make its data ready for reading.This design does not require an extra WAL to guarantee consistency because the write to the flag is atomic.
Second, when writing data to a shadow slot (see Section V-A), we set the readable flag to 0 before writing the new data and reset both the readable and nonwritable flags to 1 afterward.Since all writes are in the same cacheline, we only need to protect the order we update the flags and values and perform one flush until all writes are finished.
Third, when updating a key-value pair, we first seek upper layers to find an open or shadow slot to place the data.If successful in finding an available slot, we use the first or second strategy to insert the data.Otherwise, if failed to find an available slot in the upper layers, we then update the corresponding value "in-place."Because we support 16-byte values, which exceed the 8-byte maximum atomic updating size, this is the only case in which we employ a WAL to protect the in-place value override.
Fourth, lazy deletion is similar to lazy update, except that we additionally set the corresponding deletion flag, indicating which slot and its duplicated keys in the lower layer should be recycled later.If there is no available space for lazy deletion, TieredHM marks the slot as open by setting both the readable and nonwritable flags to 0 without altering the data part.Then, a flush is followed to guarantee the slot is successfully freed.
Finally, when migrating data during ODM, we simply use the above methods to write to the destination slot.After safely writing to the destination, we remove the data from the source by changing the corresponding flags.However, we avoid flushing the source slot after that.If the system crashes, we will scan the table to search for duplicated KV pairs.If two layers contain identical KV pairs (key and value are both the same), the one in the upper layer can be safely removed.(But if two layers contain duplicated keys with different values, then it is a lazy update or deletion, and we leave it.)Since the path of data movement is short and the direction is fixed, the recovery only needs to scan limited buckets and can be made incrementally by checking all buckets on each layer directed by the same hash value during regular workloads.

E. Resizing Scheme
Since TieredHM is deployed in HMA, resizing should avoid excessive data migration and memory allocation among different memory modules.Therefore, we separate the resizing of L0 in PM and the rest in SSD; and relocate only KVs in L0 and L1 to reduce write traffic.Specifically, to resize L0, TieredHM allocates a new consecutive space in PM and relocates existing KVs to it.To resize the layers in SSD, TieredHM adds a new bottom layer two times larger than the current bottom layer.Then, TieredHM relocates data in L1 to the newly allocated layer and removes L1.Afterward, TieredHM communicates the updated memory ranges to MS-SSD.
The resizing scheme keeps most skewed data in the top layer.As for data in L1, TieredHM leverages lazy update and prefetch (Section V-D) to mitigate access latency.
V. MITIGATING SEARCH OVERHEAD Since searching starts from the top layer to avoid getting stale values, migrating key-value pairs during insertion and updating reduces the upper layers' hit ratio.Therefore, more layers would be searched before a key is finally located, thus increasing the average search latency.In this section, we propose several schemes to mitigate search overhead.

A. Shadow Reading
TieredHM develops shadow reading that allows the source slot in upper layers to serve both queries and insertions after an ODM.A source slot can simultaneously serve query and insertion because TieredHM will have duplicated keys in both source and destination slots after performing the ODM; thus, overwriting the source slot does not result in data loss.Specifically, as depicted in Fig. 4(b), we mask the source slot as shadow rather than directly removing the source data during ODM.To enable shadow reading without compromising the crash consistency, TieredHM leverages the readable and nonwritable flags to distinguish a shadow slot from open or valid slot, as shown in Table I.As a result, the pending reads can still access the source slot until another write overrides it.
However, shadow reading becomes less effective once the shadow slot is written.This limitation is exacerbated when the hash table becomes fuller.Thus, we seek other optimizations to mitigate search overhead.

B. Signature Array
To bypass the extra lookup of the top layer when the requested key sits in the lower layer, we add a signature array to the DRAM.As shown in Fig. 3(b), each entry in the Signature Array is a 16-bit integer corresponding to a slot in the top layer.We derive the signature by combining the key's last 8 bits of 2 hash values to minimize calculation overhead.Since different keys may derive the same signature (B and C in Fig. 6), the signature array may indicate the key lies in the top layer falsely.Such a case happens rarely and does not affect the correctness of a query.The signature array can bypass over 99.6% extra lookups of the top layer in our experiments.The footprint of the signature array is only 1/16 of the top layer, which is trivial when the top layer is sufficiently small (0.2% of SSD capacity when the ratio of PM/SSD is 1/24 by default).

C. Prioritizing the First Hash Path
Another interesting finding is that the ratio of requests served by the first hash function during insertion is 73.5%; thus, we can prioritize searching in the first hash path to reduce search latency.Fig. 6 depicts the optimized lookup procedure.When receiving a search request, TieredHM will first check the signature array 1 .If it indicates the requested key (e.g., A) is not in the top layer 2a , TieredHM will directly search candidate buckets in the first hash path in lower layers (i.e., buckets indexed with the first hash function in L1 and L2).A shadow read 3a can further reduce read latency by reducing the number of layers needed to search.If the key is still not found, 4 TieredHM will then search the second hash path.
Otherwise, if the signature array indicates the requested key (e.g., B) lies in the top layer, TieredHM will search from the top layer 2b for the requested key by comparing with the stored whole key (C).And if TieredHM finds that the whole key mismatches 3b , it will continue to search lower layers in the MS-SSD for the requested key.The rest of the steps are the same as in the above illustration.

D. Parallelism-Aware Prefetching
Even with the above optimizations, when KVs reside in lower layers, the long search latency of iterating each level in TieredHM is still undesirable.Fortunately, we find that the candidate buckets for a given key in each layer are deterministic and can be precalculated all at once.In addition, since modern SSDs provide sufficient parallelism,2 by elaborating the data layout in the NAND flash, we can load all candidate buckets from the NAND flash simultaneously to hide the NAND read latency.Fig. 7 depicts how we implement the prefetching.We modify the SSD's firmware to add a prefetching logic (prefetcher) and map different layers (streams) of data to a separate parallel unit (we choose the channel for simplicity).When MS-SSD receives a memory read request to page 2, the prefetcher 1 first parses the memory address into regular logical page number (LPN), stream ID, and bucket index, and 2 looks up the internal cache.If the cache misses, the prefetcher precalculates the LPN of page 4 by adding the start offset in the range table and the candidate bucket index of the next layer.3 Then, the LPN of pages 2 and 4 are translated into NAND address with the aid of flash translation layer (FTL), and 4 corresponding flash transactions are issued in parallel to channels 0 and 1. 5 Finally, the NAND pages are fetched from NAND flash to the internal cache.Such a design only requires moderate modifications to the firmware, and since the logic is relatively simple, the runtime overhead can be neglected.To reduce bandwidth occupation, we only prefetch candidate buckets in the same hash path (pages 3 and 5 are not prefetched), and we selectively enable the prefetching by letting TiererdHM inform the SSD once the hash table reaches over 50% full.Our experiments reveal that prefetching incurs an average consumption of 436.9 MB/s on the read bandwidth, which is trivial compared with modern SSDs' bandwidth (over 10 GB/s).

VI. EVALUATION A. Evaluation Methodology
Our evaluation uses an in-house SSD emulator similar to FlatFlash [5].The emulator divides the host memory into three regions: the first region represents a regular DRAM; the second region models the PM; and the third region simulates the NAND flash-based MS-SSD.To track the pages placed in the PM and to inject memory access latencies of PM and SSD Fig. 7. Prefetch.Suppose the requested KV resides in page 4 of L2.When CPU searches page 2, the Prefetcher locates the requested address with the range table and then prefetches page 4 simultaneously by leveraging internal parallelism.

TABLE II MS-SSD-BASED HMA PARAMETERS
regions, the emulator uses mprotect to control the protection bit in the page table.
This section seeks to answer the following questions. 1) Does TieredHM provide significant write advantages and comparable read performance to other hashing schemes under real workloads?(Sections VI-B and VI-C).2) How much write efficiency does hotspot-aware page placement on hybrid memory provide for each hash indexing, compared to the preliminary physical layout [57] based on sheer MS-SSD?Does TieredHM benefit from page placement and multistream strategies better than existing hash indexings?Do the advantages of TieredHM maintain for a wide range?(Section VI-D).3) How much improvement does TeredHM bring compared to the preliminary [57] search design?(Section VI-C).4) How much performance improvement does each standalone design in TieredHM bring?(Section VI-E).To this end, we compare TieredHM with three representative hash indexings: 1) Cuckoo; 2) Linear; and 3) Level Hashing [36].We also compare TieredHM with a preliminary search design (with only shadow reading), denoted as TH-Orig.
We configure the above hash indexings and TieredHM under HMA consisting of PM (1/24 of SSD capacity by default) and MS-SSD by default.TieredHM is configured as three layers by default, and the size of L0 is adapted to PM.We use an LRU algorithm to detect page temperature for Level, Cuckoo, and Linear Hashing under HMA and place the most frequently accessed pages in PM.Pages are initially cold and detected as hot once accessed.HMA then promotes these pages from MS-SSD to PM in the background instead of swapping.After copying a hot page into PM, HMA can direct a store or load request to PM.To understand whether existing hash indexings can benefit from hotspot-aware page Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.placement, we compare with a physical layout based on sheer MS-SSD, denoted as Orig-LO.Level Hashing and TieredHM are enabled with the multistream feature by default, with each layer placed in MS-SSD mapping to a separate stream.To verify whether Level Hashing and TieredHM can benefit from multistream, we configure Level Hashing and TieredHM with and without the multistream feature.By varying the size of PM and write/read ratio, we illustrate the range where TieredHM gains an advantage over other schemes on page placement and multistream strategies.To demonstrate the effectiveness of each stand-alone design in TieredHM, we compare the performance of TieredHM with and without it.
Since WAL contributes nontrivial overhead to write latency and amplification, we enable two configurations for all hash schemes-PM-based log (tailing with −PL) and SSD-based log by default.Note that the insert and delete operations in Level Hashing and TieredHM are log-free.To gain deeper insight into the read and write performance of TieredHM compared to tree-based indexings, we also configure LevelDB-an LSM-tree-based KV engine-as one of the baselines.The LevelDB is tailored for HMA to persist data via the memory interface, with its WAL persisted to PM to reflect optimal write performance.
We employ real-world workloads, YCSB [58] with Zipfian key distribution, in the following experiments to demonstrate the effectiveness of TieredHM.The maximum amount of data inserted, updated, and deleted is 20 GB (680 million items), which is the user available capacity of SSD and enough to trigger GC.Given that there is no extra design for resizing the scheme of TieredHM, we configure the capacity of each hash data structure the same to accommodate total data inserted just enough (95%), thus avoiding resizing and preventing its complex interaction with internal GC.

B. Write Performance Analysis
In this experiment, we stress the SSD to evaluate the write performance of different hash indexings.We collect the average write latency and amplification of different hash schemes.Update and deletion are tested after inserting a certain amount of data.Since the performance changes with the fullness of indexings, we vary the amount of data inserted [90%, 50%, and 10% of the maximum data amount (20 GB)].Experimental results are shown in Fig. 8.To understand why TieredHM outperforms others in write workloads, we collect the internal and external write amplification factors (WAFs) for different indexings.We denote the ratio of total flash write times to the number of user writes as the internal WAF, which reflects the SSD write amplification from both SSD GC (lightcolored bar, measured by the average number of valid pages are moved during GC, depicted as GC) and indexing schemes (dark-colored bar, depicted as idx_w).In other words, the internal WAF reflects how effective the page placement and the multistream strategies are.we also define external WAF as the ratio of flush times from user to the number of user writes, which reflects the write amplification of indexing schemes alone.
1) Insert Analysis: Fig. 8(a)-(c) shows the insert performance of different indexings.We can find that when the WAL is persisted to SSD, the average latencies of Cuckoo and Linear Hashing reach over 785.8 and 751.2 µs, respectively.The WAL also introduces much higher flushes, as their external WAFs are as high as 4.Moreover, the increased write traffic increases the internal GC overhead.On the other hand, Level Hashing leverages flags instead of WAL during insert, so the latency is reduced to 268 µs.When the WAL is persisted to PM, the latencies of Cuckoo and Linear drop to around 200 µs, which is even lower than that of Level Hashing.This is because the flags and data are separated into different cachelines, Level Hashing needs to first flush the data and then the flags to ensure the persisting order, while Cuckoo-PL and Linear-PL only flush the data once to MS-SSD.The above results imply that MS-SSD dominates the latency under HMA, and WAL should be persisted to PM for higher performance.However, we can see that their average number of writes to SSD (depicted as idx_w in internal WAF) is still higher than 1, which means the page placement with PM helps little to reduce writes to SSD.In contrast, TieredHM reduces both WAL and extra flushes with in-cacheline metadata design, and the external WAF is reduced to 1. TieredHM also significantly reduces the write traffic and latency by building write skewness, which can then benefit from page placement.As a result, TieredHM speeds up the insert of Linear, Cuckoo, Linear-PL, Cuckoo-PL, and Level Hashing 23.3×, 24.4×, 6.4×, 6.8×, and 8.3×, with the GC overhead reduced by over 75.9×, 83.4×, 22.2×, 24.6×, and 20.0×.TieredHM even achieves a latency (32.2 µs) and amplification (0.22) that is similar to LevelDB (36.1 µs and 0.15).
2) Update Analysis: Fig. 8(d)-(f) shows the update performance of different indexings.The results are similar to the insertion.TieredHM-PL speeds up the update of Linear-PL, Cuckoo-PL, and Level-PL 2.1×, 2.3×, and 2.7×, with the GC overhead reduced by over 3.1×, 3.6×, and 4.5×.Besides, TieredHM only shows 14.0% higher latency than LevelDB.Notably, Level Hashing and TieredHM exhibit obvious variances under different fullness of indexings.This is because both of them leverage an opportunistic log-free update strategy when there are available slots.However, Level Hashing updates a KV only in the original bucket if the bucket has free slots.The effectiveness of Level's log-free update decreases dramatically when indexing fullness reaches over 50%.We can verify this with the external WAF.When the indexing fullness is 90%, Level Hashing exhibits similar external WAF (5.44) to Cuckoo and Linear Hashing (6.00).In contrast, TieredHM leverages ODM to open up slots in upper layers.Its "lazy" scheme continues to work on both PM and SSD, keeping the latency and external write amplification less than 97.3 µs and 2.25.
3) Delete Analysis: Fig. 8(g)-(i) shows the delete performance of different indexings.The deletion of Level and TieredHM are log-free, necessitating only one flush for the flags.However, Level's deletion is conducted in place, which lacks skewness when persisting to HMA.In contrast, TieredHM uses the "lazy" delete strategy to leverage the PM to insert a deleted copy, achieving much lower external and internal write amplifications.To quantify, TieredHM speeds up the delete of Linear, Cuckoo, Linear-PL, Cuckoo-PL, and Level Hashing 23.1×, 24.4×, 5.1×, 5.5×, and 4.9×, with a latency less than 43.2 µs.The GC overhead is reduced by over 30.3×, 34.5×, 9.0×, 10.4×, and 8.5×, respectively.Note that the deletion of LevelDB is much faster than other indexings in our experiments, with the latency lower than 5.6 µs.This is because LevelDB has higher write skewness among layers to benefit from page placement by appending new writes to a large log structure in the top layer to amortize the write traffic to lower layers in MS-SSD.The internal write amplifications of Linear-PL, Cuckoo-PL, Level, TieredHM, and LevelDB are 2.00, 2.15, 1.94, 0.24, and 0.02, respectively.

C. Search Performance
In this experiment, we evaluate the search performance of different indexings.We vary the fullness of indexings by inserting different numbers of keys (90%-10%).Afterward, we perform YCSB-A, B, C, and D to test the performance under pure search (YCSB-C) and read-intensive hybrid operations.We define the RAF as the ratio of flash read times to the number of user reads.TH-Orig is the preliminary version [57] with only shadow reading.
As shown in Fig. 9, the average search latency increases with the fullness for all indexings because the search path increases due to hash collisions.Linear Hashing responds the fastest, with RAF and latency being as low as 0.43 and 1.63 µs.This is because linear only uses one hash function to locate a key and collocates collided KVs contiguously, reducing the average NAND read to get a key less than one page.Cuckoo, Level, and TieredHM use two hash functions, among which Level and TieredHM further employ multilayered structures.However, the Level hashing (4.61 µs, 90% fullness) exhibits slightly lower search latency than Cuckoo Hashing (5.03 µs) since Level's two-level structure shows better locality than Cuckoo.TH-Orig exhibits an average 3.07×, 1.39×, and 1.28× higher latency than Linear, Cuckoo, and Level since it has to search multiple layers to get the requested key.In contrast, TieredHM effectively reduces the latency to less than 4.39 µs, which is even lower than Cuckoo and Level.TieredHM improves TH-Orig by 22.5%-48.1% when indexing fullness increases from 10% to 90%, with the RAF increased by only 0.32 on average.The overhead on overall read bandwidth (around 436.9 MB/s) is trivial compared with modern SSDs' bandwidth (over 10 GB/s).The search latency of LevelDB is much higher than hash indexings since it has to scan each log structure consisting of many pages in multiple layers to find the up-to-date key.Instead, the location in each layer is deterministic in TieredHM and can be precalculated.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.each hash indexing, we compare the performance under a physical layout on pure MS-SSD (denoted as Orig-LO) and on hybrid memory built with various ratios of PM/SSD.
Results are shown in Fig. 11.Compared to Orig-LO, the insertion of Linear-PL, Cuckoo-PL, Level-PL, and TieredHM-PL under hybrid memory with page placement speeds up at most 1.01×, 1.01×, 1.44×, and 13.74× during insertion, respectively; and their internal WAFs are reduced by at most 1.04×, 1.04×, 1.52×, and 28.83×.During the update, the page placement speeds up Linear-PL, Cuckoo-PL, Level-PL, and TieredHM-PL at most 1.26×, 1.25×, 1.66×, and 4.76×, with the internal WAF reduced by at most 1.35×, 1.34×, 1.78×, and 6.62×.Note the relative improvements without PM-based WAL are similar.The above results imply that Cuckoo and Linear can hardly benefit from page placement due to the lack of skewness, and Level Hashing shows more improvement since the multilayered structure provides better skewness but not enough, corroborated by results in Fig. 2(a)-(c).In contrast, TieredHM reduces the internal WAF and latency significantly by building significant skewness among layers, which can be verified by the results in the last column of Table III.
2) Effectiveness of Collaboration With Multistream: We verify the effectiveness of collaboration with multistream for Level and TierdHM by varying the ratio of search/insertion.Fig. 12 compares the latency and GC overhead with/without the aid of the multistream feature.Across all five workloads, Level Hashing reduces the latency by 0.42% on average, with the GC overhead reduced by 0.76% on average.In contrast, TieredHM reduces the latency by 17.70% on average, with the GC overhead reduced by 54.41% on average.The results prove that the write-skewed TieredHM benefits from the multistream strategy to effectively reduce the GC overhead, while Level Hashing cannot leverage it to improve write efficiency.

E. Effectiveness of Stand-Alone Design
This section verifies the effectiveness of each stand-alone optimization for insert, search, update, and delete.
1) Effectiveness of Opportunistic Data Movement: To demonstrate the effectiveness of the ODM, we configure TieredHM without a corresponding design.As shown in Fig. 13, ODM reduces the insert latency and internal WAF by 86.09% and 91.00% on average.This proves that TieredHM improves write efficiency for MS-SSD-based HMA by building significant skewness.We also test the effectiveness of MOF.When TieredHM is enabled with ODM but without MOF, the latency and internal WAF increase 10.85× and 16.90×, which is even worse than the version without ODM.The results corroborate with the design principle that move Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Effectiveness of Read Optimizations: We use YCSB-C to quantify the effectiveness of each search optimization alone by configuring TieredHM with and without corresponding designs.As shown in Fig. 14(a), under 90% indexing fullness, shadow reading, priority, signature array, and prefetch improve the average search latency by 13.2%, 16.2%, 15.6%, and 36.5%,respectively.Overall, TieredHM accelerates TH-Orig by 48% with only a 5.6% RAF increase.Fig. 9 gives a more holistic analysis.The priority and prefetch strategy mainly take effect under high fullness, while the signature takes effect regardless of the indexing fullness.This is because the search path in multilayered structures increases with the fullness of indexings, enlarging optimization space for priority and prefetching.Note since TieredHM cannot move up KVs after deletion, KVs reside in lower layers.Therefore, the case of 90% fullness represents the common scenarios, where TH-Orig performs poorly.
3) Effectiveness of Lazy Update and Deletion: We analyze the effectiveness of lazy update and deletion by varying the fullness of indexings.Fig. 14(b)-(d) demonstrates the results.With PM-base WAL enabled, "lazy deletion" reduces the latency and internal WAF by 82.61% and 90.80% on average, and "lazy update" reduces the latency and internal WAF by 59.95% and 65.84% on average.Results are similar under either configuration of WAL.Since the "lazy" strategy always tries to find a writable slot in the top layer, TieredHM benefits from the PM significantly by reducing writes to MS-SSD.In addition, the "lazy" strategy keeps taking effect with the fullness of indexing increases with the aid of ODM.

4) Sensitivity Analysis of Multiple Layers:
To demonstrate the effectiveness of multitiers of TieredHM, we evaluate TieredHM (TieredHM-X) with two, three, and four tiers of hash tables.All versions are configured to have similar capacity and the ratio of PM/SSD (around 1/24).Note that we use a PM-based log to compare the best performance.Table III shows the load factor, read and write performance, and write skewness.We denote the fraction of user writes served by corresponding layers during insertion as write skewness.The fraction of user requests served by PM (L0) dominates the write performance.Results show that, with more layers configured, the write skewness of PM slightly increases, rendering better insert, update, and delete performance.This is because more layers deliver more opportunities for ODM to open up slots in PM and conduct lazy updates or deletions.The load factor also increases with the number of layers since relocation among layers provides more chances to resolve hash collisions.However, the drawback of more layers is the read latency and amplification.Although prefetch can help to reduce read overhead, the bandwidth and internal parallelism (reflected by the RAF) are limited by certain SSDs.These results guide us to implement TieredHM as 3-layer to strike a good balance among load factor, write, and read performance.

VII. CONCLUSION
This article presents the TieredHM, a multilayered hash indexing customized for emerging MS-SSD under hybridmemory architecture.TieredHM employs an ODM scheme to generate skewed write workload to improve the write efficiency for MS-SSD-based HMA.We then develop maximum one flush policy to mitigate data movement overhead.Finally, we employ Parallelism-Aware Prefetching to achieve predictable search performance.Our experiments show that TieredHM delivers comparable search performance against other hash indexings, such as Level Hashing, and write efficiency similar to LSM-tree indexings, such as LevelDB.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.APPENDIX https://nnsslab.com/file/APPENDIX%20FILE.pdf

Fig. 2 .
Fig. 2. (a)-(c) I/O characteristic of different hash indexings.The major difference is the update frequency distribution over memory addresses, as indicated by the y-axis.(d) Flushing overhead of KV relocation.

Fig. 3 .
Fig. 3. (a) H1 and H2 are two hash functions.Buckets are indexed using MSB of the hash of keys.One bucket in the upper layer corresponds to 2 n (n = 1 by default) consecutive buckets in the successive layer.The hash path of a given key consists of all candidate buckets indexed by the same hash function from the top to the bottom layer.(b) Signature array Section V-B.

Fig. 4 .
Fig. 4. Insertion schemes of TieredHM.Only one candidate bucket in each layer is shown for simplicity.Assume a page contains two buckets, and a bucket contains two slots.Dashed arrows point to the middle of buckets.A-C collide into the same bucket in L0.In (b), B, A , and C are flushed together (Section IV-B2).Shadow slots are both writable and readable (Section V-A).(a) Regular insert scheme.(b) Opportunistic data move.(c) Lazy update.(d) Merge during ODM.
Fig. 3(b) depicts the physical layout and page placement of TieredHM.Since the multilayered structure naturally distinguishes the update frequency of each layer, statically mapping the top layer (L0) to PM and the rest to MS-SSD is enough to leverage Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 6 .
Fig. 6.Optimized lookup in TieredHM.Assume keys B and C have the same signature, depicted as SigB.Note signature array only builds for L0 to minimize the DRAM footprint.

Fig. 9 .
Fig. 9. Search (YCSB-C) performance analysis.We vary the amount of data inserted (90%, 50%, and 10%) to test hash read performance under different fullness.Effectiveness of stand-alone design is depicted as +X.(a) Search performance.(b) RAF in SSD.

Fig. 11 .
Fig. 11.Effectiveness of page placement.We compare the insert and update performance of different hash schemes by varying the ratio of PM/SSD.The Orig-LO in the x-axis stands for the preliminary layout [57] on sheer MS-SSD.(a) Insert Perf.(b) Int.WAF of insert.(c) GC of insert.(d) Update Perf.(e) Int.WAF of update.(f) GC of update.

TABLE I TRUTH
TABLE FOR SLOT STATE

TABLE III TRADEOFF
OF NUMBER OF LAYERS