Abstract:
Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (...Show MoreMetadata
Abstract:
Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (e.g., fingerprints). The adoption of deduplication for primary storage has been hampered because of its complexities, such as random-access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload.Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, Logical Block Address, and category of a data block. The experimental results show that 33% of redundant writes are eliminated, 54.5% of metadata overhead is reduced by exploiting block similarity, and the metadata cache hit rates based on the Machine Learning model are higher by 5.43% and 10.36% over systems with Least Recently Used eviction and Least Frequently Used eviction policy respectively. We achieved 14.4% better throughput with a workload-dependent Machine Learning-based cache eviction policy than a system with traditional cache eviction policy. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. Our system was evaluated on real-world I/O traces in experiments.
Published in: 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)
Date of Conference: 03-05 January 2022
Date Added to IEEE Xplore: 28 February 2022
ISBN Information: