By Topic

Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on

Date 16-20 April 2012

Filter Results

Displaying Results 1 - 25 of 26
  • Integrating flash-based SSDs into the storage stack

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (253 KB) |  | HTML iconHTML  

    Over the past few years, hybrid storage architectures that use high-performance SSDs in concert with high-density HDDs have received significant interest from both industry and academia, due to their capability to improve performance while reducing capital and operating costs. These hybrid architectures differ in their approach to integrating SSDs into the traditional HDD-based storage stack. Of several such possible integrations, two have seen widespread adoption: Caching and Dynamic Storage Tiering. Although the effectiveness of these architectures under certain workloads is well understood, a systematic side-by-side analysis of these approaches remains difficult due to the range of design alternatives and configuration parameters involved. Such a study is required now more than ever to be able to design effective hybrid storage solutions for deployment in increasingly virtualized modern storage installations that blend several workloads into a single stream. In this paper, we first present our extensions to the Loris storage stack that transform it into a framework for designing hybrid storage systems. We then illustrate the flexibility of the framework by designing several Caching and DST-based hybrid systems. Following this, we present a systematic side-by-side analysis of these systems under a range of individual workload types and offer insights into the advantages and disadvantages of each architecture. Finally, we discuss the ramifications of our findings on the design of future hybrid storage systems in the light of recent changes in hardware landscape and application workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active Flash: Out-of-core data analytics on flash storage

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (843 KB) |  | HTML iconHTML  

    Next generation science will increasingly come to rely on the ability to perform efficient, on-the-fly analytics of data generated by high-performance computing (HPC) simulations, modeling complex physical phenomena. Scientific computing workflows are stymied by the traditional chaining of simulation and data analysis, creating multiple rounds of redundant reads and writes to the storage system, which grows in cost with the ever-increasing gap between compute and storage speeds in HPC clusters. Recent HPC acquisitions have introduced compute node-local flash storage as a means to alleviate this I/O bottleneck. We propose a novel approach, Active Flash, to expedite data analysis pipelines by migrating to the location of the data, the flash device itself. We argue that Active Flash has the potential to enable true out-of-core data analytics by freeing up both the compute core and the associated main memory. By performing analysis locally, dependence on limited bandwidth to a central storage system is reduced, while allowing this analysis to proceed in parallel with the main application. In addition, offloading work from the host to the more power-efficient controller reduces peak system power usage, which is already in the megawatt range and poses a major barrier to HPC system scalability. We propose an architecture for Active Flash, explore energy and performance trade-offs in moving computation from host to storage, demonstrate the ability of appropriate embedded controllers to perform data analysis and reduction tasks at speeds sufficient for this application, and present a simulation study of Active Flash scheduling policies. These results show the viability of the Active Flash model, and its capability to potentially have a transformative impact on scientific data analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flashy prefetching for high-performance flash drives

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6659 KB) |  | HTML iconHTML  

    While hard drives hold on to the capacity advantage, flash-based solid-state drives (SSD) with high bandwidth and low latency have become good alternatives for I/O-intensive applications. Traditional data prefetching has been primarily designed to improve I/O performance on hard drives. The same techniques, if applied unchanged on flash drives, are likely to either fail to fully utilize SSDs, or interfere with application I/O requests, both of which could result in undesirable application performance. In this work, we demonstrate that data prefetching, when effectively harnessing the high performance of SSDs, can provide significant performance benefits for a wide range of data-intensive applications. The new technique, flashy prefetching, consists of accurate prediction of application needs in runtime and adaptive feedback-directed prefetching that scales with application needs, while being considerate to underlying storage devices. We have implemented a real system in Linux and evaluated it on four different SSDs. The results show 65-70% prefetching accuracy and an average 20% speedup on LFS, web search engine traces, BLAST, and TPC-H like benchmarks across various storage drives. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mercury: Host-side flash caching for the data center

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (6)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (856 KB) |  | HTML iconHTML  

    The adoption of flash memory in high volume consumer products such as cell phones, tablet computers, digital cameras, and portable music players has driven down flash costs and increased flash quality. This trend is pushing flash memory into new applications, including enterprise computing. In enterprise data centers, servers containing flash-based SolidState Drives (SSDs) are becoming common. However, data center architects prefer to deploy shared storage over direct-attached storage (DAS). Shared storage offers superior manageability, availability, and scalability compared to DAS. For these reasons, system designers want to reap the benefits of direct attached flash memory without decreasing the value of shared storage systems. Our solution is Mercury, a persistent, write-through host-side cache for flash memory. By designing Mercury as a hypervisor cache, we simplify integration and deployment into host environments. This paper presents our experience building a host-side flash cache, an architectural analysis of possible cache attachment points, and a performance evaluation using enterprise workloads. Our results show a 26% improvement in the bandwidth observed by the Jetstress benchmark and a 500% improvement in the I/O rate of an enterprise workload. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the role of burst buffers in leadership-class storage systems

    Publication Year: 2012 , Page(s): 1 - 11
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2035 KB) |  | HTML iconHTML  

    The largest-scale high-performance (HPC) systems are stretching parallel file systems to their limits in terms of aggregate bandwidth and numbers of clients. To further sustain the scalability of these file systems, researchers and HPC storage architects are exploring various storage system designs. One proposed storage system design integrates a tier of solid-state burst buffers into the storage system to absorb application I/O requests. In this paper, we simulate and explore this storage system design for use by large-scale HPC systems. First, we examine application I/O patterns on an existing large-scale HPC system to identify common burst patterns. Next, we describe enhancements to the CODES storage system simulator to enable our burst buffer simulations. These enhancements include the integration of a burst buffer model into the I/O forwarding layer of the simulator, the development of an I/O kernel description language and interpreter, the development of a suite of I/O kernels that are derived from observed I/O patterns, and fidelity improvements to the CODES models. We evaluate the I/O performance for a set of multiapplication I/O workloads and burst buffer configurations. We show that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived throughput goal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • vPFS: Bandwidth virtualization of parallel storage systems

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (337 KB) |  | HTML iconHTML  

    Existing parallel file systems are unable to differentiate I/Os requests from concurrent applications and meet per-application bandwidth requirements. This limitation prevents applications from meeting their desired Quality of Service (QoS) as high-performance computing (HPC) systems continue to scale up. This paper presents vPFS, a new solution to address this challenge through a bandwidth virtualization layer for parallel file systems. vPFS employs user-level parallel file system proxies to interpose requests between native clients and servers and to schedule parallel I/Os from different applications based on configurable bandwidth management policies. vPFS is designed to be generic enough to support various scheduling algorithms and parallel file systems. Its utility and performance are studied with a prototype which virtualizes PVFS2, a widely used parallel file system. Enhanced proportional sharing schedulers are enabled based on the unique characteristics (parallel striped I/Os) and requirement (high throughput) of parallel storage systems. The enhancements include new threshold- and layout-driven scheduling synchronization schemes which reduce global communication overhead while delivering total-service fairness. An experimental evaluation using typical HPC benchmarks (IOR, NPB BTIO) shows that the throughput overhead of vPFS is small (<;3% for write, <;1% for read). It also shows that vPFS can achieve good proportional bandwidth sharing (>;96% of target sharing ratio) for competing applications with diverse I/O patterns. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (545 KB) |  | HTML iconHTML  

    Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. We consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. We address this problem from both theoretical and practical perspectives. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further implement our replace recovery algorithm atop a parallelized architecture to justify its practicality. We experiment our replace recovery algorithm and its parallelized implementation on a networked storage system testbed, and demonstrate that our replace recovery algorithm uses less recovery time than the conventional approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An active storage framework for object storage devices

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (413 KB) |  | HTML iconHTML  

    In this paper, we present the design and implementation of an active storage framework for object storage devices. The framework is based on the use of virtual machines/execution engines to execute function code downloaded from client applications. We investigate the issues involved in supporting multiple execution engines. Allowing user-downloadable code fragments introduces potential safety and security considerations, and we study the effect of these considerations on these engines. In particular, we look at various remote procedure execution mechanisms and the efficiency and safety of these mechanisms. Finally, we present performance results of the active storage framework on a variety of applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new high-performance, energy-efficient replication storage system with reliability guarantee

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (376 KB) |  | HTML iconHTML  

    In modern replication storage systems where data carries two or more multiple copies, a primary group of disks is always up to service incoming requests while other disks are often spun down to sleep states to save energy during slack periods. However, since new writes cannot be immediately synchronized onto all disks, system reliability is degraded. This paper develops PERAID, a new high-performance, energy-efficient replication storage system, which aims to improve both performance and energy efficiency without compromising reliability. It employs a parity software RAID as a virtual write buffer disk at the front end to absorb new writes. Since extra parity redundancy supplies two or more copies, PERAID guarantees comparable reliability with that of a replication storage system. In addition, PERAID offers better write performance compared to the replication system by avoiding the classical small-write problem in traditional parity RAID: buffering many small random writes into few large writes and writing to storage in a parallel fashion. By evaluating our PERAID prototype using two benchmarks and two real-life traces, we found that PERAID significantly improves write performance and saves more energy than existing solutions such as GRAID, eRAID. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HRAID6ML: A hybrid RAID6 storage architecture with mirrored logging

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (801 KB) |  | HTML iconHTML  

    The RAID6 provides high reliability using double-parity-update at cost of high write penalty. In this paper, we propose HRAID6ML, a new logging architecture for RAID6 systems for enhanced energy efficiency, performance and reliability. HRAID6ML explores a group of Solid State Drives (SSDs) and Hard Disk Drives (HDDs): Two HDDs (parity disks) and several SSDs form RAID6. The free space of the two parity disks is used as mirrored log region of the whole system to absorb writes. The mirrored logging policy helps to recover system from parity disk failure. Mirrored logging operation does not introduce noticeable performance overhead to the whole system. HRAID6ML eliminates the additional hardware and energy costs, potential single point of failure and performance bottleneck. Furthermore, HRAID6ML prolongs the lifecycle of the SSDs and improves the systems energy efficiency by reducing the SSDs write frequency. We have implemented proposed HRAID6ML. Extensive trace-driven evaluations demonstrate the advantages of the HRAID6ML system over both traditional SSD-based RAID6 system and HDD-based RAID6 system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Write amplification due to ECC on flash memory or leave those bit errors alone

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (226 KB) |  | HTML iconHTML  

    While flash memory is receiving significant attention because of many attractive properties, concerns about write endurance delay the wider deployment of the flash memory. This paper analyzes the effectiveness of protection schemes designed for flash memory, such as ECC and scrubbing. The bit error rate of flash memory is a function of the number of program-erase cycles the cell has gone through, making the reliability dependent on time and workload. Moreover, some of the protection schemes require additional write operations, which degrade flash memory's reliability. These issues make it more complex to reveal the relationship between the protection schemes and flash memory's lifetime. In this paper, a Markov model based analysis of the protection schemes is presented. Our model considers the time varying reliability of flash memory as well as write amplification of various protection schemes such as ECC. Our study shows that write amplification from these various sources can significantly affect the benefits of these schemes in improving the lifetime. Based on the results from our analysis, we propose that bit errors within a page be left uncorrected until a threshold of errors are accumulated. We show that such an approach can significantly improve lifetimes by up to 40%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Storage challenges at Los Alamos National Lab

    Publication Year: 2012 , Page(s): 1 - 5
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (257 KB) |  | HTML iconHTML  

    There yet exist no truly parallel file systems. Those that make the claim fall short when it comes to providing adequate concurrent write performance at large scale. This limitation causes large usability headaches in HPC. Users need two major capabilities missing from current parallel file systems. One, they need low latency interactivity. Two, they need high bandwidth for large parallel IO; this capability must be resistant to IO patterns and should not require tuning. There are no existing parallel file systems which provide these features. Frighteningly, exascale renders these features even less attainable from currently available parallel file systems. Fortunately, there is a path forward. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive pipeline for deduplication

    Publication Year: 2012 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (318 KB) |  | HTML iconHTML  

    Deduplication has become one of the hottest topics in the field of data storage. Quite a few methods towards reducing disk I/O caused by deduplication have been proposed. Some methods also have been studied to accelerate computational sub-tasks in deduplication. However, the order of computational sub-tasks can affect overall deduplication throughput significantly, because computational sub-tasks exhibit quite different workload and concurrency in different orders and with different data sets. This paper proposes an adaptive pipelining model for the computational sub-tasks in deduplication. It takes both data type and hardware platform into account. Taking the compression ratio and the duplicate ratio of the data stream, and the compression speed and the fingerprinting speed on different processing units as parameters, it determines the optimal order of the pipeline stages (computational sub-tasks) and assigns each stage to the processing unit which processes it fastest. That is, “adaptive” refers to both data adaptive and hardware adaptive. Experimental results show that the adaptive pipeline improves the deduplication throughput up to 50% compared with the plain fixed pipeline, which implies that it is suitable for simultaneous deduplication of various data types on modern heterogeneous multi-core systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Shortcut-JFS: A write efficient journaling file system for phase change memory

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (339 KB) |  | HTML iconHTML  

    Journaling file systems are widely used in modern computer systems as it provides high reliability with reasonable performance. However, existing journaling file systems are not efficient for emerging PCM (Phase Change Memory) storage. Specifically, a large amount of write operations performed by journaling incur serious performance degradation of PCM storage as it has long write latency. In this paper, we present a new journaling file system for PCM, called Shortcut-JFS, that reduces write amount of journaling by more than a half exploiting the byte-accessibility of PCM. Specifically, Shortcut-JFS performs two novel schemes, 1) differential logging that performs journaling only for modified bytes and 2) in-place checkpointing that removes unnecessary block copy overhead. We implemented Shortcut-JFS on Linux 2.6, and measured the performance of Shortcut-JFS and legacy journaling schemes used in ext 3. The results show that the performance improvement of Shortcut-JFS against ext 3 is 40% on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deduplication in SSDs: Model and quantitative analysis

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB) |  | HTML iconHTML  

    In NAND Flash-based SSDs, deduplication can provide an effective resolution of three critical issues: cell lifetime, write performance, and garbage collection overhead. However, deduplication at SSD device level distinguishes itself from the one at enterprise storage systems in many aspects, whose success lies in proper exploitation of underlying very limited hardware resources and workload characteristics of SSDs. In this paper, we develop a novel deduplication framework elaborately tailored for SSDs. We first mathematically develop an analytical model that enables us to calculate the minimum required duplication rate in order to achieve performance gain given deduplication overhead. Then, we explore a number of design choices for implementing deduplication components by hardware or software. As a result, we propose two acceleration techniques: sampling-based filtering and recency-based fingerprint management. The former selectively applies deduplication based upon sampling and the latter effectively exploits limited controller memory while maximizing the deduplication ratio. We prototype the proposed deduplication framework in three physical hardware platforms and investigate deduplication efficiency according to various CPU capabilities and hardware/software alternatives. Experimental results have shown that we achieve the duplication rate ranging from 4% to 51%, with an average of 17%, for the nine workloads considered in this work. The response time of a write request can be improved by up to 48% with an average of 15%, while the lifespan of SSDs is expected to increase up to 4.1 times with an average of 2.4 times. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of an exact data deduplication cluster

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    Data deduplication is an important component of enterprise storage environments. The throughput and capacity limitations of single node solutions have led to the development of clustered deduplication systems. Most implemented clustered inline solutions are trading deduplication ratio versus performance and are willing to miss opportunities to detect redundant data, which a single node system would detect. We present an inline deduplication cluster with a joint distributed chunk index, which is able to detect as much redundancy as a single node solution. The use of locality and load balancing paradigms enables the nodes to minimize information exchange. Therefore, we are able to show that, despite different claims in previous papers, it is possible to combine exact deduplication, small chunk sizes, and scalability within one environment using only a commodity GBit Ethernet interconnect. Additionally, we investigate the throughput and scalability limitations with a special focus on the intra-node communication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of deduplication ratios in large data sets

    Publication Year: 2012 , Page(s): 1 - 11
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (481 KB) |  | HTML iconHTML  

    We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Jitter-free co-processing on a prototype exascale storage stack

    Publication Year: 2012 , Page(s): 1 - 5
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (286 KB) |  | HTML iconHTML  

    In the petascale era, the storage stack used by the extreme scale high performance computing community is fairly homogeneous across sites. On the compute edge of the stack, file system clients or IO forwarding services direct IO over an interconnect network to a relatively small set of IO nodes. These nodes forward the requests over a secondary storage network to a spindle-based parallel file system. Unfortunately, this architecture will become unviable in the exascale era. As the density growth of disks continues to outpace increases in their rotational speeds, disks are becoming increasingly cost-effective for capacity but decreasingly so for bandwidth. Fortunately, new storage media such as solid state devices are filling this gap; although not cost-effective for capacity, they are so for performance. This suggests that the storage stack at exascale will incorporate solid state storage between the compute nodes and the parallel file systems. There are three natural places into which to position this new storage layer: within the compute nodes, the IO nodes, or the parallel file system. In this paper, we argue that the IO nodes are the appropriate location for HPC workloads and show results from a prototype system that we have built accordingly. Running a pipeline of computational simulation and visualization, we show that our prototype system reduces total time to completion by up to 30%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing shared RAID performance through online profiling

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (423 KB) |  | HTML iconHTML  

    Enterprise storage systems are generally shared by multiple servers in a SAN environment. Our experiments as well as industry reports have shown that disk arrays show poor performance when multiple servers share one RAID due to resource contention as well as frequent disk head movements. We have studied IO performance characteristics of several shared storage settings of practical business operations. To avoid the IO contention, we propose a new dynamic data relocation technique on shared RAID storages, referred to as DROP, Dynamic data Relocation to Optimize Performance. DROP allocates/manages a group of cache data areas and relocates/drops the portion of hot data at a predefined sub array that is a physical partition on the top of the entire shared array. By analyzing profiling data to make each cache area owned by one server, we are able to determine optimal data relocation and partition of disks in the RAID to maximize large sequential block accesses on individual disks and at the same time maximize parallel accesses across disks in the array. As a result, DROP minimizes disk head movements in the array at run time giving rise to high IO performance. A prototype DROP has been implemented as a software module at the storage target controller. Extensive experiments have been carried out using real world IO workloads to evaluate the performance of the DROP implementation. Experimental results have shown that DROP improves shared IO performance greatly. The performance improvements in terms of average IO response time range from 20% to a factor 2.5 at no additional hardware cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting superpages in a nonvolatile memory file system

    Publication Year: 2012 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (293 KB) |  | HTML iconHTML  

    Emerging nonvolatile memory technologies (sometimes referred as Storage Class Memory (SCM)), are poised to close the enormous performance gap between persistent storage and main memory. The SCM devices can be attached directly to memory bus and accessed like normal DRAM. It becomes then possible to exploit memory management hardware resources to improve file system performance. However, in this case, SCM may share critical system resources such as the TLB, page table with DRAM which can potentially impact SCM's performance. In this paper, we propose to solve this problem by employing superpages to reduce the pressure on memory management resources such as the TLB. As a result, the file system performance is further improved. We also analyze the space utilization efficiency of superpages. We improve space efficiency of the file system by allocating normal pages (4KB) for small files while allocating super pages (2MB on ×86) for large files. We show that it is possible to achieve better performance without loss of space utilization efficiency of nonvolatile memory. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SLO-aware hybrid store

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB) |  | HTML iconHTML  

    In the past storage vendors used different types of storage depending upon the type of workload. For example, they used Solid State Drives (SSDs) or FC hard disks (HDD) for online transaction, while SATA for archival type workloads. However, recently many storage vendors are designing hybrid SSD/HDD based systems that can satisfy multiple service level objectives (SLOs) of different workloads all placed together in one storage box, at better cost points. The combination is achieved by using SSDs as a read-write cache while HDD as a permanent store. In this paper we present an SLO based resource management algorithm that controls the amount of SSD given to a particular workload. This algorithm solves following problems: 1) it ensures that workloads do not interfere with each other 2) it ensure that we do not overprovision (cost wise) the amount of SSD allocated to a workload to satisfy its SLO (latency requirement) and 3) dynamically adjust SSD allocated in light of changing workload characteristics (i.e., provide only required amount of SSD). We have implemented our algorithm in a prototype Hybrid Store, and have tested its efficacy using many real workloads. Our algorithm satisfies latency SLOs almost always by utilizing close to optimal amount of SSD and saving 6-50% of SSD space compared to the naïve algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A QoS aware non-work-conserving disk scheduler

    Publication Year: 2012 , Page(s): 1 - 5
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    Disk schedulers should provide QoS guarantees to applications, thus sharing proportionally the storage resource and enforcing performance isolation. Disk schedulers must execute requests in an efficient order though, preventing poor disk usage. Non-work-conserving disk schedulers help to increase disk throughput by predicting future requests' arrival and therefore exploiting disk spatial locality. Previous work are limited to either provide QoS guarantees or exploit disk spatial locality. In this paper, we propose a new non-work-conserving disk scheduler called High-throughput Token Bucket Scheduler (HTBS), which can provide both QoS guarantees and high throughput by (a) assigning tags to requests in a fair queuing-like fashion and (b) predicting future requests' arrival. We show through experiments with our Linux Kernel implementation that HTBS outperforms previous QoS aware work-conserving disk schedulers throughput as well as provides tight QoS guarantees, unlike other non-work-conserving algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Valmar: High-bandwidth real-time streaming data management

    Publication Year: 2012 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (134 KB) |  | HTML iconHTML  

    In applications ranging from radio telescopes to Internet traffic monitoring, our ability to generate data has outpaced our ability to effectively capture, mine, and manage it. These ultra-high-bandwidth data streams typically contain little useful information and most of the data can be safely discarded. Periodically, however, an event of interest is observed and a large segment of the data must be preserved, including data preceding detection of the event. Doing so requires guaranteed data capture at source rates, line speed filtering to detect events and data points of interest, and TiVo-like ability to save past data once an event has been detected. We present Valmar, a system for guaranteed capture, indexing, and storage of ultra-high-bandwidth data streams. Our results show that Valmar performs at nearly full disk bandwidth, up to several orders of magnitude faster than flat file and database systems, works well with both small and large data elements, and allows concurrent read and search access without compromising data capture guarantees. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ADAPT: Efficient workload-sensitive flash management based on adaptation, prediction and aggregation

    Publication Year: 2012 , Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (709 KB) |  | HTML iconHTML  

    Solid-state drives (SSDs) made of flash memory are widely utilized in enterprise servers nowadays. Internally, the management of flash memory resources is done by an embedded software known as the flash translation layer (FTL). One important function of the FTL is to map logical addresses issued by the operating system into physical flash addresses. The efficiency of this address mapping in the FTL directly impacts the performance of SSDs. In this paper, we propose a hybrid mapping FTL scheme, called Aggregated Data movement Augmenting Predictive Transfers (ADAPT). ADAPT observes access behaviors online to handle both sequential and random write requests efficiently. It also takes advantage of locality revealed in the history of recent accesses to avoid unnecessary data movements in the required merge process. More importantly, by these mechanisms, ADAPT can adapt to various workloads to achieve good performance. Experimental results show that ADAPT is as much as 35.4%, 44.2% and 23.5% faster than a state-of-the-art hybrid mapping scheme, a prevalent page-based mapping scheme, and a latest workload-adaptive mapping scheme, respectively, with a small increase in space requirement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • NANDFlashSim: Intrinsic latency variation aware NAND flash memory system modeling and simulation at microarchitecture level

    Publication Year: 2012 , Page(s): 1 - 12
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3539 KB) |  | HTML iconHTML  

    As NAND flash memory becomes popular in diverse areas ranging from embedded systems to high performance computing, exposing and understanding flash memory's performance, energy consumption, and reliability becomes increasingly important. Moreover, with an increasing trend towards multiple-die, multiple-plane architectures and high speed interfaces, high performance NAND flash memory systems are expected to continue to scale. This scaling should further reduce costs and thereby widen proliferation of devices based on the technology. However, when designing NAND flash-based devices, making decisions about the optimal system configuration is non-trivial because NAND flash is sensitive to a large number of parameters, and some parameters exhibit significant latency variations. Such parameters include varying architectures such as multi-die and multi-plane, and a host of factors that affect performance, energy consumption, diverse node technology, and reliability. Unfortunately, there are no public domain tools for high-fidelity, microarchitecture level NAND flash memory simulation in existence to assist with making such decisions. Therefore, we introduce NANDFlashSim; a latency variation-aware, detailed, and highly configurable NAND flash simulation model. NANDFlashSim implements a detailed timing model for operations in sixteen state-of-the-art NAND flash operation mode combinations. In addition, NANDFlashSim models energies and reliability of NAND flash memory based on statistics. From our comprehensive experiments using NANDFlashSim, we found that 1) most read cases were unable to leverage the highly-parallel internal architecture of NAND flash regardless of the NAND flash operation mode, 2) the main source of this performance bottleneck is I/O bus activity, not NAND flash activity itself, 3) multi-level-cell NAND flash provides lower I/O bus resource contention than single-level-cell NAND flash, but the resource contention becomes a serious problem as the number o- die increases, and 4) preference to employ many dies rather than to employ many planes promises better performance in disk-friendly real workloads. The simulator can be downloaded from http://www.cse.psu.edu/~mqj5086/nfs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.