Skip to Main Content
As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).