Skip to Main Content
Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We develop the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We present an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about workload: a k-competitive deterministic algorithm and a randomized algorithm with minimal space complexity. We verify the efficacy of bypass-yield caching by running workload traces collected from the Sloan Digital Sky Survey through a prototype implementation.