Skip to Main Content
When data resides on tertiary storage, clustering is the key to achieving high retrieval performance. However, a straightforward approach to clustering massive amounts of data on this storage requires considerable computational and storage resources that usually exceed the capabilities of even the richest super-computing centers. This paper develops a new approach to hierarchical storage management in data grid environments, which calls for two levels of clustering data on tertiary storage. Applying a mix of static and dynamic decisions, this approach achieves the benefits of clustering at reasonable costs. However, an effective realization of the approach in generic data grid environments requires advances in the areas of indexing and clustering large scientific data collections on tertiary storage. The paper describes some novel indexing and clustering techniques that can cope well not only with extremely large volumes but also with very high dimensionalities of scientific data. The basic principles of a new clustering technique for large volumes of multi-dimensional data are introduced in the paper for the first time.