Loading [MathJax]/extensions/MathMenu.js
To Store or Not to Store: a graph theoretical approach for Dataset Versioning | IEEE Conference Publication | IEEE Xplore

To Store or Not to Store: a graph theoretical approach for Dataset Versioning


Abstract:

Dataset Versioning is extremely important for ensuring the reproducibility of results, tracking data changes over time, maintaining quality measures, enabling collaborati...Show More

Abstract:

Dataset Versioning is extremely important for ensuring the reproducibility of results, tracking data changes over time, maintaining quality measures, enabling collaboration, and ensuring legal compliance. In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MINSUM RETRIEVAL (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB’15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.We established, in the full version of this work 1, that the previous best heuristic, LMG (introduced in [VLDB’15]) can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get o(n)-approximation for MSR on general graphs even if we relax the storage constraints by an O(log n) factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential "moves", to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms ...
Date of Conference: 27-31 May 2024
Date Added to IEEE Xplore: 08 July 2024
ISBN Information:

ISSN Information:

Conference Location: San Francisco, CA, USA

I. Introduction

The management and storage of data versions has become increasingly important. As an example, the increasing usage of online collaboration tools allows many collaborators to edit an original dataset simultaneously, producing multiple versions of datasets to be stored daily. Large number of dataset versions also occur often in industry data lakes [1] where huge tabular datasets like product catalogs might require a few records (or rows) to be modified periodically, resulting in a new version for each such modification. Furthermore, in Deep Learning pipelines, multiple versions are generated from the same original data for training and insight generation. At the scale of terabytes or even petabytes, storing and managing all the versions is extremely costly in the aforementioned situations [2]. Therefore, it is no surprise that data version control is emerging as a hot area in the industry [3] –[8], and even popular cloud solution providers like Databricks are now capturing data lineage information, which helps in effective data version management [9].

Contact IEEE to Subscribe

References

References is not available for this document.