I. Introduction
The management and storage of data versions has become increasingly important. As an example, the increasing usage of online collaboration tools allows many collaborators to edit an original dataset simultaneously, producing multiple versions of datasets to be stored daily. Large number of dataset versions also occur often in industry data lakes [1] where huge tabular datasets like product catalogs might require a few records (or rows) to be modified periodically, resulting in a new version for each such modification. Furthermore, in Deep Learning pipelines, multiple versions are generated from the same original data for training and insight generation. At the scale of terabytes or even petabytes, storing and managing all the versions is extremely costly in the aforementioned situations [2]. Therefore, it is no surprise that data version control is emerging as a hot area in the industry [3] –[8], and even popular cloud solution providers like Databricks are now capturing data lineage information, which helps in effective data version management [9].