I. Introduction
Several scientific domains are experiencing an explosion in the volume, variety, veracity and velocity of data owing to increased automation, increased computational power, and faster, higher resolution sensors and detectors in scientific instruments [1], [2]. At the same time, research is becoming ever more globalized, collaborative, and multidisciplinary, and there is an increasing need to publish the supporting datasets behind research findings [3]. Furthermore, scientific discovery using data analytics techniques like machine learning (ML) and artificial intelligence (AI) requires large volumes of high quality and well organized data. Prior research has shown that as much as 50–80% of time is spent on data management and wrangling in most scientific research projects and this number is expected to rise [4], [5]. These factors are not only lowering scientific productivity but are also exacerbating the problem of poor reproducibility in science. The current state of the practice leads us to urgently seek a way to manage the lifecycle of data with an effective Scientific Data Management System (SDMS) [6], and use the SDMS as an essential component of the scientific process.