DWreck: A Data Wrecker Framework for Generating Unclean Datasets | IEEE Conference Publication | IEEE Xplore

DWreck: A Data Wrecker Framework for Generating Unclean Datasets


Abstract:

In this paper, we present DWreck, a data wrecker framework for generating unclean datasets by counterproductively applying different data quality dimensions. In a typical...Show More

Abstract:

In this paper, we present DWreck, a data wrecker framework for generating unclean datasets by counterproductively applying different data quality dimensions. In a typical data-analysis pipeline, data cleaning is the most cost-intensive, laborious, and time-consuming step. Unclean dataset or partially cleaned dataset can lead to incorrect training of machine learning models and result in wrong conclusions. Generally, data-scientists examine null, missing, or duplicate values, and the dataset is cleaned by removing the entire record or imputing the values. However, deleting the records, or imputing the values cannot be termed as comprehensive cleaning, as these cleaning techniques may result in a reduction in the population of data, and increased error in estimation due to biased values. For systematically cleaning an unclean dataset, it is necessary to comply with the data quality dimensions such as completeness, validity, consistency, accuracy, and conformity. The errors described as violations of expectations for completeness, accuracy, timeliness, consistency and other dimensions of data quality often impede the successful completion of information processing streams and consequently degrade the dependent business processes. Therefore, educating a data-scientist for comprehensively cleaning a raw-dataset acquired for analysis is an incremental learning process. Moreover, for extensive training on cleaning a dataset on different quality dimensions, it is necessary to provide a variety of datasets that are unclean on various data quality dimensions. Hence, in this paper, we present DWreck, a data wrecker framework for generating unclean datasets by counterproductively applying different data quality dimensions. The DWreck framework is designed on the principles of microservices architecture pattern. For allowing function-specific extensibility, the DWreck comprises four groups of microservices: (a) Dataset Profiling, (b) Data type Processing, (c) Counterproductive D...
Date of Conference: 03-06 August 2020
Date Added to IEEE Xplore: 28 August 2020
ISBN Information:
Conference Location: Oxford, UK

Contact IEEE to Subscribe

References

References is not available for this document.