Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization | IEEE Conference Publication | IEEE Xplore

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization


Abstract:

Many applications need to clean data with a target accuracy, e.g., with at least 95% precision. As far as we know, this problem has not been studied in depth. In this pap...Show More

Abstract:

Many applications need to clean data with a target accuracy, e.g., with at least 95% precision. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all strings that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. So the users often take ad-hoc, suboptimal, or incorrect actions. Verifying and cleaning also often take a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.
Date of Conference: 17-20 December 2022
Date Added to IEEE Xplore: 26 January 2023
ISBN Information:
Conference Location: Osaka, Japan

Contact IEEE to Subscribe

References

References is not available for this document.