Conferences >2017 6th International Confer...

A review of similarity measurement for record duplication detection

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Similarity measurement is a significant process to determine the degree of similarity between two records. This paper presents a comparative analysis of important similar...Show More

Metadata

Abstract:

Similarity measurement is a significant process to determine the degree of similarity between two records. This paper presents a comparative analysis of important similarity measurements which are utilised for the detection of duplicated records in databases. The work evaluates their strengths based on the efficiency of prevailing algorithms, the time required to process and identify duplications as well as performance accuracy. The analysis conducted found that among the most common similarity measurements, those based on the Jaro-Winkler algorithm significantly outperformed the other algorithms. This paper presents an enhanced strategy based on the Jaro-Winkler algorithm to improve the detection of similarity among database records. The ability to provide solutions to this problem will greatly enhance the quality of data used in decision-making.

Published in: 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI)

Date of Conference: 25-27 November 2017

Date Added to IEEE Xplore: 12 March 2018

ISBN Information:

Electronic ISSN: 2155-6830

DOI: 10.1109/ICEEI.2017.8312386

Conference Location: Langkawi, Malaysia

Contents

I. Introduction

The similarity measure plays a vital role in nearly every field of science and engineering. A similarity measure can be described as a process to determine the degree of similarity that exists between two objects [1], [2]. The identification of similar database records is an important entity matching application. The term ‘duplicate record detection’ is used to describe the process of recognising records that represent the same realworld entity in a specific database. The difficulty associated with duplication is that duplicated records may not share the same record key. Various methods to resolve this issue have been employed to locate and cleanse erroneous duplicated records in a typical dataset. The duplicated or erroneous data can result from several factors which include, data entry errors, such as typing the name of a person like “John” as “Jon”, etc. Moreover, there could be a missing validation check or restriction issue such as an age value of 320, or of multiple conventions such as 22 E, 7th St vs. 22 East Seventh Street). An additional problem may also result from structural differences between database sources [3].

References is not available for this document.

A review of similarity measurement for record duplication detection

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A review of similarity measurement for record duplication detection

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?