Loading [MathJax]/extensions/MathZoom.js
A review of similarity measurement for record duplication detection | IEEE Conference Publication | IEEE Xplore

A review of similarity measurement for record duplication detection


Abstract:

Similarity measurement is a significant process to determine the degree of similarity between two records. This paper presents a comparative analysis of important similar...Show More

Abstract:

Similarity measurement is a significant process to determine the degree of similarity between two records. This paper presents a comparative analysis of important similarity measurements which are utilised for the detection of duplicated records in databases. The work evaluates their strengths based on the efficiency of prevailing algorithms, the time required to process and identify duplications as well as performance accuracy. The analysis conducted found that among the most common similarity measurements, those based on the Jaro-Winkler algorithm significantly outperformed the other algorithms. This paper presents an enhanced strategy based on the Jaro-Winkler algorithm to improve the detection of similarity among database records. The ability to provide solutions to this problem will greatly enhance the quality of data used in decision-making.
Date of Conference: 25-27 November 2017
Date Added to IEEE Xplore: 12 March 2018
ISBN Information:
Electronic ISSN: 2155-6830
Conference Location: Langkawi, Malaysia

I. Introduction

The similarity measure plays a vital role in nearly every field of science and engineering. A similarity measure can be described as a process to determine the degree of similarity that exists between two objects [1], [2]. The identification of similar database records is an important entity matching application. The term ‘duplicate record detection’ is used to describe the process of recognising records that represent the same realworld entity in a specific database. The difficulty associated with duplication is that duplicated records may not share the same record key. Various methods to resolve this issue have been employed to locate and cleanse erroneous duplicated records in a typical dataset. The duplicated or erroneous data can result from several factors which include, data entry errors, such as typing the name of a person like “John” as “Jon”, etc. Moreover, there could be a missing validation check or restriction issue such as an age value of 320, or of multiple conventions such as 22 E, 7th St vs. 22 East Seventh Street). An additional problem may also result from structural differences between database sources [3].

Contact IEEE to Subscribe

References

References is not available for this document.