Close category search window
 

A threshold-based similarity measure for duplicate detection

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

6 Author(s)
Ektefa, M. ; Dept. of CS, UPM, Serdang, Malaysia ; Sidi, F. ; Ibrahim, H. ; Jabar, M.A.
more authors

In order to extract beneficial information and recognize a particular pattern from huge data stored in different databases with different formats, data integration is essential. However the problem that arises here is that data integration may lead to duplication. In other words, due to the availability of data in different formats, there might be some records which refer to the same entity. Duplicate detection or record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. Most approaches concentrated on string similarity measures for comparing records. However, they fail to identify records which share the semantic information. So, in this study, a threshold-based method which takes into account both string and semantic similarity measures for comparing record pairs. This method is experimented on a real world dataset, namely Restaurant and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset. Therefore, based on experimental results, besides string similarity, semantic similarity should be considered in order to detect duplicate records more effectively.

Published in:
Open Systems (ICOS), 2011 IEEE Conference on

Date of Conference: 25-28 Sept. 2011

Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.