Conferences >2017 IEEE 33rd International ...

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To c...Show More

Metadata

Abstract:

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicatefree framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

Published in: 2017 IEEE 33rd International Conference on Data Engineering (ICDE)

Date of Conference: 19-22 April 2017

Date Added to IEEE Xplore: 18 May 2017

ISBN Information:

Electronic ISSN: 2375-026X

DOI: 10.1109/ICDE.2017.151

Conference Location: San Diego, CA, USA

Citations are not available for this document.

Contents

I. Introduction

Similarity join is an essential operation that finds all pairs of records from two data collections whose similarity scores are no less than a given threshold using a similarity function, e.g., Jaccard similarity [18]. Similarity joins are widely used in a variety of applications including data integration [6], data cleaning [7], duplicate detection [22], record linkage [20] and entity resolution [8].

References is not available for this document.

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Abstract:

Metadata

Abstract:

I. Introduction

Cites in Papers - |

Cites in Papers - IEEE (12)

Cites in Papers - Other Publishers (19)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Cites in Papers - IEEE (12) | Other Publishers (19)

Cites in Papers - IEEE (12)

Cites in Papers - Other Publishers (19)

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cites in Papers - |