Loading [a11y]/accessibility-menu.js
Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics | IEEE Conference Publication | IEEE Xplore

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics


Abstract:

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To c...Show More

Abstract:

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicatefree framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.
Date of Conference: 19-22 April 2017
Date Added to IEEE Xplore: 18 May 2017
ISBN Information:
Electronic ISSN: 2375-026X
Conference Location: San Diego, CA, USA
Citations are not available for this document.

I. Introduction

Similarity join is an essential operation that finds all pairs of records from two data collections whose similarity scores are no less than a given threshold using a similarity function, e.g., Jaccard similarity [18]. Similarity joins are widely used in a variety of applications including data integration [6], data cleaning [7], duplicate detection [22], record linkage [20] and entity resolution [8].

Cites in Papers - |

Cites in Papers - IEEE (12)

Select All
1.
Xin Xiong, "GrassJoin: Distributed Set Similarity Join Based on Graph Partitioning Model", 2023 2nd International Conference on Sensing, Measurement, Communication and Internet of Things Technologies (SMC-IoT), pp.67-72, 2023.
2.
Guorui Xiao, Jin Wang, Chunbin Lin, Carlo Zaniolo, "Highly Efficient String Similarity Search and Join over Compressed Indexes", 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp.232-244, 2022.
3.
Hong-Ji Kim, Ki-Hoon Lee, "Semi-Stream Similarity Join Processing in a Distributed Environment", IEEE Access, vol.8, pp.130194-130204, 2020.
4.
Youzhong Ma, Ruiling Zhang, Zhanyou Cui, Chunjie Lin, "Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework", IEEE Access, vol.8, pp.121665-121677, 2020.
5.
Jianye Yang, Wenjie Zhang, Xiang Wang, Ying Zhang, Xuemin Lin, "Distributed Streaming Set Similarity Join", 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp.565-576, 2020.
6.
Jin Wang, Chunbin Lin, "Fast Error-tolerant Location-aware Query Autocompletion", 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp.1998-2001, 2020.
7.
Bo Yin, Xuetao Wei, Jin Wang, Naixue Xiong, Ke Gu, "An Industrial Dynamic Skyline Based Similarity Joins For Multidimensional Big Data Applications", IEEE Transactions on Industrial Informatics, vol.16, no.4, pp.2520-2532, 2020.
8.
Lei Zhu, Weiren Yu, Chengyuan Zhang, Zuping Zhang, Fang Huang, Hao Yu, "SVS-JOIN: Efficient Spatial Visual Similarity Join for Geo-Multimedia", IEEE Access, vol.7, pp.158389-158408, 2019.
9.
Jiacheng Wu, Yong Zhang, Jin Wang, Chunbin Lin, Yingjia Fu, Chunxiao Xing, "Scalable Metric Similarity Join Using MapReduce", 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp.1662-1665, 2019.
10.
Jin Wang, Chunbin Lin, Carlo Zaniolo, "MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering", 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp.386-397, 2019.
11.
Xiaoxia Wang, Decai Sun, "QJoin: A Q-Sample-Based Method for Large-Scale String Similarity Joins", 2018 11th International Symposium on Computational Intelligence and Design (ISCID), vol.01, pp.45-48, 2018.
12.
Lingli Li, Xiaodan Shang, Jinbao Li, Jin Hu, "Learning Distance Metrics for Entity Resolution", IEEE Access, vol.6, pp.54900-54909, 2018.

Cites in Papers - Other Publishers (19)

1.
Xilin Tang, Feng Zhang, Shuhao Zhang, Yani Liu, Bingsheng He, Bingsheng He, Xiaoyong Du, Xiaoyong Du, "Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality", Proceedings of the ACM on Management of Data, vol.2, no.4, pp.1, 2024.
2.
Azar Taufique, Ali Rizwan, Ali Imran, Kamran Arshad, Ahmed Zoha, Qammer H. Abbasi, Muhammad A. Imran, "Big Data Analytics for 5G Networks: Utilities, Frameworks, Challenges, and Opportunities", Wiley 5G Ref, pp.1, 2021.
3.
George Papadakis, Ekaterini Ioannou, Emanouil Thanos, Themis Palpanas, "The Four Generations of Entity Resolution", Synthesis Lectures on Data Management, vol.16, no.2, pp.1, 2021.
4.
Lei Yang, "Data acquisition and transmission of laboratory local area network based on fuzzy DEMATEL algorithm", Wireless Networks, 2021.
5.
Zhaokang Wang, Shen Wang, Junhong Li, Chunfeng Yuan, Rong Gu, Yihua Huang, "VSIM: Distributed local structural vertex similarity calculation on big graphs", Journal of Parallel and Distributed Computing, vol.158, pp.29, 2021.
6.
Chengcheng Yang, Dong Deng, Shuo Shang, Fan Zhu, Li Liu, Ling Shao, "Internal and external memory set containment join", The VLDB Journal, vol.30, no.3, pp.447, 2021.
7.
George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, Themis Palpanas, "Blocking and Filtering Techniques for Entity Resolution", ACM Computing Surveys, vol.53, no.2, pp.1, 2021.
8.
Chuitian Rong, Lili Chen, Yasin N. Silva, "Parallel time series join using spark", Concurrency and Computation: Practice and Experience, vol.32, no.9, 2020.
9.
Chuitian Rong, Ziliang Chen, Chunbin Lin, Jianming Wang, "Motif Discovery Using Similarity-Constraints Deep Neural Networks", Database Systems for Advanced Applications, vol.12112, pp.587, 2020.
10.
Sebastián Ferrada, Benjamin Bustos, Nora Reyes, "An efficient algorithm for approximated self-similarity joins in metric spaces", Information Systems, pp.101510, 2020.
11.
Erkang Zhu, Dong Deng, Fatemeh Nargesian, Renée J. Miller, "JOSIE", Proceedings of the 2019 International Conference on Management of Data, pp.847, 2019.
12.
Zhimin Chen, Yue Wang, Vivek Narasayya, Surajit Chaudhuri, "Customizable and scalable fuzzy join for big data", Proceedings of the VLDB Endowment, vol.12, no.12, pp.2106, 2019.
13.
Liu Zheng, "Human-autonomous devices for English network teaching system based on artificial intelligence and WBIETS system", Journal of Ambient Intelligence and Humanized Computing, 2019.
14.
Decai Sun, Xiaoxia Wang, Artificial Intelligence and Security, vol.11633, pp.251, 2019.
15.
Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, Johann-Christoph Freytag, "Set similarity joins on mapreduce", Proceedings of the VLDB Endowment, vol.11, no.10, pp.1110, 2018.
16.
Jingya Hui, Lingli Li, Zhaogong Zhang, Data Science, vol.901, pp.101, 2018.
17.
Kenta Sugano, Toshiyuki Amagasa, Hiroyuki Kitagawa, Database and Expert Systems Applications, vol.11030, pp.214, 2018.
18.
Diego Junior do Carmo Oliveira, Felipe Ferreira Borges, Leonardo Andrade Ribeiro, Alfredo Cuzzocrea, Advances in Databases and Information Systems, vol.11019, pp.216, 2018.
19.
Rafael David Quirino, Sidney Ribeiro-Junior, Leonardo Andrade Ribeiro, Wellington Santos Martins, Enterprise Information Systems, vol.321, pp.74, 2018.

Contact IEEE to Subscribe

References

References is not available for this document.