Fast & scalable distributed set similarity joins for big data analytics

Chuitian Rong; Chunbin Lin; Yasin Silva; Jianguo Wang; Wei Lu; Xiaoyong Du

doi:10.1109/ICDE.2017.151

Fast & scalable distributed set similarity joins for big data analytics

Chuitian Rong, Chunbin Lin, Yasin Silva, Jianguo Wang, Wei Lu, Xiaoyong Du

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

40 Scopus citations

Abstract

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

Original language	English (US)
Title of host publication	Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
Publisher	IEEE Computer Society
Pages	1059-1070
Number of pages	12
ISBN (Electronic)	9781509065431
DOIs	https://doi.org/10.1109/ICDE.2017.151
State	Published - May 16 2017
Event	33rd IEEE International Conference on Data Engineering, ICDE 2017 - San Diego, United States Duration: Apr 19 2017 → Apr 22 2017

Publication series

Name	Proceedings - International Conference on Data Engineering
ISSN (Print)	1084-4627

Other

Other	33rd IEEE International Conference on Data Engineering, ICDE 2017
Country/Territory	United States
City	San Diego
Period	4/19/17 → 4/22/17

ASJC Scopus subject areas

Software
Signal Processing
Information Systems

Access to Document

10.1109/ICDE.2017.151

Cite this

Rong, C., Lin, C., Silva, Y., Wang, J., Lu, W., & Du, X. (2017). Fast & scalable distributed set similarity joins for big data analytics. In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017 (pp. 1059-1070). Article 7930047 (Proceedings - International Conference on Data Engineering). IEEE Computer Society. https://doi.org/10.1109/ICDE.2017.151

Fast & scalable distributed set similarity joins for big data analytics. / Rong, Chuitian; Lin, Chunbin; Silva, Yasin et al.
Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society, 2017. p. 1059-1070 7930047 (Proceedings - International Conference on Data Engineering).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Rong, C, Lin, C, Silva, Y, Wang, J, Lu, W & Du, X 2017, Fast & scalable distributed set similarity joins for big data analytics. in Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017., 7930047, Proceedings - International Conference on Data Engineering, IEEE Computer Society, pp. 1059-1070, 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, United States, 4/19/17. https://doi.org/10.1109/ICDE.2017.151

@inproceedings{f149635263e64c5a90991a6e46714ec8,

title = "Fast & scalable distributed set similarity joins for big data analytics",

abstract = "Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.",

author = "Chuitian Rong and Chunbin Lin and Yasin Silva and Jianguo Wang and Wei Lu and Xiaoyong Du",

note = "Funding Information: This material is based upon work supported by the National Natural Science Foundation of China under grant No.61402329 and No.61502504, and the China Scholarship Council. Publisher Copyright: {\textcopyright} 2017 IEEE.; 33rd IEEE International Conference on Data Engineering, ICDE 2017 ; Conference date: 19-04-2017 Through 22-04-2017",

year = "2017",

month = may,

day = "16",

doi = "10.1109/ICDE.2017.151",

language = "English (US)",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "1059--1070",

booktitle = "Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017",

}

TY - GEN

T1 - Fast & scalable distributed set similarity joins for big data analytics

AU - Rong, Chuitian

AU - Lin, Chunbin

AU - Silva, Yasin

AU - Wang, Jianguo

AU - Lu, Wei

AU - Du, Xiaoyong

N1 - Funding Information: This material is based upon work supported by the National Natural Science Foundation of China under grant No.61402329 and No.61502504, and the China Scholarship Council. Publisher Copyright: © 2017 IEEE.

PY - 2017/5/16

Y1 - 2017/5/16

N2 - Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

AB - Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

UR - http://www.scopus.com/inward/record.url?scp=85021238525&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021238525&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2017.151

DO - 10.1109/ICDE.2017.151

M3 - Conference contribution

AN - SCOPUS:85021238525

T3 - Proceedings - International Conference on Data Engineering

SP - 1059

EP - 1070

BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017

PB - IEEE Computer Society

T2 - 33rd IEEE International Conference on Data Engineering, ICDE 2017

Y2 - 19 April 2017 through 22 April 2017

ER -

Fast & scalable distributed set similarity joins for big data analytics

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this