TY - GEN
T1 - Fast & scalable distributed set similarity joins for big data analytics
AU - Rong, Chuitian
AU - Lin, Chunbin
AU - Silva, Yasin
AU - Wang, Jianguo
AU - Lu, Wei
AU - Du, Xiaoyong
N1 - Funding Information:
This material is based upon work supported by the National Natural Science Foundation of China under grant No.61402329 and No.61502504, and the China Scholarship Council.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/5/16
Y1 - 2017/5/16
N2 - Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.
AB - Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate free framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-Theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.
UR - http://www.scopus.com/inward/record.url?scp=85021238525&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85021238525&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2017.151
DO - 10.1109/ICDE.2017.151
M3 - Conference contribution
AN - SCOPUS:85021238525
T3 - Proceedings - International Conference on Data Engineering
SP - 1059
EP - 1070
BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
PB - IEEE Computer Society
T2 - 33rd IEEE International Conference on Data Engineering, ICDE 2017
Y2 - 19 April 2017 through 22 April 2017
ER -