An experimental survey of mapreduce-based similarity joins

Yasin Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, Chuitian Rong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations


In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

Original languageEnglish (US)
Title of host publicationSimilarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings
PublisherSpringer Verlag
Number of pages15
Volume9939 LNCS
ISBN (Print)9783319467580
StatePublished - 2016
Event9th International Conference on Similarity Search and Applications, SISAP 2016 - Tokyo, Japan
Duration: Oct 24 2016Oct 26 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9939 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349


Other9th International Conference on Similarity Search and Applications, SISAP 2016


  • Big data systems
  • MapReduce
  • Performance evaluation
  • Similarity joins

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'An experimental survey of mapreduce-based similarity joins'. Together they form a unique fingerprint.

Cite this