Pivot-based approximate k-NN similarity joins for big high-dimensional data

Přemysl Čech; J. Lokoč; Yasin N. Silva

doi:10.1016/j.is.2019.06.006

Pivot-based approximate k-NN similarity joins for big high-dimensional data

Přemysl Čech, J. Lokoč, Yasin N. Silva

Mathematical and Natural Sciences, School of (SMNS)

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

Original language	English (US)
Article number	101410
Journal	Information Systems
Volume	87
DOIs	https://doi.org/10.1016/j.is.2019.06.006
State	Published - Jan 2020

Keywords

Approximate similarity join
Hadoop
High-dimensional data
MapReduce
Spark
k-NN

ASJC Scopus subject areas

Software
Information Systems
Hardware and Architecture

Access to Document

10.1016/j.is.2019.06.006

Cite this

@article{909705ec6e414968962bb59aac86a45d,

title = "Pivot-based approximate k-NN similarity joins for big high-dimensional data",

abstract = "Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.",

keywords = "Approximate similarity join, Hadoop, High-dimensional data, MapReduce, Spark, k-NN",

author = "P{\v r}emysl {\v C}ech and J. Loko{\v c} and Silva, {Yasin N.}",

note = "Funding Information: This project was supported by the Charles University in Prague grant GAUK 201515 , the Czech Science Foundation (GA{\v C}R) project Nr. 17-22224S and partially by Charles University grant SVV-260451. Publisher Copyright: {\textcopyright} 2019 Elsevier Ltd",

year = "2020",

month = jan,

doi = "10.1016/j.is.2019.06.006",

language = "English (US)",

volume = "87",

journal = "Information Systems",

issn = "0306-4379",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Pivot-based approximate k-NN similarity joins for big high-dimensional data

AU - Čech, Přemysl

AU - Lokoč, J.

AU - Silva, Yasin N.

N1 - Funding Information: This project was supported by the Charles University in Prague grant GAUK 201515 , the Czech Science Foundation (GAČR) project Nr. 17-22224S and partially by Charles University grant SVV-260451. Publisher Copyright: © 2019 Elsevier Ltd

PY - 2020/1

Y1 - 2020/1

N2 - Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

AB - Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

KW - Approximate similarity join

KW - Hadoop

KW - High-dimensional data

KW - MapReduce

KW - Spark

KW - k-NN

UR - http://www.scopus.com/inward/record.url?scp=85070216906&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85070216906&partnerID=8YFLogxK

U2 - 10.1016/j.is.2019.06.006

DO - 10.1016/j.is.2019.06.006

M3 - Article

AN - SCOPUS:85070216906

SN - 0306-4379

VL - 87

JO - Information Systems

JF - Information Systems

M1 - 101410

ER -

Pivot-based approximate k-NN similarity joins for big high-dimensional data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this