Spatial data management in apache spark: the GeoSpark perspective and beyond

Jia Yu; Zongsi Zhang; Mohamed Elsayed

doi:10.1007/s10707-018-0330-9

Spatial data management in apache spark: the GeoSpark perspective and beyond

Jia Yu, Zongsi Zhang, Mohamed Elsayed

Research output: Contribution to journal › Article › peer-review

106 Scopus citations

Abstract

The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

Original language	English (US)
Pages (from-to)	37-78
Number of pages	42
Journal	GeoInformatica
Volume	23
Issue number	1
DOIs	https://doi.org/10.1007/s10707-018-0330-9
State	Published - Jan 15 2019

Keywords

Big geospatial data
Distributed computing
Spatial databases

ASJC Scopus subject areas

Information Systems
Geography, Planning and Development

Access to Document

10.1007/s10707-018-0330-9

Cite this

@article{d6663c12d8a844b3ad054b0c24ca744f,

title = "Spatial data management in apache spark: the GeoSpark perspective and beyond",

abstract = "The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.",

keywords = "Big geospatial data, Distributed computing, Spatial databases",

author = "Jia Yu and Zongsi Zhang and Mohamed Elsayed",

year = "2019",

month = jan,

day = "15",

doi = "10.1007/s10707-018-0330-9",

language = "English (US)",

volume = "23",

pages = "37--78",

journal = "GeoInformatica",

issn = "1384-6175",

publisher = "Kluwer Academic Publishers",

number = "1",

}

TY - JOUR

T1 - Spatial data management in apache spark

T2 - the GeoSpark perspective and beyond

AU - Yu, Jia

AU - Zhang, Zongsi

AU - Elsayed, Mohamed

PY - 2019/1/15

Y1 - 2019/1/15

N2 - The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

AB - The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

KW - Big geospatial data

KW - Distributed computing

KW - Spatial databases

UR - http://www.scopus.com/inward/record.url?scp=85055711759&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055711759&partnerID=8YFLogxK

U2 - 10.1007/s10707-018-0330-9

DO - 10.1007/s10707-018-0330-9

M3 - Article

AN - SCOPUS:85055711759

SN - 1384-6175

VL - 23

SP - 37

EP - 78

JO - GeoInformatica

JF - GeoInformatica

IS - 1

ER -

Spatial data management in apache spark: the GeoSpark perspective and beyond

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this