Pangea: Monolithic distributed storage for data analytics

Jia Zou, Arun Iyengar, Chris Jermaine

Research output: Contribution to journalConference articlepeer-review

7 Scopus citations

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

Original languageEnglish (US)
Pages (from-to)681-694
Number of pages14
JournalProceedings of the VLDB Endowment
Volume12
Issue number6
DOIs
StatePublished - 2018
Externally publishedYes
Event45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States
Duration: Aug 26 2017Aug 30 2017

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Pangea: Monolithic distributed storage for data analytics'. Together they form a unique fingerprint.

Cite this