Abstract
Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.
Original language | English (US) |
---|---|
Pages (from-to) | 681-694 |
Number of pages | 14 |
Journal | Proceedings of the VLDB Endowment |
Volume | 12 |
Issue number | 6 |
DOIs | |
State | Published - 2018 |
Externally published | Yes |
Event | 45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States Duration: Aug 26 2017 → Aug 30 2017 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Computer Science(all)