Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

Saman Biookaghazadeh, Shujia Zhou, Ming Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi-dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

Original languageEnglish (US)
Title of host publication2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538634868
DOIs
StatePublished - Sep 6 2017
Event2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Shenzhen, China
Duration: Aug 7 2017Aug 9 2017

Publication series

Name2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings

Other

Other2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017
Country/TerritoryChina
CityShenzhen
Period8/7/178/9/17

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems'. Together they form a unique fingerprint.

Cite this