Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

Saman Biookaghazadeh; Shujia Zhou; Ming Zhao

doi:10.1109/NAS.2017.8026864

Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

Saman Biookaghazadeh, Shujia Zhou, Ming Zhao

Computing and Augmented Intelligence, School of (IAFSE-SCAI)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi-dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

Original language	English (US)
Title of host publication	2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781538634868
DOIs	https://doi.org/10.1109/NAS.2017.8026864
State	Published - Sep 6 2017
Event	2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Shenzhen, China Duration: Aug 7 2017 → Aug 9 2017

Publication series

Name	2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings

Other

Other	2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017
Country/Territory	China
City	Shenzhen
Period	8/7/17 → 8/9/17

ASJC Scopus subject areas

Computer Networks and Communications
Hardware and Architecture

Access to Document

10.1109/NAS.2017.8026864

Cite this

Biookaghazadeh, S., Zhou, S., & Zhao, M. (2017). Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems. In 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings Article 8026864 (2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/NAS.2017.8026864

Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems. / Biookaghazadeh, Saman; Zhou, Shujia; Zhao, Ming.
2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. 8026864 (2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Biookaghazadeh, S, Zhou, S & Zhao, M 2017, Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems. in 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings., 8026864, 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017, Shenzhen, China, 8/7/17. https://doi.org/10.1109/NAS.2017.8026864

Biookaghazadeh S, Zhou S, Zhao M. Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems. In 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. 8026864. (2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings). doi: 10.1109/NAS.2017.8026864

Biookaghazadeh, Saman ; Zhou, Shujia ; Zhao, Ming. / Kaleido : Enabling Efficient Scientific Data Processing on Big-Data Systems. 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. (2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings).

@inproceedings{ef1262d5310d4e0da5c6f22064f85a8d,

title = "Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems",

abstract = "Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi-dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.",

author = "Saman Biookaghazadeh and Shujia Zhou and Ming Zhao",

note = "Funding Information: This research is sponsored by National Science Foundation awards CNS-1562837, CNS-1629888, CMMI-1610282, and IIS-1633381, and CAREER award CNS-1253944. Publisher Copyright: {\textcopyright} 2017 IEEE.; 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 ; Conference date: 07-08-2017 Through 09-08-2017",

year = "2017",

month = sep,

day = "6",

doi = "10.1109/NAS.2017.8026864",

language = "English (US)",

series = "2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings",

}

TY - GEN

T1 - Kaleido

T2 - 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017

AU - Biookaghazadeh, Saman

AU - Zhou, Shujia

AU - Zhao, Ming

N1 - Funding Information: This research is sponsored by National Science Foundation awards CNS-1562837, CNS-1629888, CMMI-1610282, and IIS-1633381, and CAREER award CNS-1253944. Publisher Copyright: © 2017 IEEE.

PY - 2017/9/6

Y1 - 2017/9/6

N2 - Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi-dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

AB - Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi-dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

UR - http://www.scopus.com/inward/record.url?scp=85015184100&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015184100&partnerID=8YFLogxK

U2 - 10.1109/NAS.2017.8026864

DO - 10.1109/NAS.2017.8026864

M3 - Conference contribution

AN - SCOPUS:85015184100

T3 - 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings

BT - 2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 7 August 2017 through 9 August 2017

ER -

Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this