WATSON: A Workflow-based Data Storage Optimizer for Analytics

Jia Zou; Ming Zhao; Juwei Shi; Chen Wang

WATSON: A Workflow-based Data Storage Optimizer for Analytics

Jia Zou, Ming Zhao, Juwei Shi, Chen Wang

Research output: Contribution to conference › Paper › peer-review

Abstract

This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance. To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producer-consumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads’ data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.

Original language	English (US)
State	Published - 2020
Event	36th International Conference on Massive Storage Systems and Technology, MSST 2020 - Virtual, Online Duration: Oct 29 2020 → Oct 30 2020

Conference

Conference	36th International Conference on Massive Storage Systems and Technology, MSST 2020
City	Virtual, Online
Period	10/29/20 → 10/30/20

Keywords

auto-tuning
Big Data analytics
data placement
parameter optimization
storage

ASJC Scopus subject areas

Electrical and Electronic Engineering
Hardware and Architecture

Cite this

@conference{c2e8c5357252406ca40d168a3ee337d6,

title = "WATSON: A Workflow-based Data Storage Optimizer for Analytics",

abstract = "This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance. To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producer-consumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads{\textquoteright} data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.",

keywords = "auto-tuning, Big Data analytics, data placement, parameter optimization, storage",

author = "Jia Zou and Ming Zhao and Juwei Shi and Chen Wang",

note = "Publisher Copyright: {\textcopyright} 2020 36th International Conference on Massive Storage Systems and Technology, MSST 2020. All Rights Reserved.; 36th International Conference on Massive Storage Systems and Technology, MSST 2020 ; Conference date: 29-10-2020 Through 30-10-2020",

year = "2020",

language = "English (US)",

}

TY - CONF

T1 - WATSON

T2 - 36th International Conference on Massive Storage Systems and Technology, MSST 2020

AU - Zou, Jia

AU - Zhao, Ming

AU - Shi, Juwei

AU - Wang, Chen

PY - 2020

Y1 - 2020

N2 - This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance. To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producer-consumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads’ data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.

AB - This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance. To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producer-consumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads’ data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.

KW - auto-tuning

KW - Big Data analytics

KW - data placement

KW - parameter optimization

KW - storage

UR - http://www.scopus.com/inward/record.url?scp=85115324625&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85115324625&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85115324625

Y2 - 29 October 2020 through 30 October 2020

ER -

WATSON: A Workflow-based Data Storage Optimizer for Analytics

Abstract

Conference

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this