Exploiting common subexpressions for cloud query processing

Yasin Silva; Paul Ake Larson; Jingren Zhou

doi:10.1109/ICDE.2012.106

Exploiting common subexpressions for cloud query processing

Yasin Silva, Paul Ake Larson, Jingren Zhou

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

29 Scopus citations

Abstract

Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

Original language	English (US)
Title of host publication	Proceedings - International Conference on Data Engineering
Pages	1337-1348
Number of pages	12
DOIs	https://doi.org/10.1109/ICDE.2012.106
State	Published - 2012
Event	IEEE 28th International Conference on Data Engineering, ICDE 2012 - Arlington, VA, United States Duration: Apr 1 2012 → Apr 5 2012

Other

Other	IEEE 28th International Conference on Data Engineering, ICDE 2012
Country/Territory	United States
City	Arlington, VA
Period	4/1/12 → 4/5/12

ASJC Scopus subject areas

Information Systems
Signal Processing
Software

Access to Document

10.1109/ICDE.2012.106

Cite this

@inproceedings{6b7a935471f249558028c456eb728b6a,

title = "Exploiting common subexpressions for cloud query processing",

abstract = "Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.",

author = "Yasin Silva and Larson, {Paul Ake} and Jingren Zhou",

year = "2012",

doi = "10.1109/ICDE.2012.106",

language = "English (US)",

pages = "1337--1348",

booktitle = "Proceedings - International Conference on Data Engineering",

note = "IEEE 28th International Conference on Data Engineering, ICDE 2012 ; Conference date: 01-04-2012 Through 05-04-2012",

}

TY - GEN

T1 - Exploiting common subexpressions for cloud query processing

AU - Silva, Yasin

AU - Larson, Paul Ake

AU - Zhou, Jingren

PY - 2012

Y1 - 2012

N2 - Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

AB - Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

UR - http://www.scopus.com/inward/record.url?scp=84864252206&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864252206&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2012.106

DO - 10.1109/ICDE.2012.106

M3 - Conference contribution

AN - SCOPUS:84864252206

SP - 1337

EP - 1348

BT - Proceedings - International Conference on Data Engineering

T2 - IEEE 28th International Conference on Data Engineering, ICDE 2012

Y2 - 1 April 2012 through 5 April 2012

ER -

Exploiting common subexpressions for cloud query processing

Abstract

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this