Abstract
Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - International Conference on Data Engineering |
Pages | 1337-1348 |
Number of pages | 12 |
DOIs | |
State | Published - 2012 |
Event | IEEE 28th International Conference on Data Engineering, ICDE 2012 - Arlington, VA, United States Duration: Apr 1 2012 → Apr 5 2012 |
Other
Other | IEEE 28th International Conference on Data Engineering, ICDE 2012 |
---|---|
Country/Territory | United States |
City | Arlington, VA |
Period | 4/1/12 → 4/5/12 |
ASJC Scopus subject areas
- Information Systems
- Signal Processing
- Software