TY - GEN
T1 - InstructExcel
T2 - 2023 Findings of the Association for Computational Linguistics: EMNLP 2023
AU - Payan, Justin
AU - Mishra, Swaroop
AU - Singh, Mukul
AU - Negreanu, Carina
AU - Poelitz, Christian
AU - Baral, Chitta
AU - Roy, Subhro
AU - Chakravarthy, Rasika
AU - Van Durme, Benjamin
AU - Nouri, Elnaz
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, INSTRUCTEXCEL, created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that INSTRUCTEXCEL is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.
AB - With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, INSTRUCTEXCEL, created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that INSTRUCTEXCEL is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.
UR - http://www.scopus.com/inward/record.url?scp=85183307568&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85183307568&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85183307568
T3 - Findings of the Association for Computational Linguistics: EMNLP 2023
SP - 4026
EP - 4043
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -