An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)

Paulo Shakarian; Abhinav Koyyalamudi; Noel Ngu; Lakshmivihari Mareedu

An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)

Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to journal › Conference article › peer-review

2 Scopus citations

Abstract

We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.

Original language	English (US)
Journal	CEUR Workshop Proceedings
Volume	3433
State	Published - 2023
Event	AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering, AAAI-MAKE 2023 - San Francisco, United States Duration: Mar 27 2023 → Mar 29 2023

Keywords

ChatGPT
Large Language Models
Math Word Problems

ASJC Scopus subject areas

General Computer Science

Cite this

@article{d60b797fff4947a4a16cbe2efd604d3a,

title = "An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)",

abstract = "We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.",

keywords = "ChatGPT, Large Language Models, Math Word Problems",

author = "Paulo Shakarian and Abhinav Koyyalamudi and Noel Ngu and Lakshmivihari Mareedu",

note = "Publisher Copyright: {\textcopyright} 2023 CEUR-WS. All rights reserved.; AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering, AAAI-MAKE 2023 ; Conference date: 27-03-2023 Through 29-03-2023",

year = "2023",

language = "English (US)",

volume = "3433",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)

AU - Shakarian, Paulo

AU - Koyyalamudi, Abhinav

AU - Ngu, Noel

AU - Mareedu, Lakshmivihari

PY - 2023

Y1 - 2023

N2 - We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.

AB - We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.

KW - ChatGPT

KW - Large Language Models

KW - Math Word Problems

UR - http://www.scopus.com/inward/record.url?scp=85166472786&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85166472786&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85166472786

SN - 1613-0073

VL - 3433

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering, AAAI-MAKE 2023

Y2 - 27 March 2023 through 29 March 2023

ER -

An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this