An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)

Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu

Research output: Contribution to journalConference articlepeer-review

2 Scopus citations

Abstract

We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.

Original languageEnglish (US)
JournalCEUR Workshop Proceedings
Volume3433
StatePublished - 2023
EventAAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering, AAAI-MAKE 2023 - San Francisco, United States
Duration: Mar 27 2023Mar 29 2023

Keywords

  • ChatGPT
  • Large Language Models
  • Math Word Problems

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'An Independent Evaluation of ChatGPT on MathematicalWord Problems (MWP)'. Together they form a unique fingerprint.

Cite this