John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility

Himanshu Gupta; Neeraj Varshney; Swaroop Mishra; Kuntal Kumar Pal; Saurabh Arjun Sawant; Kevin Scaria; Siddharth Goyal; Chitta Baral

John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility

Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal, Saurabh Arjun Sawant, Kevin Scaria, Siddharth Goyal, Chitta Baral

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on MCQ and BCQ questions, GPT-3 achieves an accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question. We find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.

Original language	English (US)
Title of host publication	EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	407-417
Number of pages	11
ISBN (Electronic)	9781959429449
State	Published - 2023
Event	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Croatia Duration: May 2 2023 → May 6 2023

Publication series

Name	EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/Territory	Croatia
City	Dubrovnik
Period	5/2/23 → 5/6/23

ASJC Scopus subject areas

Computational Theory and Mathematics
Software
Linguistics and Language

Cite this

Gupta, H., Varshney, N., Mishra, S., Pal, K. K., Sawant, S. A., Scaria, K., Goyal, S., & Baral, C. (2023). John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 407-417). (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference). Association for Computational Linguistics (ACL).

John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility. / Gupta, Himanshu; Varshney, Neeraj; Mishra, Swaroop et al.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. p. 407-417 (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Gupta, H, Varshney, N, Mishra, S, Pal, KK, Sawant, SA, Scaria, K, Goyal, S & Baral, C 2023, John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility. in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), pp. 407-417, 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, 5/2/23.

Gupta H, Varshney N, Mishra S, Pal KK, Sawant SA, Scaria K et al. John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2023. p. 407-417. (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

Gupta, Himanshu ; Varshney, Neeraj ; Mishra, Swaroop et al. / John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility. EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. pp. 407-417 (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

@inproceedings{9feff0d83e2b40ae801b8f2a61d87a63,

title = "John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility",

abstract = "In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on MCQ and BCQ questions, GPT-3 achieves an accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question. We find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.",

author = "Himanshu Gupta and Neeraj Varshney and Swaroop Mishra and Pal, {Kuntal Kumar} and Sawant, {Saurabh Arjun} and Kevin Scaria and Siddharth Goyal and Chitta Baral",

note = "Publisher Copyright: {\textcopyright} 2023 Association for Computational Linguistics.; 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 ; Conference date: 02-05-2023 Through 06-05-2023",

year = "2023",

language = "English (US)",

series = "EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "407--417",

booktitle = "EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference",

}

TY - GEN

T1 - John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility

AU - Gupta, Himanshu

AU - Varshney, Neeraj

AU - Mishra, Swaroop

AU - Pal, Kuntal Kumar

AU - Sawant, Saurabh Arjun

AU - Scaria, Kevin

AU - Goyal, Siddharth

AU - Baral, Chitta

PY - 2023

Y1 - 2023

N2 - In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on MCQ and BCQ questions, GPT-3 achieves an accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question. We find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.

AB - In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on MCQ and BCQ questions, GPT-3 achieves an accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question. We find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.

UR - http://www.scopus.com/inward/record.url?scp=85159861070&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85159861070&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85159861070

T3 - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

SP - 407

EP - 417

BT - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

T2 - 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023

Y2 - 2 May 2023 through 6 May 2023

ER -

John is 50 years old, can his son be 65? Evaluating NLP Models' Understanding of Feasibility

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this