CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Shailaja Keyur Sampat; Akshay Kumar; Yezhou Yang; Chitta Baral

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang, Chitta Baral

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

13 Scopus citations

Abstract

Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et al., 2017a) dataset. Wethen modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality.

Original language	English (US)
Title of host publication	NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
Subtitle of host publication	Human Language Technologies, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	3692-3709
Number of pages	18
ISBN (Electronic)	9781954085466
State	Published - 2021
Event	2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online Duration: Jun 6 2021 → Jun 11 2021

Publication series

Name	NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Conference

Conference	2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
City	Virtual, Online
Period	6/6/21 → 6/11/21

ASJC Scopus subject areas

Computer Networks and Communications
Hardware and Architecture
Information Systems
Software

Cite this

Sampat, S. K., Kumar, A., Yang, Y., & Baral, C. (2021). CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 3692-3709). (NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference). Association for Computational Linguistics (ACL).

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. / Sampat, Shailaja Keyur; Kumar, Akshay; Yang, Yezhou et al.
NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2021. p. 3692-3709 (NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Sampat, SK, Kumar, A, Yang, Y & Baral, C 2021, CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. in NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Association for Computational Linguistics (ACL), pp. 3692-3709, 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Virtual, Online, 6/6/21.

Sampat SK, Kumar A, Yang Y , Baral C. CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2021. p. 3692-3709. (NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference).

Sampat, Shailaja Keyur ; Kumar, Akshay ; Yang, Yezhou et al. / CLEVR_HYP : A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2021. pp. 3692-3709 (NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference).

@inproceedings{7dfcf8a2624e42fcbc0309960b4a5674,

title = "CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images",

abstract = "Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et al., 2017a) dataset. Wethen modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality.",

author = "Sampat, {Shailaja Keyur} and Akshay Kumar and Yezhou Yang and Chitta Baral",

note = "Funding Information: We are thankful to the anonymous reviewers for the constructive feedback. This work is partially supported by the grants NSF 1816039, DARPA W911NF2020006 and ONR N00014-20-1-2332. Publisher Copyright: {\textcopyright} 2021 Association for Computational Linguistics.; 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

language = "English (US)",

series = "NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "3692--3709",

booktitle = "NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics",

}

TY - GEN

T1 - CLEVR_HYP

T2 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021

AU - Sampat, Shailaja Keyur

AU - Kumar, Akshay

AU - Yang, Yezhou

AU - Baral, Chitta

N1 - Funding Information: We are thankful to the anonymous reviewers for the constructive feedback. This work is partially supported by the grants NSF 1816039, DARPA W911NF2020006 and ONR N00014-20-1-2332. Publisher Copyright: © 2021 Association for Computational Linguistics.

PY - 2021

Y1 - 2021

N2 - Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et al., 2017a) dataset. Wethen modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality.

AB - Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et al., 2017a) dataset. Wethen modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality.

UR - http://www.scopus.com/inward/record.url?scp=85129717055&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85129717055&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85129717055

T3 - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

SP - 3692

EP - 3709

BT - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics

PB - Association for Computational Linguistics (ACL)

Y2 - 6 June 2021 through 11 June 2021

ER -

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Cite this