WeaQA: Weak Supervision via Captions for Visual Question Answering

Pratyay Banerjee; Tejas Gokhale; Yezhou Yang; Chitta Baral

WeaQA: Weak Supervision via Captions for Visual Question Answering

Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

13 Scopus citations

Abstract

Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.

Original language	English (US)
Title of host publication	Findings of the Association for Computational Linguistics
Subtitle of host publication	ACL-IJCNLP 2021
Editors	Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Publisher	Association for Computational Linguistics (ACL)
Pages	3420-3435
Number of pages	16
ISBN (Electronic)	9781954085541
State	Published - 2021
Event	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 - Virtual, Online Duration: Aug 1 2021 → Aug 6 2021

Publication series

Name	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Conference

Conference	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
City	Virtual, Online
Period	8/1/21 → 8/6/21

ASJC Scopus subject areas

Language and Linguistics
Linguistics and Language

Cite this

Banerjee, P., Gokhale, T., Yang, Y., & Baral, C. (2021). WeaQA: Weak Supervision via Captions for Visual Question Answering. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 3420-3435). (Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021). Association for Computational Linguistics (ACL).

WeaQA: Weak Supervision via Captions for Visual Question Answering. / Banerjee, Pratyay; Gokhale, Tejas; Yang, Yezhou et al.
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. ed. / Chengqing Zong; Fei Xia; Wenjie Li; Roberto Navigli. Association for Computational Linguistics (ACL), 2021. p. 3420-3435 (Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Banerjee, P, Gokhale, T, Yang, Y & Baral, C 2021, WeaQA: Weak Supervision via Captions for Visual Question Answering. in C Zong, F Xia, W Li & R Navigli (eds), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics (ACL), pp. 3420-3435, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Virtual, Online, 8/1/21.

Banerjee P, Gokhale T, Yang Y , Baral C. WeaQA: Weak Supervision via Captions for Visual Question Answering. In Zong C, Xia F, Li W, Navigli R, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics (ACL). 2021. p. 3420-3435. (Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021).

Banerjee, Pratyay ; Gokhale, Tejas ; Yang, Yezhou et al. / WeaQA : Weak Supervision via Captions for Visual Question Answering. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. editor / Chengqing Zong ; Fei Xia ; Wenjie Li ; Roberto Navigli. Association for Computational Linguistics (ACL), 2021. pp. 3420-3435 (Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021).

@inproceedings{2406a7f12a5048cebd6ba1b2c29aa262,

title = "WeaQA: Weak Supervision via Captions for Visual Question Answering",

abstract = "Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.",

author = "Pratyay Banerjee and Tejas Gokhale and Yezhou Yang and Chitta Baral",

note = "Funding Information: The authors acknowledge support from the DARPA SAIL-ON program W911NF2020006, ONR award N00014-20-1-2332, and NSF grant 1816039, and the anonymous reviewers for their insightful discussion. Publisher Copyright: {\textcopyright} 2021 Association for Computational Linguistics; Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 ; Conference date: 01-08-2021 Through 06-08-2021",

year = "2021",

language = "English (US)",

series = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",

publisher = "Association for Computational Linguistics (ACL)",

pages = "3420--3435",

editor = "Chengqing Zong and Fei Xia and Wenjie Li and Roberto Navigli",

booktitle = "Findings of the Association for Computational Linguistics",

}

TY - GEN

T1 - WeaQA

T2 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

AU - Banerjee, Pratyay

AU - Gokhale, Tejas

AU - Yang, Yezhou

AU - Baral, Chitta

N1 - Funding Information: The authors acknowledge support from the DARPA SAIL-ON program W911NF2020006, ONR award N00014-20-1-2332, and NSF grant 1816039, and the anonymous reviewers for their insightful discussion. Publisher Copyright: © 2021 Association for Computational Linguistics

PY - 2021

Y1 - 2021

N2 - Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.

AB - Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.

UR - http://www.scopus.com/inward/record.url?scp=85115435681&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85115435681&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85115435681

T3 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

SP - 3420

EP - 3435

BT - Findings of the Association for Computational Linguistics

A2 - Zong, Chengqing

A2 - Xia, Fei

A2 - Li, Wenjie

A2 - Navigli, Roberto

PB - Association for Computational Linguistics (ACL)

Y2 - 1 August 2021 through 6 August 2021

ER -

WeaQA: Weak Supervision via Captions for Visual Question Answering

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this