Modularized textual grounding for counterfactual resilience

Zhiyuan Fang, Shu Kong, Charless Fowlkes, Yezhou Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Scopus citations


Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PublisherIEEE Computer Society
Number of pages11
ISBN (Electronic)9781728132938
StatePublished - Jun 2019
Event32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States
Duration: Jun 16 2019Jun 20 2019

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919


Conference32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
Country/TerritoryUnited States
CityLong Beach


  • Categorization
  • Recognition: Detection
  • Retrieval
  • Vision + Language
  • Vision Applications and Systems

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Modularized textual grounding for counterfactual resilience'. Together they form a unique fingerprint.

Cite this