CAVAN: Commonsense Knowledge Anchored Video Captioning

Huiliang Shao, Zhiyuan Fang, Yezhou Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations


It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

Original languageEnglish (US)
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages8
ISBN (Electronic)9781665490627
StatePublished - 2022
Event26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada
Duration: Aug 21 2022Aug 25 2022

Publication series

NameProceedings - International Conference on Pattern Recognition
ISSN (Print)1051-4651


Conference26th International Conference on Pattern Recognition, ICPR 2022

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition


Dive into the research topics of 'CAVAN: Commonsense Knowledge Anchored Video Captioning'. Together they form a unique fingerprint.

Cite this