CAVAN: Commonsense Knowledge Anchored Video Captioning

Huiliang Shao; Zhiyuan Fang; Yezhou Yang

doi:10.1109/ICPR56361.2022.9956241

CAVAN: Commonsense Knowledge Anchored Video Captioning

Huiliang Shao, Zhiyuan Fang, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

Original language	English (US)
Title of host publication	2022 26th International Conference on Pattern Recognition, ICPR 2022
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	4095-4102
Number of pages	8
ISBN (Electronic)	9781665490627
DOIs	https://doi.org/10.1109/ICPR56361.2022.9956241
State	Published - 2022
Event	26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada Duration: Aug 21 2022 → Aug 25 2022

Publication series

Name	Proceedings - International Conference on Pattern Recognition
Volume	2022-August
ISSN (Print)	1051-4651

Conference

Conference	26th International Conference on Pattern Recognition, ICPR 2022
Country/Territory	Canada
City	Montreal
Period	8/21/22 → 8/25/22

ASJC Scopus subject areas

Computer Vision and Pattern Recognition

Access to Document

10.1109/ICPR56361.2022.9956241

Cite this

CAVAN: Commonsense Knowledge Anchored Video Captioning. / Shao, Huiliang; Fang, Zhiyuan; Yang, Yezhou.
2022 26th International Conference on Pattern Recognition, ICPR 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 4095-4102 (Proceedings - International Conference on Pattern Recognition; Vol. 2022-August).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Shao, H, Fang, Z & Yang, Y 2022, CAVAN: Commonsense Knowledge Anchored Video Captioning. in 2022 26th International Conference on Pattern Recognition, ICPR 2022. Proceedings - International Conference on Pattern Recognition, vol. 2022-August, Institute of Electrical and Electronics Engineers Inc., pp. 4095-4102, 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, Canada, 8/21/22. https://doi.org/10.1109/ICPR56361.2022.9956241

@inproceedings{dba69354a5ce479f855de4b502a863a1,

title = "CAVAN: Commonsense Knowledge Anchored Video Captioning",

abstract = "It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.",

author = "Huiliang Shao and Zhiyuan Fang and Yezhou Yang",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 26th International Conference on Pattern Recognition, ICPR 2022 ; Conference date: 21-08-2022 Through 25-08-2022",

year = "2022",

doi = "10.1109/ICPR56361.2022.9956241",

language = "English (US)",

series = "Proceedings - International Conference on Pattern Recognition",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4095--4102",

booktitle = "2022 26th International Conference on Pattern Recognition, ICPR 2022",

}

TY - GEN

T1 - CAVAN

T2 - 26th International Conference on Pattern Recognition, ICPR 2022

AU - Shao, Huiliang

AU - Fang, Zhiyuan

AU - Yang, Yezhou

PY - 2022

Y1 - 2022

N2 - It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

AB - It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

UR - http://www.scopus.com/inward/record.url?scp=85143595985&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85143595985&partnerID=8YFLogxK

U2 - 10.1109/ICPR56361.2022.9956241

DO - 10.1109/ICPR56361.2022.9956241

M3 - Conference contribution

AN - SCOPUS:85143595985

T3 - Proceedings - International Conference on Pattern Recognition

SP - 4095

EP - 4102

BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 21 August 2022 through 25 August 2022

ER -

CAVAN: Commonsense Knowledge Anchored Video Captioning

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this