Knowledge Distillation Across Vision and Language

Zhiyuan Fang, Yezhou Yang

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.

Original languageEnglish (US)
Title of host publicationStudies in Computational Intelligence
PublisherSpringer Science and Business Media Deutschland GmbH
Pages65-94
Number of pages30
DOIs
StatePublished - 2023

Publication series

NameStudies in Computational Intelligence
Volume1100
ISSN (Print)1860-949X
ISSN (Electronic)1860-9503

Keywords

  • Knowledge distillation
  • Representation learning
  • Vision and language learning

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Knowledge Distillation Across Vision and Language'. Together they form a unique fingerprint.

Cite this