Knowledge Distillation Across Vision and Language

Zhiyuan Fang; Yezhou Yang

doi:10.1007/978-3-031-32095-8_3

Knowledge Distillation Across Vision and Language

Zhiyuan Fang, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Chapter

Abstract

Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.

Original language	English (US)
Title of host publication	Studies in Computational Intelligence
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	65-94
Number of pages	30
DOIs	https://doi.org/10.1007/978-3-031-32095-8_3
State	Published - 2023

Publication series

Name	Studies in Computational Intelligence
Volume	1100
ISSN (Print)	1860-949X
ISSN (Electronic)	1860-9503

Keywords

Knowledge distillation
Representation learning
Vision and language learning

ASJC Scopus subject areas

Artificial Intelligence

Access to Document

10.1007/978-3-031-32095-8_3

Cite this

@inbook{92e7d0e54ae74f8fa31fa929015c4531,

title = "Knowledge Distillation Across Vision and Language",

abstract = "Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.",

keywords = "Knowledge distillation, Representation learning, Vision and language learning",

author = "Zhiyuan Fang and Yezhou Yang",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.",

year = "2023",

doi = "10.1007/978-3-031-32095-8_3",

language = "English (US)",

series = "Studies in Computational Intelligence",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "65--94",

booktitle = "Studies in Computational Intelligence",

address = "Germany",

}

TY - CHAP

T1 - Knowledge Distillation Across Vision and Language

AU - Fang, Zhiyuan

AU - Yang, Yezhou

PY - 2023

Y1 - 2023

N2 - Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.

AB - Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.

KW - Knowledge distillation

KW - Representation learning

KW - Vision and language learning

UR - http://www.scopus.com/inward/record.url?scp=85163221409&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85163221409&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-32095-8_3

DO - 10.1007/978-3-031-32095-8_3

M3 - Chapter

AN - SCOPUS:85163221409

T3 - Studies in Computational Intelligence

SP - 65

EP - 94

BT - Studies in Computational Intelligence

PB - Springer Science and Business Media Deutschland GmbH

ER -

Knowledge Distillation Across Vision and Language

Abstract

Publication series

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this