TY - CHAP
T1 - Knowledge Distillation Across Vision and Language
AU - Fang, Zhiyuan
AU - Yang, Yezhou
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.
AB - Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge Distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KD on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KD applied to Vision and Language-related topics.
KW - Knowledge distillation
KW - Representation learning
KW - Vision and language learning
UR - http://www.scopus.com/inward/record.url?scp=85163221409&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163221409&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-32095-8_3
DO - 10.1007/978-3-031-32095-8_3
M3 - Chapter
AN - SCOPUS:85163221409
T3 - Studies in Computational Intelligence
SP - 65
EP - 94
BT - Studies in Computational Intelligence
PB - Springer Science and Business Media Deutschland GmbH
ER -