知识蒸馏(Knowledge Distillation, KD)可以用大模型指导小模型学习,但大小模型之间的容量差距阻碍了知识的传递,过去的研究集中在训练范式(早停[1])和架构调整(助教[2])等,他们要么性能提升有限,要么训练成本过高。本文从无需训练的神经网络架构搜索(NAS)的角度出发,为教师寻找最佳的学生架构,从而缓解模型容量差距。无需训练的 NAS 也被称为 Zero-cost proxy,这一表述会在后文出现。论文标题:DisWOT: Student Architecture Search for Distillation WithOut Training
论文链接:
https://arxiv.org/pdf/2303.15678.pdf
代码链接:
https://github.com/lilujunai/DisWOT-CVPR2023
方法:DisWOT
DisWOT 的目标是在给定的教师模型和约束条件下,通过无需训练的 NAS 方法找到最适合教师的学生模型,在此学生模型上,用 KD 范式以更高效地传递知识,提升学生模型的性能。
2.1 搜索最优学生网络
无需训练的 NAS 的三大关键要素是「度量指标」、「搜索空间」和「搜索策略」,图 1 是该环节的示意图,图1(a)反映了本文的两大度量指标,图1(b)反映了本文所选择的搜索空间和搜索策略,我们将在这一小节对具体细节依次进行介绍。
[1] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In ICCV, 2019.[2] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI, 2020.[3] Sihao Lin, Hongwei Xie, Bing Wang, Kaicheng Yu, Xiaojun Chang, Xiaodan Liang, and Gang Wang. Knowledge distillation via the target-aware transformer. In CVPR, 2022.[4] Yun-Hao Cao and Jianxin Wu. A random cnn sees objects: One inductive bias of cnn and its applications. In AAAI, 2022.[5] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.[6] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019.