腾讯优图实验室针对现有多标签分类方法对于大量训练集不可见的未知类别标签不能有效识别的问题,提出了一种可迁移多模态知识的通用 Open Vocabulary 多标签学习框架:MKT。该研究迁移图文预训练模型强大的图文匹配能力,通过引入提示学习和知识蒸馏来优化标签 Embedding 以及提升图像 - 标签 Embedding 的一致性,并采用双流模块同时捕捉局部和全局特征,提高了模型的多标签识别能力。在 NUS-WIDE 和 Open Images 两个公开数据集上的实验结果表明,该方法有效实现了 Open Vocabulary 的多标签学习。
参考文献
[1] Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 Oct (pp. 1532-1543).[2] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning 2021 Jul 1 (pp. 8748-8763). PMLR.[3] Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 (pp. 14084-14093).[4] Huynh D, Kuen J, Lin Z, Gu J, Elhamifar E. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 (pp. 7020-7031).[5] Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. International Journal of Computer Vision. 2022 Sep;130 (9):2337-48.[6] Huynh D, Elhamifar E. A shared multi-attention framework for multi-label zero-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020 (pp. 8776-8786).[7] Ben-Cohen A, Zamir N, Ben-Baruch E, Friedman I, Zelnik-Manor L. Semantic diversity learning for zero-shot multi-label classification. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 640-650).[8] Narayan S, Gupta A, Khan S, Khan FS, Shao L, Shah M. Discriminative region-based multi-label zero-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 8731-8740).