中山大学 和 Meta AI 在 NeurIPS 2022 接收的论文在研究了注意力中的 dropout 后发现:不同注意力位置对过拟合的作用并不一致,如果丢弃了不恰当的位置甚至会加速过拟合。基于这一发现,我们提出了一种归因驱动的 Dropout(AD-DROP),该方法选择高归因的位置进行 Drop,从而实现更具针对性的 dropout。论文标题:AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning论文链接:https://arxiv.org/abs/2210.05883代码链接:https://github.com/TaoYang225/AD-DROP
交叉微调的影响:在两个关键参数 p 和 q 的搜索范围 [0.1,0.9] 内,观察验证集结果的分布情况,交叉微调可以显著提升 AD-DROP 在不同参数下的有效性。
参数的敏感性分析:在搜索范围内,将 AD-DROP 对比原始微调结果的差异归一化后进行可视化,发现 AD-DROP 在 BERT 上对引入的两个超参数不敏感,大部分情况下都能 work,而在 RoBERTa 上则需要仔细的参数搜索,猜测原因可能是 RoBERTa 进行了更有效地预训练,更不容易产生过拟合。
此外还做了重复性实验、数据量的影响、小样本场景、大模型的影响等。这里就不一 一详细介绍了,感兴趣的朋友请移步原文查看。参考文献[1] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. The journal of machine learning research, 2014, 15(1): 1929-1958.[2] Zhang Y, Tiňo P, Leonardis A, et al. A survey on neural network interpretability[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021.[3] Li J, Monroe W, Jurafsky D. Understanding neural networks through representation erasure[J]. arXiv preprint arXiv:1612.08220, 2016.[4] Clark K, Khandelwal U, Levy O, et al. What does bert look at? an analysis of bert's attention[J]. arXiv preprint arXiv:1906.04341, 2019.[5] Baehrens D, Schroeter T, Harmeling S, et al. How to explain individual classification decisions[J]. The Journal of Machine Learning Research, 2010, 11: 1803-1831.[6] Bach S, Binder A, Montavon G, et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation[J]. PloS one, 2015, 10(7): e0130140.[7] Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks[C]//International conference on machine learning. PMLR, 2017: 3319-3328.[8] Hao Y, Dong L, Wei F, et al. Self-attention attribution: Interpreting information interactions inside transformer[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(14): 12963-12971. 技术交流群邀请函△长按添加小助手