当不断优化模型 时,我们还会持续调整价值函数 ,使其对于状态的评分越来越准确。价值函数 V 调整方法比较简单,就是让 与新的估计 接近,即减少两者之间的 L2 平方损失(Squared Loss)。此时, 表示当前的样本对于 估计的增减变化。如图 13 所示,目标模型 Z 和价值函数 V 是迭代优化的过程,最终收敛到某个稳定状态。
▲ 图13. 目标模型 Z 和价值函数 V 迭代优化过程的示意图
以上介绍了 ChatGPT 三个主要训练步骤:语言模型、提示精调和强化学习。其中第三步强化学习实施起来相对比较困难。除了强化学习本身对于初始值和超参设置比较敏感外,其间一般需要多次人工标注。因为目标模型 Z 经过多次调整和优化之后,之前获得的奖励模型 已经不再适用于模型 Z 生成内容的评价了(因为生成内容的分布已经发生了变化),所以还有较大的改进空间。
作者简介
郑骁庆
复旦大学计算机科学技术学院副教授、博士生导师,美国麻省理工学院 International Faculty Fellow,加州大学洛杉矶分校访问学者,主要研究方向为自然语言处理和机器学习,在 Computational Linguistics、NeurIPS、ICLR、ACL、AAAI、IJCAI、EMNLP、WWW、T-ASL 等自然语言处理和人工智能领域的顶级国际会议和期刊发表论文 50 余篇。
参考文献
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. Attention is All you Need. In the Proceedings of the International Conference on Neural Information Processing Systems (NIPS’17), 2017.
[2] Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov. On the Role of Bidirectionality in Language Model Pre-Training. In the Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’22), 2022.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In the Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’18), 2018.
[4] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
[5] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207, 2021.
[6] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training language models to follow instructions with human feedback. In the Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS’22), 2022.
[7] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761, 2023.
[8] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416, 2022.
[9] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347, 2017.