为了后续推导的方便,我们引入两个重要的量。为了评估某个状态的整体上的好坏,引入了状态值函数(State value function),用函数来表示,其定义为状态未来累积奖励的期望,期望越大说明当前状态越有利。引入状态动作值函数(State-action value function),用函数表示,其定义为状态下采取动作后未来累积奖励的期望。与之间存在如下的等式关系:
关于Q、V值的定义和转换关系可以参考图2
图2 Q、V 值的定义和转换关系[6]
显然,强化学习模型优化目标可以使用表示,其中表示智能体的开始状态。
2.2 Model-Free vs Model-Based, Value-Based vs Policy-Based, off-policy vs on-policy
“Sometimes in RL, we don’t need to describe how good an action is in an abolutesense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function. ”
3.7 论文总结 PPO 算法在 TRPO 的基础上提出了两种优化后的形式、,并经过大量实验证明了其良好的效果。相比于 TRPO 算法,两种形式均重点解决 TRPO 约束项影响迭代步伐、求解繁琐、实现困难的特点,它们均将约束项转为惩罚项,解决了有约束的优化问题求解繁琐的特点,针对迭代步长过大的问题,将之限制在[1−ϵ, 1 + ϵ]范围内,则自适应地调整惩罚性系数,同样达到了限制步长的效果,与TRPO相比,它们对约束项的处理更加简单优雅,实现也更方便,性能也得到了提升。4 总结与思考PPO 算法自 2017 年提出后,以其实现相对简单、效果优良的特点,在 OpenAI 强化学习相关工作中出现频率极高,在游戏、机器人控制等实际应用领域成功证明其性能,截至目前仍然是基于策略的强化学习算法中最前沿的算法,即使 OpenAI 已经解散其机器人研究组,PPO 算法却又意外的在 NLP 领域大放异彩,其与 GPT-3 的结合成功产生了让人惊艳无比的 ChatGPT,笔者不禁思考,这种 RLHF(Reinforcement Learning Human Feedback)范式下究竟是什么让 ChatGPT 如此强大,是作为 Human Feedback 的高质量语料,还是 PPO 算法中蕴含的策略,还是 GPT-3 中未充分挖掘的大模型潜力?目前高质量数据逐渐成为商业公司的专属,国内大模型的研发也落后于美国,模型提出后缺乏后续维护,没有长期坚持下去形成自己的技术路线,不注重研究的 diversity。而国外部分顶尖研究组已经与 OpenAI 等大公司形成紧密的联系,在相关研究上有着很大的优势,ChatGPT 的出现已经对很多 NLP 研究领域造成了巨大的冲击,面对美国在大模型上的封锁,我国缺乏由商业公司主导的大模型,国内研究组只能另辟蹊径,但大模型的出现到产生 ChatGPT 一般的成果需要漫长的技术积累,若不立刻行动起来,我国与美国在大模型上的差距恐怕会越拉越大。参考文献[1] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.[2] OpenAI. Learning dexterity: a human-like robot hand to manipulate physical objects with unprecedented dexterity. Web Page, 2018. Last Accessed December 23, 2022.[3] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak,Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.[4] OpenAI. Openai five defeats dota 2 world champions. Web Page, 2018. Last Accessed December 23, 2022.[5] OpenAI. Openai spinning up. Web Page, 2018. Last Accessed December 23, 2022.[6] CSDN. 【强化学习笔记】强化学习中的 v 值和 q 值. Web Page, 2022. Last Accessed December 23, 2022.[7] David Silver. Lectures on reinforcement learning.URL:https://www.davidsilver.uk/teaching/, 2015.[8] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR,abs/2005.01643, 2020.[9] OpenAI. Openai implement code. Web Page, 2018. Last Accessed December 23, 2022.[10] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR.[11] Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška. Approximate reinforcement learning: An overview. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pages 1–8. IEEE, 2011.[12] S. Kakade and J. Langford, Approximately Optimal Approximate Reinforcement Learning. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002.