3 个月前

GDI:重新思考强化学习与监督学习的本质差异

GDI:重新思考强化学习与监督学习的本质差异

摘要

深度Q网络(Deep Q Network, DQN)首次通过将深度学习(Deep Learning, DL)与强化学习(Reinforcement Learning, RL)相结合,开启了深度强化学习(Deep Reinforcement Learning, DRL)的大门。DQN敏锐地注意到,在训练过程中所获取数据的分布会动态变化。该方法识别出这一特性可能引发训练不稳定性,因而提出了一系列有效机制以缓解其负面影响。然而,与以往聚焦于该特性的不利影响不同,我们发现:对于强化学习而言,关键在于缩小估计数据分布与真实数据分布之间的差距——而监督学习(Supervised Learning, SL)则无法实现这一点。基于这一全新的视角,我们对强化学习的基本范式——广义策略迭代(Generalized Policy Iteration, GPI)——进行了拓展,提出了一种更为通用的框架,称为广义数据分布迭代(Generalized Data Distribution Iteration, GDI)。我们发现,大量现有的强化学习算法与技术均可被统一纳入GDI框架之下,GPI可视为GDI的一个特例。本文提供了理论证明,阐明了GDI相较于GPI的优势所在及其内在工作机制。基于GDI框架,我们进一步提出了若干实用性强的算法,以验证其有效性与广泛适用性。大量实证实验表明,我们的方法在雅达利学习环境(Arcade Learning Environment, ALE)上取得了当前最先进的性能:在仅使用2亿次训练帧的前提下,平均人类归一化得分(Mean Human Normalized Score, HNS)达到9620.98%,中位数HNS达1146.39%,并实现了22项人类世界纪录的突破(Human World Record Breakthroughs, HWRB)。本研究旨在推动强化学习研究迈向突破人类极限的新阶段,致力于在性能与效率双重维度上探索真正具备超人类能力的智能体。

基准测试

基准方法指标
atari-games-on-atari-2600-beam-riderGDI-I3
Score: 162100
atari-games-on-atari-2600-berzerkGDI-I3
Score: 7607
atari-games-on-atari-2600-bowlingGDI-I3
Score: 201.9
atari-games-on-atari-2600-boxingGDI-H3
Score: 100
atari-games-on-atari-2600-centipedeGDI-I3
Score: 155830
atari-games-on-atari-2600-chopper-commandGDI-H3
Score: 999999
atari-games-on-atari-2600-crazy-climberGDI-I3
Score: 201000
atari-games-on-atari-2600-defenderGDI-I3
Score: 893110
atari-games-on-atari-2600-demon-attackGDI-I3
Score: 675530
atari-games-on-atari-2600-double-dunkGDI-H3
Score: 24
atari-games-on-atari-2600-enduroGDI-I3
Score: 14330
atari-games-on-atari-2600-fishing-derbyGDI-I3
Score: 59
atari-games-on-atari-2600-freewayGDI-I3
Score: 34
atari-games-on-atari-2600-frostbiteGDI-I3
Score: 10485
atari-games-on-atari-2600-gravitarGDI-I3
Score: 5905
atari-games-on-atari-2600-heroGDI-I3
Score: 38330
atari-games-on-atari-2600-ice-hockeyGDI-I3
Score: 44.94
atari-games-on-atari-2600-james-bondGDI-I3
Score: 594500
atari-games-on-atari-2600-kangarooGDI-I3
Score: 14500
atari-games-on-atari-2600-krullGDI-I3
Score: 97575
atari-games-on-atari-2600-montezumas-revengeGDI-I3
Score: 3000
atari-games-on-atari-2600-ms-pacmanGDI-I3
Score: 11536
atari-games-on-atari-2600-name-this-gameGDI-I3
Score: 34434
atari-games-on-atari-2600-phoenixGDI-I3
Score: 894460
atari-games-on-atari-2600-pitfallGDI-I3
Score: 0
atari-games-on-atari-2600-private-eyeGDI-I3
Score: 15100
atari-games-on-atari-2600-qbertGDI-I3
Score: 27800
atari-games-on-atari-2600-road-runnerGDI-I3
Score: 878600
atari-games-on-atari-2600-robotankGDI-I3
Score: 108.2
atari-games-on-atari-2600-seaquestGDI-I3
Score: 943910
atari-games-on-atari-2600-skiingGDI-I3
Score: -6774
atari-games-on-atari-2600-solarisGDI-I3
Score: 11074
atari-games-on-atari-2600-space-invadersGDI-I3
Score: 140460
atari-games-on-atari-2600-star-gunnerGDI-I3
Score: 465750
atari-games-on-atari-2600-surroundGDI-I3
Score: -7.8
atari-games-on-atari-2600-tennisGDI-I3
Score: 24
atari-games-on-atari-2600-time-pilotGDI-I3
Score: 216770
atari-games-on-atari-2600-tutankhamGDI-I3
Score: 423.9
atari-games-on-atari-2600-up-and-downGDI-I3
Score: 986440
atari-games-on-atari-57GDI-H3(200M frames)
Human World Record Breakthrough: 22
Mean Human Normalized Score: 9620.98%
atari-games-on-atari-57GDI-H3-

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
GDI:重新思考强化学习与监督学习的本质差异 | 论文 | HyperAI超神经