4 个月前

带有惩罚点概率距离的策略优化:近端策略优化的一种替代方法

带有惩罚点概率距离的策略优化:近端策略优化的一种替代方法

摘要

作为信任区域策略优化(Trust Region Policy Optimization, TRPO)最成功的变体和改进,近端策略优化(Proximal Policy Optimization, PPO)因其在数据利用效率、实现简便性和良好的并行性等方面的优势,已在多个领域得到广泛应用。本文提出了一种称为带有惩罚点概率距离的策略优化(Policy Optimization with Penalized Point Probability Distance, POP3D)的一阶梯度强化学习算法,该算法是总方差散度平方的一个下界。首先,我们讨论了几种常用算法的不足之处,这些不足部分地激发了我们的方法。其次,我们通过应用POP3D来克服这些不足。再次,我们从解流形的角度深入探讨了其机制。最后,我们在常见的基准测试中对几种最新的算法进行了定量比较。仿真结果表明,与PPO相比,POP3D具有高度竞争力。此外,我们的代码已发布在https://github.com/paperwithcode/pop3d。

代码仓库

cxxgtxy/POP3D
官方
tf
GitHub 中提及

基准测试

基准方法指标
atari-games-on-atari-2600-alienPOP3D
Score: 1510.8
atari-games-on-atari-2600-amidarPOP3D
Score: 729.15
atari-games-on-atari-2600-assaultPOP3D
Score: 5400.13
atari-games-on-atari-2600-asterixPOP3D
Score: 4310.67
atari-games-on-atari-2600-asteroidsPOP3D
Score: 2488.1
atari-games-on-atari-2600-atlantisPOP3D
Score: 2193605.67
atari-games-on-atari-2600-bank-heistPOP3D
Score: 1212.23
atari-games-on-atari-2600-battle-zonePOP3D
Score: 15466.67
atari-games-on-atari-2600-beam-riderPOP3D
Score: 4549
atari-games-on-atari-2600-bowlingPOP3D
Score: 38.99
atari-games-on-atari-2600-boxingPOP3D
Score: 97.23
atari-games-on-atari-2600-breakoutPOP3D
Score: 458.41
atari-games-on-atari-2600-centipedePOP3D
Score: 3315.44
atari-games-on-atari-2600-chopper-commandPOP3D
Score: 6308.33
atari-games-on-atari-2600-crazy-climberPOP3D
Score: 120247.33
atari-games-on-atari-2600-demon-attackPOP3D
Score: 61147.33
atari-games-on-atari-2600-double-dunkPOP3D
Score: -7.89
atari-games-on-atari-2600-enduroPOP3D
Score: 459.85
atari-games-on-atari-2600-fishing-derbyPOP3D
Score: 28.99
atari-games-on-atari-2600-freewayPOP3D
Score: 21.21
atari-games-on-atari-2600-frostbitePOP3D
Score: 316.87
atari-games-on-atari-2600-gopherPOP3D
Score: 6207
atari-games-on-atari-2600-gravitarPOP3D
Score: 557.17
atari-games-on-atari-2600-ice-hockeyPOP3D
Score: -4.12
atari-games-on-atari-2600-james-bondPOP3D
Score: 358.54
atari-games-on-atari-2600-kangarooPOP3D
Score: 3891.67
atari-games-on-atari-2600-krullPOP3D
Score: 7715.68
atari-games-on-atari-2600-kung-fu-masterPOP3D
Score: 33728
atari-games-on-atari-2600-montezumas-revengePOP3D
Score: 0
atari-games-on-atari-2600-ms-pacmanPOP3D
Score: 1683.87
atari-games-on-atari-2600-name-this-gamePOP3D
Score: 6065.63
atari-games-on-atari-2600-pitfallPOP3D
Score: 0
atari-games-on-atari-2600-pongPOP3D
Score: 20.5
atari-games-on-atari-2600-private-eyePOP3D
Score: 79.67
atari-games-on-atari-2600-qbertPOP3D
Score: 15396.67
atari-games-on-atari-2600-river-raidPOP3D
Score: 8052.23
atari-games-on-atari-2600-road-runnerPOP3D
Score: 44679.67
atari-games-on-atari-2600-robotankPOP3D
Score: 4.6
atari-games-on-atari-2600-seaquestPOP3D
Score: 1807.47
atari-games-on-atari-2600-space-invadersPOP3D
Score: 1216.15
atari-games-on-atari-2600-star-gunnerPOP3D
Score: 48984
atari-games-on-atari-2600-tennisPOP3D
Score: -8.32
atari-games-on-atari-2600-time-pilotPOP3D
Score: 3770.33
atari-games-on-atari-2600-tutankhamPOP3D
Score: 241.21
atari-games-on-atari-2600-up-and-downPOP3D
Score: 242701.51
atari-games-on-atari-2600-venturePOP3D
Score: 36.33
atari-games-on-atari-2600-video-pinballPOP3D
Score: 37780.7
atari-games-on-atari-2600-wizard-of-worPOP3D
Score: 4704
atari-games-on-atari-2600-zaxxonPOP3D
Score: 9472

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
带有惩罚点概率距离的策略优化:近端策略优化的一种替代方法 | 论文 | HyperAI超神经