Command Palette
Search for a command to run...
Simon Schmitt Matteo Hessel Karen Simonyan

Abstract
We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| atari-games-on-atari-57 | LASER | Human World Record Breakthrough: 7 Mean Human Normalized Score: 1741.36% |
| atari-games-on-atari-games | LASER | Mean Human Normalized Score: 1741.36% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.