
摘要
当前的强化学习框架仅关注性能表现,往往以牺牲效率为代价。相比之下,生物控制系统在实现卓越性能的同时,还能有效优化计算能耗与决策频率。为此,我们提出一种决策约束马尔可夫决策过程(Decision Bounded Markov Decision Process, DB-MDP),该框架对强化学习环境中智能体可执行的决策次数及可用计算能量施加严格限制。实验结果表明,现有强化学习算法在该框架下表现不佳,往往导致失败或次优性能。为应对这一挑战,我们提出一种受生物机制启发的时序分层架构(Temporally Layered Architecture, TLA),该架构通过两个具有不同时间尺度与能耗特性的层级,使智能体能够有效管理计算成本。TLA在决策受限环境中实现了最优性能,并在连续控制任务中达到当前最先进算法的性能水平,同时仅需极低的计算资源开销。相较于仅以性能为导向的现有强化学习方法,我们的方案在保持同等性能的前提下,显著降低了计算能耗。上述研究成果为能量与时间感知型控制研究建立了新的基准,也为未来高效、可持续的智能决策系统设计提供了重要方向。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| openai-gym-on-ant-v2 | TLA | Action Repetition: .1268 Average Decisions: 860.21 Mean Reward: 5163.54 |
| openai-gym-on-halfcheetah-v2 | TLA | Action Repetition: .1805 Average Decisions: 831.42 Mean Reward: 9571.99 |
| openai-gym-on-hopper-v2 | TLA | Action Repetition: .5722 Average Decisions: 423.91 Mean Reward: 3458.22 |
| openai-gym-on-inverteddoublependulum-v2 | TLA | Action Repetition: .7522 Average Decisions: 247.76 Mean Reward: 9356.67 |
| openai-gym-on-invertedpendulum-v2 | TLA | Action Repetition: .8882 Average Decisions: 111.79 Mean Reward: 1000 |
| openai-gym-on-mountaincarcontinuous-v0 | TLA | Action Repetition: .914 Average Decisions: 10.6 Mean Reward: 93.88 |
| openai-gym-on-pendulum-v1 | TLA | Action Repetition: .7032 Average Decisions: 62.31 Mean Reward: -154.92 |
| openai-gym-on-walker2d-v2 | TLA | Action Repetition: .4745 Average Decisions: 513.12 Mean Reward: 3878.41 |