17 天前

DeepSeek-R1:通过强化学习激励LLM的推理能力

DeepSeek-AIDaya GuoDejian YangHaowei ZhangJunxiao SongRuoyu ZhangRunxin XuQihao ZhuShirong MaPeiyi WangXiao BiXiaokang ZhangXingkai YuYu WuZ. F. WuZhibin GouZhihong ShaoZhuoshu LiZiyi GaoAixin LiuBing XueBingxuan WangBochao WuBei FengChengda LuChenggang ZhaoChengqi DengChenyu ZhangChong RuanDamai DaiDeli ChenDongjie JiErhang LiFangyun LinFucong DaiFuli LuoGuangbo HaoGuanting ChenGuowei LiH. ZhangHan BaoHanwei XuHaocheng WangHonghui DingHuajian XinHuazuo GaoHui QuHui LiJianzhong GuoJiashi LiJiawei WangJingchang ChenJingyang YuanJunjie QiuJunlong LiJ. L. CaiJiaqi NiJian LiangJin ChenKai DongKai HuKaige GaoKang GuanKexin HuangKuai YuLean WangLecong ZhangLiang ZhaoLitong WangLiyue ZhangLei XuLeyi XiaMingchuan ZhangMinghua ZhangMinghui TangMeng LiMiaojun WangMingming LiNing TianPanpan HuangPeng ZhangQiancheng WangQinyu ChenQiushi DuRuiqi GeRuisong ZhangRuizhe PanRunji WangR. J. ChenR. L. JinRuyi ChenShanghao LuShangyan ZhouShanhuang ChenShengfeng YeShiyu WangShuiping YuShunfeng ZhouShuting PanS. S. LiShuang ZhouShaoqing WuShengfeng YeTao YunTian PeiTianyu SunT. WangWangding ZengWanjia ZhaoWen LiuWenfeng LiangWenjun GaoWenqin YuWentao ZhangW. L. XiaoWei AnXiaodong LiuXiaohan WangXiaokang ChenXiaotao NieXin ChengXin LiuXin XieXingchao LiuXinyu YangXinyuan LiXuecheng SuXuheng LinX. Q. LiXiangyue JinXiaojin ShenXiaosha ChenXiaowen SunXiaoxiang WangXinnan SongXinyi ZhouXianzu WangXinxia ShanY. K. LiY. Q. WangY. X. WeiYang ZhangYanhong XuYao LiYao ZhaoYaofeng SunYaohui WangYi YuYichao ZhangYifan ShiYiliang XiongYing HeYishi PiaoYisong WangYixuan TanYiyang MaYiyuan LiuYongqiang GuoYuan OuYuduan WangYue GongYuheng ZouYujia HeYunfan XiongYuxiang LuoYuxiang YouYuxuan LiuYuyang ZhouY. X. ZhuYanhong XuYanping HuangYaohui LiYi ZhengYuchen ZhuYunxian MaYing TangYukun ZhaYuting YanZ. Z. RenZehui RenZhangli ShaZhe FuZhean XuZhenda XieZhengyan ZhangZhewen HaoZhicheng MaZhigang YanZhiyu WuZihui GuZijia ZhuZijun LiuZilin LiZiwei XieZiyang SongZizheng PanZhen HuangZhipeng XuZhongyu ZhangZhen Zhang
DeepSeek-R1:通过强化学习激励LLM的推理能力

摘要

我们推出了首款推理模型——DeepSeek-R1-Zero 和 DeepSeek-R1。DeepSeek-R1-Zero 是通过大规模强化学习(Reinforcement Learning, RL)训练而成,无需监督微调(Supervised Fine-Tuning, SFT)作为前置步骤,展现出卓越的推理能力。在强化学习过程中,DeepSeek-R1-Zero 自然涌现出多种强大且引人注目的推理行为。然而,该模型也面临可读性差、语言混杂等挑战。为解决上述问题并进一步提升推理性能,我们提出了 DeepSeek-R1,该模型在强化学习前引入了多阶段训练和冷启动数据。在推理任务上,DeepSeek-R1 的表现可与 OpenAI-o1-1217 相媲美。为支持学术研究,我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以及基于 Qwen 和 Llama 构建的六款密集型模型(1.5B、7B、8B、14B、32B、70B),这些模型均从 DeepSeek-R1 中蒸馏而来。

代码仓库

deepseek-ai/deepseek-r1
官方
GitHub 中提及
vlm-rl/ocean-r1
pytorch
GitHub 中提及
zhaoolee/garss
pytorch
GitHub 中提及

基准测试

基准方法指标
mathematical-reasoning-on-aime24DeepSeek-r1
Acc: 79.8
multi-task-language-understanding-on-mmluds-r1(671b)
Average (%): 87.5
question-answering-on-newsqadeepseek-r1
EM: 80.57
F1: 86.13

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
DeepSeek-R1:通过强化学习激励LLM的推理能力 | 论文 | HyperAI超神经