Zheng CaiMaosong CaoHaojiong ChenKai ChenKeyu ChenXin ChenXun ChenZehui ChenZhi ChenPei ChuXiaoyi DongHaodong DuanQi FanZhaoye FeiYang GaoJiaye GeChenya GuYuzhe GuTao GuiAijia GuoQipeng GuoConghui HeYingfan HuTing HuangTao JiangPenglong JiaoZhenjiang JinZhikai LeiJiaxing LiJingwen LiLinyang LiShuaibin LiWei LiYining LiHongwei LiuJiangning LiuJiawei HongKaiwen LiuKuikun LiuXiaoran LiuChengqi LvHaijun LvKai LvLi MaRunyuan MaZerun MaWenchang NingLinke OuyangJiantao QiuYuan QuFukai ShangYunfan ShaoDemin SongZifan SongZhihao SuiPeng SunYu SunHuanze TangBin WangGuoteng WangJiaqi WangJiayu WangRui WangYudong WangZiyi WangXingjian WeiQizhen WengFan WuYingtong XiongChao XuRuiliang XuHang YanYirong YanXiaogui YangHaochen YeHuaiyuan YingJia YuJing YuYuhang ZangChuyu ZhangLi ZhangPan ZhangPeng ZhangRuijie ZhangShuo ZhangSongyang ZhangWenjian ZhangWenwei ZhangXingcheng ZhangXinyue ZhangHui ZhaoQian ZhaoXiaomeng ZhaoFengzhe ZhouZaida ZhouJingming ZhuoYicheng ZouXipeng QiuYu QiaoDahua Lin

摘要
大型语言模型(LLM)如ChatGPT和GPT-4的快速发展,引发了关于人工通用智能(AGI)即将到来的广泛讨论。然而,在开源模型中复现此类技术进展仍面临诸多挑战。本文介绍了InternLM2——一个开源大型语言模型,其在六大维度、三十项基准测试、长文本建模能力以及开放式主观评估中均展现出优于前代模型的综合性能。这一成果得益于创新的预训练与优化技术。InternLM2的预训练过程被详尽阐述,重点介绍了多样化数据类型的准备,包括文本、代码及长上下文数据。该模型在预训练与微调阶段逐步提升上下文长度,从初始的4K token扩展至32K token,从而高效捕捉长期依赖关系。在200K长度的“针在 haystack”(Needle-in-a-Haystack)测试中,InternLM2表现出卓越性能。为进一步提升模型对齐能力,InternLM2采用监督微调(SFT)策略,并引入一种新颖的条件在线人类反馈强化学习(Conditional Online Reinforcement Learning from Human Feedback, COOL RLHF)方法,有效应对人类偏好冲突与奖励劫持(reward hacking)问题。通过在不同训练阶段与多种模型规模下发布InternLM2系列模型,本文向研究社区提供了对模型演进过程的深入洞察。
代码仓库
internlm/internlm
pytorch
GitHub 中提及
pwc-1/Paper-9/tree/main/internlm
mindspore
ruixiangcui/agieval
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| long-context-understanding-on-ada-leval | InternLM2-7b | 128k: 0.0 12k: 2.0 16k: 0.8 1k: 58.6 2k: 49.5 32k: 0.5 4k: 33.9 64k: 0.5 6k: 12.3 8k: 13.4 |
| long-context-understanding-on-ada-leval-tsort | InternLM2-7b | 16k: 4.3 2k: 5.1 4k: 3.9 8k: 5.1 |