4 个月前

GLM-130B:一个开放的双语预训练模型

GLM-130B:一个开放的双语预训练模型

摘要

我们介绍了GLM-130B,这是一个具有1300亿参数的双语(英语和汉语)预训练语言模型。该模型旨在开源一个至少与GPT-3(达芬奇)相当的百亿规模模型,并揭示如何成功地预训练如此大规模的模型。在这一过程中,我们遇到了许多意想不到的技术和工程挑战,特别是在损失峰值和发散问题上。本文介绍了GLM-130B的训练过程,包括其设计选择、为提高效率和稳定性而采用的训练策略以及工程努力。最终,GLM-130B在广泛的流行英语基准测试中显著优于GPT-3 1750亿参数(达芬奇),而在OPT-1750亿参数和BLOOM-1760亿参数模型中未观察到这种性能优势。此外,GLM-130B在相关基准测试中也始终显著优于最大的汉语语言模型ERNIE TITAN 3.0 2600亿参数。最后,我们利用了GLM-130B的独特缩放特性,在无需后训练的情况下实现了INT4量化,几乎没有任何性能损失,使其成为首个实现这一目标的百亿规模模型。更重要的是,这使得GLM-130B能够在4×RTX 3090(24GB)或8×RTX 2080 Ti(11GB)GPU上进行有效推理,这些是最经济实惠的用于运行百亿规模模型的GPU。GLM-130B的模型权重已公开访问,其代码、训练日志、相关工具包及经验教训已在\url{https://github.com/THUDM/GLM-130B/} 开源。

代码仓库

基准测试

基准方法指标
language-modelling-on-big-bench-liteGLM-130B (0-shot)
Accuracy: 13.31
language-modelling-on-big-bench-liteGLM-130B (3-shot)
Accuracy: 15.11
language-modelling-on-big-bench-liteGLM-130B (1-shot)
Accuracy: 14.91
language-modelling-on-clue-afqmcERNIE 3.0 Titan-260B
Accuracy: 69.0
language-modelling-on-clue-afqmcGLM-130B
Accuracy: 71.2
language-modelling-on-clue-c3ERNIE 3.0 Titan-260B
Accuracy: 54.9
language-modelling-on-clue-c3GLM-130B
Accuracy: 77.5
language-modelling-on-clue-cmnliERNIE 3.0 Titan-260B
Accuracy: 51.7
language-modelling-on-clue-cmnliGLM-130B
Accuracy: 77.0
language-modelling-on-clue-cmrc2018GLM-130B
Accuracy: 55.7
language-modelling-on-clue-cmrc2018ERNIE 3.0 Titan-260B
Accuracy: 16.6
language-modelling-on-clue-drcdERNIE 3.0 Titan-260B
Accuracy: 29.5
language-modelling-on-clue-drcdGLM-130B
Accuracy: 77.1
language-modelling-on-clue-ocnli-50kGLM-130B
Accuracy: 74.7
language-modelling-on-clue-ocnli-50kERNIE 3.0 Titan-260B
Accuracy: 44.6
language-modelling-on-clue-wsc1-1ERNIE 3.0 Titan-260B
Accuracy: 81.1
language-modelling-on-clue-wsc1-1GLM-130B
Accuracy: 83.9
language-modelling-on-fewclue-bustmGLM-130B
Accuracy: 77.5
language-modelling-on-fewclue-bustmERNIE 3.0 Titan-260B
Accuracy: 64.4
language-modelling-on-fewclue-chid-fcGLM-130B
Accuracy: 90.1
language-modelling-on-fewclue-chid-fcERNIE 3.0 Titan-260B
Accuracy: 87.1
language-modelling-on-fewclue-cluewsc-fcGLM-130B
Accuracy: 77.4
language-modelling-on-fewclue-cluewsc-fcERNIE 3.0 Titan-260B
Accuracy: 53.5
language-modelling-on-fewclue-eprstmtERNIE 3.0 Titan-260B
Accuracy: 88.8
language-modelling-on-fewclue-eprstmtGLM-130B
Accuracy: 92.5
language-modelling-on-fewclue-ocnli-fcGLM-130B
Accuracy: 73.8
language-modelling-on-fewclue-ocnli-fcERNIE 3.0 Titan-260B
Accuracy: 53.8
language-modelling-on-lambadaGLM-130B (bidirectional attention)
Accuracy: 80.2
language-modelling-on-the-pileJurassic-1
Bits per byte: 0.65
language-modelling-on-the-pileGLM-130B
Bits per byte: 0.634
language-modelling-on-the-pileGPT-3
Bits per byte: 0.742
long-context-understanding-on-ada-levalChatGLM3-6b-32k
12k: 0.9
16k: 0.5
1k: 39.8
2k: 18.8
4k: 9.0
6k: 5.0
8k: 3.4
long-context-understanding-on-ada-levalChatGLM2-6b-32k
12k: 0.0
16k: 0.3
1k: 31.2
2k: 10.9
4k: 4.5
6k: 1.6
8k: 1.6
long-context-understanding-on-ada-leval-tsortChatGLM2-6b-32k
16k: 0.9
2k: 0.9
4k: 0.2
8k: 0.7
long-context-understanding-on-ada-leval-tsortChatGLM3-6b-32k
16k: 0.7
2k: 2.3
4k: 2.4
8k: 2.0
multi-task-language-understanding-on-mmluGLM-130B
Average (%): 44.8

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
GLM-130B:一个开放的双语预训练模型 | 论文 | HyperAI超神经