5 months ago

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng; Xiao Liu; Zhengxiao Du; Zihan Wang; Hanyu Lai; Ming Ding; Zhuoyi Yang; Yifan Xu; Wendi Zheng; Xiao Xia; Weng Lam Tam; Zixuan Ma; Yufei Xue; Jidong Zhai; Wenguang Chen; Peng Zhang; Yuxiao Dong; Jie Tang

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Code Repositories

thudm/glm-130b

Official

pytorch

Mentioned in GitHub

thudm/chatglm2-6b

pytorch

Mentioned in GitHub

modelscope/modelscope

pytorch

jackaduma/ChatGLM-LoRA-RLHF-PyTorch

pytorch

thudm/chatglm

pytorch

Mentioned in GitHub

THUDM/GLM

pytorch

2023-MindSpore-4/Code12/tree/main/MindFormers/glm3

mindspore

thudm/chatglm3

pytorch

Mentioned in GitHub

2023-MindSpore-4/Code12/tree/main/MindFormers/glm

mindspore

Benchmarks

Benchmark	Methodology	Metrics
language-modelling-on-big-bench-lite	GLM-130B (0-shot)	Accuracy: 13.31
language-modelling-on-big-bench-lite	GLM-130B (3-shot)	Accuracy: 15.11
language-modelling-on-big-bench-lite	GLM-130B (1-shot)	Accuracy: 14.91
language-modelling-on-clue-afqmc	ERNIE 3.0 Titan-260B	Accuracy: 69.0
language-modelling-on-clue-afqmc	GLM-130B	Accuracy: 71.2
language-modelling-on-clue-c3	ERNIE 3.0 Titan-260B	Accuracy: 54.9
language-modelling-on-clue-c3	GLM-130B	Accuracy: 77.5
language-modelling-on-clue-cmnli	ERNIE 3.0 Titan-260B	Accuracy: 51.7
language-modelling-on-clue-cmnli	GLM-130B	Accuracy: 77.0
language-modelling-on-clue-cmrc2018	GLM-130B	Accuracy: 55.7
language-modelling-on-clue-cmrc2018	ERNIE 3.0 Titan-260B	Accuracy: 16.6
language-modelling-on-clue-drcd	ERNIE 3.0 Titan-260B	Accuracy: 29.5
language-modelling-on-clue-drcd	GLM-130B	Accuracy: 77.1
language-modelling-on-clue-ocnli-50k	GLM-130B	Accuracy: 74.7
language-modelling-on-clue-ocnli-50k	ERNIE 3.0 Titan-260B	Accuracy: 44.6
language-modelling-on-clue-wsc1-1	ERNIE 3.0 Titan-260B	Accuracy: 81.1
language-modelling-on-clue-wsc1-1	GLM-130B	Accuracy: 83.9
language-modelling-on-fewclue-bustm	GLM-130B	Accuracy: 77.5
language-modelling-on-fewclue-bustm	ERNIE 3.0 Titan-260B	Accuracy: 64.4
language-modelling-on-fewclue-chid-fc	GLM-130B	Accuracy: 90.1
language-modelling-on-fewclue-chid-fc	ERNIE 3.0 Titan-260B	Accuracy: 87.1
language-modelling-on-fewclue-cluewsc-fc	GLM-130B	Accuracy: 77.4
language-modelling-on-fewclue-cluewsc-fc	ERNIE 3.0 Titan-260B	Accuracy: 53.5
language-modelling-on-fewclue-eprstmt	ERNIE 3.0 Titan-260B	Accuracy: 88.8
language-modelling-on-fewclue-eprstmt	GLM-130B	Accuracy: 92.5
language-modelling-on-fewclue-ocnli-fc	GLM-130B	Accuracy: 73.8
language-modelling-on-fewclue-ocnli-fc	ERNIE 3.0 Titan-260B	Accuracy: 53.8
language-modelling-on-lambada	GLM-130B (bidirectional attention)	Accuracy: 80.2
language-modelling-on-the-pile	Jurassic-1	Bits per byte: 0.65
language-modelling-on-the-pile	GLM-130B	Bits per byte: 0.634
language-modelling-on-the-pile	GPT-3	Bits per byte: 0.742
long-context-understanding-on-ada-leval	ChatGLM3-6b-32k	12k: 0.9 16k: 0.5 1k: 39.8 2k: 18.8 4k: 9.0 6k: 5.0 8k: 3.4
long-context-understanding-on-ada-leval	ChatGLM2-6b-32k	12k: 0.0 16k: 0.3 1k: 31.2 2k: 10.9 4k: 4.5 6k: 1.6 8k: 1.6
long-context-understanding-on-ada-leval-tsort	ChatGLM2-6b-32k	16k: 0.9 2k: 0.9 4k: 0.2 8k: 0.7
long-context-understanding-on-ada-leval-tsort	ChatGLM3-6b-32k	16k: 0.7 2k: 2.3 4k: 2.4 8k: 2.0
multi-task-language-understanding-on-mmlu	GLM-130B	Average (%): 44.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng; Xiao Liu; Zhengxiao Du; Zihan Wang; Hanyu Lai; Ming Ding; Zhuoyi Yang; Yifan Xu; Wendi Zheng; Xiao Xia; Weng Lam Tam; Zixuan Ma; Yufei Xue; Jidong Zhai; Wenguang Chen; Peng Zhang; Yuxiao Dong; Jie Tang

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters