HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng; Xiao Liu; Zhengxiao Du; Zihan Wang; Hanyu Lai; Ming Ding; Zhuoyi Yang; Yifan Xu; Wendi Zheng; Xiao Xia; Weng Lam Tam; Zixuan Ma; Yufei Xue; Jidong Zhai; Wenguang Chen; Peng Zhang; Yuxiao Dong; Jie Tang

GLM-130B: An Open Bilingual Pre-trained Model

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Code Repositories

thudm/glm-130b
Official
pytorch
Mentioned in GitHub
thudm/chatglm2-6b
pytorch
Mentioned in GitHub
thudm/chatglm
pytorch
Mentioned in GitHub
THUDM/GLM
pytorch
thudm/chatglm3
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
language-modelling-on-big-bench-liteGLM-130B (0-shot)
Accuracy: 13.31
language-modelling-on-big-bench-liteGLM-130B (3-shot)
Accuracy: 15.11
language-modelling-on-big-bench-liteGLM-130B (1-shot)
Accuracy: 14.91
language-modelling-on-clue-afqmcERNIE 3.0 Titan-260B
Accuracy: 69.0
language-modelling-on-clue-afqmcGLM-130B
Accuracy: 71.2
language-modelling-on-clue-c3ERNIE 3.0 Titan-260B
Accuracy: 54.9
language-modelling-on-clue-c3GLM-130B
Accuracy: 77.5
language-modelling-on-clue-cmnliERNIE 3.0 Titan-260B
Accuracy: 51.7
language-modelling-on-clue-cmnliGLM-130B
Accuracy: 77.0
language-modelling-on-clue-cmrc2018GLM-130B
Accuracy: 55.7
language-modelling-on-clue-cmrc2018ERNIE 3.0 Titan-260B
Accuracy: 16.6
language-modelling-on-clue-drcdERNIE 3.0 Titan-260B
Accuracy: 29.5
language-modelling-on-clue-drcdGLM-130B
Accuracy: 77.1
language-modelling-on-clue-ocnli-50kGLM-130B
Accuracy: 74.7
language-modelling-on-clue-ocnli-50kERNIE 3.0 Titan-260B
Accuracy: 44.6
language-modelling-on-clue-wsc1-1ERNIE 3.0 Titan-260B
Accuracy: 81.1
language-modelling-on-clue-wsc1-1GLM-130B
Accuracy: 83.9
language-modelling-on-fewclue-bustmGLM-130B
Accuracy: 77.5
language-modelling-on-fewclue-bustmERNIE 3.0 Titan-260B
Accuracy: 64.4
language-modelling-on-fewclue-chid-fcGLM-130B
Accuracy: 90.1
language-modelling-on-fewclue-chid-fcERNIE 3.0 Titan-260B
Accuracy: 87.1
language-modelling-on-fewclue-cluewsc-fcGLM-130B
Accuracy: 77.4
language-modelling-on-fewclue-cluewsc-fcERNIE 3.0 Titan-260B
Accuracy: 53.5
language-modelling-on-fewclue-eprstmtERNIE 3.0 Titan-260B
Accuracy: 88.8
language-modelling-on-fewclue-eprstmtGLM-130B
Accuracy: 92.5
language-modelling-on-fewclue-ocnli-fcGLM-130B
Accuracy: 73.8
language-modelling-on-fewclue-ocnli-fcERNIE 3.0 Titan-260B
Accuracy: 53.8
language-modelling-on-lambadaGLM-130B (bidirectional attention)
Accuracy: 80.2
language-modelling-on-the-pileJurassic-1
Bits per byte: 0.65
language-modelling-on-the-pileGLM-130B
Bits per byte: 0.634
language-modelling-on-the-pileGPT-3
Bits per byte: 0.742
long-context-understanding-on-ada-levalChatGLM3-6b-32k
12k: 0.9
16k: 0.5
1k: 39.8
2k: 18.8
4k: 9.0
6k: 5.0
8k: 3.4
long-context-understanding-on-ada-levalChatGLM2-6b-32k
12k: 0.0
16k: 0.3
1k: 31.2
2k: 10.9
4k: 4.5
6k: 1.6
8k: 1.6
long-context-understanding-on-ada-leval-tsortChatGLM2-6b-32k
16k: 0.9
2k: 0.9
4k: 0.2
8k: 0.7
long-context-understanding-on-ada-leval-tsortChatGLM3-6b-32k
16k: 0.7
2k: 2.3
4k: 2.4
8k: 2.0
multi-task-language-understanding-on-mmluGLM-130B
Average (%): 44.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
GLM-130B: An Open Bilingual Pre-trained Model | Papers | HyperAI