
摘要
自然语言处理(NLP)在金融科技领域的应用广泛且复杂,涵盖了从情感分析、命名实体识别到问答系统等多个方面。大型语言模型(LLMs)已在多种任务中展现出有效性;然而,目前尚未有文献报道专门针对金融领域的大型语言模型。在本研究中,我们介绍了BloombergGPT,这是一个拥有500亿参数的语言模型,训练数据涵盖广泛的金融信息。我们基于彭博社丰富的数据资源构建了一个包含3630亿个标记的数据集,这可能是迄今为止最大的特定领域数据集,并辅以来自通用数据集的3450亿个标记。我们在标准的大型语言模型基准测试、公开的金融基准测试以及一系列内部基准测试上对BloombergGPT进行了验证,这些内部基准测试最能反映我们的预期用途。混合数据集的训练使得该模型在金融任务上的表现显著优于现有模型,同时在通用大型语言模型基准测试上的性能也未受影响。此外,我们详细解释了建模选择、训练过程及评估方法。我们发布了《训练编年史》(附录C),记录了我们在训练BloombergGPT过程中的经验。
代码仓库
yangletliu/finlora
pytorch
GitHub 中提及
open-finance-lab/finlora
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| common-sense-reasoning-on-arc-challenge | BLOOM 176B (1-shot) | Accuracy: 50.85 |
| common-sense-reasoning-on-arc-challenge | Bloomberg GPT 50B (1-shot) | Accuracy: 48.63 |
| common-sense-reasoning-on-arc-challenge | GPT-NeoX 20B (1-shot) | Accuracy: 45.39 |
| common-sense-reasoning-on-arc-challenge | OPT 66B (one-shot) | Accuracy: 44.54 |
| common-sense-reasoning-on-arc-easy | GPT-NeoX 20B (1-shot) | Accuracy: 70.79 |
| common-sense-reasoning-on-arc-easy | Bloomberg GPT 50B (1-shot) | Accuracy: 73.99 |
| common-sense-reasoning-on-arc-easy | OPT 66B (1-shot) | Accuracy: 71.25 |
| common-sense-reasoning-on-arc-easy | BLOOM 176B (1-shot) | Accuracy: 75.93 |
| common-sense-reasoning-on-big-bench | BLOOM 176B (few-shot, k=3) | Accuracy: 40.4 |
| common-sense-reasoning-on-big-bench | GPT-NeoX 20B (few-shot, k=3) | Accuracy: 40.8 |
| common-sense-reasoning-on-big-bench | Bloomberg GPT 50B (few-shot, k=3) | Accuracy: 34 |
| common-sense-reasoning-on-big-bench | PaLM 540B (few-shot, k=3) | Accuracy: 60.8 |
| common-sense-reasoning-on-big-bench | OPT 66B (few-shot, k=3) | Accuracy: 40.4 |
| common-sense-reasoning-on-big-bench-causal | GPT-NeoX 20B (few-shot, k=3) | Accuracy: 52.41 |
| common-sense-reasoning-on-big-bench-causal | BloombergGPT 50B (few-shot, k=3) | Accuracy: 49.73 |
| common-sense-reasoning-on-big-bench-causal | OPT 66B (few-shot, k=3) | Accuracy: 51.87 |
| common-sense-reasoning-on-big-bench-causal | PaLM 540B (few-shot, k=3) | Accuracy: 61.0 |
| common-sense-reasoning-on-big-bench-causal | BLOOM 176B (few-shot, k=3) | Accuracy: 51.87 |
| common-sense-reasoning-on-big-bench-date | GPT-NeoX 20B (few-shot, k=3) | Accuracy: 45.60 |
| common-sense-reasoning-on-big-bench-date | PaLM 540B (few-shot,k=3) | Accuracy: 53.6 |
| common-sense-reasoning-on-big-bench-date | OPT 66B (few-shot, k=3) | Accuracy: 49.60 |
| common-sense-reasoning-on-big-bench-date | Bloomberg GPT 50B (few-shot, k=3) | Accuracy: 54.8 |
| common-sense-reasoning-on-big-bench-date | BLOOM 176B (few-shot, k=3) | Accuracy: 50.00 |
| common-sense-reasoning-on-big-bench-sports | OPT 66B (few-shot, k=3) | Accuracy: 54.4 |
| common-sense-reasoning-on-big-bench-sports | GPT-NeoX (few-shot, k=3) | Accuracy: 53.2 |
| common-sense-reasoning-on-big-bench-sports | Bloomberg GPT (few-shot, k=3) | Accuracy: 62.8 |
| common-sense-reasoning-on-big-bench-sports | PaLM 540B (few-shot, k=3) | Accuracy: 80.4 |
| common-sense-reasoning-on-commonsenseqa | OPT 66B (1-shot) | Accuracy: 66.4 |
| common-sense-reasoning-on-commonsenseqa | BLOOM 176B (1-shot) | Accuracy: 64.2 |
| common-sense-reasoning-on-commonsenseqa | GPT-NeoX 20B (1-shot) | Accuracy: 60.4 |
| common-sense-reasoning-on-commonsenseqa | Bloomberg GPT 50B (1-shot) | Accuracy: 65.5 |
| common-sense-reasoning-on-record | OPT 66B (1-shot) | F1: 82.5 |
| common-sense-reasoning-on-record | Bloomberg GPT 50B (1-shot) | F1: 82.8 |
| common-sense-reasoning-on-record | GPT-NeoX 20B (1-shot) | F1: 67.9 |
| common-sense-reasoning-on-record | BLOOM 176B (1-shot) | F1: 78 |
| common-sense-reasoning-on-winogrande | OPT 66B (1-shot) | Accuracy: 66.1 |
| common-sense-reasoning-on-winogrande | Bloomberg GPT (one-shot) | Accuracy: 64.1 |
| common-sense-reasoning-on-winogrande | BLOOM 176B (1-shot) | Accuracy: 67 |
| common-sense-reasoning-on-winogrande | GPT-NeoX (one-shot) | Accuracy: 60.6 |
| logical-reasoning-on-big-bench-formal | PaLM 540B (few-shot, k=3) | Accuracy: 53.6 |
| logical-reasoning-on-big-bench-formal | GPT-NeoX 20B (few-shot, k=3) | Accuracy: 52.8 |
| logical-reasoning-on-big-bench-formal | OPT 66B (few-shot, k=3) | Accuracy: 54 |
| logical-reasoning-on-big-bench-formal | BLOOM 176B (few-shot, k=3) | Accuracy: 52.8 |
| logical-reasoning-on-big-bench-formal | Bloomberg GPT 50B (few-shot, k=3) | Accuracy: 50.8 |
| logical-reasoning-on-big-bench-penguins-in-a | GPT-NeoX (few-shot, k=3) | Accuracy: 33.56 |
| logical-reasoning-on-big-bench-penguins-in-a | OPT 66B (few-shot, k=3) | Accuracy: 28.08 |
| logical-reasoning-on-big-bench-penguins-in-a | BLOOM 176B (few-shot, k=3) | Accuracy: 40.41 |
| logical-reasoning-on-big-bench-penguins-in-a | Bloomberg GPT (few-shot, k=3) | Accuracy: 37.67 |
| logical-reasoning-on-big-bench-penguins-in-a | PaLM 540B (few-shot, k=3) | Accuracy: 44.5 |
| logical-reasoning-on-big-bench-reasoning | PaLM 540B (few-shot, k=3) | Accuracy: 38 |
| logical-reasoning-on-big-bench-reasoning | BLOOM 176B (few-shot, k=3) | Accuracy: 36.8 |
| logical-reasoning-on-big-bench-reasoning | GPT-NeoX (few-shot, k=3) | Accuracy: 26 |
| logical-reasoning-on-big-bench-reasoning | OPT 66B (few-shot, k=3) | Accuracy: 31.2 |
| logical-reasoning-on-big-bench-reasoning | Bloomberg GPT (few-shot, k=3) | Accuracy: 34.8 |
| logical-reasoning-on-big-bench-temporal | Bloomberg GPT (few-shot, k=3) | Accuracy: 29.2 |
| logical-reasoning-on-big-bench-temporal | OPT 66B (few-shot, k=3) | Accuracy: 23.6 |
| logical-reasoning-on-big-bench-temporal | PaLM 540B (few-shot, k=3) | Accuracy: 39.6 |
| logical-reasoning-on-big-bench-temporal | BLOOM 176B (few-shot, k=3) | Accuracy: 36.8 |
| logical-reasoning-on-big-bench-temporal | GPT-NeoX (few-shot, k=3) | Accuracy: 21.2 |
| multi-task-language-understanding-on-mmlu | Bloomberg GPT 50B (5-shot) | Average (%): 39.2 |
| multi-task-language-understanding-on-mmlu | BLOOM 176B (5-shot) | Average (%): 39.1 |
| multi-task-language-understanding-on-mmlu | OPT 66B (5-shot) | Average (%): 36 |
| multiple-choice-question-answering-mcqa-on-27 | BLOOM 176B (few-shot, k=3) | Accuracy: 92 |
| multiple-choice-question-answering-mcqa-on-27 | OPT 66B (few-shot, k=3) | Accuracy: 91.6 |
| multiple-choice-question-answering-mcqa-on-27 | Bloomberg GPT (few-shot, k=3) | Accuracy: 92 |
| multiple-choice-question-answering-mcqa-on-27 | GPT-NeoX (few-shot, k=3) | Accuracy: 92 |
| multiple-choice-question-answering-mcqa-on-27 | PaLM 540B (few-shot, k=3) | Accuracy: 70.8 |
| multiple-choice-question-answering-mcqa-on-28 | GPT-NeoX (few-shot, k=3) | Accuracy: 86.4 |
| multiple-choice-question-answering-mcqa-on-28 | OPT 66B (few-shot, k=3) | Accuracy: 91.2 |
| multiple-choice-question-answering-mcqa-on-28 | BLOOM 176B (few-shot, k=3) | Accuracy: 91.2 |
| multiple-choice-question-answering-mcqa-on-28 | Bloomberg GPT (few-shot, k=3) | Accuracy: 90.4 |
| multiple-choice-question-answering-mcqa-on-28 | PaLM 540B (few-shot, k=3) | Accuracy: 87.2 |
| multiple-choice-question-answering-mcqa-on-29 | Bloomberg GPT (few-shot, k=3) | Accuracy: 42 |
| multiple-choice-question-answering-mcqa-on-29 | BLOOM 176B (few-shot, k=3) | Accuracy: 50 |
| multiple-choice-question-answering-mcqa-on-29 | PaLM 540B (few-shot, k=3) | Accuracy: 62.4 |
| multiple-choice-question-answering-mcqa-on-29 | OPT 66B (few-shot, k=3) | Accuracy: 42 |
| multiple-choice-question-answering-mcqa-on-29 | GPT-NeoX (few-shot, k=3) | Accuracy: 45.2 |
| multiple-choice-question-answering-mcqa-on-30 | BLOOM 176B (few-shot, k=3) | Accuracy: 54.8 |
| multiple-choice-question-answering-mcqa-on-30 | Bloomberg GPT (few-shot, k=3) | Accuracy: 56 |
| multiple-choice-question-answering-mcqa-on-30 | GPT-NeoX (few-shot, k=3) | Accuracy: 54 |
| multiple-choice-question-answering-mcqa-on-30 | PaLM 540B (few-shot, k=3) | Accuracy: 76 |
| multiple-choice-question-answering-mcqa-on-30 | OPT 66B (few-shot, k=3) | Accuracy: 52.8 |
| natural-language-inference-on-anli-test | BLOOM 176B (one-shot) | A1: 33.6 A2: 33.8 A3: 35.17 |
| natural-language-inference-on-anli-test | OPT 66B (one-shot) | A1: 33.1 A2: 34.2 A3: 34.92 |
| natural-language-inference-on-anli-test | GPT-NeoX (one-shot) | A1: 32.6 A2: 33.8 A3: 36.17 |
| natural-language-inference-on-anli-test | Bloomberg GPT (one-shot) | A1: 32.9 A2: 34.4 A3: 37.33 |
| natural-language-inference-on-commitmentbank | OPT 66B (one-shot) | Accuracy: 44.64 |
| natural-language-inference-on-commitmentbank | GPT-NeoX (one-shot) | Accuracy: 48.21 |
| natural-language-inference-on-commitmentbank | BLOOM 176B (one-shot) | Accuracy: 48.21 |
| natural-language-inference-on-commitmentbank | Bloomberg GPT (one-shot) | Accuracy: 53.57 |
| natural-language-inference-on-rte | GPT-NeoX 20B (1-shot) | Accuracy: 53.8% |
| natural-language-inference-on-rte | Bloomberg GPT 50B (1-shot) | Accuracy: 69.3% |
| natural-language-inference-on-rte | OPT 66B (1-shot) | Accuracy: 54.9% |
| natural-language-inference-on-rte | BLOOM 176B (1-shot) | Accuracy: 57.4% |
| question-answering-on-boolq | Bloomberg GPT 50B (1-shot) | Accuracy: 74.6 |
| question-answering-on-boolq | GPT-NeoX 20B (1-shot) | Accuracy: 46.4 |
| question-answering-on-boolq | OPT 66B (1-shot) | Accuracy: 57.5 |
| question-answering-on-boolq | BLOOM 176B (1-shot) | Accuracy: 52.9 |
| question-answering-on-copa | BLOOM 176B (one-shot) | Accuracy: 84 |
| question-answering-on-copa | OPT 66B (one-shot) | Accuracy: 86 |
| question-answering-on-copa | GPT-NeoX (one-shot) | Accuracy: 88 |
| question-answering-on-copa | Bloomberg GPT (one-shot) | Accuracy: 86 |
| question-answering-on-multirc | BLOOM 176B (1-shot) | F1: 26.7 |
| question-answering-on-multirc | GPT-NeoX 20B (1-shot) | F1: 22.9 |
| question-answering-on-multirc | OPT 66B (1-shot) | F1: 18.8 |
| question-answering-on-multirc | Bloomberg GPT 50B (1-shot) | F1: 62.3 |
| question-answering-on-openbookqa | BLOOM 176B (2-shot) | Accuracy: 47.2 |
| question-answering-on-openbookqa | Bloomberg GPT 50B (1-shot) | Accuracy: 51.6 |
| question-answering-on-openbookqa | GPT-NeoX 50B (2-shot) | Accuracy: 44.2 |
| question-answering-on-openbookqa | OPT 66B (one-shot) | Accuracy: 58.0 |
| question-answering-on-piqa | OPT 66B (1-shot) | Accuracy: 77.6 |
| question-answering-on-piqa | GPT-NeoX 20B (1-shot) | Accuracy: 75.8 |
| question-answering-on-piqa | Bloomberg GPT 50B (1-shot) | Accuracy: 77.9 |
| question-answering-on-piqa | BLOOM 176B (1-shot) | Accuracy: 77 |
| reading-comprehension-on-race | BLOOM 176B (one-shot) | Accuracy (High): 39.14 Accuracy (Middle): 52.3 |
| reading-comprehension-on-race | GPT-NeoX (one-shot) | Accuracy (High): 34.33 Accuracy (Middle): 41.23 |
| reading-comprehension-on-race | OPT 66B (one-shot) | Accuracy (High): 37.02 Accuracy (Middle): 47.42 |
| reading-comprehension-on-race | Bloomberg GPT (one-shot) | Accuracy (High): 41.74 Accuracy (Middle): 54.32 |
| sarcasm-detection-on-big-bench-snarks | PaLM 540B (few-shot, k=3) | Accuracy: 78.1 |
| sarcasm-detection-on-big-bench-snarks | Bloomberg GPT (few-shot, k=3) | Accuracy: 69.66 |
| sarcasm-detection-on-big-bench-snarks | BLOOM 176B (few-shot, k=3) | Accuracy: 72.47 |
| sarcasm-detection-on-big-bench-snarks | GPT-NeoX (few-shot, k=3) | Accuracy: 62.36 |