Multi Task Language Understanding On Bbh Nlp
评估指标
Average (%)
评测结果
各个模型在此基准测试上的表现结果
| Paper Title | Repository | ||
|---|---|---|---|
| Qwen2.5-72B | 86.3 | - | - |
| Jiutian-大模型 | 86.1 | - | - |
| LLama-3-405B | 85.9 | - | - |
| Jiutian-57B | 84.07 | - | - |
| Qwen2-72B | 82.4 | - | - |
| LLama-3-70B | 81.0 | - | - |
| Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | 78.4 | Scaling Instruction-Finetuned Language Models | |
| PaLM 540B (CoT + self-consistency) | 78.2 | Scaling Instruction-Finetuned Language Models | |
| code-davinci-002 175B (CoT) | 73.5 | Evaluating Large Language Models Trained on Code | |
| Flan-PaLM 540B (3-shot, fine-tuned, CoT) | 72.4 | Scaling Instruction-Finetuned Language Models | |
| PaLM 540B (CoT) | 71.2 | Scaling Instruction-Finetuned Language Models | |
| Flan-PaLM 540B (5-shot, finetuned) | 70.0 | Scaling Instruction-Finetuned Language Models | |
| PaLM 540B | 62.7 | Scaling Instruction-Finetuned Language Models | |
| Orca 2-13B | 50.18 | Orca 2: Teaching Small Language Models How to Reason | - |
| Orca 2-7B | 45.93 | Orca 2: Teaching Small Language Models How to Reason | - |
0 of 15 row(s) selected.