Common Sense Reasoning On Arc Challenge

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
GPT-4 (few-shot, k=25)96.4GPT-4 Technical Report
PaLM 2 (few-shot, CoT, SC)95.1PaLM 2 Technical Report
Shivaay (4B, few-shot, k=8)91.04--
StupidLLM91.03--
Claude 2 (few-shot, k=5)91Model Card and Evaluations for Claude Models-
Claude 1.3 (few-shot, k=5)90Model Card and Evaluations for Claude Models-
PaLM 540B (Self Improvement, Self Consistency)89.8Large Language Models Can Self-Improve-
PaLM 540B (Self Consistency)88.7Large Language Models Can Self-Improve-
PaLM 540B (Self Improvement, CoT Prompting)88.3Large Language Models Can Self-Improve-
PaLM 540B (Self Improvement, Standard-Prompting)87.2Large Language Models Can Self-Improve-
PaLM 540B (Standard-Prompting)87.1Large Language Models Can Self-Improve-
ST-MoE-32B 269B (fine-tuned)86.5ST-MoE: Designing Stable and Transferable Sparse Expert Models
Claude Instant 1.1 (few-shot, k=5)85.7Model Card and Evaluations for Claude Models-
PaLM 540B (CoT Prompting)85.2Large Language Models Can Self-Improve-
GPT-3.5 (few-shot, k=25)85.2GPT-4 Technical Report
LLaMA 3 8B + MoSLoRA (fine-tuned)81.5Mixture-of-Subspaces in Low-Rank Adaptation
LLaMA-3 8B + MixLoRA79.9MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
LLaMA-2 13B + MixLoRA69.9MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
PaLM 2-L (1-shot)69.2PaLM 2 Technical Report
GAL 120B (zero-shot)67.9Galactica: A Large Language Model for Science
0 of 54 row(s) selected.
Common Sense Reasoning On Arc Challenge | SOTA | HyperAI超神经