Arithmetic Reasoning On Gsm8K

评估指标

Accuracy
Parameters (Billion)

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Claude 3.5 Sonnet (HPT)97.72-Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Qwen2-Math-72B-Instruct (greedy)96.772Qwen2 Technical Report
SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)96.47--
OpenMath2-Llama3.1-70B (majority@256)96.0-OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Jiutian-大模型95.275--
DAMOMath-7B(MetaMath, OVM, BS, Ensemble)95.17--
Claude 3 Opus (0-shot chain-of-thought)95-The Claude 3 Model Family: Opus, Sonnet, Haiku-
OpenMath2-Llama3.1-70B94.9-OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
GPT-4 (Teaching-Inspired)94.8-Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
SFT-Mistral-7B (Metamath + ovm +ensemble)94.137--
OpenMath2-Llama3.1-8B (majority@256)94.1-OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Qwen2-72B-Instruct-Step-DPO (0-shot CoT)94.0-Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
DAMOMath-7B(MetaMath, OVM, Ensemble)93.27--
Claude 3 Sonnet (0-shot chain-of-thought)92.3-The Claude 3 Model Family: Opus, Sonnet, Haiku-
AlphaLLM (with MCTS)9270Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
OpenMath2-Llama3.1-8B91.7-OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
PaLM 2 (few-shot, k=8, SC)91.0-PaLM 2 Technical Report
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)90.91-Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
OpenMath-CodeLlama-70B (w/ code, SC, k=50)90.870OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)90.470DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
0 of 160 row(s) selected.
Arithmetic Reasoning On Gsm8K | SOTA | HyperAI超神经