| Gemini 2.0 Flash Experimental | 89.7 | - | - |
| Qwen2.5-Math-72B-Instruct(TIR,Greedy) | 88.1 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| Qwen2.5-Math-72B-Instruct(COT,Greedy) | 85.9 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| Qwen2.5-Math-7B-Instruct(TIR,Greedy) | 85.2 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| GPT-4-code model (CSV, w/ code, SC, k=16) | 84.3 | Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | |
| Qwen2-Math-72B-Instruct(greedy) | 84.0 | Qwen2 Technical Report | |
| Qwen2.5-Math-7B-Instruct(COT,Greedy) | 83.6 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| Qwen2.5-Math-1.5B-Instruct(TIR,Greedy) | 79.9 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| Qwen2.5-Math-1.5B-Instruct(COT,Greedy) | 75.8 | Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | - |
| CR (GPT-4-turbo model, w/ code) | 72.2 | Cumulative Reasoning with Large Language Models | |
| Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code) | 70.8 | Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs | |