Instruction Following On Ifeval
评估指标
Inst-level loose-accuracy
Inst-level strict-accuracy
Prompt-level loose-accuracy
Prompt-level strict-accuracy
评测结果
各个模型在此基准测试上的表现结果
| Paper Title | Repository | |||||
|---|---|---|---|---|---|---|
| AutoIF (Llama3 70B) | 90.4 | 86.7 | 85.6 | 80.2 | Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models | |
| AutoIF (Qwen2 72B) | 88 | 86.1 | 82.3 | 80.2 | Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models | |
| GPT-4 | 85.37 | 83.57 | 79.3 | 76.89 | Instruction-Following Evaluation for Large Language Models | |
| PaLM 2 S | 59.11 | 55.76 | 46.95 | 43.07 | Instruction-Following Evaluation for Large Language Models | 
0 of 4 row(s) selected.