Code Generation On Humaneval

评估指标

Pass@1

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Llama-3 8B (HPT)100Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Claude 3.5 Sonnet (HPT)100Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
LLMDebugger (OpenAI o1)99.4Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
CodeSim (o3-mini)98.8CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
QualityFlow (Sonnet-3.5)98.8QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks-
Nexus (Claude 3.5 Sonnet)98.8Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation
LLMDebugger (GPT 4o)98.2Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
LPW (GPT-4o)98.2Planning-Driven Programming: A Large Language Model Programming Workflow
CodeSim (GPT-4o and LDB Debugger )97.6CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
MGDebugger (DeepSeek-Coder-V2-Lite)96.3From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
AgentCoder (GPT-4)96.3AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
CodeSim (GPT-4o)95.1CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
AFlow(GPT-4o-mini)94.7AFlow: Automating Agentic Workflow Generation
MapCoder (GPT-4)93.9MapCoder: Multi-Agent Code Generation for Competitive Problem Solving
Claude 3.5 Sonnet (0-shot)92.0--
FractalResearch : Pioneer-SWO (GPT-4-turbo)91.65--
L2MAC (GPT-4)90.2L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
GPT-4o (0-shot)90.2Claude 3.5 Sonnet Model Card Addendum-
OctorCoder (GPT-4)86.6OctoPack: Instruction Tuning Code Large Language Models
Spark_FP16_medium_v4.1.185.97--
0 of 21 row(s) selected.
Code Generation On Humaneval | SOTA | HyperAI超神经