HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on
  Challenging Queries

Abstract

Tool calling has emerged as a critical capability for AI agents to interactwith the real world and solve complex tasks. While the Model Context Protocol(MCP) provides a powerful standardized framework for tool integration, there isa significant gap in benchmarking how well AI agents can effectively solvemulti-step tasks using diverse MCP tools in realistic, dynamic scenarios. Inthis work, we present LiveMCP-101, a benchmark of 101 carefully curatedreal-world queries, refined through iterative LLM rewriting and manual review,that require coordinated use of multiple MCP tools including web search, fileoperations, mathematical reasoning, and data analysis. Moreover, we introduce anovel evaluation approach that leverages ground-truth execution plans ratherthan raw API outputs, better reflecting the evolving nature of real-worldenvironments. Experiments show that even frontier LLMs achieve a success ratebelow 60\%, highlighting major challenges in tool orchestration. Detailedablations and error analysis further reveal distinct failure modes andinefficiencies in token usage, pointing to concrete directions for advancingcurrent models. LiveMCP-101 sets a rigorous standard for evaluating real-worldagent capabilities, advancing toward autonomous AI systems that reliablyexecute complex tasks through tool use.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries | Papers | HyperAI