HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Spider 2.0: Evaluating Language Models on Real-World Enterprise
  Text-to-SQL Workflows

Abstract

Real-world enterprise text-to-SQL workflows often involve complex cloud orlocal data across various database systems, multiple SQL queries in variousdialects, and diverse operations from data transformation to analytics. Weintroduce Spider 2.0, an evaluation framework comprising 632 real-worldtext-to-SQL workflow problems derived from enterprise-level database use cases.The databases in Spider 2.0 are sourced from real data applications, oftencontaining over 1,000 columns and stored in local or cloud database systemssuch as BigQuery and Snowflake. We show that solving problems in Spider 2.0frequently requires understanding and searching through database metadata,dialect documentation, and even project-level codebases. This challenge callsfor models to interact with complex SQL workflow environments, processextremely long contexts, perform intricate reasoning, and generate multiple SQLqueries with diverse operations, often exceeding 100 lines, which goes farbeyond traditional text-to-SQL challenges. Our evaluations indicate that basedon o1-preview, our code agent framework successfully solves only 21.3% of thetasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results onSpider 2.0 show that while language models have demonstrated remarkableperformance in code generation -- especially in prior text-to-SQL benchmarks --they require significant improvement in order to achieve adequate performancefor real-world enterprise usage. Progress on Spider 2.0 represents crucialsteps towards developing intelligent, autonomous, code agents for real-worldenterprise settings. Our code, baseline models, and data are available athttps://spider2-sql.github.io

Benchmarks

BenchmarkMethodologyMetrics
text-to-sql-on-spider-2-0Spider-Agent + Claude-3.5-Sonnect
Success Rate: 9.02
text-to-sql-on-spider-2-0Spider-Agent + GPT-4o
Success Rate: 10.13
text-to-sql-on-spider-2-0Spider-Agent + DeepSeek-V2.5
Success Rate: 5.22
text-to-sql-on-spider-2-0Spider-Agent + Qwen2.5-72B
Success Rate: 6.17
text-to-sql-on-spider-2-0Spider-Agent + GPT-4
Success Rate: 8.86
text-to-sql-on-spider-2-0Spider-Agent + Gemini-Pro-1.5
Success Rate: 2.53
text-to-sql-on-spider-2-0Spider-Agent + Llama-3.1-405B
Success Rate: 2.21
text-to-sql-on-spider-2-0Spider-Agent + o1-preview
Success Rate: 17.03

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows | Papers | HyperAI