HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

WideSearch: Benchmarking Agentic Broad Info-Seeking

WideSearch: Benchmarking Agentic Broad Info-Seeking

Abstract

From professional research to everyday planning, many tasks are bottleneckedby wide-scale information seeking, which is more repetitive than cognitivelycomplex. With the rapid development of Large Language Models (LLMs), automatedsearch agents powered by LLMs offer a promising solution to liberate humansfrom this tedious work. However, the capability of these agents to perform such"wide-context" collection reliably and completely remains largely unevaluateddue to a lack of suitable benchmarks. To bridge this gap, we introduceWideSearch, a new benchmark engineered to evaluate agent reliability on theselarge-scale collection tasks. The benchmark features 200 manually curatedquestions (100 in English, 100 in Chinese) from over 15 diverse domains,grounded in real user queries. Each task requires agents to collect large-scaleatomic information, which could be verified one by one objectively, and arrangeit into a well-organized output. A rigorous five-stage quality control pipelineensures the difficulty, completeness, and verifiability of the dataset. Webenchmark over 10 state-of-the-art agentic search systems, includingsingle-agent, multi-agent frameworks, and end-to-end commercial systems. Mostsystems achieve overall success rates near 0\%, with the best performerreaching just 5\%. However, given sufficient time, cross-validation by multiplehuman testers can achieve a near 100\% success rate. These results demonstratethat present search agents have critical deficiencies in large-scaleinformation seeking, underscoring urgent areas for future research anddevelopment in agentic search. Our dataset, evaluation pipeline, and benchmarkresults have been publicly released at https://widesearch-seed.github.io/

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
WideSearch: Benchmarking Agentic Broad Info-Seeking | Papers | HyperAI