6 months ago

Abstract

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Reasoning

Retrieval-Augmented Generation

Benchmarks

AI Infra

Method/Architecture

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Reasoning

Retrieval-Augmented Generation

Benchmarks

AI Infra

Method/Architecture

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li1 more

Abstract

Build AI with AI

HyperAI Newsletters

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li

Mo Yu Tsz Ting Chung Chulun Zhou Tong Li Rui Lu Jiangnan Li Liyan Xu Haoshu Lu Ning Zhang Jing Li