3 months ago

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu Jiangnan Li Yuqing Li Lemao Liu Liyan Xu Jiwei Li Dit-Yan Yeung Jie Zhou Mo Yu

Abstract

Retrieval-augmented generation (RAG) over long documents typically involvessplitting the text into smaller chunks, which serve as the basic units forretrieval. However, due to dependencies across the original document,contextual information is often essential for accurately interpreting eachchunk. To address this, prior work has explored encoding longer context windowsto produce embeddings for longer chunks. Despite these efforts, gains inretrieval and downstream tasks remain limited. This is because (1) longerchunks strain the capacity of embedding models due to the increased amount ofinformation they must encode, and (2) many real-world applications stillrequire returning localized evidence due to constraints on model or humanbandwidth. We propose an alternative approach to this challenge by representing shortchunks in a way that is conditioned on a broader context window to enhanceretrieval performance -- i.e., situating a chunk's meaning within its context.We further show that existing embedding models are not well-equipped to encodesuch situated context effectively, and thus introduce a new training paradigmand develop the situated embedding models (SitEmb). To evaluate our method, wecurate a book-plot retrieval dataset specifically designed to assess situatedretrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3substantially outperforms state-of-the-art embedding models, including severalwith up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 modelfurther improves performance by over 10% and shows strong results acrossdifferent languages and several downstream applications.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu Jiangnan Li Yuqing Li Lemao Liu Liyan Xu Jiwei Li Dit-Yan Yeung Jie Zhou Mo Yu

Abstract

Build AI with AI

Hyper Newsletters