Command Palette
Search for a command to run...
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
Junjie Wu Jiangnan Li Yuqing Li Lemao Liu Liyan Xu Jiwei Li Dit-Yan Yeung Jie Zhou Mo Yu

Abstract
Retrieval-augmented generation (RAG) over long documents typically involvessplitting the text into smaller chunks, which serve as the basic units forretrieval. However, due to dependencies across the original document,contextual information is often essential for accurately interpreting eachchunk. To address this, prior work has explored encoding longer context windowsto produce embeddings for longer chunks. Despite these efforts, gains inretrieval and downstream tasks remain limited. This is because (1) longerchunks strain the capacity of embedding models due to the increased amount ofinformation they must encode, and (2) many real-world applications stillrequire returning localized evidence due to constraints on model or humanbandwidth. We propose an alternative approach to this challenge by representing shortchunks in a way that is conditioned on a broader context window to enhanceretrieval performance -- i.e., situating a chunk's meaning within its context.We further show that existing embedding models are not well-equipped to encodesuch situated context effectively, and thus introduce a new training paradigmand develop the situated embedding models (SitEmb). To evaluate our method, wecurate a book-plot retrieval dataset specifically designed to assess situatedretrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3substantially outperforms state-of-the-art embedding models, including severalwith up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 modelfurther improves performance by over 10% and shows strong results acrossdifferent languages and several downstream applications.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.