Date

a month ago

Organization

Paper URL

2512.17495

License

Other

Tags

LLM

Multimodal

Natural Language Processing

GroundingME is a visual reference evaluation dataset for multimodal large language models (MLLMs), released in 2025 by Tsinghua University in collaboration with Xiaomi and the University of Hong Kong, among other institutions. Related research papers include... GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional EvaluationThe aim is to systematically evaluate the model's ability to accurately map natural language to visual targets in real-world complex scenarios, with particular attention to understanding and safety performance in situations involving ambiguous references, complex spatial relationships, small targets, occlusion, and unreferentiality.

This dataset contains 1,005 evaluation samples. The images are sourced from two high-quality datasets, SA-1B and HR-Bench, and only the original images were used to construct the tasks to avoid data contamination. The samples cover four primary task categories: discriminative reference (204 samples, 20.31 TP3T), spatial relationship understanding (300 samples, 29.91 TP3T), restricted visibility scenes (300 samples, 29.91 TP3T), and non-referential rejection task (201 samples, 20.01 TP3T), further subdivided into 12 secondary sub-tasks with a balanced overall distribution. The dataset involves 241 real-world object classes. There are a large number of objects of the same class in a single image, and object instances usually occupy a small proportion of the image. The length of the language descriptions is significantly longer than existing reference datasets, significantly increasing the difficulty of visual reference tasks from multiple dimensions.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at support@hyper.ai for prompt review and removal.