HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe Satya Narayan Shukla Omid Poursaeed Michael S. Ryoo Tsung-Yu Lin

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-activitynet-qaLocVLM-Vid-B+
Accuracy: 38.2
video-question-answering-on-activitynet-qaLocVLM-Vid-B
Accuracy: 37.4
video-question-answering-on-msr-vttLocVLM-Vid-B
Accuracy: 51.2
video-question-answering-on-msvd-qaLocVLM-Vid-B
Accuracy: 66.1
video-question-answering-on-tgif-qaLocVLM-Vid-B
Accuracy: 51.8
visual-question-answering-on-gqa-1LocVLM-L
Accuracy: 50.2
visual-question-answering-on-vqa-v2-test-dev-1LocVLM-L
Accuracy: 56.2
visual-question-answering-on-vqa-v2-val-1LocVLM-L
Accuracy: 55.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Papers | HyperAI