HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Satwik Kottur; José M. F. Moura; Devi Parikh; Dhruv Batra; Marcus Rohrbach

Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Abstract

Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., it'), as the dialog agent must first link it to a previous coreference (e.g.,boat'), and only then can rely on the visual grounding of the coreference boat' to reason about the pronounit'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.

Code Repositories

facebookresearch/corefnmn
tf
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
common-sense-reasoning-on-visual-dialog-v0-9NMN [kottur2018visual]
1 in 10 R@5: 80.1
visual-dialog-on-visdial-v09-valCorefNMN
MRR: 63.6
Mean Rank: 4.53
R@1: 50.24
R@10: 88.51
R@5: 79.81
visual-dialog-on-visdial-v09-valCorefNMN (ResNet-152)
MRR: 64.1
Mean Rank: 4.45
R@1: 50.92
R@10: 88.81
R@5: 80.18
visual-dialog-on-visual-dialog-v1-0-test-stdCorefNMN (ResNet-152)
MRR (x 100): 61.50
Mean: 4.40
NDCG (x 100): 54.70
R@1: 47.55
R@10: 88.80
R@5: 78.10

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Visual Coreference Resolution in Visual Dialog using Neural Module Networks | Papers | HyperAI