HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin Liunian Harold Li Ziniu Hu Nanyun Peng Kai-Wei Chang

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Abstract

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

Code Repositories

wadeyin9712/gd-vcr
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
visual-commonsense-reasoning-on-gd-vcrVisualBERT
Accuracy: 53.95
Gap (West): -10.42
visual-commonsense-reasoning-on-gd-vcrHuman
Accuracy: 88.84
visual-commonsense-reasoning-on-gd-vcrViLBERT
Accuracy: 59.99
Gap (West): -7.28
visual-commonsense-reasoning-on-gd-vcrText-only BERT
Accuracy: 35.33

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Papers | HyperAI