HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VinVL+L: Enriching Visual Representation with Location Context in VQA

{Lukáš Picek Jiří Vyskočil}

VinVL+L: Enriching Visual Representation with Location Context in VQA

Abstract

In this paper, we describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization.The code and newly generated sets of features are available at https://github.com/vyskocj/VinVL-L.

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-gqa-test2019VinVL+L
Accuracy: 64.85
Binary: 82.59
Consistency: 94.0
Distribution: 4.59
Open: 49.19
Plausibility: 84.91
Validity: 96.62

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VinVL+L: Enriching Visual Representation with Location Context in VQA | Papers | HyperAI