3 months ago

VinVL+L: Enriching Visual Representation with Location Context in VQA

{Lukáš Picek Jiří Vyskočil}

Abstract

In this paper, we describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization.The code and newly generated sets of features are available at https://github.com/vyskocj/VinVL-L.

Benchmarks

Benchmark	Methodology	Metrics
visual-question-answering-on-gqa-test2019	VinVL+L	Accuracy: 64.85 Binary: 82.59 Consistency: 94.0 Distribution: 4.59 Open: 49.19 Plausibility: 84.91 Validity: 96.62

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning