HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh Ronghang Hu Vedanuj Goswami Guillaume Couairon Wojciech Galuba Marcus Rohrbach Douwe Kiela

FLAVA: A Foundational Language And Vision Alignment Model

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Code Repositories

social-ai-studio/matk
pytorch
Mentioned in GitHub
apsdehal/flava-tutorials
Mentioned in GitHub
facebookresearch/multimodal
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-retrieval-on-cocoFLAVA (zero-shot)
recall@1: 38.38
recall@5: 67.47
image-retrieval-on-cocoCLIP (zero-shot)
recall@1: 33.29
recall@5: 62.47
image-to-text-retrieval-on-cocoFLAVA (ViT-B, zero-shot)
Recall@1: 42.74
Recall@5: 76.76

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
FLAVA: A Foundational Language And Vision Alignment Model | Papers | HyperAI