Command Palette
Search for a command to run...
Amanpreet Singh Ronghang Hu Vedanuj Goswami Guillaume Couairon Wojciech Galuba Marcus Rohrbach Douwe Kiela

Abstract
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-coco | FLAVA (zero-shot) | recall@1: 38.38 recall@5: 67.47 |
| image-retrieval-on-coco | CLIP (zero-shot) | recall@1: 33.29 recall@5: 62.47 |
| image-to-text-retrieval-on-coco | FLAVA (ViT-B, zero-shot) | Recall@1: 42.74 Recall@5: 76.76 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.