HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multimodal Few-Shot Learning with Frozen Language Models

Maria Tsimpoukelli Jacob Menick Serkan Cabi S. M. Ali Eslami Oriol Vinyals Felix Hill

Multimodal Few-Shot Learning with Frozen Language Models

Abstract

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-ok-vqaFrozen
Accuracy: 5.9
visual-question-answering-on-vqa-v2-valFrozen
Accuracy: 29.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Few-Shot Learning with Frozen Language Models | Papers | HyperAI