5 months ago

ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar; Alaaeldin El-Nouby; Zhuang Liu; Mannat Singh; Kalyan Vasudev Alwala; Armand Joulin; Ishan Misra

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Code Repositories

klemens-floege/oneprot

pytorch

Mentioned in GitHub

ginihumer/amumo

jax

Mentioned in GitHub

facebookresearch/imagebind

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
sound-prompted-semantic-segmentation-on	ImageBIND	mAP: 19.7 mIoU: 20.5
speech-prompted-semantic-segmentation-on	ImageBIND	mAP: 20.2 mIoU: 19.7
temporal-relation-extraction-on-vinoground	ImageBind	Group Score: 0.6 Text Score: 9.4 Video Score: 3.4
zero-shot-video-retrieval-on-msr-vtt	ImageBind	text-to-video R@1: 36.8 text-to-video R@10: 70.0 text-to-video R@5: 61.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette