Command Palette
Search for a command to run...
Rohit Girdhar; Alaaeldin El-Nouby; Zhuang Liu; Mannat Singh; Kalyan Vasudev Alwala; Armand Joulin; Ishan Misra

Abstract
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| sound-prompted-semantic-segmentation-on | ImageBIND | mAP: 19.7 mIoU: 20.5 |
| speech-prompted-semantic-segmentation-on | ImageBIND | mAP: 20.2 mIoU: 19.7 |
| temporal-relation-extraction-on-vinoground | ImageBind | Group Score: 0.6 Text Score: 9.4 Video Score: 3.4 |
| zero-shot-video-retrieval-on-msr-vtt | ImageBind | text-to-video R@1: 36.8 text-to-video R@10: 70.0 text-to-video R@5: 61.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.