HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Omnivore: A Single Model for Many Visual Modalities

Rohit Girdhar Mannat Singh Nikhila Ravi Laurens van der Maaten Armand Joulin Ishan Misra

Omnivore: A Single Model for Many Visual Modalities

Abstract

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Code Repositories

facebookresearch/omnivore
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400OMNIVORE (Swin-B)
Acc@1: 84.0
Acc@5: 96.2
action-classification-on-kinetics-400OMNIVORE (Swin-L)
Acc@1: 84.1
Acc@5: 96.1
action-recognition-in-videos-on-somethingOMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
Top-1 Accuracy: 71.4
Top-5 Accuracy: 93.5
action-recognition-on-epic-kitchens-100OMNIVORE (Swin-B, finetuned)
Action@1: 49.9
Noun@1: 61.7
Verb@1: 69.5
image-classification-on-imagenetOmnivore (Swin-L)
Top 1 Accuracy: 86.0%
image-classification-on-imagenetOmnivore (Swin-B)
Top 1 Accuracy: 85.3%
image-classification-on-inaturalist-2018OMNIVORE (Swin-L)
Top-1 Accuracy: 84.1%
scene-recognition-on-sun-rgbdOMNIVORE (Swin-B)
Accuracy (%): 67.2
semantic-segmentation-on-nyu-depth-v2OMNIVORE (Swin-B, finetuned)
Mean IoU: 55.1%
semantic-segmentation-on-nyu-depth-v2OMNIVORE (Swin-L, finetuned)
Mean IoU: 56.8%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp