HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

OmniVec: Learning robust representations with cross modal sharing

Srivastava Siddharth ; Sharma Gaurav

OmniVec: Learning robust representations with cross modal sharing

Abstract

Majority of research in learning based methods has been towards designing andtraining networks for specific tasks. However, many of the learning basedtasks, across modalities, share commonalities and could be potentially tackledin a joint framework. We present an approach in such direction, to learnmultiple tasks, in multiple modalities, with a unified architecture. Theproposed network is composed of task specific encoders, a common trunk in themiddle, followed by task specific prediction heads. We first pre-train it byself-supervised masked training, followed by sequential training for thedifferent tasks. We train the network on all major modalities, e.g.\ visual,audio, text and 3D, and report results on $22$ diverse and challenging publicbenchmarks. We demonstrate empirically that, using a joint network to trainacross modalities leads to meaningful information sharing and this allows us toachieve state-of-the-art results on most of the benchmarks. We also showgeneralization of the trained network on cross-modal tasks as well as unseendatasets and tasks.

Benchmarks

BenchmarkMethodologyMetrics
3d-point-cloud-classification-on-modelnet40-cOmniVec
Error Rate: 0.156
3d-point-cloud-classification-on-scanobjectnnOmniVec
Overall Accuracy: 96.1
action-classification-on-kinetics-400OmniVec
Acc@1: 91.1
action-classification-on-mitOmniVec
Top 1 Accuracy: 49.8
action-classification-on-moments-in-time-2OmniVec
Top 1 Accuracy: 49.8
action-recognition-in-videos-on-ucf101OmniVec
3-fold Accuracy: 99.6
audio-classification-on-audiosetOmniVec
Test mAP: 0.548
audio-classification-on-esc-50OmniVec
Accuracy (5-fold): 98.4
PRE-TRAINING DATASET: Multiple
Top-1 Accuracy: 98.4
fine-grained-image-classification-on-oxford-1OmniVec
Accuracy: 99.2
image-classification-on-inaturalist-2018OmniVec
Top-1 Accuracy: 93.8
image-classification-on-places365OmniVec(ViT)
Top 1 Accuracy: 63.5
semantic-segmentation-on-nyu-depth-v2OmniVec
Mean IoU: 60.8
semantic-segmentation-on-s3dis-area5OmniVec
mIoU: 75.9
text-summarization-on-dialogsumOmniVec
BertScore: 71.91
Rouge1: 46.91
Rouge2: 21.22
RougeL: 40.19
video-retrieval-on-msr-vtt-1kaOmniVec
text-to-video R@10: 89.4
video-retrieval-on-msr-vtt-1kaOmniVec (pretrained)
text-to-video R@10: 78.6
video-retrieval-on-youcook2OmniVec (pretrained)
text-to-video R@10: 64.2
video-retrieval-on-youcook2OmniVec
text-to-video R@10: 70.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OmniVec: Learning robust representations with cross modal sharing | Papers | HyperAI