HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang; Kunchang Li; Yizhuo Li; Yinan He; Bingkun Huang; Zhiyu Zhao; Hongjie Zhang; Jilan Xu; Yi Liu; Zun Wang; Sen Xing; Guo Chen; Junting Pan; Jiashuo Yu; Yali Wang; Limin Wang; Yu Qiao

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Code Repositories

opengvlab/internvideo
Official
pytorch
Mentioned in GitHub
yingsen1/unimd
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400InternVideo
Acc@1: 91.1
action-classification-on-kinetics-600InternVideo-T
Top-1 Accuracy: 91.3
action-classification-on-kinetics-700InternVideo-T
Top-1 Accuracy: 84.0
action-recognition-in-videos-on-somethingInternVideo
Top-1 Accuracy: 77.2
action-recognition-in-videos-on-something-1InternVideo
Top 1 Accuracy: 70.0
action-recognition-on-ava-v2-2InternVideo
mAP: 41.01
open-set-action-recognition-on-ucf-hmdbInternVideo
AUROC: 85.48
open-set-action-recognition-on-ucf101-mitv2InternVideo
AUROC: 91.85
spatio-temporal-action-localization-on-avaInternVideo
val mAP: 41.01
temporal-action-localization-on-activitynetInternVideo
mAP: 39.00
temporal-action-localization-on-fineactionInternVideo
mAP: 17.57
temporal-action-localization-on-hacsInternVideo
Average-mAP: 41.55
temporal-action-localization-on-thumos14ActionFormer (InternVideo features)
Avg mAP (0.3:0.7): 71.58
video-question-answering-on-situatedInternVideo
Average Accuracy: 58.7
video-retrieval-on-activitynetInternVideo
text-to-video R@1: 62.2
video-to-text R@1: 62.8
video-retrieval-on-didemoInternVideo
text-to-video R@1: 57.9
video-to-text R@1: 59.1
video-retrieval-on-lsmdcInternVideo
text-to-video R@1: 34.0
video-to-text R@1: 34.9
video-retrieval-on-msr-vttInternVideo
text-to-video R@1: 55.2
video-to-text R@1: 57.9
video-retrieval-on-msvdInternVideo
text-to-video R@1: 58.4
video-to-text R@1: 76.3
video-retrieval-on-vatexInternVideo
text-to-video R@1: 71.1
video-to-text R@1: 87.2
visual-question-answering-on-msrvtt-qa-1InternVideo
Accuracy: 0.471
visual-question-answering-on-msvd-qa-1InternVideo
Accuracy: 0.555
visual-question-answering-on-tgif-qaInternVideo
Accuracy: 0.722
zero-shot-video-question-answer-on-egoschema-1InternVideo
Accuracy: 32.1
zero-shot-video-question-answer-on-starInternVideo
Accuracy: 41.6
zero-shot-video-question-answer-on-tvqaInternVideo (no speech)
Accuracy: 35.9
zero-shot-video-retrieval-on-activitynetInternVideo
text-to-video R@1: 30.7
video-to-text R@1: 31.4
zero-shot-video-retrieval-on-didemoInternVideo
text-to-video R@1: 31.5
text-to-video R@10: 68.2
text-to-video R@5: 57.6
video-to-text R@1: 33.5
video-to-text R@10: 71.1
video-to-text R@5: 60.3
zero-shot-video-retrieval-on-lsmdcInternVideo
text-to-video R@1: 17.6
text-to-video R@10: 40.2
text-to-video R@5: 32.4
video-to-text R@1: 13.2
video-to-text R@10: 34.9
video-to-text R@5: 27.8
zero-shot-video-retrieval-on-msr-vttInternVideo
text-to-video R@1: 40.7
video-to-text R@1: 39.6
zero-shot-video-retrieval-on-msvdInternVideo
text-to-video R@1: 43.4
video-to-text R@1: 67.6
zero-shot-video-retrieval-on-vatexInternVideo
text-to-video R@1: 49.5
video-to-text R@1: 69.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | Papers | HyperAI