Zero Shot Video Retrieval On Youcook2
评估指标
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
| Paper Title | Repository | |||||
|---|---|---|---|---|---|---|
| OmniVec2 | - | 26.1 | 70.8 | 54.1 | OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | - |
| Norton | - | 24.2 | 64.1 | 51.9 | Multi-granularity Correspondence Learning from Long-term Noisy Videos | |
| VideoCLIP | - | 22.7 | 63.1 | 50.4 | VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | |
| TACo | - | 19.9 | 55.7 | 43.2 | TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment | - |
| VAST, HowToCaption-finetuned | 8 | 19.7 | 53.9 | 43.6 | HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | |
| VideoCOca | - | 20.3 | 53.3 | 43.0 | VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | - |
| MIL-NCE | - | 15.1 | 51.2 | 38.0 | End-to-End Learning of Visual Representations from Uncurated Instructional Videos | |
| VATT-MBS | - | - | 45.5 | - | VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | |
| HowToCaption | 15 | 13.4 | 44.1 | 33.1 | HowToCaption: Prompting LLMs to Transform Video Annotations at Scale |
0 of 9 row(s) selected.