
摘要
我们提出了一种简单的方法,可以将ViT编码器转换为高效的视频模型,该模型能够无缝处理图像和视频输入。通过稀疏采样输入数据,模型可以从这两种输入中进行训练和推理。该模型易于扩展,并且可以在无需完全微调的情况下适应大规模预训练的ViT模型。实验结果表明,该模型达到了当前最佳(SOTA)性能,相关代码也将开源。
代码仓库
daniel-code/TubeViT
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | TubeViT-L | MAP: 66.2 |
| action-classification-on-kinetics-400 | TubeVit-L (ImageNet-1k) | Acc@1: 90.2 Acc@5: 98.6 FLOPs (G) x views: 95300x4x3 Parameters (M): 307 |
| action-classification-on-kinetics-400 | TubeViT-H (ImageNet-1k) | Acc@1: 90.9 Acc@5: 98.9 FLOPs (G) x views: 176400x4x3 Parameters (M): 632 |
| action-classification-on-kinetics-400 | TubeVit-B (ImageNet-1k) | Acc@1: 88.6 Acc@5: 97.6 FLOPs (G) x views: 8700x3x4 Parameters (M): 86 |
| action-classification-on-kinetics-600 | TubeVit-L | Top-1 Accuracy: 91.5 Top-5 Accuracy: 98.7 |
| action-classification-on-kinetics-600 | TubeVit-B | Top-1 Accuracy: 90.9 Top-5 Accuracy: 97.3 |
| action-classification-on-kinetics-600 | TubeVit-H | Top-1 Accuracy: 91.8 Top-5 Accuracy: 98.9 |
| action-classification-on-kinetics-700 | TubeViT-L | Top-1 Accuracy: 83.8 Top-5 Accuracy: 96.6 |
| action-recognition-in-videos-on-something | TubeViT-L | Top-1 Accuracy: 76.1 Top-5 Accuracy: 95.2 |