Action Classification On Charades

评估指标

MAP

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
TokenLearner66.3TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
TubeViT-L66.2Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
MoViNet-A663.2MoViNets: Mobile Video Networks for Efficient Video Recognition
DEEP-HAL with ODF+SDF (AssembleNet++)62.29Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors-
AssembleNet++ 5059.8AssembleNet++: Assembling Modality Representations via Attention Connections
AssembleNet-10158.6AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures
AssembleNet58.6AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures
VicTR (ViT-L/14)57.6VicTR: Video-conditioned Text Representations for Activity Recognition-
AssembleNet++ 50 without object54.98AssembleNet++: Assembling Modality Representations via Attention Connections
BIKE50.7Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
DEEP-HAL with ODF+SDF (I3D)50.16Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors-
MoViNet-A448.5MoViNets: Mobile Video Networks for Efficient Video Recognition
AdaFocus (weak supervision, MViT-B-24, 32x3)47.8Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition-
MViT-B-24, 32x3 (Kinetics-600 pretraining)47.7Multiscale Vision Transformers
En-VidTr-L47.3VidTr: Video Transformer Without Convolutions-
MViT-B, 32x3 (Kinetics-600 pretraining)47.1Multiscale Vision Transformers
MViT-B-24, 32x3 (Kinetics-400 pretraining)46.3Multiscale Vision Transformers
SlowFast (Kinetics-600 pretraining, NL)45.2SlowFast Networks for Video Recognition
ActionCLIP (ViT-B/16)44.3ActionCLIP: A New Paradigm for Video Action Recognition
MViT-B, 32x3 (Kinetics-400 pretraining)44.3Multiscale Vision Transformers
0 of 49 row(s) selected.
Action Classification On Charades | SOTA | HyperAI超神经