Action Recognition In Videos On Something 1

评估指标

GFLOPs
Param.
Top 1 Accuracy
Top 5 Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo--70.0-InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoMAE V2-g--68.791.9VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Side4Video (EVA ViT-E/14--67.388.8Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
ATM--65.688.6What Can Simple Arithmetic Operations Do for Temporal Modeling?
TAdaFormer-L/14--63.7-Temporally-Adaptive Models for Efficient Video Understanding
TDS-CLIP-ViT-L/14(8frames)--63.087.8TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning
UniFormerV2-L--62.788.0UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer-
StructVit-B-4-1--61.3-Learning Correlation Structures for Vision Transformers-
UniFormer-B (IN-1K + Kinetics400)259x350.160.987.3UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning-
TAdaConvNeXtV2-B--60.7-Temporally-Adaptive Models for Efficient Video Understanding
TPS--58.3-Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
MSMA (8+16frames)--57.9-Multi-scale Motion-Aware Module for Video Action Recognition-
UniFormer-B (IN-1K + Kinetics600)41.8x321.457.684.9UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning-
SIFA--57.3-Stand-Alone Inter-Frame Attention in Video Models
TCM (Ensemble)--57.2-Motion-driven Visual Tempo Learning for Video-based Action Recognition
EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)--57.283.9EAN: Event Adaptive Network for Enhanced Action Recognition
BQNEn (ImageNet + K400 pretrained)--57.184.2Busy-Quiet Video Disentangling for Video Classification
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)--56.884.1TDN: Temporal Difference Networks for Efficient Action Recognition
CT-Net Ensemble (R50, 8+12+16+24)--56.6-CT-Net: Channel Tensorization Network for Video Classification
MoDS (8+16frames)--56.6-Action Recognition With Motion Diversification and Dynamic Selection-
0 of 74 row(s) selected.
Action Recognition In Videos On Something 1 | SOTA | HyperAI超神经