Action Recognition In Videos On Something

评估指标

Top-1 Accuracy
Top-5 Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
MVD (Kinetics400 pretrain, ViT-H, 16 frame)77.395.7Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
InternVideo77.2-InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo2-1B77.1-InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoMAE V2-g77.095.9VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
MVD (Kinetics400 pretrain, ViT-L, 16 frame)76.795.5Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Hiera-L (no extra data)76.5-Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
TubeViT-L76.195.2Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
VideoMAE (no extra data, ViT-L, 32x2)75.495.2VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Side4Video (EVA ViT-E/14)75.294.0Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
MaskFeat (Kinetics600 pretrain, MViT-L)75.095.0Masked Feature Prediction for Self-Supervised Visual Pre-Training
MAR (50% mask, ViT-L, 16x4)74.794.9MAR: Masked Autoencoders for Efficient Action Recognition
ATM74.694.4What Can Simple Arithmetic Operations Do for Temporal Modeling?
MAWS (ViT-L)74.4-The effectiveness of MAE pre-pretraining for billion-scale pretraining
VideoMAE (no extra data, ViT-L, 16frame)74.394.6VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MAR (75% mask, ViT-L, 16x4)73.894.4MAR: Masked Autoencoders for Efficient Action Recognition
ViC-MAE (ViT-L)73.7-ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
MVD (Kinetics400 pretrain, ViT-B, 16 frame)73.794.0Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
TAdaFormer-L/1473.6-Temporally-Adaptive Models for Efficient Video Understanding
TDS-CLIP-ViT-L/14(8frames)73.493.8TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning
AMD(ViT-B/16)73.394.0Asymmetric Masked Distillation for Pre-Training Small Foundation Models-
0 of 122 row(s) selected.
Action Recognition In Videos On Something | SOTA | HyperAI超神经