Action Recognition On Epic Kitchens 100

评估指标

Action@1
GFLOPs
Noun@1
Verb@1

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Avion (ViT-L)54.4-65.473.0Training a Large Video Model on a Single Machine in a Day
M&M (WTS 60M)53.6-66.372.0M&M Mix: A Multimodal Multiview Transformer Ensemble-
LVMAE52.1-61.875.0Extending Video Masked Autoencoders to 128 frames-
TAdaFormer-L/1451.8-64.171.7Temporally-Adaptive Models for Efficient Video Understanding
LaViLa (TimeSformer-L)51-62.972Learning Video Representations from Large Language Models
MTV-B (WTS 60M)50.5-63.969.9Multiview Transformers for Video Recognition
OMNIVORE (Swin-B, finetuned)49.9-61.769.5Omnivore: A Single Model for Many Visual Modalities
CAST-B/1649.3-60.972.5CAST: Cross-Attention in Space and Time for Video Action Recognition
TAdaConvNeXtV2-S48.9-60.271.0Temporally-Adaptive Models for Efficient Video Understanding
MeMViT-2448.4-60.371.4MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
MMT47.8-61.070.1Multiscale Multimodal Transformer for Multimodal Action Recognition-
MoViNet-A647.7117x157.372.2MoViNets: Mobile Video Networks for Efficient Video Recognition
AVT47.2-59.370.4AVT: Audio-Video Transformer for Multimodal Action Recognition-
ORViT Mformer-L (ORViT blocks)45.7-58.768.4Object-Region Video Transformers
TempAgg45.26-53.3566Technical Report: Temporal Aggregate Representations
MoViNet-A544.574.9x155.169.1MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-HR44.5-58.567.0Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
GSF44.48-53.1869.06Gate-Shift-Fuse for Video Action Recognition
MoViNet-A444.442.2x156.268.8MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-L44.1-57.667.1Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
0 of 30 row(s) selected.
Action Recognition On Epic Kitchens 100 | SOTA | HyperAI超神经