HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
动作分类
Action Classification On Kinetics 400
Action Classification On Kinetics 400
评估指标
Acc@1
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Acc@1
Paper Title
Repository
OmniVec2
93.6
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
-
FTP-UniFormerV2-L/14
93.4
Enhancing Video Transformers for Action Understanding with VLM-aided Training
-
InternVideo2-6B
92.1
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B
91.6
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
OmniVec
91.1
OmniVec: Learning robust representations with cross modal sharing
-
InternVideo
91.1
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
TubeViT-H (ImageNet-1k)
90.9
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UMT-L (ViT-L/16)
90.6
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher (ViT-L)
90.6
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
TubeVit-L (ImageNet-1k)
90.2
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UniFormerV2-L (ViT-L, 336)
90.0
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
-
VideoMAE V2-g (64x266x266)
90.0
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
MTV-H (WTS 60M)
89.9
Multiview Transformers for Video Recognition
TAdaFormer-L/14
89.9
Temporally-Adaptive Models for Efficient Video Understanding
EVA
89.7
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
AM/12 ViT-B Dinov2
89.6
AM Flow: Adapters for Temporal Processing in Action Recognition
-
ATM
89.4
What Can Simple Arithmetic Operations Do for Temporal Modeling?
CoCa (finetuned)
88.9
CoCa: Contrastive Captioners are Image-Text Foundation Models
ILA (ViT-L/14)
88.7
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
BIKE (CLIP ViT-L/14)
88.7
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
0 of 204 row(s) selected.
Previous
Next
Action Classification On Kinetics 400 | SOTA | HyperAI超神经