
摘要
当前动作分类数据集(如UCF-101和HMDB-51)中视频数量的不足,使得识别优秀的视频架构变得困难,因为大多数方法在现有的小规模基准测试中表现出类似的性能。本文基于新的Kinetics人类动作视频数据集重新评估了最先进的架构。Kinetics的数据量比现有数据集高出两个数量级,包含400个人类动作类别,每个类别超过400个片段,并且这些数据是从现实且具有挑战性的YouTube视频中收集的。我们对当前架构在这项数据集上的动作分类任务表现进行了分析,并探讨了在Kinetics上预训练后,这些模型在较小的基准测试数据集上的性能提升情况。此外,我们还引入了一种新的双流膨胀3D卷积网络(Two-Stream Inflated 3D ConvNet, I3D),该网络基于2D卷积网络的膨胀:非常深的图像分类卷积网络中的滤波器和池化核被扩展到3D,从而使得从视频中学习无缝的空间-时间特征提取器成为可能,同时利用成功的ImageNet架构设计及其参数。我们展示了在Kinetics上预训练后,I3D模型在动作分类任务上的性能显著超过了现有最先进水平,在HMDB-51上达到了80.9%,在UCF-101上达到了98.0%。
代码仓库
helloxy96/CS5242_Project2020
pytorch
GitHub 中提及
2023-MindSpore-1/ms-code-24
mindspore
GitHub 中提及
vijayvee/behavior_recognition
tf
GitHub 中提及
google-deepmind/kinetics-i3d
tf
GitHub 中提及
yaohungt/GSTEG_CVPR_2019
pytorch
GitHub 中提及
prinshul/GWSDR
tf
GitHub 中提及
dlpbc/keras-kinetics-i3d
tf
GitHub 中提及
OanaIgnat/i3d_keras
tf
GitHub 中提及
ShobhitMaheshwari/sign-language1
tf
GitHub 中提及
anonymous-p/Flickering_Adversarial_Video
pytorch
GitHub 中提及
LukasHedegaard/co3d
pytorch
GitHub 中提及
ivanwilliammd/I3DR-Net-Transfer-Learning
pytorch
GitHub 中提及
FrederikSchorr/sign-language
tf
GitHub 中提及
mHealthBuet/SegCodeNet
pytorch
GitHub 中提及
KingGugu/I3D
mindspore
GitHub 中提及
hassony2/kinetics_i3d_pytorch
pytorch
GitHub 中提及
piergiaj/pytorch-i3d
pytorch
GitHub 中提及
JeffCHEN2017/WSSTG
pytorch
GitHub 中提及
MarkoLewis-Projects/Sign_language_detection
tf
GitHub 中提及
CMU-CREATE-Lab/deep-smoke-machine
pytorch
GitHub 中提及
vijayvee/behavior-recognition
tf
GitHub 中提及
AbdurrahmanNadi/activity_recognition_web_service
tf
GitHub 中提及
aim3-ruc/youmakeup_challenge2022
pytorch
GitHub 中提及
ahsaniqbal/Kinetics-FeatureExtractor
tf
GitHub 中提及
Alexyuda/action_recognition
pytorch
GitHub 中提及
open-mmlab/mmaction2
pytorch
sebastiantiesmeyer/deeplabchop3d
pytorch
GitHub 中提及
daniansan/i3d_mindspore
mindspore
GitHub 中提及
PPPrior/i3d-pytorch
pytorch
GitHub 中提及
deepmind/kinetics-i3d
tf
GitHub 中提及
StanfordVL/RubiksNet
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | I3D | MAP: 32.9 |
| action-classification-on-kinetics-400 | I3D | Acc@1: 71.1 Acc@5: 89.3 |
| action-classification-on-moments-in-time | I3D | Top 1 Accuracy: 29.51% Top 5 Accuracy: 56.06% |
| action-classification-on-toyota-smarthome | I3D | CS: 53.4 CV1: 34.9 CV2: 45.1 |
| action-recognition-in-videos-on-hmdb-51 | Flow-I3D (Kinetics pre-training) | Average accuracy of 3 splits: 77.3 |
| action-recognition-in-videos-on-hmdb-51 | Two-stream I3D | Average accuracy of 3 splits: 80.9 |
| action-recognition-in-videos-on-hmdb-51 | Two-Stream I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 80.7 |
| action-recognition-in-videos-on-hmdb-51 | RGB-I3D (Kinetics pre-training) | Average accuracy of 3 splits: 74.3 |
| action-recognition-in-videos-on-hmdb-51 | Flow-I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 77.1 |
| action-recognition-in-videos-on-hmdb-51 | RGB-I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 74.8 |
| action-recognition-in-videos-on-ucf101 | Two-Stream I3D (Kinetics pre-training) | 3-fold Accuracy: 97.8 |
| action-recognition-in-videos-on-ucf101 | Flow-I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 96.7 |
| action-recognition-in-videos-on-ucf101 | RGB-I3D (Kinetics pre-training) | 3-fold Accuracy: 95.1 |
| action-recognition-in-videos-on-ucf101 | Two-stream I3D | 3-fold Accuracy: 93.4 |
| action-recognition-in-videos-on-ucf101 | Two-Stream I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 98.0 |
| action-recognition-in-videos-on-ucf101 | RGB-I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 95.6 |
| action-recognition-in-videos-on-ucf101 | Flow-I3D (Kinetics pre-training) | 3-fold Accuracy: 96.5 |
| hand-gesture-recognition-on-egogesture-1 | I3D | Accuracy: 92.78 |
| hand-gesture-recognition-on-viva-hand-1 | I3D | Accuracy: 83.1 |
| skeleton-based-action-recognition-on-j-hmdb | I3D | Accuracy (RGB+pose): 84.1 |
| video-object-tracking-on-cater | I3D-50 + LSTM | L1: 1.2 Top 1 Accuracy: 60.2 Top 5 Accuracy: 81.8 |