4 个月前

面向弱监督端到端学习的长视频动作识别

面向弱监督端到端学习的长视频动作识别

摘要

在长视频中开发端到端动作识别模型对于长视频动作理解至关重要。由于在整个长视频上进行端到端训练的成本过高,现有的研究通常是在从长视频中剪辑出的短片段上训练模型。然而,这种“先剪辑再训练”的方法需要动作区间注释以提供片段级别的监督,即知道哪些动作被剪辑到了这些片段中。不幸的是,收集此类注释非常昂贵,阻碍了大规模的模型训练。为此,本研究旨在构建一个仅使用视频级别动作类别标签的弱监督端到端框架,用于在长视频上训练识别模型。在不知道长视频中动作的确切时间位置的情况下,我们提出的弱监督框架(即AdaptFocus)估计动作可能发生的位置及其概率,从而自适应地关注信息量丰富的动作片段进行端到端训练。AdaptFocus框架的有效性已在三个长视频数据集上得到验证。此外,在下游长视频任务中,我们的AdaptFocus框架提供了一种弱监督特征提取流程,用于提取更加鲁棒的长视频特征,从而显著提升了下游任务的最新方法。我们将发布代码和模型。

基准测试

基准方法指标
action-classification-on-charadesAdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4)
MAP: 41.4
action-classification-on-charadesAdaFocus (weak supervision, MViT-B-24, 32x3)
MAP: 47.8
action-classification-on-charadesAdaFocus (weak supervision, Slowfast-R50, 16x8)
MAP: 39.3
action-classification-on-charadesAdaFocus (weak supervision, X3D-L, 32x3)
MAP: 41.2
action-segmentation-on-breakfast-1AdaFocus (newly extracted I3D-features, LT-Context model)
Acc: 78.0
Average F1: 76.2
Edit: 78.3
F1@10%: 82.1
F1@25%: 79.0
F1@50%: 67.5
long-video-activity-recognition-on-breakfastAdaFocus (I3D-Breakfast-Pretrain-feature, GHRM)
mAP: 69.6
long-video-activity-recognition-on-breakfastAdaFocus (I3D-Breakfast-Pretrain-feature, Timeception)
mAP: 70.4
long-video-activity-recognition-on-breakfastAdaFocus (MViT-Breakfast-Pretrain-feature, Timeception)
mAP: 79.2
long-video-activity-recognition-on-breakfastAdaFocus (MViT-Breakfast-Pretrain-feature, GHRM)
mAP: 79.5
temporal-sentence-grounding-on-charades-staAdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
R1@0.5: 46.9
R1@0.7: 21.1
R5@0.5: 79.3
R5@0.7: 49.2
temporal-sentence-grounding-on-charades-staAdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
R1@0.5: 49.1
R1@0.7: 22.4
R5@0.5: 84.2
R5@0.7: 51.8
temporal-sentence-grounding-on-charades-staAdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
R1@0.5: 56.7
R1@0.7: 35.6
R5@0.5: 87.9
R5@0.7: 65.0
temporal-sentence-grounding-on-charades-staAdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
R1@0.5: 62.4
R1@0.7: 38.6
R5@0.5: 89.4
R5@0.7: 66.4
temporal-sentence-grounding-on-charades-staAdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
R1@0.5: 50.1
R1@0.7: 21.8
R5@0.5: 86.1
R5@0.7: 54.6
temporal-sentence-grounding-on-charades-staAdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
R1@0.5: 51.7
R1@0.7: 23.2
R5@0.5: 85.2
R5@0.7: 52.6
weakly-supervised-action-segmentation-actionAdaFocus (newly extracted I3D-features, POC model)
Acc: 49.6

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
面向弱监督端到端学习的长视频动作识别 | 论文 | HyperAI超神经