
摘要
在长视频中开发端到端动作识别模型对于长视频动作理解至关重要。由于在整个长视频上进行端到端训练的成本过高,现有的研究通常是在从长视频中剪辑出的短片段上训练模型。然而,这种“先剪辑再训练”的方法需要动作区间注释以提供片段级别的监督,即知道哪些动作被剪辑到了这些片段中。不幸的是,收集此类注释非常昂贵,阻碍了大规模的模型训练。为此,本研究旨在构建一个仅使用视频级别动作类别标签的弱监督端到端框架,用于在长视频上训练识别模型。在不知道长视频中动作的确切时间位置的情况下,我们提出的弱监督框架(即AdaptFocus)估计动作可能发生的位置及其概率,从而自适应地关注信息量丰富的动作片段进行端到端训练。AdaptFocus框架的有效性已在三个长视频数据集上得到验证。此外,在下游长视频任务中,我们的AdaptFocus框架提供了一种弱监督特征提取流程,用于提取更加鲁棒的长视频特征,从而显著提升了下游任务的最新方法。我们将发布代码和模型。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4) | MAP: 41.4 |
| action-classification-on-charades | AdaFocus (weak supervision, MViT-B-24, 32x3) | MAP: 47.8 |
| action-classification-on-charades | AdaFocus (weak supervision, Slowfast-R50, 16x8) | MAP: 39.3 |
| action-classification-on-charades | AdaFocus (weak supervision, X3D-L, 32x3) | MAP: 41.2 |
| action-segmentation-on-breakfast-1 | AdaFocus (newly extracted I3D-features, LT-Context model) | Acc: 78.0 Average F1: 76.2 Edit: 78.3 F1@10%: 82.1 F1@25%: 79.0 F1@50%: 67.5 |
| long-video-activity-recognition-on-breakfast | AdaFocus (I3D-Breakfast-Pretrain-feature, GHRM) | mAP: 69.6 |
| long-video-activity-recognition-on-breakfast | AdaFocus (I3D-Breakfast-Pretrain-feature, Timeception) | mAP: 70.4 |
| long-video-activity-recognition-on-breakfast | AdaFocus (MViT-Breakfast-Pretrain-feature, Timeception) | mAP: 79.2 |
| long-video-activity-recognition-on-breakfast | AdaFocus (MViT-Breakfast-Pretrain-feature, GHRM) | mAP: 79.5 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model) | R1@0.5: 46.9 R1@0.7: 21.1 R5@0.5: 79.3 R5@0.7: 49.2 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model) | R1@0.5: 49.1 R1@0.7: 22.4 R5@0.5: 84.2 R5@0.7: 51.8 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model) | R1@0.5: 56.7 R1@0.7: 35.6 R5@0.5: 87.9 R5@0.7: 65.0 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model) | R1@0.5: 62.4 R1@0.7: 38.6 R5@0.5: 89.4 R5@0.7: 66.4 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model) | R1@0.5: 50.1 R1@0.7: 21.8 R5@0.5: 86.1 R5@0.7: 54.6 |
| temporal-sentence-grounding-on-charades-sta | AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model) | R1@0.5: 51.7 R1@0.7: 23.2 R5@0.5: 85.2 R5@0.7: 52.6 |
| weakly-supervised-action-segmentation-action | AdaFocus (newly extracted I3D-features, POC model) | Acc: 49.6 |