5 months ago

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Zhou Jiaming ; Li Hanjun ; Lin Kun-Yu ; Liang Junwei

Abstract

Developing end-to-end action recognition models on long videos is fundamentaland crucial for long-video action understanding. Due to the unaffordable costof end-to-end training on the whole long videos, existing works generally trainmodels on short clips trimmed from long videos. However, this``trimming-then-training'' practice requires action interval annotations forclip-level supervision, i.e., knowing which actions are trimmed into the clips.Unfortunately, collecting such annotations is very expensive and prevents modeltraining at scale. To this end, this work aims to build a weakly supervisedend-to-end framework for training recognition models on long videos, with onlyvideo-level action category labels. Without knowing the precise temporallocations of actions in long videos, our proposed weakly supervised framework,namely AdaptFocus, estimates where and how likely the actions will occur toadaptively focus on informative action clips for end-to-end training. Theeffectiveness of the proposed AdaptFocus framework is demonstrated on threelong-video datasets. Furthermore, for downstream long-video tasks, ourAdaptFocus framework provides a weakly supervised feature extraction pipelinefor extracting more robust long-video features, such that the state-of-the-artmethods on downstream tasks are significantly advanced. We will release thecode and models.

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-charades	AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4)	MAP: 41.4
action-classification-on-charades	AdaFocus (weak supervision, MViT-B-24, 32x3)	MAP: 47.8
action-classification-on-charades	AdaFocus (weak supervision, Slowfast-R50, 16x8)	MAP: 39.3
action-classification-on-charades	AdaFocus (weak supervision, X3D-L, 32x3)	MAP: 41.2
action-segmentation-on-breakfast-1	AdaFocus (newly extracted I3D-features, LT-Context model)	Acc: 78.0 Average F1: 76.2 Edit: 78.3 F1@10%: 82.1 F1@25%: 79.0 F1@50%: 67.5
long-video-activity-recognition-on-breakfast	AdaFocus (I3D-Breakfast-Pretrain-feature, GHRM)	mAP: 69.6
long-video-activity-recognition-on-breakfast	AdaFocus (I3D-Breakfast-Pretrain-feature, Timeception)	mAP: 70.4
long-video-activity-recognition-on-breakfast	AdaFocus (MViT-Breakfast-Pretrain-feature, Timeception)	mAP: 79.2
long-video-activity-recognition-on-breakfast	AdaFocus (MViT-Breakfast-Pretrain-feature, GHRM)	mAP: 79.5
temporal-sentence-grounding-on-charades-sta	AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)	R1@0.5: 46.9 R1@0.7: 21.1 R5@0.5: 79.3 R5@0.7: 49.2
temporal-sentence-grounding-on-charades-sta	AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)	R1@0.5: 49.1 R1@0.7: 22.4 R5@0.5: 84.2 R5@0.7: 51.8
temporal-sentence-grounding-on-charades-sta	AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)	R1@0.5: 56.7 R1@0.7: 35.6 R5@0.5: 87.9 R5@0.7: 65.0
temporal-sentence-grounding-on-charades-sta	AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)	R1@0.5: 62.4 R1@0.7: 38.6 R5@0.5: 89.4 R5@0.7: 66.4
temporal-sentence-grounding-on-charades-sta	AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)	R1@0.5: 50.1 R1@0.7: 21.8 R5@0.5: 86.1 R5@0.7: 54.6
temporal-sentence-grounding-on-charades-sta	AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)	R1@0.5: 51.7 R1@0.7: 23.2 R5@0.5: 85.2 R5@0.7: 52.6
weakly-supervised-action-segmentation-action	AdaFocus (newly extracted I3D-features, POC model)	Acc: 49.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning