HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Zhou Jiaming ; Li Hanjun ; Lin Kun-Yu ; Liang Junwei

Towards Weakly Supervised End-to-end Learning for Long-video Action
  Recognition

Abstract

Developing end-to-end action recognition models on long videos is fundamentaland crucial for long-video action understanding. Due to the unaffordable costof end-to-end training on the whole long videos, existing works generally trainmodels on short clips trimmed from long videos. However, this``trimming-then-training'' practice requires action interval annotations forclip-level supervision, i.e., knowing which actions are trimmed into the clips.Unfortunately, collecting such annotations is very expensive and prevents modeltraining at scale. To this end, this work aims to build a weakly supervisedend-to-end framework for training recognition models on long videos, with onlyvideo-level action category labels. Without knowing the precise temporallocations of actions in long videos, our proposed weakly supervised framework,namely AdaptFocus, estimates where and how likely the actions will occur toadaptively focus on informative action clips for end-to-end training. Theeffectiveness of the proposed AdaptFocus framework is demonstrated on threelong-video datasets. Furthermore, for downstream long-video tasks, ourAdaptFocus framework provides a weakly supervised feature extraction pipelinefor extracting more robust long-video features, such that the state-of-the-artmethods on downstream tasks are significantly advanced. We will release thecode and models.

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-charadesAdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4)
MAP: 41.4
action-classification-on-charadesAdaFocus (weak supervision, MViT-B-24, 32x3)
MAP: 47.8
action-classification-on-charadesAdaFocus (weak supervision, Slowfast-R50, 16x8)
MAP: 39.3
action-classification-on-charadesAdaFocus (weak supervision, X3D-L, 32x3)
MAP: 41.2
action-segmentation-on-breakfast-1AdaFocus (newly extracted I3D-features, LT-Context model)
Acc: 78.0
Average F1: 76.2
Edit: 78.3
F1@10%: 82.1
F1@25%: 79.0
F1@50%: 67.5
long-video-activity-recognition-on-breakfastAdaFocus (I3D-Breakfast-Pretrain-feature, GHRM)
mAP: 69.6
long-video-activity-recognition-on-breakfastAdaFocus (I3D-Breakfast-Pretrain-feature, Timeception)
mAP: 70.4
long-video-activity-recognition-on-breakfastAdaFocus (MViT-Breakfast-Pretrain-feature, Timeception)
mAP: 79.2
long-video-activity-recognition-on-breakfastAdaFocus (MViT-Breakfast-Pretrain-feature, GHRM)
mAP: 79.5
temporal-sentence-grounding-on-charades-staAdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
R1@0.5: 46.9
R1@0.7: 21.1
R5@0.5: 79.3
R5@0.7: 49.2
temporal-sentence-grounding-on-charades-staAdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
R1@0.5: 49.1
R1@0.7: 22.4
R5@0.5: 84.2
R5@0.7: 51.8
temporal-sentence-grounding-on-charades-staAdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
R1@0.5: 56.7
R1@0.7: 35.6
R5@0.5: 87.9
R5@0.7: 65.0
temporal-sentence-grounding-on-charades-staAdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
R1@0.5: 62.4
R1@0.7: 38.6
R5@0.5: 89.4
R5@0.7: 66.4
temporal-sentence-grounding-on-charades-staAdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
R1@0.5: 50.1
R1@0.7: 21.8
R5@0.5: 86.1
R5@0.7: 54.6
temporal-sentence-grounding-on-charades-staAdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
R1@0.5: 51.7
R1@0.7: 23.2
R5@0.5: 85.2
R5@0.7: 52.6
weakly-supervised-action-segmentation-actionAdaFocus (newly extracted I3D-features, POC model)
Acc: 49.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition | Papers | HyperAI