HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Lin Yijie ; Zhang Jie ; Huang Zhenyu ; Liu Jia ; Wen Zujie ; Peng Xi

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Abstract

Existing video-language studies mainly focus on learning short video clips,leaving long-term temporal dependencies rarely explored due to over-highcomputational cost of modeling long videos. To address this issue, one feasiblesolution is learning the correspondence between video clips and captions, whichhowever inevitably encounters the multi-granularity noisy correspondence (MNC)problem. To be specific, MNC refers to the clip-caption misalignment(coarse-grained) and frame-word misalignment (fine-grained), hindering temporallearning and video understanding. In this paper, we propose NOise RobustTemporal Optimal traNsport (Norton) that addresses MNC in a unified optimaltransport (OT) framework. In brief, Norton employs video-paragraph andclip-caption contrastive losses to capture long-term dependencies based on OT.To address coarse-grained misalignment in video-paragraph contrast, Nortonfilters out the irrelevant clips and captions through an alignable promptbucket and realigns asynchronous clip-caption pairs based on transportdistance. To address the fine-grained misalignment, Norton incorporates asoft-maximum operator to identify crucial words and key frames. Additionally,Norton exploits the potential faulty negative samples in clip-caption contrastby rectifying the alignment target with OT assignment to ensure precisetemporal modeling. Extensive experiments on video retrieval, videoQA, andaction segmentation verify the effectiveness of our method. Code is availableat https://lin-yijie.github.io/projects/Norton.

Code Repositories

XLearning-SCU/2024-ICLR-Norton
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-segmentation-on-coinNorton
Frame accuracy: 69.8
long-video-retrieval-background-removed-onNorton
Cap. Avg. R@1: 75.5
Cap. Avg. R@10: 97.7
Cap. Avg. R@5: 95.0
DTW R@1: 88.7
DTW R@10: 99.5
DTW R@5: 98.8
OTAM R@1: 88.9
OTAM R@10: 99.5
OTAM R@5: 98.4
video-question-answering-on-msrvtt-mcNorton
Accuracy: 92.7
zero-shot-video-retrieval-on-msr-vttNorton
text-to-video R@1: 10.7
text-to-video R@5: 24.1
zero-shot-video-retrieval-on-youcook2Norton
text-to-video R@1: 24.2
text-to-video R@10: 64.1
text-to-video R@5: 51.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multi-granularity Correspondence Learning from Long-term Noisy Videos | Papers | HyperAI