Command Palette
Search for a command to run...
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging Mohammadreza Zolfaghari Hamed Pirsiavash Thomas Brox

Abstract
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-captioning-on-activitynet-captions | COOT (ae-test split) - Only Appearance features | BLEU-3: 17.43 BLEU4: 10.85 CIDEr: 28.19 METEOR: 15.99 ROUGE-L: 31.45 |
| video-captioning-on-youcook2 | COOT | BLEU-3: 17.97 BLEU-4: 11.30 CIDEr: 0.57 METEOR: 19.85 ROUGE-L: 37.94 |
| video-retrieval-on-youcook2 | COOT | text-to-video Median Rank: 9 text-to-video R@1: 16.7 text-to-video R@10: 52.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.