HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging Mohammadreza Zolfaghari Hamed Pirsiavash Thomas Brox

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Code Repositories

gingsi/coot-videotext
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-activitynet-captionsCOOT (ae-test split) - Only Appearance features
BLEU-3: 17.43
BLEU4: 10.85
CIDEr: 28.19
METEOR: 15.99
ROUGE-L: 31.45
video-captioning-on-youcook2COOT
BLEU-3: 17.97
BLEU-4: 11.30
CIDEr: 0.57
METEOR: 19.85
ROUGE-L: 37.94
video-retrieval-on-youcook2COOT
text-to-video Median Rank: 9
text-to-video R@1: 16.7
text-to-video R@10: 52.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning | Papers | HyperAI