Command Palette
Search for a command to run...
Shruti Palaskar; Jindrich Libovický; Spandana Gella; Florian Metze

Abstract
In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-summarization-on-how2 | Ground-truth transcript + Action with Hierarchical Attn | Content F1: 48.9 ROUGE-L: 54.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.