Command Palette
Search for a command to run...
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han; Linjie Yang; Xiaojun Chang; Lina Yao; Heng Wang

Abstract
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-captioning-on-shot2story20k | Shot2Story | BLEU-4: 10.7 CIDEr: 37.4 METEOR: 16.2 ROUGE: 29.6 |
| video-narration-captioning-on-shot2story20k | Ours | BLEU-4: 18.8 CIDEr: 168.7 METEOR: 24.8 ROUGE: 39 |
| video-summarization-on-shot2story20k | SUM-shot | BLEU-4: 11.7 CIDEr: 8.6 METEOR: 19.7 ROUGE: 26.8 |
| zeroshot-video-question-answer-on-msrvtt-qa | SUM-shot+Vicuna | Accuracy: 56.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.