Command Palette
Search for a command to run...
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Wang Yuxuan ; Gao Difei ; Yu Licheng ; Lei Stan Weixian ; Feiszli Matt ; Shou Mike Zheng

Abstract
Cognitive science has shown that humans perceive videos in terms of eventsseparated by the state changes of dominant subjects. State changes trigger newevents and are one of the most useful among the large amount of redundantinformation perceived. However, previous research focuses on the overallunderstanding of segments without evaluating the fine-grained status changesinside. In this paper, we introduce a new dataset called Kinetic-GEB+. Thedataset consists of over 170k boundaries associated with captions describingstatus changes in the generic events in 12K videos. Upon this new dataset, wepropose three tasks supporting the development of a more fine-grained, robust,and human-like understanding of videos through status changes. We evaluate manyrepresentative baselines in our dataset, where we also design a new TPD(Temporal-based Pairwise Difference) Modeling method for visual difference andachieve significant performance improvements. Besides, the results show thereare still formidable challenges for current methods in the utilization ofdifferent granularities, representation of visual difference, and the accuratelocalization of status changes. Further analysis shows that our dataset candrive developing more powerful methods to understand status changes and thusimprove video level comprehension. The dataset including both videos andboundaries is available at https://yuxuan-w.github.io/GEB-plus/
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| boundary-captioning-on-kinetic-geb | ActBERT-revised | CIDEr: 74.71 ROUGE-L: 28.15 SPICE: 19.52 |
| boundary-grounding-on-kinetic-geb | FROZEN-revised | F1@0.1s: 4.28 F1@0.2s: 8.54 F1@0.5s: 18.33 F1@1.0s: 31.04 F1@1.5s: 40.48 F1@2.0s: 47.86 F1@2.5s: 54.81 F1@3.0s: 61.45 F1@Avg: 33.35 |
| text-to-video-retrieval-on-kinetic-geb | FROZEN-revised | mAP: 23.39 |
| text-to-video-retrieval-on-kinetic-geb | FROZEN-revised (two-stream) | text-to-video R@1: 12.8 text-to-video R@10: 45.66 text-to-video R@5: 34.81 text-to-video R@50: 68.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.