HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan Mattia ; Pardo Alejandro ; Alcázar Juan León ; Heilbron Fabian Caba ; Zhao Chen ; Giancola Silvio ; Ghanem Bernard

MAD: A Scalable Dataset for Language Grounding in Videos from Movie
  Audio Descriptions

Abstract

The recent and increasing interest in video-language research has driven thedevelopment of large-scale datasets that enable data-intensive machine learningtechniques. In comparison, limited effort has been made at assessing thefitness of these datasets for the video-language grounding task. Recent workshave begun to discover significant limitations in these datasets, suggestingthat state-of-the-art techniques commonly overfit to hidden dataset biases. Inthis work, we present MAD (Movie Audio Descriptions), a novel benchmark thatdeparts from the paradigm of augmenting existing video datasets with textannotations and focuses on crawling and aligning available audio descriptionsof mainstream movies. MAD contains over 384,000 natural language sentencesgrounded in over 1,200 hours of videos and exhibits a significant reduction inthe currently diagnosed biases for video-language grounding datasets. MAD'scollection strategy enables a novel and more challenging version ofvideo-language grounding, where short temporal moments (typically seconds long)must be accurately grounded in diverse long-form videos that can last up tothree hours. We have released MAD's data and baselines code athttps://github.com/Soldelli/MAD.

Code Repositories

Soldelli/MAD
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-language-moment-retrieval-on-madVLG-Net
R@1,IoU=0.1: 3.50
R@1,IoU=0.3: 2.63
R@1,IoU=0.5: 1.61
R@10,IoU=0.1: 18.32
R@10,IoU=0.3: 15.2
R@10,IoU=0.5: 10.18
R@100,IoU=0.1: 49.65
R@100,IoU=0.3: 43.95
R@100,IoU=0.5: 34.18
R@5,IoU=0.1: 11.74
R@5,IoU=0.3: 9.49
R@5,IoU=0.5: 6.23
R@50,IoU=0.1: 38.41
R@50,IoU=0.3: 33.68
R@50,IoU=0.5: 25.33
natural-language-moment-retrieval-on-madRandom Chance
R@1,IoU=0.1: 0.09
R@1,IoU=0.3: 0.04
R@1,IoU=0.5: 0.01
R@10,IoU=0.1: 0.88
R@10,IoU=0.3: 0.39
R@10,IoU=0.5: 0.14
R@100,IoU=0.1: 8.47
R@100,IoU=0.3: 3.80
R@100,IoU=0.5: 1.40
R@5,IoU=0.1: 0.44
R@5,IoU=0.3: 0.19
R@5,IoU=0.5: 0.07
R@50,IoU=0.1: 4.33
R@50,IoU=0.3: 1.92
R@50,IoU=0.5: 0.71
natural-language-moment-retrieval-on-madCLIP
R@1,IoU=0.1: 6.57
R@1,IoU=0.3: 3.13
R@1,IoU=0.5: 1.39
R@10,IoU=0.1: 20.26
R@10,IoU=0.3: 14.13
R@10,IoU=0.5: 8.38
R@100,IoU=0.1: 47.73
R@100,IoU=0.3: 36.98
R@100,IoU=0.5: 24.99
R@5,IoU=0.1: 15.05
R@5,IoU=0.3: 9.85
R@5,IoU=0.5: 5.44
R@50,IoU=0.1: 37.92
R@50,IoU=0.3: 28.71
R@50,IoU=0.5: 18.80

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | Papers | HyperAI