HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

Samuel Yu; Peter Wu; Paul Pu Liang; Ruslan Salakhutdinov; Louis-Philippe Morency

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

Abstract

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

Code Repositories

samuelyu2002/pacs
Official
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
physical-commonsense-reasoning-on-physicalUNITER (Large)
Without Audio (Acc %): 60.6 ± 2.2
physical-commonsense-reasoning-on-physicalHuman
With Audio (Acc %): 96.3 ± 2.1
Without Audio (Acc %): 90.5 ± 3.1
physical-commonsense-reasoning-on-physicalMerlot Reserve (Large)
With Audio (Acc %): 70.1 ± 1.0
Without Audio (Acc %): 68.4 ± 0.7
physical-commonsense-reasoning-on-physicalMajority
With Audio (Acc %): 50.4
Without Audio (Acc %): 50.4
physical-commonsense-reasoning-on-physicalCLIP/AudioCLIP
With Audio (Acc %): 60.0 ± 0.9
Without Audio (Acc %): 56.3 ± 0.7
physical-commonsense-reasoning-on-physicalLate Fusion
With Audio (Acc %): 55.0 ± 1.1
Without Audio (Acc %): 52.5 ± 1.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
PACS: A Dataset for Physical Audiovisual CommonSense Reasoning | Papers | HyperAI