HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer Yuxuan Li Paul Vicol Shixiang Shane Gu Nick Matarese Kevin Swersky Been Kim Priyank Jaini Robert Geirhos

Video models are zero-shot learners and reasoners

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) havepropelled natural language processing from task-specific models to unified,generalist foundation models. This transformation emerged from simpleprimitives: large, generative models trained on web-scale data. Curiously, thesame primitives apply to today's generative video models. Could video models beon a trajectory towards general-purpose vision understanding, much like LLMsdeveloped general-purpose language understanding? We demonstrate that Veo 3 cansolve a broad variety of tasks it wasn't explicitly trained for: segmentingobjects, detecting edges, editing images, understanding physical properties,recognizing object affordances, simulating tool use, and more. These abilitiesto perceive, model, and manipulate the visual world enable early forms ofvisual reasoning like maze and symmetry solving. Veo's emergent zero-shotcapabilities indicate that video models are on a path to becoming unified,generalist vision foundation models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video models are zero-shot learners and reasoners | Papers | HyperAI