Command Palette
Search for a command to run...
Thaddäus Wiedemer Yuxuan Li Paul Vicol Shixiang Shane Gu Nick Matarese Kevin Swersky Been Kim Priyank Jaini Robert Geirhos

Abstract
The remarkable zero-shot capabilities of Large Language Models (LLMs) havepropelled natural language processing from task-specific models to unified,generalist foundation models. This transformation emerged from simpleprimitives: large, generative models trained on web-scale data. Curiously, thesame primitives apply to today's generative video models. Could video models beon a trajectory towards general-purpose vision understanding, much like LLMsdeveloped general-purpose language understanding? We demonstrate that Veo 3 cansolve a broad variety of tasks it wasn't explicitly trained for: segmentingobjects, detecting edges, editing images, understanding physical properties,recognizing object affordances, simulating tool use, and more. These abilitiesto perceive, model, and manipulate the visual world enable early forms ofvisual reasoning like maze and symmetry solving. Veo's emergent zero-shotcapabilities indicate that video models are on a path to becoming unified,generalist vision foundation models.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.