5 months ago

Video Understanding

Video Captioning

Computer Vision

Thaddäus Wiedemer Yuxuan Li Paul Vicol Shixiang Shane Gu Nick Matarese Kevin Swersky Been Kim Priyank Jaini Robert Geirhos

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) havepropelled natural language processing from task-specific models to unified,generalist foundation models. This transformation emerged from simpleprimitives: large, generative models trained on web-scale data. Curiously, thesame primitives apply to today's generative video models. Could video models beon a trajectory towards general-purpose vision understanding, much like LLMsdeveloped general-purpose language understanding? We demonstrate that Veo 3 cansolve a broad variety of tasks it wasn't explicitly trained for: segmentingobjects, detecting edges, editing images, understanding physical properties,recognizing object affordances, simulating tool use, and more. These abilitiesto perceive, model, and manipulate the visual world enable early forms ofvisual reasoning like maze and symmetry solving. Veo's emergent zero-shotcapabilities indicate that video models are on a path to becoming unified,generalist vision foundation models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

5 months ago

Video Understanding

Video Captioning

Computer Vision

Thaddäus Wiedemer Yuxuan Li Paul Vicol Shixiang Shane Gu Nick Matarese Kevin Swersky Been Kim Priyank Jaini Robert Geirhos

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) havepropelled natural language processing from task-specific models to unified,generalist foundation models. This transformation emerged from simpleprimitives: large, generative models trained on web-scale data. Curiously, thesame primitives apply to today's generative video models. Could video models beon a trajectory towards general-purpose vision understanding, much like LLMsdeveloped general-purpose language understanding? We demonstrate that Veo 3 cansolve a broad variety of tasks it wasn't explicitly trained for: segmentingobjects, detecting edges, editing images, understanding physical properties,recognizing object affordances, simulating tool use, and more. These abilitiesto perceive, model, and manipulate the visual world enable early forms ofvisual reasoning like maze and symmetry solving. Veo's emergent zero-shotcapabilities indicate that video models are on a path to becoming unified,generalist vision foundation models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp