23 days ago

UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei Quande Liu Zixuan Ye Qiulin Wang Xintao Wang Pengfei Wan Kun Gai Wenhu Chen

Abstract

Unified multimodal models have shown promising results in multimodal contentgeneration and editing but remain largely limited to the image domain. In thiswork, we present UniVideo, a versatile framework that extends unified modelingto the video domain. UniVideo adopts a dual-stream design, combining aMultimodal Large Language Model (MLLM) for instruction understanding with aMultimodal DiT (MMDiT) for video generation. This design enables accurateinterpretation of complex multimodal instructions while preserving visualconsistency. Built on this architecture, UniVideo unifies diverse videogeneration and editing tasks under a single multimodal instruction paradigm andis jointly trained across them. Extensive experiments demonstrate that UniVideomatches or surpasses state-of-the-art task-specific baselines intext/image-to-video generation, in-context video generation and in-contextvideo editing. Notably, the unified design of UniVideo enables two forms ofgeneralization. First, UniVideo supports task composition, such as combiningediting with style transfer, by integrating multiple capabilities within asingle instruction. Second, even without explicit training on free-form videoediting, UniVideo transfers its editing capability from large-scale imageediting data to this setting, handling unseen instructions such asgreen-screening characters or changing materials within a video. Beyond thesecore capabilities, UniVideo also supports visual-prompt-based video generation,where the MLLM interprets visual prompts and guides the MMDiT during synthesis.To foster future research, we will release our model and code.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei Quande Liu Zixuan Ye Qiulin Wang Xintao Wang Pengfei Wan Kun Gai Wenhu Chen

Abstract

Build AI with AI

Hyper Newsletters