Command Palette
Search for a command to run...
Cong Wei Quande Liu Zixuan Ye Qiulin Wang Xintao Wang Pengfei Wan Kun Gai Wenhu Chen

Abstract
Unified multimodal models have shown promising results in multimodal contentgeneration and editing but remain largely limited to the image domain. In thiswork, we present UniVideo, a versatile framework that extends unified modelingto the video domain. UniVideo adopts a dual-stream design, combining aMultimodal Large Language Model (MLLM) for instruction understanding with aMultimodal DiT (MMDiT) for video generation. This design enables accurateinterpretation of complex multimodal instructions while preserving visualconsistency. Built on this architecture, UniVideo unifies diverse videogeneration and editing tasks under a single multimodal instruction paradigm andis jointly trained across them. Extensive experiments demonstrate that UniVideomatches or surpasses state-of-the-art task-specific baselines intext/image-to-video generation, in-context video generation and in-contextvideo editing. Notably, the unified design of UniVideo enables two forms ofgeneralization. First, UniVideo supports task composition, such as combiningediting with style transfer, by integrating multiple capabilities within asingle instruction. Second, even without explicit training on free-form videoediting, UniVideo transfers its editing capability from large-scale imageediting data to this setting, handling unseen instructions such asgreen-screening characters or changing materials within a video. Beyond thesecore capabilities, UniVideo also supports visual-prompt-based video generation,where the MLLM interprets visual prompts and guides the MMDiT during synthesis.To foster future research, we will release our model and code.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.