Command Palette
Search for a command to run...
Jiuniu Wang Hangjie Yuan Dayou Chen Yingya Zhang Xiang Wang Shiwei Zhang

Abstract
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-video-generation-on-msr-vtt | ModelScopeT2V | CLIPSIM: 0.2930 FID: 11.09 FVD: 550 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.