Date

6 months ago

Size

322.68 MB

1. Tutorial Introduction

Pusa V1, proposed by Yaofang-Liu's team on July 25, 2025, is a high-efficiency multimodal video generation model. Based on Vectorized Temporal Adaptation (VTA) technology, it solves the core problems of high training cost, low inference efficiency, and poor temporal consistency in traditional video generation models. Unlike traditional methods that rely on large-scale data and computing power, Pusa V1 achieves breakthrough optimizations based on Wan2.1-T2V-14B through a lightweight fine-tuning strategy: its training cost is only $500 (1/200th of similar models), the dataset requires only 4K samples (1/2500th of similar models), and training can be completed on eight 80 GB GPUs, significantly lowering the application threshold of video generation technology. Simultaneously, it possesses powerful multi-tasking capabilities, supporting not only text-driven video (T2V) and image-guided video (I2V), but also zero-shot tasks such as video completion, first and last frame generation, and cross-scene transitions, without requiring additional training for specific scenes. More importantly, its generation performance is particularly outstanding. Employing a short-step inference strategy (surpassing the baseline model in just 10 steps), it achieved a total score of 87.32% on the VBench-I2V platform, demonstrating excellent performance in dynamic detail reproduction (such as limb movements and lighting changes) and temporal coherence. Furthermore, the non-destructive adaptation mechanism implemented through VTA technology injects temporal dynamic capabilities into the base model while preserving the original model's image generation quality, achieving a "1+1>2" effect. At the deployment level, its low inference latency meets diverse needs from rapid preview to high-definition output, making it suitable for creative design, short video production, and other scenarios. Related paper results are... PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation .

This tutorial uses dual-card RTX A6000 resources.

2. Project Examples

1. Image-to-Video

2. Multi-Frames to Video

3. Video-to-Video

4. Text-to-Video

3. Operation steps

1. After starting the container, click the API address to enter the Web interface

2. Usage steps

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

2.1 Image-to-Video

Parameter Description

Generation Parameters
- Noise Multiplier: Adjustable from 0.0 to 1.0, default 0.2 (lower values are more faithful to the input image, higher values are more creative).
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).

2.2 Multi-Frames to Video

Parameter Description

Conditioning Parameters
- Conditioning Positions: Comma-separated frame indices (e.g. "0,20" defines the time points of the keyframes in the video).
- Noise Multipliers: Comma-separated 0.0-1.0 values (e.g. "0.2,0.5", corresponding to the creative freedom of each keyframe, lower values are more faithful to the frame, higher values are more varied).
Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).

2.3 Video-to-Video

Parameter Description

Conditioning Parameters
- Conditioning Positions: Comma-separated frame indices (e.g., "0,1,2,3", specifying the keyframe positions in the original video used for constraint generation, required).
- Noise Multipliers: Comma-separated 0.0-1.0 values (e.g. "0.0,0.3", corresponding to the degree of influence of each conditional frame, lower values are closer to the original frame, higher values are more flexible).
Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).

2.4 Text-to-Video

Parameter Description

Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@article{liu2025pusa,
  title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
  author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
  journal={arXiv preprint arXiv:2507.16116},
  year={2025}
}

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}

This notebook is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at support@hyper.ai for prompt review and removal.

Related Notebooks

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

3 months ago

Ovis-Image: High-quality Image Generation Model

2 months ago

JarvisArt-Preview Smart Photo Retouching Proxy

a month ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook

Date

6 months ago

Size

322.68 MB

1. Tutorial Introduction

This tutorial uses dual-card RTX A6000 resources.