Date

6 months ago

Size

128.76 MB

1. Tutorial Introduction

Self Forcing, proposed by Xun Huang's team on June 9, 2025, is a novel training paradigm for autoregressive video diffusion models. It addresses the long-standing problem of exposure bias, where models trained on real context must generate sequences based on their own imperfect outputs during inference. Unlike previous methods that denoise future frames based on real context frames, Self Forcing performs an autoregressive rollout with key-value (KV) caching during training, setting the generation conditions for each frame to the previously generated output. This strategy is supervised by a video-level global loss function that directly evaluates the quality of the entire generated sequence, rather than relying solely on a traditional frame-by-frame objective function. To ensure training efficiency, a few-step diffusion model and stochastic gradient truncation strategy are employed, effectively balancing computational cost and performance. A rolling key-value caching mechanism is further introduced to achieve efficient autoregressive video extrapolation. Extensive experiments demonstrate that their method can achieve real-time streaming video generation with sub-second latency on a single GPU, while achieving or even surpassing the generation quality of significantly slower and non-causal diffusion models. Related paper results are as follows: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion .

This tutorial uses resources for a single RTX 4090 card.

2. Project Examples

3. Operation steps

1. After starting the container, click the API address to enter the Web interface

2. Usage steps

Parameter Description

Advanced Settings:
- Seed: Random seed value that controls the randomness of the generation process. A fixed seed can reproduce the same results; -1 indicates a random seed.
- Target FPS: Target frame rate. The default value here is 6, which means the generated video is 6 frames per second.
- torch.compile: Enable PyTorch compilation optimization to accelerate model inference (environment support required).
- FP8 Quantization: Enables 8-bit floating-point quantization, reducing computational precision to increase generation speed (may slightly affect quality).
- TAEHV VAE: Specifies the type of variational autoencoder (VAE) model used, which may affect the generated details or style.

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@article{huang2025selfforcing,
  title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
  author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
  journal={arXiv preprint arXiv:2506.08009},
  year={2025}
}

This notebook is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at support@hyper.ai for prompt review and removal.

Related Notebooks

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

Date

6 months ago

Size

128.76 MB

1. Tutorial Introduction

This tutorial uses resources for a single RTX 4090 card.

2. Project Examples

3. Operation steps

1. After starting the container, click the API address to enter the Web interface

2. Usage steps

Parameter Description

Advanced Settings:
- Seed: Random seed value that controls the randomness of the generation process. A fixed seed can reproduce the same results; -1 indicates a random seed.
- Target FPS: Target frame rate. The default value here is 6, which means the generated video is 6 frames per second.
- torch.compile: Enable PyTorch compilation optimization to accelerate model inference (environment support required).
- FP8 Quantization: Enables 8-bit floating-point quantization, reducing computational precision to increase generation speed (may slightly affect quality).
- TAEHV VAE: Specifies the type of variational autoencoder (VAE) model used, which may affect the generated details or style.

4. Discussion

Citation Information

The citation information for this project is as follows:

@article{huang2025selfforcing,
  title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
  author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
  journal={arXiv preprint arXiv:2506.08009},
  year={2025}
}

Related Notebooks

Krea-realtime-video: Real-time Video Generation Model

3 months ago

LongCat-Video: Meituan's open-source AI Video Generation Model

3 months ago

SAM3: Visual Segmentation Model

2 months ago

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

2 months ago

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

2 months ago

LongCat-Image: A Bilingual Text-Driven Image Generation System

2 months ago

ROCKET-2: 3D Game Zero-Shot Transfer

2 months ago

Depth-Anything-3: Restoring Visual Space From Any Perspective

2 months ago

F5-E2 TTS Clones Any Sound in Just 3 Seconds

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Self-Forcing Real-Time Video Generation

1. Tutorial Introduction

2. Project Examples

3. Operation steps

4. Discussion

Citation Information

Build AI with AI

HyperAI Newsletters

Command Palette

Self-Forcing Real-Time Video Generation

1. Tutorial Introduction

2. Project Examples

3. Operation steps

4. Discussion

Citation Information

Related Notebooks

Krea-realtime-video: Real-time Video Generation Model

LongCat-Video: Meituan's open-source AI Video Generation Model

SAM3: Visual Segmentation Model

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

LongCat-Image: A Bilingual Text-Driven Image Generation System

ROCKET-2: 3D Game Zero-Shot Transfer

Depth-Anything-3: Restoring Visual Space From Any Perspective

F5-E2 TTS Clones Any Sound in Just 3 Seconds

Build AI with AI

HyperAI Newsletters

Command Palette

Self-Forcing Real-Time Video Generation

1. Tutorial Introduction

2. Project Examples

3. Operation steps

4. Discussion

Citation Information

Related Notebooks

Krea-realtime-video: Real-time Video Generation Model

LongCat-Video: Meituan's open-source AI Video Generation Model

SAM3: Visual Segmentation Model

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

LongCat-Image: A Bilingual Text-Driven Image Generation System

ROCKET-2: 3D Game Zero-Shot Transfer

Depth-Anything-3: Restoring Visual Space From Any Perspective

F5-E2 TTS Clones Any Sound in Just 3 Seconds

Build AI with AI

HyperAI Newsletters

Related Notebooks

Krea-realtime-video: Real-time Video Generation Model

LongCat-Video: Meituan's open-source AI Video Generation Model

SAM3: Visual Segmentation Model

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

LongCat-Image: A Bilingual Text-Driven Image Generation System

ROCKET-2: 3D Game Zero-Shot Transfer

Depth-Anything-3: Restoring Visual Space From Any Perspective

F5-E2 TTS Clones Any Sound in Just 3 Seconds

Related Notebooks

Krea-realtime-video: Real-time Video Generation Model

LongCat-Video: Meituan's open-source AI Video Generation Model

SAM3: Visual Segmentation Model

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

LongCat-Image: A Bilingual Text-Driven Image Generation System

ROCKET-2: 3D Game Zero-Shot Transfer

Depth-Anything-3: Restoring Visual Space From Any Perspective

F5-E2 TTS Clones Any Sound in Just 3 Seconds