HyperAIHyperAI

Command Palette

Search for a command to run...

SemanticGen: Video Generation in Semantic Space

Abstract

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

One-sentence Summary

Researchers from Zhejiang University, Kuaishou Technology (Kling Team), and CUHK propose SemanticGen, a two-stage video generation framework operating in semantic space rather than VAE latents. It first creates compact semantic features for global video planning using diffusion, then generates VAE latents conditioned on these features, accelerating convergence and enabling efficient high-quality one-minute video synthesis while outperforming prior VAE-based methods.

Key Contributions

  • Current video generative models suffer from slow convergence and high computational costs for long videos due to modeling low-level video tokens with bidirectional attention in the VAE latent space, which becomes impractical for extended sequences.
  • SemanticGen introduces a two-stage diffusion framework that first generates compact semantic features for global video layout planning, then conditions VAE latent generation on these features to add high-frequency details, leveraging video redundancy for efficiency.
  • Experiments show this approach achieves faster convergence than VAE latent space generation and outperforms state-of-the-art methods in video quality while scaling effectively to long video generation scenarios.

Introduction

Video generation models face a critical trade-off: diffusion-based methods produce high-fidelity short videos but struggle with long sequences due to the computational limits of processing all frames simultaneously via bidirectional attention, while autoregressive approaches scale better for long videos yet deliver inferior quality. Hybrid methods attempt to bridge this gap but still underperform pure diffusion models. Separately, research shows that integrating semantic representations—like those from pre-trained encoders—accelerates training and boosts image generation quality compared to raw latent spaces. The authors address both challenges by proposing a novel diffusion-based video generation framework. They fine-tune diffusion models to learn compressed semantic representations, which are then mapped to the VAE latent space. This approach achieves faster convergence than direct VAE latent generation and effectively scales to long video synthesis, overcoming prior scalability barriers without sacrificing fidelity.

Dataset

The authors use two distinct datasets for short and long video generation:

  • Dataset composition and sources:
    The training data comprises an internal text-video pair dataset for short videos and a constructed long-video dataset sourced from movie and TV show clips.

  • Key subset details:

    • Short video subset: Uses internal text-video pairs (exact size unspecified).
    • Long video subset: Created by splitting movie/TV clips into 60-second segments, with text prompts auto-generated via an internal captioner. No explicit filtering rules are detailed.
  • Data usage in training:

    • Frames are sampled at 24 fps for VAE input and 1.6 fps for the semantic encoder during training.
    • The semantic encoder leverages the vision tower from Qwen2.5-VL-72B-Instruct.
    • Training focuses on text-to-video generation without specified mixture ratios or split proportions.
  • Processing specifics:

    • No cropping strategy is mentioned.
    • Metadata (text prompts) for long videos is synthetically constructed via the internal captioner.
    • Frame-rate differentiation (24 fps vs. 1.6 fps) tailors inputs to the VAE and semantic encoder workflows.

Method

The authors leverage a two-stage diffusion-based framework called SemanticGen to generate videos by first modeling high-level semantic representations and then refining them into VAE latents. This design addresses the computational inefficiency and slow convergence of conventional methods that operate directly in the VAE latent space. The overall architecture is structured to decouple global scene planning from fine-grained detail synthesis, enabling scalable and efficient generation, particularly for long videos.

In the first stage, a semantic generator is trained to produce compact semantic representations conditioned on text prompts. These representations are derived from an off-the-shelf video semantic encoder — specifically, the vision tower of Qwen-2.5-VL — which processes input videos by downsampling frames and compressing spatial patches into a low-dimensional token sequence. To improve training stability and convergence, the authors introduce a lightweight MLP that compresses the high-dimensional semantic features zsemz_{sem}'zsem into a lower-dimensional latent space zsemz_{sem}zsem, parameterized as a Gaussian distribution. The MLP outputs mean and variance vectors, and a KL divergence term is added to the loss to regularize the compressed space. This compressed representation is then used as conditioning for the second stage.

Refer to the framework diagram for the semantic generation pipeline. The semantic generator takes a text prompt as input and outputs zsemz_{sem}zsem, which is further processed by the MLP to yield zsemz_{sem}'zsem, the compressed semantic embedding used for conditioning. This design ensures that the semantic space is both compact and amenable to diffusion sampling.

In the second stage, a video diffusion model generates VAE latents conditioned on the compressed semantic representation. During training, the model receives concatenated inputs: the noised VAE latent ztz_tzt and the semantic embedding zsemz_{sem}zsem. This in-context conditioning allows the diffusion model to leverage high-level semantic cues — such as object motion and scene layout — while generating low-level visual details. The authors observe that this conditioning preserves global structure and dynamics while allowing variation in texture and color, as demonstrated in reference-based generation experiments.

As shown in the figure below, the VAE latent generator integrates the semantic encoder and MLP to extract and compress semantic features from input videos, which are then injected into the diffusion model alongside the noised VAE latents. During inference, the semantic generator produces zsemz_{sem}zsem from text, and noise is added to it to bridge the training-inference gap, following practices from RAE.

For long video generation, SemanticGen employs a hybrid attention strategy to manage computational complexity. In the semantic space, full self-attention is applied to maintain global consistency across scenes and characters. When mapping to the VAE latent space, shifted-window attention is used to limit quadratic growth in compute cost. The authors interleave VAE latent tokens and semantic tokens within attention windows of size TwT_wTw, shifting the window by Tw/2T_w/2Tw/2 at odd layers. This design ensures that the semantic space — which is 16x more compressed than the VAE space — remains tractable while enabling high-fidelity reconstruction.

The inference pipeline integrates both stages: the semantic generator produces zsemz_{sem}zsem from text, which is then fed into the VAE latent generator to produce the final video. The entire process is illustrated in the overall framework diagram, which highlights the sequential nature of semantic-to-latent generation.

The base diffusion model follows the Rectified Flow framework, with the forward process defined as zt=(1t)z0+tϵz_t = (1 - t) z_0 + t \epsilonzt=(1t)z0+tϵ, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)ϵN(0,I). Training minimizes the Conditional Flow Matching loss LLCM=Et,p+(z,ϵ),p(ϵ)vΘ(zt,t)ut(z0ϵ)22\mathcal{L}_{\text{LCM}} = \mathbb{E}_{t, p_+(z, \epsilon), p(\epsilon)} \| v_\Theta(z_t, t) - u_t(z_0 | \epsilon) \|_2^2LLCM=Et,p+(z,ϵ),p(ϵ)vΘ(zt,t)ut(z0ϵ)22, and inference uses Euler discretization: zt=zt1+vΘ(zt1,t)Δtz_t = z_{t-1} + v_\Theta(z_{t-1}, t) \cdot \Delta tzt=zt1+vΘ(zt1,t)Δt. This foundation enables efficient sampling while preserving temporal coherence.

The authors emphasize that SemanticGen is not tied to a specific semantic encoder; alternatives such as V-JEPA 2, VideoMAE 2, or 4DS can be substituted. The framework’s modularity and reliance on pre-trained semantic encoders make it adaptable to evolving video understanding models while maintaining efficiency and quality in generation.

Experiment

  • Validated semantic space compression: Using an 8-dimensional representation achieved the highest VBench scores, reducing artifacts and broken frames compared to higher-dimensional variants (64-dim and uncompressed).
  • Confirmed faster convergence in semantic space: Training on pre-trained semantic representations converged significantly faster than modeling compressed VAE latents, producing coherent videos within the same training steps where VAE-based approaches yielded only coarse color patches.
  • Outperformed state-of-the-art in long video generation: Surpassed all baselines (including MAGI-1, SkyReels-V2, and continued-training variants) in temporal consistency and reduced drifting issues, achieving lower ΔdriftM scores and maintaining stable backgrounds without color shifts.
  • Matched SOTA in short video quality: Achieved comparable VBench metrics to leading methods while demonstrating superior text adherence (e.g., correctly generating specific motions like head turns or melting snowflakes missed by baselines).

The authors evaluate the impact of semantic space compression on video generation quality by comparing models trained with different output dimensions. Results show that compressing the semantic representation to lower dimensions (dim=8) improves consistency, smoothness, and aesthetic quality compared to no compression (dim=2048), with the dim=8 variant achieving the highest scores across most metrics. This indicates that dimensionality reduction enhances both convergence and output fidelity in the semantic generation pipeline.

The authors evaluate SemanticGen against baselines using quantitative metrics for long video generation, showing it achieves the highest subject consistency and motion smoothness while maintaining strong background consistency and minimal temporal flickering. SemanticGen outperforms Hunyuan-Video and Wan2.1-T2V-14B in subject consistency and matches or exceeds them in other stability metrics, though it scores slightly lower than Wan2.1-T2V-14B in imaging and aesthetic quality. Results confirm SemanticGen’s effectiveness in preserving long-term coherence without significant degradation in visual fidelity.

SemanticGen outperforms baseline methods across multiple video quality metrics, achieving the highest scores in subject consistency, background consistency, motion smoothness, and imaging quality. It also demonstrates the lowest drifting metric (Δ_drift^M), indicating superior long-term temporal stability compared to alternatives like SkyReels-V2, Self-Forcing, LongLive, and Base-CT-Swin.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp