Command Palette
Search for a command to run...
Yume-1.5: A Text-Controlled Interactive World Generation Model
Yume-1.5: A Text-Controlled Interactive World Generation Model
Xiaofeng Mao Zhen Li Chuanhao Li Xiaojie Xu Kaining Ying Tong He Jiangmiao Pang Yu Qiao Kaipeng Zhang
Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
One-sentence Summary
The authors from Shanghai AI Laboratory, Fudan University, and Shanghai Innovation Institute propose Yume1.5, a lightweight framework enabling real-time, text- and image-controlled generation of persistent, explorable virtual worlds via continuous keyboard input, leveraging unified context compression, bidirectional attention distillation, and enhanced text embeddings to overcome limitations of prior diffusion-based methods in speed, memory, and controllability.
Key Contributions
-
Yume1.5 introduces Joint Temporal-Spatial-Channel Modeling (TSCM), a novel compression framework that enables stable, long-context video generation by jointly compressing historical frames along temporal, spatial, and channel dimensions, significantly reducing memory usage and maintaining consistent inference speed regardless of context length.
-
The framework integrates a bidirectional attention distillation strategy with a self-forcing-like training paradigm, leveraging TSCM to replace the KV cache and reduce error accumulation, thereby accelerating inference while preserving visual quality and enabling real-time, continuous world exploration.
-
Yume1.5 achieves text-controlled generation of dynamic world events through a mixed-dataset training approach and architectural design, allowing users to edit and generate new events via text prompts—demonstrated on a re-annotated dataset with event-focused captions—while supporting keyboard-based navigation across image- and text-to-world generation modes.
Introduction
The authors leverage video diffusion models to enable interactive, persistent virtual world generation—critical for applications in immersive entertainment, simulation, and virtual embodiment—where users can explore dynamic environments in real time. Prior work faces key limitations: poor generalizability beyond game-like settings, high latency due to slow diffusion inference, and lack of text-based control for event generation, often relying only on mouse or keyboard inputs with limited expressiveness. To overcome these, the authors introduce Yume1.5, which achieves real-time, autoregressive generation of infinite video worlds from a single image or text prompt. Their main contributions are threefold: (1) Joint Temporal-Spatial-Channel Modeling (TSCM), which compresses historical frames across time, space, and channels to maintain long-context coherence without memory explosion; (2) a self-forcing-inspired acceleration method that reduces sampling steps from 50 to 4 while minimizing error accumulation; and (3) text-controlled event generation via a mixed-dataset training strategy and architectural design, enabling dynamic scene evolution through natural language.
Dataset
-
The dataset is composed of three main components: a Real-world Dataset, a Synthetic Dataset, and a specialized Event Dataset, each contributing to balanced performance in realistic motion control, general video quality, and event-specific generation.
-
The Real-world Dataset is derived from Sekai-Real-HQ, a high-quality subset of the Sekai dataset featuring long walking video clips with detailed camera motion trajectories and semantic labels. The authors convert camera trajectory data into discrete keyboard and mouse control signals using a method from [21], mapping them to action vocabularies for camera movement (e.g., →, ↑, ↘) and human-like camera motion (e.g., W, A, W+A). Additionally, the dataset is re-annotated: original scene descriptions are preserved for Text-to-Video (T2V) training, while InternVL3-78B generates new event-focused captions for Image-to-Video (I2V) training to improve event-driven generation.
-
The Synthetic Dataset is built from Openvid, with 80,000 diverse captions selected via similarity-based deduplication and random sampling. Using Wan 2.1 14B, 80,000 videos at 720p resolution are synthesized. Quality is assessed with VBench, and the top 50,000 videos are retained to prevent overfitting and maintain general video generation ability, primarily used for T2V training.
-
The Event Dataset consists of 10,000 first-person images paired with user-generated descriptions across four categories: urban daily life, sci-fi, fantasy, and weather phenomena. These are used to synthesize 10,000 image-to-video sequences with Wan 2.2 14B-I2V. After manual screening, 4,000 high-fidelity videos are selected to enhance semantic alignment in complex, event-driven scenarios, and are used exclusively for T2V training.
-
The model is trained using a mixture of these datasets: the Real-world and Event datasets are used for both T2V and I2V tasks, while the Synthetic Dataset supports T2V to preserve generalization. Training employs a decomposed event and action description pipeline, with adaptive history token downsampling and chunk-based autoregressive inference for efficient memory management.
Method
The authors leverage a diffusion-based framework for generating interactive and explorable worlds, built upon a DiT (Diffusion Transformer) backbone. The overall architecture supports both text-to-video and image-to-video generation, with a unified approach to conditioning. For text-to-video, the model takes a text embedding c and a noise tensor z∈RC×ft×h×w as input. For image-to-video, a conditional image or video zc is zero-padded to match the dimensions of z, and a binary mask Mc is used to identify preserved regions. The conditional input is fused with the noise via Mc⋅zc+(1−Mc)⋅z, which is then processed by the DiT backbone. The text encoding strategy is distinct from prior work; the caption is decomposed into an Event Description and an Action Description, which are processed separately by a T5 encoder and then concatenated. This design allows for efficient precomputation of action descriptions, reducing computational overhead during inference. The model is trained using the Rectified Flow loss.

The framework incorporates a long-video generation method that addresses the challenges of large context and slow inference. This is achieved through a combination of temporal-spatial and channel compression. For temporal-spatial compression, historical frames zc are downsampled using a multi-rate Patchify scheme. The compression rate varies based on the frame's temporal distance from the current prediction: frames from t−1 to t−2 are downsampled by (1, 2, 2), frames from t−3 to t−6 by (1, 4, 4), and so on, with the initial frame also using (1, 2, 2). This is implemented by interpolating the Patchify weights. The compressed representation z^c is then concatenated with the prediction frame z^d, which is processed with a fixed (1, 2, 2) downsampling rate, and the combined tensor is fed into the DiT block. For channel compression, the historical frames zc are passed through a Patchify with a compression rate of (8, 4, 4) and reduced to 96 channels, resulting in zlinear. This compressed representation is fed into the DiT block. After the video tokens zl pass through the cross-attention layer, they are processed by a fully connected layer for channel reduction. The predicted frames zpl are then concatenated with zlinear, and the combined tokens zfus are fused via a linear attention layer to produce zfusl. This output is then passed through another fully connected layer to restore the channel dimension and added element-wise to zl for feature fusion.

To accelerate real-time inference, the authors employ a bidirectional attention distillation strategy. This method converts the multi-step diffusion model into a few-step generator by minimizing the expected KL divergence between the real and fake data distributions. The process involves a Real Model (teacher) and a Fake Model (student). The generator Gθ samples previous frames from its own distribution and uses them as context to generate new predicted frames, forming a clean video sequence z0. The key innovation is that the Fake Model is optimized to match the trajectory of the Real Model via a distribution matching gradient, using model-predicted data rather than real data as conditioning. This approach mitigates the train-inference discrepancy and error accumulation in long videos. The generator is initialized with weights from a foundation model trained on a mixed dataset using an alternating training strategy for text-to-video and image-to-video tasks. The resulting model is capable of generating realistic, interactive, and continuous worlds from a single image or text prompt, supporting keyboard-based exploration.
Experiment
- Conducted foundation model training on Wan2.2-5B with 704×1280 resolution, 16 FPS, batch size 40, Adam optimizer (1e-5 LR), and 10,000 iterations on A100 GPUs; followed by 600 iterations of Self-Forcing with TSCM under identical settings.
- Evaluated on Yume-Bench with 544×960 resolution, 16 FPS, 96 frames, and 4 inference steps; used six metrics including instruction following, subject consistency, background consistency, motion smoothness, aesthetic quality, and imaging quality.
- Yume1.5 achieved an instruction-following score of 0.836 on image-to-video generation, significantly outperforming Wan-2.1 and MatrixGame, which showed limited real-world controllability.
- In long-video generation, the model with Self-Forcing and TSCM maintained stable aesthetic (0.523) and image quality (0.601) scores across the final video segments, surpassing the baseline (aesthetic: 0.442, image quality: 0.542).
- TSCM improved instruction following while stabilizing inference time; at over 8 video blocks, inference time per step remained constant, outperforming full-context input and spatial compression methods.
- Yume1.5 generated videos at 12 fps at 540p resolution using a single A100 GPU.
- Limitations include artifacts in motion direction (e.g., backward-moving vehicles) and performance degradation in high-density scenes, attributed to the 5B model size; MoE architectures are proposed as a future solution.
Results show that Yume1.5 outperforms Wan-2.1 and MatrixGame across multiple metrics, achieving the highest instruction-following score of 0.836 and significantly improving aesthetic and imaging quality compared to the baseline models. The model also demonstrates the fastest inference speed, completing generation in just 8 seconds, while maintaining strong consistency in subject, background, and motion smoothness.

Results show that the TSCM model achieves superior performance across multiple metrics compared to the Spatial Compression model. Specifically, TSCM scores 0.836 in Instruction Following, 0.945 in Background Consistency, and 0.985 in Motion Smoothness, outperforming the Spatial Compression model in all evaluated categories.
