12 days ago

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu

Table of Contents

Abstract

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

One-sentence Summary

OmniDirector is a unified framework trained on million-scale camera grid-video pairs that enables general multi-shot camera cloning without cross-paired data by encoding cameras as grid motion videos and utilizing a novel hierarchical prompt expansion agent to coordinate characters, actions, and cameras to provide director-level control for multimodal diffusion transformers, with extensive experiments demonstrating superior performance and controllability.

Key Contributions

The paper introduces a general camera motion representation that encodes cameras as grid motion videos to support the integration of diverse trajectories for multi-shot video generation.
OmniDirector is presented as a unified framework trained on million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers.
A novel hierarchical prompt expansion agent is designed to harmoniously integrate different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of the framework.

Introduction

Camera motion is essential for shaping narrative depth in video generation, yet achieving precise control remains challenging. Existing parametric methods fail to handle multi-shot transitions and struggle with semantic alignment, while implicit approaches relying on cross-paired data suffer from scarcity and information leakage. To address these issues, the authors introduce OmniDirector, a unified framework designed for Multi-Modal Diffusion Transformers. They propose a camera grid representation that encodes camera parameters as motion videos in an empty 3D scene to decouple motion from content. This approach enables training on million-scale data without requiring cross-paired samples. Additionally, they design a hierarchical prompt expansion agent to seamlessly integrate camera motion with subject and object controls during inference.

Method

The authors propose OmniDirector, a framework designed to achieve general multi-shot camera cloning by decoupling camera motion from scene content. The core innovation is the Camera Grid representation, which abstracts complex camera movements into a visual signal that can be processed alongside standard video content.

Refer to the framework diagram below:

The method begins with Camera Grid Rendering. To simulate spatial transformations, the authors model the environment as an empty room containing only 3D grid lines. Specifically, they generate grid points on X-Z planes to represent the floor and ceiling, and construct vertical line segments to form a tubular boundary structure surrounding the camera trajectory. This creates a visual tunnel wall effect that clearly presents the geometric structure and motion path. For special effects like fisheye distortion or dolly zoom, the rendering scheme is modified using specific optical models, such as the Kannala–Brandt model, to reproduce characteristic visual distortions.

During the training phase, the architecture leverages a 3D Variational Autoencoder (3D VAE) to encode the inputs. The reference image $I$ , the camera grid $G$ , and the noisy video latent $Z_v$ are encoded into latent representations $z_I$ , $z_c$ , and $z_v$ . These modalities are concatenated along the frame dimension to form a unified spatiotemporal representation $z_{vis} = \text{Concat}(z_I, z_v, z_c)$ . This combined latent is then patchified into token sequences $Z_{vis}$ . Simultaneously, the text condition $T'$ is processed through a text encoder to produce text tokens $Z_t$ . The model utilizes a Multimodal Diffusion Transformer (MMDiT) backbone where visual and text tokens interact through joint attention mechanisms, allowing text semantics to modulate the visual generation process while maintaining alignment between visual signals. To ensure the model strictly adheres to the camera grid geometry, a self-reconstruction objective is incorporated into 30% of the training samples. In these cases, the target video is replaced by the camera grid itself, forcing the model to parse the geometric structure rather than relying on spurious correlations.

For inference, the system employs a Hierarchical Prompt Expansion Agent to harmonize heterogeneous control signals. First, a Camera Prompt Generator analyzes transition frames and camera poses to produce a specific camera motion description $T_c$ . This description is then fused with the user prompt $T_u$ and the reference image $I$ via a Semantic Fusion module to create a final prompt $T_f$ . This unified prompt guides the OmniDirector model to generate the output video. Additionally, an Adaptive Classifier-Free Guidance strategy is applied, where the visual unconditional input is set to a black background and the text prompt includes a completely static camera description. A coarse-to-fine denoising schedule is also used, injecting camera grid features during the high-noise regime to establish global structure before refining local details.

Experiment

OmniDirector is evaluated on a diverse validation set to assess camera control accuracy, shot transition consistency, and content leakage against state-of-the-art baselines. Comparative results demonstrate that the proposed method surpasses existing approaches by faithfully cloning complex trajectories and maintaining semantic coherence in multi-shot scenarios without significant content leakage. Ablation studies and emergent capability tests further validate that the design choices enhance signal fusion and enable generalization to diverse input signals like raw video without retraining.

The authors evaluate their proposed method, OmniDirector, against several state-of-the-art baselines across camera control, transition accuracy, and leakage rates. The results indicate that their approach consistently outperforms competitors, achieving the highest precision in camera pose estimation and shot transitions while minimizing unwanted reference video leakage. The proposed method achieves the best performance in camera accuracy metrics, showing the lowest rotation and translation errors among all compared approaches. OmniDirector demonstrates superior transition accuracy with the highest temporal and semantic precision, while baseline methods lack support for semantic evaluation. The approach significantly reduces leakage rates at both frame and shot levels, contrasting with other methods that exhibit substantial content leakage from the reference video.

The authors present a pairwise comparison evaluating camera control, visual quality, and narrative consistency against a baseline method. The results indicate superior performance across all dimensions, with the proposed approach achieving high preference rates and favorable outcome ratios particularly in visual quality and camera trajectory adherence. Visual quality received the highest proportion of favorable or equivalent ratings compared to the baseline. Camera control exhibited the strongest ratio of positive to negative outcomes among the evaluated categories. Narrative consistency demonstrated solid performance with high agreement rates, though with a lower favorable ratio than the other dimensions.

The ablation study evaluates the contributions of Semantic Fusion, Transition Prompt Engineering, and Adaptive CFG to the overall performance of the OmniDirector model. Results indicate that the complete configuration achieves superior camera accuracy, transition precision, and minimal content leakage compared to removing individual components. Specifically, the absence of inter-shot modeling severely impacts semantic consistency during transitions, while removing adaptive guidance degrades camera motion fidelity. The full model configuration yields the best camera accuracy and lowest leakage rates across all settings. Excluding transition prompt engineering causes a significant degradation in the semantic precision of shot transitions. Semantic fusion is essential for reducing reference video leakage and maintaining accurate camera pose estimation.

The authors evaluate OmniDirector against leading baselines using quantitative metrics and human pairwise comparisons to assess camera control, transition accuracy, visual quality, and narrative consistency. Comparative results indicate the proposed approach consistently outperforms competitors by achieving superior precision in camera pose estimation and shot transitions while significantly minimizing reference video leakage. Additionally, an ablation study confirms that the full model configuration is essential for maintaining semantic consistency and camera motion fidelity, as removing individual components leads to notable performance degradation.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

12 days ago

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu

Table of Contents

Abstract

One-sentence Summary

Key Contributions

The paper introduces a general camera motion representation that encodes cameras as grid motion videos to support the integration of diverse trajectories for multi-shot video generation.
OmniDirector is presented as a unified framework trained on million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers.
A novel hierarchical prompt expansion agent is designed to harmoniously integrate different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of the framework.

Introduction

Method

Refer to the framework diagram below:

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu

Jiwen Liu Shujuan Li Zhixue Fang Xiaohan Li Yan Zhou Zijie Meng Zhimin Zhang Yawen Luo Guoxin Zhang Yu-Shen Liu