Command Palette
Search for a command to run...
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Abstract
Recent advances in video insertion based on diffusion models are impressive.However, existing methods rely on complex control signals but struggle withsubject consistency, limiting their practical applicability. In this paper, wefocus on the task of Mask-free Video Insertion and aim to resolve three keychallenges: data scarcity, subject-scene equilibrium, and insertionharmonization. To address the data scarcity, we propose a new data pipelineInsertPipe, constructing diverse cross-pair data automatically. Building uponour data pipeline, we develop OmniInsert, a novel unified framework formask-free video insertion from both single and multiple subject references.Specifically, to maintain subject-scene equilibrium, we introduce a simple yeteffective Condition-Specific Feature Injection mechanism to distinctly injectmulti-source conditions and propose a novel Progressive Training strategy thatenables the model to balance feature injection from subjects and source video.Meanwhile, we design the Subject-Focused Loss to improve the detailedappearance of the subjects. To further enhance insertion harmonization, wepropose an Insertive Preference Optimization methodology to optimize the modelby simulating human preferences, and incorporate a Context-Aware Rephrasermodule during reference to seamlessly integrate the subject into the originalscenes. To address the lack of a benchmark for the field, we introduceInsertBench, a comprehensive benchmark comprising diverse scenes withmeticulously selected subjects. Evaluation on InsertBench indicates OmniInsertoutperforms state-of-the-art closed-source commercial solutions. The code willbe released.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.