Command Palette
Search for a command to run...
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Siyoon Jin Seongchan Kim Dahyun Chung Jaeho Lee Hyunwook Choi Jisu Nam Jiyoung Kim Seungryong Kim

Abstract
Video DiTs have advanced video generation, yet they still struggle to modelmulti-instance or subject-object interactions. This raises a key question: Howdo these models internally represent interactions? To answer this, we curateMATRIX-11K, a video dataset with interaction-aware captions and multi-instancemask tracks. Using this dataset, we conduct a systematic analysis thatformalizes two perspectives of video DiTs: semantic grounding, viavideo-to-text attention, which evaluates whether noun and verb tokens captureinstances and their relations; and semantic propagation, via video-to-videoattention, which assesses whether instance bindings persist across frames. Wefind both effects concentrate in a small subset of interaction-dominant layers.Motivated by this, we introduce MATRIX, a simple and effective regularizationthat aligns attention in specific layers of video DiTs with multi-instance masktracks from the MATRIX-11K dataset, enhancing both grounding and propagation.We further propose InterGenEval, an evaluation protocol for interaction-awarevideo generation. In experiments, MATRIX improves both interaction fidelity andsemantic alignment while reducing drift and hallucination. Extensive ablationsvalidate our design choices. Codes and weights will be released.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.