HyperAIHyperAI

Command Palette

Search for a command to run...

15 days ago

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Scaling Instruction-Based Video Editing with a High-Quality Synthetic
  Dataset

Abstract

Instruction-based video editing promises to democratize content creation, yetits progress is severely hampered by the scarcity of large-scale, high-qualitytraining data. We introduce Ditto, a holistic framework designed to tackle thisfundamental challenge. At its heart, Ditto features a novel data generationpipeline that fuses the creative diversity of a leading image editor with anin-context video generator, overcoming the limited scope of existing models. Tomake this process viable, our framework resolves the prohibitive cost-qualitytrade-off by employing an efficient, distilled model architecture augmented bya temporal enhancer, which simultaneously reduces computational overhead andimproves temporal coherence. Finally, to achieve full scalability, this entirepipeline is driven by an intelligent agent that crafts diverse instructions andrigorously filters the output, ensuring quality control at scale. Using thisframework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset ofone million high-fidelity video editing examples. We trained our model, Editto,on Ditto-1M with a curriculum learning strategy. The results demonstratesuperior instruction-following ability and establish a new state-of-the-art ininstruction-based video editing.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset | Papers | HyperAI