Command Palette
Search for a command to run...
Seedream 4.0: Toward Next-generation Multimodal Image Generation

Abstract
We introduce Seedream 4.0, an efficient and high-performance multimodal imagegeneration system that unifies text-to-image (T2I) synthesis, image editing,and multi-image composition within a single framework. We develop a highlyefficient diffusion transformer with a powerful VAE which also can reduce thenumber of image tokens considerably. This allows for efficient training of ourmodel, and enables it to fast generate native high-resolution images (e.g.,1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanningdiverse taxonomies and knowledge-centric concepts. Comprehensive datacollection across hundreds of vertical scenarios, coupled with optimizedstrategies, ensures stable and large-scale training, with stronggeneralization. By incorporating a carefully fine-tuned VLM model, we performmulti-modal post-training for training both T2I and image editing tasksjointly. For inference acceleration, we integrate adversarial distillation,distribution matching, and quantization, as well as speculative decoding. Itachieves an inference time of up to 1.8 seconds for generating a 2K image(without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream4.0 can achieve state-of-the-art results on both T2I and multimodal imageediting. In particular, it demonstrates exceptional multimodal capabilities incomplex tasks, including precise image editing and in-context reasoning, andalso allows for multi-image reference, and can generate multiple output images.This extends traditional T2I systems into an more interactive andmultidimensional creative tool, pushing the boundary of generative AI for bothcreativity and professional applications. Seedream 4.0 is now accessible onhttps://www.volcengine.com/experience/ark?launch=seedream.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.