Command Palette
Search for a command to run...
LongCat-Image Technical Report
LongCat-Image Technical Report
Abstract
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
One-sentence Summary
The Meituan LongCat Team proposes LongCat-Image, a 6B-parameter bilingual (Chinese-English) diffusion model that achieves state-of-the-art performance in multilingual text rendering, photorealism, and image editing through curated data pipelines and reward-guided training, outperforming larger MoE models in efficiency and Chinese character coverage while enabling low-cost deployment via a comprehensive open-source ecosystem.
Key Contributions
-
LongCat-Image introduces a novel, bilingual (Chinese-English) foundation model for image generation that achieves state-of-the-art performance in text rendering, photorealism, and aesthetic quality through rigorous data curation across pre-training, mid-training, and SFT stages, along with the use of curated reward models during RLHF to guide the model toward real-world visual fidelity.
-
It sets a new standard for Chinese character rendering by supporting complex and rare characters with superior coverage and accuracy, outperforming both open-source and commercial models, thanks to a dedicated data and training strategy tailored for multilingual text fidelity.
-
The model achieves high efficiency with a compact 6B-parameter diffusion architecture—significantly smaller than typical 20B+ MoE models—enabling low VRAM usage, fast inference, and reduced deployment costs, while also delivering SOTA results in image editing with strong consistency on standard benchmarks.
Introduction
The authors introduce LongCat-Image, a lightweight, bilingual (Chinese-English) foundation model for image generation and editing, addressing key challenges in multilingual text rendering, photorealism, and deployment efficiency. Prior work has relied on large-scale models—often 20B or more, including MoE architectures—leading to high computational costs, slow inference, and limited accessibility, despite only marginal gains in quality. Additionally, existing models struggle with accurate rendering of complex Chinese characters and maintaining visual consistency in image editing. The authors’ main contribution is a 6B-parameter diffusion model that achieves state-of-the-art performance across benchmarks while significantly reducing VRAM usage and inference latency. This is enabled by a rigorous, multi-stage data curation pipeline that excludes AIGC-generated content during early training, uses hand-curated synthetic data in SFT, and incorporates an AIGC detection model as a reward signal in RL to boost realism. For Chinese text rendering, they employ character-level encoding and synthetically generated text-in-image data with clean backgrounds, achieving superior coverage and accuracy. In image editing, they use a separate model initialized from mid-training weights, combined with multi-task joint training and high-quality human-annotated data, to achieve SOTA consistency and instruction-following performance. Finally, they release a complete open-source ecosystem, including multiple checkpoints and full training code, to accelerate community-driven innovation.
Dataset
-
The dataset comprises 1.2 billion samples curated from diverse sources, including open-source repositories, synthetic pipelines, video sequences, and web-scale interleaved corpora, forming a large-scale, high-quality training corpus for generative modeling.
-
Key subsets include:
- Open-source datasets (OmniEdit, OmniGen2, NHREdit): rigorously cleaned and instruction-rewritten to produce high-fidelity source-target image pairs.
- Synthesized data: generated via expert models for specific editing tasks (e.g., object manipulation, style transfer, background alteration), with MLLMs crafting instructions and expert models producing target images.
- Interleaved web corpora: mined for implicit editing signals from natural image-text sequences, filtered and rewritten for training suitability, though scale remains limited due to resource intensity.
-
Data is processed through a four-stage pipeline:
- Filtering: removes duplicates via MD5 hashing and SigLIP-based similarity; excludes low-resolution images (shortest edge < 384px), extreme aspect ratios (0.25–4.0), watermarked content, and AI-generated content (AIGC) using an internal detector; retains only images with LAION-Aesthetics scores ≥ 4.5.
- Metadata extraction: five attributes are derived—category, style, named entities, OCR text, and aesthetics—using VLMs and specialized models.
- Multi-granularity captioning: VLMs generate captions ranging from entity-level tags to detailed descriptions, informed by extracted metadata and prompt templates.
- Stratification: data is organized into a pyramid structure based on style quality and content diversity to support progressive, multi-stage training.
-
Aesthetic quality is evaluated through a dual-dimensional framework: Quality (technical fidelity via signal statistics and deep reference-free metrics like MUSIQ and Q-Align) and Artistry (photographic merit via VLM-based analysis of composition, lighting, and tonality, complemented by Q-Align-Aesthetics).
-
For image editing training, a separate high-fidelity SFT dataset is constructed from real photos, professional retouches, and synthetic sources, with strict human-in-the-loop filtering to ensure structural alignment between source and edited images—critical for maintaining generation stability and consistency.
-
The model is trained using a mixture of T2I and editing data, with the editing training split incorporating diverse instruction types and task granularities, including reference-based generation, structural modification, and viewpoint transformation.
-
Text rendering capabilities are enhanced through OCR integration, with extracted text incorporated into captions and validated via human-in-the-loop checks, especially for complex layouts like posters, menus, and signage.
-
The dataset supports evaluation through CEdit-Bench, a new benchmark with 1,464 bilingual editing pairs across 15 fine-grained task categories, designed to overcome limitations in existing benchmarks by improving task coverage, granularity, and instruction diversity.
Method
The authors leverage a hybrid MM-DiT (Multimodal Diffusion Transformer) architecture, building upon the FLUX.1-dev framework, to achieve high-fidelity image generation. The core model is a DiT (Diffusion Transformer) that processes latent representations of images and text. The overall framework design is illustrated in the figure below, which shows the model's structure and data flow.

The model employs a double-stream attention mechanism in its initial layers, which processes both text and image latents separately, before transitioning to a single-stream mechanism in the subsequent layers. This design maintains a parameter balance with a ratio of approximately 1:2 between double-stream and single-stream blocks. For the VAE component, the implementation from FLUX.1-dev is used, which compresses input images by a factor of 8× spatially and then applies 2×2 token merging, resulting in a final sequence length of 16×16H×W before entering the DiT module.
The text encoder, responsible for embedding user prompts into continuous representations, is a unified Qwen2.5VL-7B model. This choice enhances multilingual compatibility, particularly for Chinese, and avoids the conventional injection of text embeddings into timestep embeddings for adaLN modulation, as empirical evidence suggests this provides negligible performance gains. For visual text rendering, a character-level tokenizer is used for content within quotation marks, which improves data efficiency and convergence without the overhead of specialized encoders.
Positional embedding is handled by a 3D variant of Multimodal Rotary Position Embedding (MRoPE). The first dimension of this embedding is used for modality differentiation, distinguishing between noise latents, text latents, and reference image latents in editing tasks. The remaining two dimensions encode 2D spatial coordinates for images, while for text, both coordinates are set to the same value, analogous to 1D-RoPE, enabling flexible generation across arbitrary aspect ratios and seamless interaction with other modalities.
The model's training process is structured into a multi-stage pipeline, as illustrated in the figure below. This pipeline consists of three distinct phases: Pre-training, Mid-training, and Post-training. The pre-training phase uses a progressive multi-resolution strategy to efficiently acquire global semantic knowledge and refine high-frequency details. The mid-training phase acts as a crucial bridge, refining the model's baseline generation quality by training on a high-fidelity, curated dataset to elevate aesthetic priors and visual realism. The post-training phase focuses on alignment and stylization, comprising Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The SFT stage optimizes for photorealistic attributes and stylistic fidelity using a hybrid dataset of real and synthetic data, while the RL stage employs advanced alignment techniques like DPO and GRPO to enhance instruction adherence and quality.

For image editing, the model architecture is modified to include an image conditioning branch. Reference images are encoded into VAE latents and distinguished from noised latents by manipulating the first dimension of the 3D RoPE embeddings. These reference tokens are then concatenated with noised latents along the sequence dimension to serve as input for the diffusion visual stream. The source image and instructions are fed into the multimodal encoder (Qwen2.5-VL), and a distinct system prompt is used to differentiate editing tasks from standard text-to-image generation. The overall schematic of this model architecture is illustrated in the figure below.

Experiment
- Stratification experiment validates a staged training approach: pre-training with limited artistic data (0.5%) prevents photorealistic generation collapse; mid-training gradually increases stylized data to 2.5% for style enhancement while preserving realism; SFT phase uses curated real and synthetic data to align with human aesthetic preferences, enabling rapid convergence.
- Progressive mixed-resolution training (256px → 512px → 512–1024px) with bucket sampling and dynamic synthetic text data sampling improves training stability and text rendering accuracy, especially for rare Chinese characters, with synthetic data phased out in final pre-training.
- GenEval, DPG-Bench, and WISE evaluations show LongCat-Image achieves SOTA or competitive performance in text-image alignment, compositional reasoning, and world knowledge, demonstrating strong semantic understanding and fine-grained controllability.
- GlyphDraw2, CVTG-2K, ChineseWord, and Poster&SceneBench evaluations confirm LongCat-Image achieves SOTA results in complex Chinese and English text rendering, including multi-region layouts and long-tail character coverage, with superior performance in intricate character structures and real-world scene integration.
- Human evaluation (MOS) on 400 prompts shows LongCat-Image outperforms HunyuanImage 3.0 across all metrics, matches Qwen-Image in alignment and plausibility, exceeds Qwen-Image and Seedream 4.0 in visual realism, and delivers competitive aesthetic quality.
- CEdit-Bench, GEdit-Bench, and ImgEdit-Bench quantitative evaluations demonstrate strong performance in image editing tasks, with LongCat-Image-Edit achieving SOTA results on CEdit-Bench and GEdit-Bench, and outperforming Qwen-Image-Edit and FLUX.1 Kontext [Pro] in both comprehensive quality and consistency.
- Side-by-side human evaluation (SBS) shows LongCat-Image-Edit achieves higher win rates than Qwen-Image-Edit and FLUX.1 Kontext [Pro] in comprehensive quality and consistency, though a gap remains compared to commercial models Nano Banana and Seedream 4.0.
- Qualitative results confirm superior performance in multi-turn editing, portrait and human-centric editing, style transfer, object manipulation, scene text editing, and viewpoint transformation, with consistent preservation of identity, layout, and texture across complex, multi-step edits.
Results show that LongCat-Image-Edit achieves the highest overall score of 4.50 on the CEdit-Bench, outperforming all compared models including commercial systems like FLUX.1 Kontext [Pro] and Qwen-Image-Edit. The model demonstrates strong performance across all editing categories, particularly excelling in action, style, and hybrid editing tasks, with notable improvements in consistency and comprehensive quality.

Results show that LongCat-Image achieves an average score of 0.95 across all evaluated benchmarks, outperforming HunyuanImage-3.0 and approaching the performance of Seedream 4.0, which achieves an average score of 0.97. The model demonstrates strong capabilities in both English and Chinese text rendering tasks, with particularly high scores on the Complex-en and Poster-zh subsets, indicating robust performance in handling complex textual content and bilingual scenarios.

Results show that LongCat-Image achieves the highest average word accuracy of 0.8658, outperforming Seedream 4.0 and Qwen-Image, while also achieving a competitive NED score of 0.9361 and a CLIPScore of 0.7859. The model demonstrates strong performance in multi-region text rendering, particularly in handling complex text layouts, though it lags behind Seedream 4.0 in NED and Qwen-Image in CLIPScore.

The authors use a comprehensive evaluation framework to assess LongCat-Image's performance across multiple dimensions, including text-image alignment, entity recognition, attribute binding, and relational reasoning. Results show that LongCat-Image achieves strong performance across all categories, with particularly high scores in entity and attribute tasks, and ranks among the top models in overall performance, demonstrating its robustness in handling complex compositional and semantic constraints.

Results show that LongCat-Image achieves an overall score of 0.87 on the GenEval benchmark, matching the best-performing model Qwen-Image and outperforming all other models listed. The model demonstrates strong capabilities in attribute binding, counting, and position understanding, with particularly high scores in single-object and two-object tasks, indicating robust fine-grained control over compositional elements.
