Command Palette
Search for a command to run...
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Abstract
Visual tokenization remains a core challenge in unifying visual understandingand generation within the autoregressive paradigm. Existing methods typicallyemploy tokenizers in discrete latent spaces to align with the tokens from largelanguage models, where the quantization errors can limit semanticexpressiveness and degrade the capability of vision-language understanding. Toaddress this, we introduce MingTok, a new family of visual tokenizers with acontinuous latent space, for unified autoregressive generation andunderstanding. While understanding tasks favor discriminative high-dimensionalfeatures, generation tasks prefer compact low-level codes. Thus, to reconcilethese competing demands, MingTok adopts a three-stage sequential architectureinvolving low-level encoding, semantic expansion, and visual reconstruction.Built on top of it, Ming-UniVision eliminates the need for task-specific visualrepresentations, and unifies diverse vision-language tasks under a singleautoregrsssive prediction paradigm. By formulating both understanding andgeneration as next-token prediction in a shared continuous space, it seamlesslysupports multi-round, in-context tasks such as iterative understanding,generation and editing. Empirically, we find that using a unified continuousvisual representation reconciles the competing requirements on the tokenizersby the understanding and generation tasks, thereby leading to state-of-the-artlevel performance across both domains. We hope our findings will facilitateunified visual tokenization in the continuous domain. Inference code and modelweights are released to benefit community.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.