a month ago

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang DanDan Zheng Cheng Zou Rui Liu Xiaolong Wang Kaixiang Ji Weilong Chai Jianxin Sun Libin Wang Yongjie Lv

Abstract

Visual tokenization remains a core challenge in unifying visual understandingand generation within the autoregressive paradigm. Existing methods typicallyemploy tokenizers in discrete latent spaces to align with the tokens from largelanguage models, where the quantization errors can limit semanticexpressiveness and degrade the capability of vision-language understanding. Toaddress this, we introduce MingTok, a new family of visual tokenizers with acontinuous latent space, for unified autoregressive generation andunderstanding. While understanding tasks favor discriminative high-dimensionalfeatures, generation tasks prefer compact low-level codes. Thus, to reconcilethese competing demands, MingTok adopts a three-stage sequential architectureinvolving low-level encoding, semantic expansion, and visual reconstruction.Built on top of it, Ming-UniVision eliminates the need for task-specific visualrepresentations, and unifies diverse vision-language tasks under a singleautoregrsssive prediction paradigm. By formulating both understanding andgeneration as next-token prediction in a shared continuous space, it seamlesslysupports multi-round, in-context tasks such as iterative understanding,generation and editing. Empirically, we find that using a unified continuousvisual representation reconciles the competing requirements on the tokenizersby the understanding and generation tasks, thereby leading to state-of-the-artlevel performance across both domains. We hope our findings will facilitateunified visual tokenization in the continuous domain. Inference code and modelweights are released to benefit community.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang DanDan Zheng Cheng Zou Rui Liu Xiaolong Wang Kaixiang Ji Weilong Chai Jianxin Sun Libin Wang Yongjie Lv6 more

Abstract

Build AI with AI

Hyper Newsletters

Ziyuan Huang DanDan Zheng Cheng Zou Rui Liu Xiaolong Wang Kaixiang Ji Weilong Chai Jianxin Sun Libin Wang Yongjie Lv