HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ming-UniVision: Joint Image Understanding and Generation with a Unified
  Continuous Tokenizer

Abstract

Visual tokenization remains a core challenge in unifying visual understandingand generation within the autoregressive paradigm. Existing methods typicallyemploy tokenizers in discrete latent spaces to align with the tokens from largelanguage models, where the quantization errors can limit semanticexpressiveness and degrade the capability of vision-language understanding. Toaddress this, we introduce MingTok, a new family of visual tokenizers with acontinuous latent space, for unified autoregressive generation andunderstanding. While understanding tasks favor discriminative high-dimensionalfeatures, generation tasks prefer compact low-level codes. Thus, to reconcilethese competing demands, MingTok adopts a three-stage sequential architectureinvolving low-level encoding, semantic expansion, and visual reconstruction.Built on top of it, Ming-UniVision eliminates the need for task-specific visualrepresentations, and unifies diverse vision-language tasks under a singleautoregrsssive prediction paradigm. By formulating both understanding andgeneration as next-token prediction in a shared continuous space, it seamlesslysupports multi-round, in-context tasks such as iterative understanding,generation and editing. Empirically, we find that using a unified continuousvisual representation reconciles the competing requirements on the tokenizersby the understanding and generation tasks, thereby leading to state-of-the-artlevel performance across both domains. We hope our findings will facilitateunified visual tokenization in the continuous domain. Inference code and modelweights are released to benefit community.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer | Papers | HyperAI