a month ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Yanghao Li Rui Qian Bowen Pan Haotian Zhang Haoshuo Huang Bowen Zhang Jialing Tong Haoxuan You Xianzhi Du Zhe Gan

Abstract

Unified multimodal Large Language Models (LLMs) that can both understand andgenerate visual content hold immense potential. However, existing open-sourcemodels often suffer from a performance trade-off between these capabilities. Wepresent Manzano, a simple and scalable unified framework that substantiallyreduces this tension by coupling a hybrid image tokenizer with a well-curatedtraining recipe. A single shared vision encoder feeds two lightweight adaptersthat produce continuous embeddings for image-to-text understanding and discretetokens for text-to-image generation within a common semantic space. A unifiedautoregressive LLM predicts high-level semantics in the form of text and imagetokens, with an auxiliary diffusion decoder subsequently translating the imagetokens into pixels. The architecture, together with a unified training recipeover understanding and generation data, enables scalable joint learning of bothcapabilities. Manzano achieves state-of-the-art results among unified models,and is competitive with specialist models, particularly on text-richevaluation. Our studies show minimal task conflicts and consistent gains fromscaling model size, validating our design choice of a hybrid tokenizer.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Yanghao Li Rui Qian Bowen Pan Haotian Zhang Haoshuo Huang Bowen Zhang Jialing Tong Haoxuan You Xianzhi Du Zhe Gan17 more

Abstract

Build AI with AI

Hyper Newsletters

Yanghao Li Rui Qian Bowen Pan Haotian Zhang Haoshuo Huang Bowen Zhang Jialing Tong Haoxuan You Xianzhi Du Zhe Gan