Command Palette
Search for a command to run...
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Abstract
Unified multimodal Large Language Models (LLMs) that can both understand andgenerate visual content hold immense potential. However, existing open-sourcemodels often suffer from a performance trade-off between these capabilities. Wepresent Manzano, a simple and scalable unified framework that substantiallyreduces this tension by coupling a hybrid image tokenizer with a well-curatedtraining recipe. A single shared vision encoder feeds two lightweight adaptersthat produce continuous embeddings for image-to-text understanding and discretetokens for text-to-image generation within a common semantic space. A unifiedautoregressive LLM predicts high-level semantics in the form of text and imagetokens, with an auxiliary diffusion decoder subsequently translating the imagetokens into pixels. The architecture, together with a unified training recipeover understanding and generation data, enables scalable joint learning of bothcapabilities. Manzano achieves state-of-the-art results among unified models,and is competitive with specialist models, particularly on text-richevaluation. Our studies show minimal task conflicts and consistent gains fromscaling model size, validating our design choice of a hybrid tokenizer.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.