HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
  Vision Tokenizer

Abstract

Unified multimodal Large Language Models (LLMs) that can both understand andgenerate visual content hold immense potential. However, existing open-sourcemodels often suffer from a performance trade-off between these capabilities. Wepresent Manzano, a simple and scalable unified framework that substantiallyreduces this tension by coupling a hybrid image tokenizer with a well-curatedtraining recipe. A single shared vision encoder feeds two lightweight adaptersthat produce continuous embeddings for image-to-text understanding and discretetokens for text-to-image generation within a common semantic space. A unifiedautoregressive LLM predicts high-level semantics in the form of text and imagetokens, with an auxiliary diffusion decoder subsequently translating the imagetokens into pixels. The architecture, together with a unified training recipeover understanding and generation data, enables scalable joint learning of bothcapabilities. Manzano achieves state-of-the-art results among unified models,and is competitive with specialist models, particularly on text-richevaluation. Our studies show minimal task conflicts and consistent gains fromscaling model size, validating our design choice of a hybrid tokenizer.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer | Papers | HyperAI