16 days ago

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao Mingxuan Li Silei Wu Linjun Dai Xiaohua Wang Hanming Deng Lewei Lu Dahua Lin Ziwei Liu

Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a risingcontender to typical modular VLMs, shaped by evolving model architectures andtraining paradigms. Yet, two lingering clouds cast shadows over its widespreadexploration and promotion: (-) What fundamental constraints set native VLMsapart from modular ones, and to what extent can these barriers be overcome? (-)How to make research in native VLMs more accessible and democratized, therebyaccelerating progress in the field. In this paper, we clarify these challengesand outline guiding principles for constructing native VLMs. Specifically, onenative VLM primitive should: (i) effectively align pixel and wordrepresentations within a shared semantic space; (ii) seamlessly integrate thestrengths of formerly separate vision and language modules; (iii) inherentlyembody various cross-modal properties that support unified vision-languageencoding, aligning, and reasoning. Hence, we launch NEO, a novel family ofnative VLMs built from first principles, capable of rivaling top-tier modularcounterparts across diverse real-world scenarios. With only 390M image-textexamples, NEO efficiently develops visual perception from scratch whilemitigating vision-language conflicts inside a dense and monolithic modelcrafted from our elaborate primitives. We position NEO as a cornerstone forscalable and powerful native VLMs, paired with a rich set of reusablecomponents that foster a cost-effective and extensible ecosystem. Our code andmodels are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao Mingxuan Li Silei Wu Linjun Dai Xiaohua Wang Hanming Deng Lewei Lu Dahua Lin Ziwei Liu

Abstract

Build AI with AI

Hyper Newsletters