4 months ago

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VLmodels that redefines the conventional predetermined-resolution approach invisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,which enables the model to dynamically process images of varying resolutionsinto different numbers of visual tokens. This approach allows the model togenerate more efficient and accurate visual representations, closely aligningwith human perceptual processes. The model also integrates Multimodal RotaryPosition Embedding (M-RoPE), facilitating the effective fusion of positionalinformation across text, images, and videos. We employ a unified paradigm forprocessing both images and videos, enhancing the model's visual perceptioncapabilities. To explore the potential of large multimodal models, Qwen2-VLinvestigates the scaling laws for large vision-language models (LVLMs). Byscaling both the model size-with versions at 2B, 8B, and 72B parameters-and theamount of training data, the Qwen2-VL Series achieves highly competitiveperformance. Notably, the Qwen2-VL-72B model achieves results comparable toleading models such as GPT-4o and Claude3.5-Sonnet across various multimodalbenchmarks, outperforming other generalist models. Code is available athttps://github.com/QwenLM/Qwen2-VL.

Code Repositories

baichuan-inc/Baichuan-Omni-1.5

pytorch

Mentioned in GitHub

qwenlm/qwen2.5-vl

pytorch

Mentioned in GitHub

juruobenruo/DexVLA

pytorch

Mentioned in GitHub

yangyucheng000/University/tree/main/model-3/qwen2_vl

mindspore

qwenlm/qwen2-vl

Official

pytorch

Mentioned in GitHub

MindCode-4/code-4/tree/main/qwen2_vl

mindspore

MindCode-4/code-4/tree/main/qwen2_moe

mindspore

tutujingyugang1/ChatVLA_public

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
natural-language-visual-grounding-on	Qwen2-VL-7B	Accuracy (%): 42.1
temporal-relation-extraction-on-vinoground	Qwen2-VL-7B	Group Score: 15.2 Text Score: 40.2 Video Score: 32.4
temporal-relation-extraction-on-vinoground	Qwen2-VL-72B	Group Score: 17.4 Text Score: 50.4 Video Score: 32.6
video-question-answering-on-next-qa	Qwen2-VL(7B)	Accuracy: 81.2
video-question-answering-on-tvbench	Qwen2-VL-72B	Average Accuracy: 52.7
video-question-answering-on-tvbench	Qwen2-VL-7B	Average Accuracy: 43.8
visual-question-answering-on-mm-vet	Qwen2-VL-2B	GPT-4 score: 49.5
visual-question-answering-on-mm-vet	Qwen2-VL-72B	GPT-4 score: 74.0
visual-question-answering-on-mm-vet	Qwen2-VL-7B	GPT-4 score: 62.0
visual-question-answering-on-mm-vet-v2	Qwen2-VL-72B (qwen-vl-max-0809)	GPT-4 score: 66.9±0.3 Params: 72B
visual-question-answering-vqa-on-vlm2-bench	Qwen2-VL-7B	Average Score on VLM2-bench (9 subtasks): 42.37 GC-mat: 27.80 GC-trk: 19.18 OC-cnt: 45.99 OC-cpr: 68.06 OC-grp: 35.00 PC-VID: 16.25 PC-cnt: 58.59 PC-cpr: 61.50 PC-grp: 49.00

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge9 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge