Command Palette
Search for a command to run...
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Abstract
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VLmodels that redefines the conventional predetermined-resolution approach invisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,which enables the model to dynamically process images of varying resolutionsinto different numbers of visual tokens. This approach allows the model togenerate more efficient and accurate visual representations, closely aligningwith human perceptual processes. The model also integrates Multimodal RotaryPosition Embedding (M-RoPE), facilitating the effective fusion of positionalinformation across text, images, and videos. We employ a unified paradigm forprocessing both images and videos, enhancing the model's visual perceptioncapabilities. To explore the potential of large multimodal models, Qwen2-VLinvestigates the scaling laws for large vision-language models (LVLMs). Byscaling both the model size-with versions at 2B, 8B, and 72B parameters-and theamount of training data, the Qwen2-VL Series achieves highly competitiveperformance. Notably, the Qwen2-VL-72B model achieves results comparable toleading models such as GPT-4o and Claude3.5-Sonnet across various multimodalbenchmarks, outperforming other generalist models. Code is available athttps://github.com/QwenLM/Qwen2-VL.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-visual-grounding-on | Qwen2-VL-7B | Accuracy (%): 42.1 |
| temporal-relation-extraction-on-vinoground | Qwen2-VL-7B | Group Score: 15.2 Text Score: 40.2 Video Score: 32.4 |
| temporal-relation-extraction-on-vinoground | Qwen2-VL-72B | Group Score: 17.4 Text Score: 50.4 Video Score: 32.6 |
| video-question-answering-on-next-qa | Qwen2-VL(7B) | Accuracy: 81.2 |
| video-question-answering-on-tvbench | Qwen2-VL-72B | Average Accuracy: 52.7 |
| video-question-answering-on-tvbench | Qwen2-VL-7B | Average Accuracy: 43.8 |
| visual-question-answering-on-mm-vet | Qwen2-VL-2B | GPT-4 score: 49.5 |
| visual-question-answering-on-mm-vet | Qwen2-VL-72B | GPT-4 score: 74.0 |
| visual-question-answering-on-mm-vet | Qwen2-VL-7B | GPT-4 score: 62.0 |
| visual-question-answering-on-mm-vet-v2 | Qwen2-VL-72B (qwen-vl-max-0809) | GPT-4 score: 66.9±0.3 Params: 72B |
| visual-question-answering-vqa-on-vlm2-bench | Qwen2-VL-7B | Average Score on VLM2-bench (9 subtasks): 42.37 GC-mat: 27.80 GC-trk: 19.18 OC-cnt: 45.99 OC-cpr: 68.06 OC-grp: 35.00 PC-VID: 16.25 PC-cnt: 58.59 PC-cpr: 61.50 PC-grp: 49.00 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.