HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VLmodels that redefines the conventional predetermined-resolution approach invisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,which enables the model to dynamically process images of varying resolutionsinto different numbers of visual tokens. This approach allows the model togenerate more efficient and accurate visual representations, closely aligningwith human perceptual processes. The model also integrates Multimodal RotaryPosition Embedding (M-RoPE), facilitating the effective fusion of positionalinformation across text, images, and videos. We employ a unified paradigm forprocessing both images and videos, enhancing the model's visual perceptioncapabilities. To explore the potential of large multimodal models, Qwen2-VLinvestigates the scaling laws for large vision-language models (LVLMs). Byscaling both the model size-with versions at 2B, 8B, and 72B parameters-and theamount of training data, the Qwen2-VL Series achieves highly competitiveperformance. Notably, the Qwen2-VL-72B model achieves results comparable toleading models such as GPT-4o and Claude3.5-Sonnet across various multimodalbenchmarks, outperforming other generalist models. Code is available athttps://github.com/QwenLM/Qwen2-VL.

Code Repositories

baichuan-inc/Baichuan-Omni-1.5
pytorch
Mentioned in GitHub
qwenlm/qwen2.5-vl
pytorch
Mentioned in GitHub
juruobenruo/DexVLA
pytorch
Mentioned in GitHub
qwenlm/qwen2-vl
Official
pytorch
Mentioned in GitHub
tutujingyugang1/ChatVLA_public
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-language-visual-grounding-onQwen2-VL-7B
Accuracy (%): 42.1
temporal-relation-extraction-on-vinogroundQwen2-VL-7B
Group Score: 15.2
Text Score: 40.2
Video Score: 32.4
temporal-relation-extraction-on-vinogroundQwen2-VL-72B
Group Score: 17.4
Text Score: 50.4
Video Score: 32.6
video-question-answering-on-next-qaQwen2-VL(7B)
Accuracy: 81.2
video-question-answering-on-tvbenchQwen2-VL-72B
Average Accuracy: 52.7
video-question-answering-on-tvbenchQwen2-VL-7B
Average Accuracy: 43.8
visual-question-answering-on-mm-vetQwen2-VL-2B
GPT-4 score: 49.5
visual-question-answering-on-mm-vetQwen2-VL-72B
GPT-4 score: 74.0
visual-question-answering-on-mm-vetQwen2-VL-7B
GPT-4 score: 62.0
visual-question-answering-on-mm-vet-v2Qwen2-VL-72B (qwen-vl-max-0809)
GPT-4 score: 66.9±0.3
Params: 72B
visual-question-answering-vqa-on-vlm2-benchQwen2-VL-7B
Average Score on VLM2-bench (9 subtasks): 42.37
GC-mat: 27.80
GC-trk: 19.18
OC-cnt: 45.99
OC-cpr: 68.06
OC-grp: 35.00
PC-VID: 16.25
PC-cnt: 58.59
PC-cpr: 61.50
PC-grp: 49.00

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Papers | HyperAI