HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report

Abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-languageseries, which demonstrates significant advancements in both foundationalcapabilities and innovative functionalities. Qwen2.5-VL achieves a major leapforward in understanding and interacting with the world through enhanced visualrecognition, precise object localization, robust document parsing, andlong-video comprehension. A standout feature of Qwen2.5-VL is its ability tolocalize objects using bounding boxes or points accurately. It provides robuststructured data extraction from invoices, forms, and tables, as well asdetailed analysis of charts, diagrams, and layouts. To handle complex inputs,Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding,enabling it to process images of varying sizes and videos of extended durations(up to hours) with second-level event localization. This allows the model tonatively perceive spatial scales and temporal dynamics without relying ontraditional normalization techniques. By training a native dynamic-resolutionVision Transformer (ViT) from scratch and incorporating Window Attention, wereduce computational overhead while maintaining native resolution. As a result,Qwen2.5-VL excels not only in static image and document understanding but alsoas an interactive visual agent capable of reasoning, tool usage, and taskexecution in real-world scenarios such as operating computers and mobiledevices. Qwen2.5-VL is available in three sizes, addressing diverse use casesfrom edge AI to high-performance computing. The flagship Qwen2.5-VL-72B modelmatches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularlyexcelling in document and diagram understanding. Additionally, Qwen2.5-VLmaintains robust linguistic performance, preserving the core languagecompetencies of the Qwen2.5 LLM.

Code Repositories

qwenlm/qwen2.5-vl
pytorch
Mentioned in GitHub
princeton-nlp/CharXiv
pytorch
Mentioned in GitHub
qwenlm/qwen2-vl
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-vqa-on-vlm2-benchQwen2.5-VL-7B
Average Score on VLM2-bench (9 subtasks): 54.82
GC-mat: 35.91
GC-trk: 43.38
OC-cnt: 41.72
OC-cpr: 71.39
OC-grp: 47.50
PC-VID: 46.50
PC-cnt: 57.98
PC-cpr: 80.00
PC-grp: 69.00

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Qwen2.5-VL Technical Report | Papers | HyperAI