Command Palette
Search for a command to run...
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai Jinze ; Bai Shuai ; Yang Shusheng ; Wang Shijie ; Tan Sinan ; Wang Peng ; Lin Junyang ; Zhou Chang ; Zhou Jingren

Abstract
In this work, we introduce the Qwen-VL series, a set of large-scalevision-language models (LVLMs) designed to perceive and understand both textsand images. Starting from the Qwen-LM as a foundation, we endow it with visualcapacity by the meticulously designed (i) visual receptor, (ii) input-outputinterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodalcleaned corpus. Beyond the conventional image description andquestion-answering, we implement the grounding and text-reading ability ofQwen-VLs by aligning image-caption-box tuples. The resulting models, includingQwen-VL and Qwen-VL-Chat, set new records for generalist models under similarmodel scales on a broad range of visual-centric benchmarks (e.g., imagecaptioning, question answering, visual grounding) and different settings (e.g.,zero-shot, few-shot). Moreover, on real-world dialog benchmarks, ourinstruction-tuned Qwen-VL-Chat also demonstrates superiority compared toexisting vision-language chatbots. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| chart-question-answering-on-chartqa | Qwen-VL | 1:1 Accuracy: 65.7 |
| chart-question-answering-on-chartqa | Qwen-VL-Chat | 1:1 Accuracy: 66.3 |
| fs-mevqa-on-sme | Qwen-VL-Max | #Learning Samples (N): 16 ACC: 40.33 BLEU-4: 24.30 CIDEr: 201.47 Detection: 1.05 METEOR: 23.40 ROUGE-L: 34.52 SPICE: 26.13 |
| mmr-total-on-mrr-benchmark | Qwen-vl-max | Total Column Score: 366 |
| mmr-total-on-mrr-benchmark | Qwen-vl-plus | Total Column Score: 310 |
| natural-language-visual-grounding-on | Qwen-VL | Accuracy (%): 5.2 |
| spatial-reasoning-on-embspatial-bench | Qwen-VL-Max | Generation: 49.11 |
| visual-question-answering-on-docvqa-test | Qwen-VL | ANLS: 0.651 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Plus | ANLS: 0.9024 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Chat | ANLS: 0.626 |
| visual-question-answering-on-mm-vet | Qwen-VL-Max | GPT-4 score: 66.6±0.5 |
| visual-question-answering-on-mm-vet | Qwen-VL-Plus | GPT-4 score: 61.1±0.2 |
| visual-question-answering-on-mm-vet-v2 | Qwen-VL-Max | GPT-4 score: 55.8±0.2 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Coordinates) | GPT-4 score (bbox): 45.3 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Visual Prompt) | GPT-4 score (bbox): 39.2 GPT-4 score (human): 41.7 |
| visual-question-answering-vqa-on-core-mm | Qwen-VL-Chat | Abductive: 44.39 Analogical: 30.42 Deductive: 37.55 Overall score: 37.39 Params: 16B |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.