Command Palette
Search for a command to run...
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract
In this report, we introduce InternVL 1.5, an open-source multimodal largelanguage model (MLLM) to bridge the capability gap between open-source andproprietary commercial models in multimodal understanding. We introduce threesimple improvements: (1) Strong Vision Encoder: we explored a continuouslearning strategy for the large-scale vision foundation model -- InternViT-6B,boosting its visual understanding capabilities, and making it can betransferred and reused in different LLMs. (2) Dynamic High-Resolution: wedivide images into tiles ranging from 1 to 40 of 448times448 pixelsaccording to the aspect ratio and resolution of the input images, whichsupports up to 4K resolution input. (3) High-Quality Bilingual Dataset: wecarefully collected a high-quality bilingual dataset that covers common scenes,document images, and annotated them with English and Chinese question-answerpairs, significantly enhancing performance in OCR- and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparativestudies. Compared to both open-source and proprietary models, InternVL 1.5shows competitive performance, achieving state-of-the-art results in 8 of 18benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | InternVL 1.2 | GPT-4 score: 48.9 Params: 40B |
| visual-question-answering-on-mm-vet | InternVL 1.5 | GPT-4 score: 62.8 Params: 26B |
| visual-question-answering-on-mm-vet-v2 | InternVL-Chat-V1-5 | GPT-4 score: 51.5±0.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.