Command Palette
Search for a command to run...
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Abstract
The exponential growth of large language models (LLMs) has opened up numerouspossibilities for multimodal AGI systems. However, the progress in vision andvision-language foundation models, which are also critical elements ofmulti-modal AGI, has not kept pace with LLMs. In this work, we design alarge-scale vision-language foundation model (InternVL), which scales up thevision foundation model to 6 billion parameters and progressively aligns itwith the LLM, using web-scale image-text data from various sources. This modelcan be broadly applied to and achieve state-of-the-art performance on 32generic visual-linguistic benchmarks including visual perception tasks such asimage-level or pixel-level recognition, vision-language tasks such as zero-shotimage/video classification, zero-shot image/video-text retrieval, and link withLLMs to create multi-modal dialogue systems. It has powerful visualcapabilities and can be a good alternative to the ViT-22B. We hope that ourresearch could contribute to the development of multi-modal large models. Codeand models are available at https://github.com/OpenGVLab/InternVL.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-flickr30k-cn | InternVL-G-FT | R@1: 85.9 R@10: 97.1 R@5: 98.7 |
| image-retrieval-on-flickr30k-cn | InternVL-C-FT | R@1: 85.2 R@10: 97.0 R@5: 98.5 |
| image-to-text-retrieval-on-flickr30k | InternVL-G-FT (finetuned, w/o ranking) | Recall@1: 97.9 Recall@10: 100 Recall@5: 100 |
| image-to-text-retrieval-on-flickr30k | InternVL-C-FT (finetuned, w/o ranking) | Recall@1: 97.2 Recall@10: 100 Recall@5: 100 |
| mmr-total-on-mrr-benchmark | InternVL2-8B | Total Column Score: 368 |
| mmr-total-on-mrr-benchmark | InternVL2-1B | Total Column Score: 237 |
| visual-question-answering-on-vqa-v2-test-dev | InternVL-C | Accuracy: 81.2 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | InternVL-C | Image-to-text R@1: 70.6 Image-to-text R@10: 93.5 Image-to-text R@5: 89.0 Text-to-image R@1: 54.1 Text-to-image R@10: 84.6 Text-to-image R@5: 77.3 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | InternVL-G | Image-to-text R@1: 74.9 Image-to-text R@10: 95.2 Image-to-text R@5: 91.3 Text-to-image R@1: 58.6 Text-to-image R@10: 88.0 Text-to-image R@5: 81.3 |
| zero-shot-cross-modal-retrieval-on-flickr30k | InternVL-G | Image-to-text R@1: 95.7 Image-to-text R@10: 99.9 Image-to-text R@5: 99.7 Text-to-image R@1: 85.0 Text-to-image R@10: 98.6 Text-to-image R@5: 97.0 |
| zero-shot-cross-modal-retrieval-on-flickr30k | InternVL-C | Image-to-text R@1: 94.7 Image-to-text R@10: 99.9 Image-to-text R@5: 99.6 Text-to-image R@1: 81.7 Text-to-image R@10: 98.2 Text-to-image R@5: 96.0 |
| zero-shot-transfer-image-classification-on-1 | InternVL-C | Accuracy (Private): 83.2 |
| zero-shot-transfer-image-classification-on-17 | InternVL-C | Top 1 Accuracy: 95.3 |
| zero-shot-transfer-image-classification-on-3 | InternVL-C | Accuracy (Private): 77.3 |
| zero-shot-transfer-image-classification-on-5 | InternVL-C | Accuracy (Private): 83.8 |
| zero-shot-transfer-image-classification-on-6 | InternVL-C | Accuracy (Private): 80.6 |
| zero-shot-transfer-image-classification-on-8 | InternVL-C | Accuracy (Private): 73.9 |
| zero-shot-transfer-image-classification-on-cn | InternVL-C | Accuracy (Private): 64.5 |
| zero-shot-video-retrieval-on-msr-vtt-full | InternVL-C | text-to-video R@1: 44.7 text-to-video R@10: 78.4 text-to-video R@5: 68.2 video-to-text R@1: 40.2 video-to-text R@10: 74.1 video-to-text R@5: 63.1 |
| zero-shot-video-retrieval-on-msr-vtt-full | InternVL-G | text-to-video R@1: 46.3 text-to-video R@10: 79.6 text-to-video R@5: 70.5 video-to-text R@1: 42.4 video-to-text R@10: 75.4 video-to-text R@5: 65.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.