Command Palette
Search for a command to run...
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Abstract
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-visionlanguage model that supports long-contextual input and output. IXC-2.5 excelsin various text-image comprehension and composition applications, achievingGPT-4V level capabilities with merely 7B LLM backend. Trained with 24Kinterleaved image-text contexts, it can seamlessly extend to 96K long contextsvia RoPE extrapolation. This long-context capability allows IXC-2.5 to excel intasks requiring extensive input and output contexts. Compared to its previous2.0 version, InternLM-XComposer-2.5 features three major upgrades invision-language comprehension: (1) Ultra-High Resolution Understanding, (2)Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. Inaddition to comprehension, IXC-2.5 extends to two compelling applications usingextra LoRA parameters for text-image composition: (1) Crafting Webpages and (2)Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28benchmarks, outperforming existing open-source state-of-the-art models on 16benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on16 key tasks. The InternLM-XComposer-2.5 is publicly available athttps://github.com/InternLM/InternLM-XComposer.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| temporal-relation-extraction-on-vinoground | InternLM-XC-2.5 | Group Score: 9.6 Text Score: 28.8 Video Score: 27.8 |
| temporal-relation-extraction-on-vinoground | InternLM-XC-2.5 (CoT) | Group Score: 9 Text Score: 30.8 Video Score: 28.4 |
| video-question-answering-on-tvbench | IXC-2.5 7B | Average Accuracy: 51.6 |
| visual-question-answering-on-mm-vet | IXC-2.5-7B | GPT-4 score: 51.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.