Command Palette
Search for a command to run...
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Abstract
Vision-Language Models (VLMs) have recently made significant progress, butthe limited scale and quality of open-source instruction data hinder theirperformance compared to closed-source models. In this work, we address thislimitation by introducing Infinity-MM, a large-scale multimodal instructiondataset with 40 million samples, enhanced through rigorous quality filteringand deduplication. We also propose a synthetic instruction generation methodbased on open-source VLMs, using detailed image annotations and diversequestion generation. Using this data, we trained a 2-billion-parameter VLM,Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models ofsimilar scale. This demonstrates that expanding instruction data and generatingsynthetic data can significantly improve the performance of open-source models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-generation-on-textatlaseval | Infinity-2B | StyledTextSynth Clip Score: 0.2727 StyledTextSynth FID: 84.95 StyledTextSynth OCR (Accuracy): 0.80 StyledTextSynth OCR (Cer): 0.93 StyledTextSynth OCR (F1 Score): 1.42 TextScenesHQ Clip Score: 0.2346 TextScenesHQ FID: 71.59 TextScenesHQ OCR (Accuracy): 1.06 TextScenesHQ OCR (Cer): 0.88 TextScenesHQ OCR (F1 Score): 1.74 TextVisionBlend Clip Score: 0.1979 TextVisionBlend FID: 95.69 TextVisionBlend OCR (Accuracy): 2.98 TextVisionBlend OCR (Cer): 0.83 TextVsionBlend OCR (F1 Score): 3.44 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.