5 months ago

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Shuhao Gu Jialing Zhang Siyuan Zhou Kevin Yu Zhaohu Xing Liangdong Wang Zhou Cao Jintao Jia Zhuoyi Zhang Yixuan Wang

Abstract

Vision-Language Models (VLMs) have recently made significant progress, butthe limited scale and quality of open-source instruction data hinder theirperformance compared to closed-source models. In this work, we address thislimitation by introducing Infinity-MM, a large-scale multimodal instructiondataset with 40 million samples, enhanced through rigorous quality filteringand deduplication. We also propose a synthetic instruction generation methodbased on open-source VLMs, using detailed image annotations and diversequestion generation. Using this data, we trained a 2-billion-parameter VLM,Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models ofsimilar scale. This demonstrates that expanding instruction data and generatingsynthetic data can significantly improve the performance of open-source models.

Code Repositories

https://huggingface.co/datasets/BAAI/Infinity-MM

https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen

flagopen/flagscale

pytorch

Mentioned in GitHub

LLaVA-VL/LLaVA-NeXT

pytorch

Benchmarks

Benchmark	Methodology	Metrics
image-generation-on-textatlaseval	Infinity-2B	StyledTextSynth Clip Score: 0.2727 StyledTextSynth FID: 84.95 StyledTextSynth OCR (Accuracy): 0.80 StyledTextSynth OCR (Cer): 0.93 StyledTextSynth OCR (F1 Score): 1.42 TextScenesHQ Clip Score: 0.2346 TextScenesHQ FID: 71.59 TextScenesHQ OCR (Accuracy): 1.06 TextScenesHQ OCR (Cer): 0.88 TextScenesHQ OCR (F1 Score): 1.74 TextVisionBlend Clip Score: 0.1979 TextVisionBlend FID: 95.69 TextVisionBlend OCR (Accuracy): 2.98 TextVisionBlend OCR (Cer): 0.83 TextVsionBlend OCR (F1 Score): 3.44

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette