HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Infinity-MM: Scaling Multimodal Performance with Large-Scale and
  High-Quality Instruction Data

Abstract

Vision-Language Models (VLMs) have recently made significant progress, butthe limited scale and quality of open-source instruction data hinder theirperformance compared to closed-source models. In this work, we address thislimitation by introducing Infinity-MM, a large-scale multimodal instructiondataset with 40 million samples, enhanced through rigorous quality filteringand deduplication. We also propose a synthetic instruction generation methodbased on open-source VLMs, using detailed image annotations and diversequestion generation. Using this data, we trained a 2-billion-parameter VLM,Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models ofsimilar scale. This demonstrates that expanding instruction data and generatingsynthetic data can significantly improve the performance of open-source models.

Benchmarks

BenchmarkMethodologyMetrics
image-generation-on-textatlasevalInfinity-2B
StyledTextSynth Clip Score: 0.2727
StyledTextSynth FID: 84.95
StyledTextSynth OCR (Accuracy): 0.80
StyledTextSynth OCR (Cer): 0.93
StyledTextSynth OCR (F1 Score): 1.42
TextScenesHQ Clip Score: 0.2346
TextScenesHQ FID: 71.59
TextScenesHQ OCR (Accuracy): 1.06
TextScenesHQ OCR (Cer): 0.88
TextScenesHQ OCR (F1 Score): 1.74
TextVisionBlend Clip Score: 0.1979
TextVisionBlend FID: 95.69
TextVisionBlend OCR (Accuracy): 2.98
TextVisionBlend OCR (Cer): 0.83
TextVsionBlend OCR (F1 Score): 3.44

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data | Papers | HyperAI