HyperAI

Text-to-Image-2M is a high-quality text-image pair dataset designed for fine-tuning text-to-image models. Existing public datasets often have limitations (image understanding datasets, informally collected or task-specific datasets, and size limitations). To address these issues, the team combined and enhanced existing high-quality datasets with advanced text-to-image and captioning models to create the Text-to-Image-2M dataset.

The dataset contains about 2 million samples, divided into 2 core subsets: data_512_2M (2 million 512×512 resolution images and annotations) and data_1024_10K (10,000 1024×1024 high-resolution images and annotations), providing flexible options for model training with different accuracy requirements.

Data composition:

data_512_2M:
- LLaVA-next fine-tuning dataset (about 700,000 samples, text is regenerated by Qwen2-VL to improve accuracy)
- LLaVA pre-trained dataset (about 500,000 samples, images are generated by Flux-dev model, and original text descriptions are retained)
- ProGamerGov synthetic dataset (~900k samples, center-cropped and validity-filtered)
- GPT-4o generated dataset (100,000 samples, text designed by GPT-4o, images generated by Flux-dev)
data_1024_10K:
- Contains 10,000 high-resolution images, with text generated by GPT-4o and images rendered by the Flux-dev model, focusing on complex scenes with rich details

Text-to-Image-2M text-to-image Training Dataset

Data composition:

Build AI with AI

Hyper Newsletters

Command Palette

Text-to-Image-2M text-to-image Training Dataset

Data composition:

Build AI with AI

Hyper Newsletters