Command Palette
Search for a command to run...
CoSyn-400K Multimodal Synthetic Question Answering Dataset
Date
Size
Paper URL
CoSyn-400K is a multimodal synthetic question answering dataset jointly released by the University of Pennsylvania and the Allen Institute for Artificial Intelligence in 2025.Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation", which aims to provide high-quality, scalable synthetic data resources for multimodal model training.
The dataset contains more than 400,000 image-text question-answering pairs, covering 10 fields such as chemistry, mathematics, nutrition, and music, 9 types of text-rich images (charts, documents, math problems, tables, charts, vector graphics, music scores, circuit diagrams, and chemical structures), and 2.7 million lines of instruction tuning data (such as image type, theme, and code generation information), supporting visual question answering tasks.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.