HyperAIHyperAI

Command Palette

Search for a command to run...

CoSyn-400K Multimodal Synthetic Question Answering Dataset

Date

3 months ago

Size

59.4 GB

Organization

Allen Institute for Artificial Intelligence
University of Pennsylvania

Paper URL

arxiv.org

CoSyn-400K is a multimodal synthetic question answering dataset jointly released by the University of Pennsylvania and the Allen Institute for Artificial Intelligence in 2025.Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation", which aims to provide high-quality, scalable synthetic data resources for multimodal model training.

The dataset contains more than 400,000 image-text question-answering pairs, covering 10 fields such as chemistry, mathematics, nutrition, and music, 9 types of text-rich images (charts, documents, math problems, tables, charts, vector graphics, music scores, circuit diagrams, and chemical structures), and 2.7 million lines of instruction tuning data (such as image type, theme, and code generation information), supporting visual question answering tasks.

CoSyn-400K.torrent
Seeding 1Downloading 0Completed 16Total Downloads 73
  • CoSyn-400K/
    • README.md
      1.56 KB
    • README.txt
      3.11 KB
      • data/
        • CoSyn-400K.zip
          59.4 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CoSyn-400K Multimodal Synthetic Question Answering Dataset | Datasets | HyperAI