HyperAIHyperAI

Command Palette

Search for a command to run...

InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset

Date

a year ago

Size

73.61 GB

Organization

Chinese Academy of Sciences

The InfiMM-WebMath-40B dataset was released by a research team from ByteDance and the Chinese Academy of Sciences in 2024. The related paper is titled “InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning".

This dataset is a large open-source multimodal dataset designed specifically for mathematical reasoning tasks, containing 2.4k web pages, 8.5k related image URLs, and 40 billion tokens, all of which have been carefully extracted and filtered from the CommonCrawl database (2019-2023). The release of this dataset provides a valuable resource for the open-source community to advance the capabilities of multimodal large language models (MLLMs) in mathematical reasoning.

The dataset construction process includes text extraction, language filtering, high-quality content filtering, deduplication, and extraction of image URLs. Through these steps, the quality and relevance of the dataset are ensured. In terms of model training, the InfiMM-WebMath-40B dataset is used for continued pre-training to enhance the model's ability to acquire mathematical knowledge in a multimodal setting. In addition, instruction fine-tuning is performed to further improve model performance.

InfiMM-WebMath-40B.torrent
Seeding 1Downloading 0Completed 209Total Downloads 261
  • InfiMM-WebMath-40B/
    • README.md
      1.83 KB
    • README.txt
      3.67 KB
      • data/
        • InfiMM-WebMath-40B.zip
          73.61 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset | Datasets | HyperAI