Command Palette
Search for a command to run...
OpenMathInstruct-2 Math Instruction Tuning Dataset
OpenMathInstruct-2 is a large-scale open source math instruction dataset released by NVIDIA in 2024, which aims to accelerate the progress of artificial intelligence in mathematics. The related paper results are "OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataThe dataset contains 14 million question-answer pairs (about 600,000 unique questions), which is nearly 8 times larger than the previous largest dataset of its kind. By fine-tuning the Llama-3.1-8B-Base model with OpenMathInstruct-2, its performance on the MATH dataset is improved by 15.9% over Llama3.1-8B-Instruct (from 51.9% to 67.8%). The OpenMathInstruct-2 dataset contains the following fields:
- problem: Original problems, either from the GSM8K or MATH training sets, or problems augmented from these training sets.
- generated_solution: The synthetically generated solution.
- expected_answer: For questions in the training set, it is the true reference answer provided in the dataset. For augmented questions, it is the answer obtained by majority vote.
- problem_source: Indicates that the problem is directly from GSM8K or MATH, or is an enhanced version derived from either dataset.

Example of dataset structure
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.