HyperAIHyperAI

Command Palette

Search for a command to run...

MMLU-Pro Large-Scale Multi-Task Understanding Dataset

The MMLU-Pro dataset is a more powerful and challenging large-scale multi-task understanding dataset designed to more rigorously benchmark the capabilities of large language models. The dataset contains 12K complex questions across disciplines. The dataset was released in 2024 by researchers from the University of Waterloo, the University of Toronto, and Carnegie Mellon University. The related paper results are "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark".

  • Questions and Options: Each question in the dataset typically has 10 multiple-choice options, but during manual review, some options were reduced to eliminate unreasonable choices. Each question originally had 4 options; the addition of options is intended to improve complexity and robustness, requiring deeper reasoning to identify the correct answer from a large pool of potential distractors.
  • Source: This dataset integrates questions from multiple sources:
  • Original MMLU Problems: Part of the dataset comes from the original MMLU dataset. We removed trivial and ambiguous problems.
  • STEM website: Carefully selected high-quality STEM questions from the internet.
  • TheoremQA: High-quality manual annotation problems that require theorem solutions.
  • SciBench: Science questions for university exams.
  • The newly added data covers the following disciplines: enhanced by questions from STEM websites, TheoremQA, and SciBench, including biology, business, chemistry, computer science, economics, engineering, mathematics, physics, and psychology. Compared with the original MMLU, there are three main differences:
  • The original MMLU dataset contains only 4 options, and MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. Random guessing will result in a much lower score.
  • The original MMLU dataset mainly contains knowledge-driven questions that do not require much reasoning. Therefore, PPL results are usually better than CoT. By increasing the difficulty of questions and integrating more reasoning-focused questions in MMLU-Pro, CoT can be 20% higher than PPL.
  • By increasing the number of distractors, MMLU-Pro significantly reduces the probability of guessing correctly by chance, thereby improving the robustness of the baseline. Specifically, after testing 24 different prompt styles, the sensitivity of the model score to prompt changes decreased from 4-5% in MMLU to 2% in MMLU-Pro.
MMLU-Pro.torrent
Seeding 2Downloading 0Completed 290Total Downloads 611
  • MMLU-Pro/
    • README.md
      2.88 KB
    • README.txt
      5.75 KB
      • data/
        • MMLU-Pro.zip
          3.48 MB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp