Date

2 years ago

Size

3.48 MB

Publish URL

github.com

Paper URL

arxiv.org

Tags

LLM

Natural Language Processing

Multi-Task Learning

Reasoning

Benchmarks

The MMLU-Pro dataset is a more powerful and challenging large-scale multi-task understanding dataset designed to more rigorously benchmark the capabilities of large language models. The dataset contains 12K complex questions across disciplines. The dataset was released in 2024 by researchers from the University of Waterloo, the University of Toronto, and Carnegie Mellon University. The related paper results are "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark".

Questions and Options: Each question in the dataset typically has 10 multiple-choice options, but during manual review, some options were reduced to eliminate unreasonable choices. Each question originally had 4 options; the addition of options is intended to improve complexity and robustness, requiring deeper reasoning to identify the correct answer from a large pool of potential distractors.
Source: This dataset integrates questions from multiple sources:

Original MMLU Problems: Part of the dataset comes from the original MMLU dataset. We removed trivial and ambiguous problems.
STEM website: Carefully selected high-quality STEM questions from the internet.
TheoremQA: High-quality manual annotation problems that require theorem solutions.
SciBench: Science questions for university exams.

The newly added data covers the following disciplines: enhanced by questions from STEM websites, TheoremQA, and SciBench, including biology, business, chemistry, computer science, economics, engineering, mathematics, physics, and psychology. Compared with the original MMLU, there are three main differences:
The original MMLU dataset contains only 4 options, and MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. Random guessing will result in a much lower score.
The original MMLU dataset mainly contains knowledge-driven questions that do not require much reasoning. Therefore, PPL results are usually better than CoT. By increasing the difficulty of questions and integrating more reasoning-focused questions in MMLU-Pro, CoT can be 20% higher than PPL.
By increasing the number of distractors, MMLU-Pro significantly reduces the probability of guessing correctly by chance, thereby improving the robustness of the baseline. Specifically, after testing 24 different prompt styles, the sensitivity of the model score to prompt changes decreased from 4-5% in MMLU to 2% in MMLU-Pro.

MMLU-Pro.torrent

Seeding 2Downloading 0Completed 290Total Downloads 611

MMLU-Pro/
- README.md
  2.88 KB
- README.txt
  5.75 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at support@hyper.ai for prompt review and removal.