HyperAIHyperAI

Command Palette

Search for a command to run...

InfinityInstruct-3M Launches Ten Million Instruction Fine-tuning Dataset

Date

a year ago

Size

2.79 GB

Organization

InfinityInstruct is a large-scale, high-quality, open-source instruction fine-tuning dataset project launched by Beijing Academy of Artificial Intelligence (BAAI). The goal of the project is to develop a dataset containing millions of instructions to support instruction tracing capabilities of large language models, thereby improving model performance.

This version is the InfinityInstruct-3M instruction dataset, and the final version is expected to be released at the end of June.

Features of InfinityInstruct include:

  1. Large-scale datasets:The project plans to release tens of millions of command data, and 3 million Chinese and English command data have been released in the first phase.
  2. High quality screening:The Zhiyuan Research Institute conducts field analysis and quality screening on existing open source data to ensure the high value of the data, and augments the data in areas where it is lacking.
  3. Open Source Community Contributions: During the dataset construction process, the open source community provided a large amount of instruction data, including datasets from multiple sources, such as OpenHermes-2.5, UltraInteract_sft, CodeBagel, etc.
  4. Risk Assessment and Data Generation: The project team is currently conducting risk assessment and data generation and expects to release the final version containing 10 million instructions by the end of June.
  5. Performance Improvements: The current open source 3 million instruction data set has demonstrated SFT (Supervised Fine-Tuning) data capabilities that surpass existing data sets such as Mistral and Openhermes.
  6. Future Outlook: It is expected that after the data volume increases to tens of millions, the dialogue model trained based on the instruction fine-tuning dataset will be able to reach the level of GPT-4.

The development and release of the InfinityInstruct dataset is of great significance for promoting the research and application of large language models. It provides rich instruction data for large models, which helps improve the model's ability to understand and execute instructions. At the same time, its open source nature also promotes collaboration and knowledge sharing in the AI community.

InfinityInstruct-3M.torrent
Seeding 1Downloading 0Completed 216Total Downloads 267
  • InfinityInstruct-3M/
    • README.md
      2.44 KB
    • README.txt
      4.88 KB
      • data/
        • Infinity-Instruct.zip
          2.79 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
InfinityInstruct-3M Launches Ten Million Instruction Fine-tuning Dataset | Datasets | HyperAI