HyperAIHyperAI

Command Palette

Search for a command to run...

MLDR Multilingual Document Retrieval Dataset

Date

6 months ago

Size

9.3 GB

MLDR (Multilingual Long-Document Retrieval) is a multilingual long document retrieval dataset built based on Wikipedia, Wudao and mC4 multilingual corpus, which aims to support the research and development of cross-language long text retrieval tasks. It covers 13 typologically different languages, including Arabic (ar), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Russian (ru), Thai (th), and Chinese (zh).

Features and advantages:

  • Wide multi-language coverage: It includes 13 languages, covering multiple language families (such as Indo-European, Sino-Tibetan, Arabic, etc.).
  • Long document feature: The average length of a document is 4,737 words, which is suitable for long text processing needs in real scenarios.
  • Standardized construction: Generate high-quality queries through GPT-3.5 to ensure strong relevance of queries to document content.
MLDR.torrent
Seeding 1Downloading 0Completed 94Total Downloads 128
  • MLDR/
    • README.md
      1.62 KB
    • README.txt
      3.24 KB
      • data/
        • MLDR.zip
          9.3 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MLDR Multilingual Document Retrieval Dataset | Datasets | HyperAI