HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Learning Word Vectors for 157 Languages

Edouard Grave; Piotr Bojanowski; Prakhar Gupta; Armand Joulin; Tomas Mikolov

Learning Word Vectors for 157 Languages

Abstract

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

Code Repositories

dzieciou/lemmatizer-pl
tf
Mentioned in GitHub
KMicha/MachineLearning
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
task-1-grouping-on-ocwFastText (News)
Wasserstein Distance (WD): 85.5 ± .5
# Correct Groups: 62 ± 3
# Solved Walls: 0 ± 0
Adjusted Mutual Information (AMI): 15.8 ± .3
Adjusted Rand Index (ARI): 13.0 ± .2
Fowlkes Mallows Score (FMS): 30.4 ± .2
task-1-grouping-on-ocwFastText (Crawl)
Wasserstein Distance (WD): 84.2 ± .5
# Correct Groups: 80 ± 4
# Solved Walls: 0 ± 0
Adjusted Mutual Information (AMI): 18.4 ± .4
Adjusted Rand Index (ARI): 15.2 ± .3
Fowlkes Mallows Score (FMS): 32.1 ± .3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp