HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Mikel Artetxe; Holger Schwenk

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Abstract

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pre-trained encoder and the multilingual test set are available at https://github.com/facebookresearch/LASER

Code Repositories

facebookresearch/LASER
Official
pytorch
Mentioned in GitHub
transducens/LASERtrain
pytorch
Mentioned in GitHub
yannvgn/laserembeddings
pytorch
Mentioned in GitHub
Unbabel/COMET
pytorch
Mentioned in GitHub
jiamingkong/infoxlm_paddle
paddle
Mentioned in GitHub
jeongukjae/smaller-labse
tf
Mentioned in GitHub
Tony4469/laser-agir
pytorch
Mentioned in GitHub
facebookresearch/vizseq
Mentioned in GitHub
raymondhs/fairseq-laser
pytorch
Mentioned in GitHub
imamathcat/LASER_Dependencies
pytorch
Mentioned in GitHub
kmkwon94/ainize-laser
pytorch
Mentioned in GitHub
prabhakar267/LASER-improved
pytorch
Mentioned in GitHub
LawrenceDuan/myLASER
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-lingual-bitext-mining-on-bucc-chineseMassively Multilingual Sentence Embeddings
F1 score: 92.27
cross-lingual-bitext-mining-on-bucc-french-toMassively Multilingual Sentence Embeddings
F1 score: 93.91
cross-lingual-bitext-mining-on-bucc-german-toMassively Multilingual Sentence Embeddings
F1 score: 96.19
cross-lingual-bitext-mining-on-bucc-russianMassively Multilingual Sentence Embeddings
F1 score: 93.3
cross-lingual-document-classification-onMassively Multilingual Sentence Embeddings
Accuracy: 84.78%
cross-lingual-document-classification-on-1Massively Multilingual Sentence Embeddings
Accuracy: 77.33
cross-lingual-document-classification-on-10Massively Multilingual Sentence Embeddings
Accuracy: 69.43
cross-lingual-document-classification-on-11Massively Multilingual Sentence Embeddings
Accuracy: 60.3
cross-lingual-document-classification-on-2Massively Multilingual Sentence Embeddings
Accuracy: 77.95
cross-lingual-document-classification-on-8Massively Multilingual Sentence Embeddings
Accuracy: 71.93
cross-lingual-document-classification-on-9Massively Multilingual Sentence Embeddings
Accuracy: 67.78

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond | Papers | HyperAI