HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Mikel Artetxe; Holger Schwenk

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Abstract

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

Code Repositories

facebookresearch/LASER
Official
pytorch
Mentioned in GitHub
transducens/LASERtrain
pytorch
Mentioned in GitHub
thompsonb/prism_bitext_filter
pytorch
Mentioned in GitHub
Tony4469/laser-agir
pytorch
Mentioned in GitHub
raymondhs/fairseq-laser
pytorch
Mentioned in GitHub
imamathcat/LASER_Dependencies
pytorch
Mentioned in GitHub
kmkwon94/ainize-laser
pytorch
Mentioned in GitHub
prabhakar267/LASER-improved
pytorch
Mentioned in GitHub
LawrenceDuan/myLASER
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-lingual-bitext-mining-on-bucc-french-toMultilingual Sentence Embeddings
F1 score: 92.89
cross-lingual-bitext-mining-on-bucc-german-toMultilingual Sentence Embeddings
F1 score: 95.58

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings | Papers | HyperAI