HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training

Abstract

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Code Repositories

openmatch/coco-dr
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
code-search-on-codesearchnetcpt-code S
Go: 97.7
JS: 86.0
Java: 94.0
Overall: 93.4
PHP: 96.7
Python: 99.8
Ruby: 86.3
code-search-on-codesearchnetcpt-code M
Go: 97.5
JS: 86.5
Java: 94.4
Overall: 93.5
PHP: 97.2
Python: 99.9
Ruby: 85.5
passage-ranking-on-ms-marcocpt-text XL
MRR@10: 22.7
passage-ranking-on-ms-marcoFine-tuned SOTA
MRR@10: 44.3
passage-ranking-on-ms-marcocpt-text L
MRR@10: 21.5
passage-ranking-on-ms-marcoBM25
MRR@10: 18.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Text and Code Embeddings by Contrastive Pre-Training | Papers | HyperAI