HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Generating Datasets with Pretrained Language Models

Timo Schick Hinrich Schütze

Generating Datasets with Pretrained Language Models

Abstract

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

Code Repositories

yipingnus/scratchplot-story-generation
pytorch
Mentioned in GitHub
timoschick/dino
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
semantic-textual-similarity-on-sickDino (STSb/̄
Spearman Correlation: 0.6809
semantic-textual-similarity-on-sickDino (STS/̄
Spearman Correlation: 0.7426
semantic-textual-similarity-on-sts-benchmarkDino (STSb/̄
Spearman Correlation: 0.7782
semantic-textual-similarity-on-sts-benchmarkDino (STS/̄
Spearman Correlation: 0.7651
semantic-textual-similarity-on-sts12Dino (STSb/̄
Spearman Correlation: 0.7027
semantic-textual-similarity-on-sts13Dino (STSb/̄
Spearman Correlation: 0.8126
semantic-textual-similarity-on-sts14Dino (STSb/̄
Spearman Correlation: 0.7125
semantic-textual-similarity-on-sts15Dino (STSb/)
Spearman Correlation: 0.8049
semantic-textual-similarity-on-sts16Dino (STSb/̄
Spearman Correlation: 0.7718

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Generating Datasets with Pretrained Language Models | Papers | HyperAI