HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Efficient Vector Representation for Documents through Corruption

Minmin Chen

Efficient Vector Representation for Documents through Corruption

Abstract

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
semantic-similarity-on-sickDoc2VecC
MSE: 0.3053
Pearson Correlation: 0.8381
Spearman Correlation: 0.7621
sentiment-analysis-on-imdbDoc2VecC
Accuracy: 88.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Efficient Vector Representation for Documents through Corruption | Papers | HyperAI