4 months ago

Distributed Representations of Sentences and Documents

Quoc V. Le; Tomas Mikolov

Abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Code Repositories

bombdiggity/paper-bag

Mentioned in GitHub

jimmy6727/Informd

Mentioned in GitHub

TheCyberian/windowsMalwareDetectionWithNLP

Mentioned in GitHub

julian-risch/ICADL2018

Mentioned in GitHub

hithisisdhara/doc2vec

pytorch

Mentioned in GitHub

inejc/paragraph-vectors

pytorch

Mentioned in GitHub

kr900910/supreme_court_opinion

Mentioned in GitHub

fabiocorreacordeiro/Elsevier_abstracts-Classification

Mentioned in GitHub

tsandefer/capstone_2

Mentioned in GitHub

DCYN/Ramdomized-Clinical-Trail-Classification

Mentioned in GitHub

vanboefer/nn_doc2vec_exercise

Mentioned in GitHub

tsandefer/dsi_capstone_2

Mentioned in GitHub

eske/multivec

Mentioned in GitHub

kitnhl/partisan-tweets-classification

Mentioned in GitHub

Nalydy/doc2vec

Mentioned in GitHub

ibrahimsharaf/doc2vec

Mentioned in GitHub

g-k-l/dsi-arxiv-recommender

Mentioned in GitHub

TheCyberian/androidMalwareDetectionWithNLP

Mentioned in GitHub

slme1109/Lyrics_Generator_Using_LSTM

Mentioned in GitHub

dhyeon/ingredient-vectors

pytorch

Mentioned in GitHub

kramamur/sentiment-analysis

Mentioned in GitHub

slme1109/lyrics-generator

Mentioned in GitHub

wiflore/IBM_Articles_Recomender

Mentioned in GitHub

Antonildo43/Classifica-o-de-textos-com-doc2Vec

Mentioned in GitHub

rvstraalen/doc2vec-workshop

Mentioned in GitHub

YinpeiDai/NAUM

Mentioned in GitHub

kinimod23/NMT_Project

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
question-answering-on-qasent	Paragraph vector	MAP: 0.5213 MRR: 0.6023
question-answering-on-qasent	Paragraph vector (lexical overlap + dist output)	MAP: 0.6762 MRR: 0.7514
question-answering-on-wikiqa	Paragraph vector	MAP: 0.5110 MRR: 0.5160
question-answering-on-wikiqa	Paragraph vector (lexical overlap + dist output)	MAP: 0.5976 MRR: 0.6058

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Distributed Representations of Sentences and Documents

Quoc V. Le; Tomas Mikolov

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters