Command Palette
Search for a command to run...
Mikhail Khodak; Nikunj Saunshi; Kiran Vodrahalli

Abstract
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| sarcasm-detection-on-sarc-all-bal | Bag-of-Bigrams | Accuracy: 75.8 |
| sarcasm-detection-on-sarc-pol-bal | Bag-of-Bigrams | Accuracy: 76.5 |
| sarcasm-detection-on-sarc-pol-unbal | Bag-of-Words | Avg F1: 27.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.