HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Xiaotao Gu; Zihan Wang; Zhenyu Bi; Yu Meng; Liyuan Liu; Jiawei Han; Jingbo Shang

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Abstract

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

Code Repositories

xgeric/UCPhrase-exp
Official
pytorch
Mentioned in GitHub
xgeric/UCPhrase-reproduce
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
keyphrase-extraction-on-kp20kStanfordNLP
F1@10: 13.9
Recall: 51.7
keyphrase-extraction-on-kp20kWiki+RoBERTa
F1@10: 19.2
Recall: 73.0
keyphrase-extraction-on-kp20kSpacy
F1@10: 15.3
Recall: 59.5
keyphrase-extraction-on-kp20kAutoPhrase
F1@10: 18.2
Recall: 62.9
keyphrase-extraction-on-kp20kUCPhrase
F1@10: 19.7
Recall: 72.9
keyphrase-extraction-on-kp20kPKE
F1@10: 12.6
Recall: 57.1
keyphrase-extraction-on-kp20kTopMine
F1@10: 15.0
Recall: 53.3
keyphrase-extraction-on-kptimesAutoPhrase
F1@10: 10.3
Recall: 77.8
keyphrase-extraction-on-kptimesWiki+RoBERTa
F1@10: 9.4
Recall: 64.5
keyphrase-extraction-on-kptimesUCPhrase
F1@10: 10.9
Recall: 83.4
keyphrase-extraction-on-kptimesTopMine
F1@10: 8.5
Recall: 63.4
phrase-ranking-on-kp20kTopMine
P@50K: 78.0
P@5K: 81.5
phrase-ranking-on-kp20kWiki+RoBERTa
P@50K: 98.5
P@5K: 100.0
phrase-ranking-on-kp20kUCPhrase
P@50K: 96.5
P@5K: 96.5
phrase-ranking-on-kptimesUCPhrase
P@50K: 95.5
P@5K: 96.5
phrase-ranking-on-kptimesWiki+RoBERTa
P@50K: 96.5
P@5K: 99.0
phrase-ranking-on-kptimesAutoPhrase
P@50K: 95.5
P@5K: 96.5
phrase-ranking-on-kptimesTopMine
P@50K: 71.0
P@5K: 85.5
phrase-tagging-on-kp20kAutoPhrase
F1: 49.7
Precision: 55.2
Recall: 45.2
phrase-tagging-on-kp20kWiki+RoBERTa
F1: 61.0
Precision: 58.1
Recall: 64.2
phrase-tagging-on-kp20kTopMine
F1: 40.6
Precision: 39.8
Recall: 41.4
phrase-tagging-on-kp20kUCPhrase
F1: 73.9
Precision: 69.9
Recall: 78.3
phrase-tagging-on-kptimesAutoPhrase
F1: 45.9
Precision: 44.2
Recall: 47.7
phrase-tagging-on-kptimesWiki+RoBERTa
F1: 63.2
Precision: 60.9
Recall: 65.6
phrase-tagging-on-kptimesUCPhrase
F1: 73.5
Precision: 69.1
Recall: 78.9
phrase-tagging-on-kptimesTopMine
F1: 34.0
Precision: 32.0
Recall: 36.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
UCPhrase: Unsupervised Context-aware Quality Phrase Tagging | Papers | HyperAI