3 months ago

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Yi Tay Vinh Q. Tran Sebastian Ruder Jai Gupta Hyung Won Chung Dara Bahri Zhen Qin Simon Baumgartner Cong Yu Donald Metzler

Abstract

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

Code Repositories

google-research/google-research

Official

lucidrains/charformer-pytorch

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
linguistic-acceptability-on-cola	Charformer-Tall	Accuracy: 51.8%
natural-language-inference-on-multinli	Charformer-Tall	Matched: 83.7 Mismatched: 84.4
natural-language-inference-on-qnli	Charformer-Tall	Accuracy: 91.0%
paraphrase-identification-on-quora-question	Charformer-Tall	Accuracy: 91.4 F1: 88.5
semantic-textual-similarity-on-mrpc	Charformer-Tall	Accuracy: 87.5% F1: 91.4
semantic-textual-similarity-on-sts-benchmark	Charformer-Tall	Pearson Correlation: 0.873
sentiment-analysis-on-sst-2-binary	Charformer-Base	Accuracy: 91.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette