3 months ago

TAPE: Assessing Few-shot Russian Language Understanding

Ekaterina Taktasheva Tatiana Shavrina Alena Fenogenova Denis Shevelev Nadezhda Katricheva Maria Tikhonova Albina Akhmetgareeva Oleg Zinkevich Anastasiia Bashmakova Svetlana Iordanskaia

Abstract

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

Code Repositories

RussianNLP/TAPE

Official

Benchmarks

Benchmark	Methodology	Metrics
ethics-on-ethics	RuGPT-3 Large	Accuracy: 68.6
ethics-on-ethics	Human benchmark	Accuracy: 52.9
ethics-on-ethics	RuGPT-3 Meduim	Accuracy: 68.3
ethics-on-ethics	RuGPT-3 Small	Accuracy: 55.5
ethics-on-ethics-2	RuGPT-3 Small	Accuracy: 60.9
ethics-on-ethics-2	Human benchmark	Accuracy: 67.6
ethics-on-ethics-2	RuGPT-3 Large	Accuracy: 44.9
ethics-on-ethics-2	RuGPT-3 Medium	Accuracy: 44.1
logical-reasoning-on-ruworldtree	RuGPT-3 Medium	Accuracy : 38.0
logical-reasoning-on-ruworldtree	Human benchmark	Accuracy : 83.7
logical-reasoning-on-ruworldtree	RuGPT-3 Small	Accuracy : 34.0
logical-reasoning-on-ruworldtree	RuGPT-3 Large	Accuracy : 40.7
logical-reasoning-on-winograd-automatic	RuGPT-3 Small	Accuracy: 57.9
logical-reasoning-on-winograd-automatic	RuGPT-3 Medium	Accuracy: 57.2
logical-reasoning-on-winograd-automatic	Human benchmark	Accuracy: 87.0
logical-reasoning-on-winograd-automatic	RuGPT-3 Large	Accuracy: 55.5
question-answering-on-chegeka	RuGPT-3 Large	Accuracy: 00
question-answering-on-chegeka	Human benchmark	Accuracy: 64.5
question-answering-on-chegeka	RuGPT-3 Medium	Accuracy: 00
question-answering-on-chegeka	RuGPT-3 Small	Accuracy: 00
question-answering-on-multiq	RuGPT-3 Small	Accuracy: 00
question-answering-on-multiq	RuGPT-3 Medium	Accuracy: 00
question-answering-on-multiq	RuGPT-3 Large	Accuracy: 00
question-answering-on-multiq	Human benchmark	Accuracy: 91.0
question-answering-on-ruopenbookqa	RuGPT-3 Large	Accuracy: 55.5
question-answering-on-ruopenbookqa	RuGPT-3 Small	Accuracy: 57.9
question-answering-on-ruopenbookqa	Human benchmark	Accuracy: 86.5
question-answering-on-ruopenbookqa	RuGPT-3 Medium	Accuracy: 57.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette