Command Palette
Search for a command to run...

Abstract
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| ethics-on-ethics | RuGPT-3 Large | Accuracy: 68.6 |
| ethics-on-ethics | Human benchmark | Accuracy: 52.9 |
| ethics-on-ethics | RuGPT-3 Meduim | Accuracy: 68.3 |
| ethics-on-ethics | RuGPT-3 Small | Accuracy: 55.5 |
| ethics-on-ethics-2 | RuGPT-3 Small | Accuracy: 60.9 |
| ethics-on-ethics-2 | Human benchmark | Accuracy: 67.6 |
| ethics-on-ethics-2 | RuGPT-3 Large | Accuracy: 44.9 |
| ethics-on-ethics-2 | RuGPT-3 Medium | Accuracy: 44.1 |
| logical-reasoning-on-ruworldtree | RuGPT-3 Medium | Accuracy : 38.0 |
| logical-reasoning-on-ruworldtree | Human benchmark | Accuracy : 83.7 |
| logical-reasoning-on-ruworldtree | RuGPT-3 Small | Accuracy : 34.0 |
| logical-reasoning-on-ruworldtree | RuGPT-3 Large | Accuracy : 40.7 |
| logical-reasoning-on-winograd-automatic | RuGPT-3 Small | Accuracy: 57.9 |
| logical-reasoning-on-winograd-automatic | RuGPT-3 Medium | Accuracy: 57.2 |
| logical-reasoning-on-winograd-automatic | Human benchmark | Accuracy: 87.0 |
| logical-reasoning-on-winograd-automatic | RuGPT-3 Large | Accuracy: 55.5 |
| question-answering-on-chegeka | RuGPT-3 Large | Accuracy: 00 |
| question-answering-on-chegeka | Human benchmark | Accuracy: 64.5 |
| question-answering-on-chegeka | RuGPT-3 Medium | Accuracy: 00 |
| question-answering-on-chegeka | RuGPT-3 Small | Accuracy: 00 |
| question-answering-on-multiq | RuGPT-3 Small | Accuracy: 00 |
| question-answering-on-multiq | RuGPT-3 Medium | Accuracy: 00 |
| question-answering-on-multiq | RuGPT-3 Large | Accuracy: 00 |
| question-answering-on-multiq | Human benchmark | Accuracy: 91.0 |
| question-answering-on-ruopenbookqa | RuGPT-3 Large | Accuracy: 55.5 |
| question-answering-on-ruopenbookqa | RuGPT-3 Small | Accuracy: 57.9 |
| question-answering-on-ruopenbookqa | Human benchmark | Accuracy: 86.5 |
| question-answering-on-ruopenbookqa | RuGPT-3 Medium | Accuracy: 57.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.