Command Palette
Search for a command to run...

Abstract
Understanding document images (e.g., invoices) is a core but challenging tasksince it requires complex functions such as reading text and a holisticunderstanding of the document. Current Visual Document Understanding (VDU)methods outsource the task of reading text to off-the-shelf Optical CharacterRecognition (OCR) engines and focus on the understanding task with the OCRoutputs. Although such OCR-based approaches have shown promising performance,they suffer from 1) high computational costs for using OCR; 2) inflexibility ofOCR models on languages or types of document; 3) OCR error propagation to thesubsequent process. To address these issues, in this paper, we introduce anovel OCR-free VDU model named Donut, which stands for Document understandingtransformer. As the first step in OCR-free VDU research, we propose a simplearchitecture (i.e., Transformer) with a pre-training objective (i.e.,cross-entropy loss). Donut is conceptually simple yet effective. Throughextensive experiments and analyses, we show a simple OCR-free VDU model, Donut,achieves state-of-the-art performances on various VDU tasks in terms of bothspeed and accuracy. In addition, we offer a synthetic data generator that helpsthe model pre-training to be flexible in various languages and domains. Thecode, trained model and synthetic data are available athttps://github.com/clovaai/donut.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | Donut | Accuracy: 95.3% |
| key-value-pair-extraction-on-rfund-en | Donut | key-value pair F1: 24.54 |
| key-value-pair-extraction-on-sibr | Donut | key-value pair F1: 17.26 |
| visual-question-answering-on-docvqa-test | Donut | ANLS: 0.675 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.