Command Palette
Search for a command to run...
OCR-free Document Understanding Transformer
OCR-free Document Understanding Transformer
Abstract
Understanding document images (e.g., invoices) is a core but challenging tasksince it requires complex functions such as reading text and a holisticunderstanding of the document. Current Visual Document Understanding (VDU)methods outsource the task of reading text to off-the-shelf Optical CharacterRecognition (OCR) engines and focus on the understanding task with the OCRoutputs. Although such OCR-based approaches have shown promising performance,they suffer from 1) high computational costs for using OCR; 2) inflexibility ofOCR models on languages or types of document; 3) OCR error propagation to thesubsequent process. To address these issues, in this paper, we introduce anovel OCR-free VDU model named Donut, which stands for Document understandingtransformer. As the first step in OCR-free VDU research, we propose a simplearchitecture (i.e., Transformer) with a pre-training objective (i.e.,cross-entropy loss). Donut is conceptually simple yet effective. Throughextensive experiments and analyses, we show a simple OCR-free VDU model, Donut,achieves state-of-the-art performances on various VDU tasks in terms of bothspeed and accuracy. In addition, we offer a synthetic data generator that helpsthe model pre-training to be flexible in various languages and domains. Thecode, trained model and synthetic data are available athttps://github.com/clovaai/donut.