HyperAIHyperAI

Command Palette

Search for a command to run...

OCR-free Document Understanding Transformer

Abstract

Understanding document images (e.g., invoices) is a core but challenging tasksince it requires complex functions such as reading text and a holisticunderstanding of the document. Current Visual Document Understanding (VDU)methods outsource the task of reading text to off-the-shelf Optical CharacterRecognition (OCR) engines and focus on the understanding task with the OCRoutputs. Although such OCR-based approaches have shown promising performance,they suffer from 1) high computational costs for using OCR; 2) inflexibility ofOCR models on languages or types of document; 3) OCR error propagation to thesubsequent process. To address these issues, in this paper, we introduce anovel OCR-free VDU model named Donut, which stands for Document understandingtransformer. As the first step in OCR-free VDU research, we propose a simplearchitecture (i.e., Transformer) with a pre-training objective (i.e.,cross-entropy loss). Donut is conceptually simple yet effective. Throughextensive experiments and analyses, we show a simple OCR-free VDU model, Donut,achieves state-of-the-art performances on various VDU tasks in terms of bothspeed and accuracy. In addition, we offer a synthetic data generator that helpsthe model pre-training to be flexible in various languages and domains. Thecode, trained model and synthetic data are available athttps://github.com/clovaai/donut.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp