Command Palette
Search for a command to run...
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Rafał Powalski Łukasz Borchmann Dawid Jurkiewicz Tomasz Dwojak Michał Pietruszka Gabriela Pałka

Abstract
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | TILT-Base | Accuracy: 95.25% |
| document-image-classification-on-rvl-cdip | TILT-Large | Accuracy: 95.52% |
| visual-question-answering-on-docvqa-test | TILT-Large | ANLS: 0.8705 |
| visual-question-answering-on-docvqa-test | TILT-Base | ANLS: 0.8392 |
| visual-question-answering-vqa-on | TILT-Large | ANLS: 61.20 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.