3 months ago

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Rafał Powalski Łukasz Borchmann Dawid Jurkiewicz Tomasz Dwojak Michał Pietruszka Gabriela Pałka

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Code Repositories

uakarsh/TiLT-Implementation

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	TILT-Base	Accuracy: 95.25%
document-image-classification-on-rvl-cdip	TILT-Large	Accuracy: 95.52%
visual-question-answering-on-docvqa-test	TILT-Large	ANLS: 0.8705
visual-question-answering-on-docvqa-test	TILT-Base	ANLS: 0.8392
visual-question-answering-vqa-on	TILT-Large	ANLS: 61.20

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette