HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Unifying Vision, Text, and Layout for Universal Document Processing

Zineng Tang; Ziyi Yang; Guoxin Wang; Yuwei Fang; Yang Liu; Chenguang Zhu; Michael Zeng; Cha Zhang; Mohit Bansal

Unifying Vision, Text, and Layout for Universal Document Processing

Abstract

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.

Code Repositories

DS4SD/MarkushGrapher
pytorch
Mentioned in GitHub
microsoft/i-code
Official
jax
Mentioned in GitHub
microsoft/udop
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-docvqa-testUDOP (aux)
ANLS: 0.878
visual-question-answering-on-docvqa-testUDOP
ANLS: 0.847
visual-question-answering-vqa-onUDOP
ANLS: 47.4
visual-question-answering-vqa-onUDOP (aux)
ANLS: 63.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Unifying Vision, Text, and Layout for Universal Document Processing | Papers | HyperAI