HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Huang Yupan ; Lv Tengchao ; Cui Lei ; Lu Yutong ; Wei Furu

LayoutLMv3: Pre-training for Document AI with Unified Text and Image
  Masking

Abstract

Self-supervised pre-training techniques have achieved remarkable progress inDocument AI. Most multimodal pre-trained models use a masked language modelingobjective to learn bidirectional representations on the text modality, but theydiffer in pre-training objectives for the image modality. This discrepancy addsdifficulty to multimodal representation learning. In this paper, we propose\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI withunified text and image masking. Additionally, LayoutLMv3 is pre-trained with aword-patch alignment objective to learn cross-modal alignment by predictingwhether the corresponding image patch of a text word is masked. The simpleunified architecture and training objectives make LayoutLMv3 a general-purposepre-trained model for both text-centric and image-centric Document AI tasks.Experimental results show that LayoutLMv3 achieves state-of-the-art performancenot only in text-centric tasks, including form understanding, receiptunderstanding, and document visual question answering, but also inimage-centric tasks such as document image classification and document layoutanalysis. The code and models are publicly available at\url{https://aka.ms/layoutlmv3}.

Benchmarks

BenchmarkMethodologyMetrics
document-ai-on-ephoieLayoutLMv3
Average F1: 99.21
document-image-classification-on-rvl-cdipLayoutLMV3Large
Accuracy: 95.93%
Parameters: 368M
document-image-classification-on-rvl-cdipLayoutLMv3BASE
Accuracy: 95.44%
Parameters: 133M
document-layout-analysis-on-publaynet-valLayoutLMv3-B
Figure: 0.970
List: 0.955
Overall: 0.951
Table: 0.979
Text: 0.945
Title: 0.906
key-information-extraction-on-cordLayoutLMv3 Large
F1: 97.46
key-information-extraction-on-ephoieLayoutLMv3
Average F1: 99.21
key-value-pair-extraction-on-rfund-enLayoutLMv3
key-value pair F1: 57.66
key-value-pair-extraction-on-sibrLayoutLMv3_base_chinese
key-value pair F1: 73.51
named-entity-recognition-ner-on-cord-rLayoutLMv3
F1: 82.72
named-entity-recognition-ner-on-funsd-rLayoutLMv3
F1: 78.77
relation-extraction-on-funsdLayoutLMv3 large
F1: 80.35
semantic-entity-labeling-on-funsdLayoutLMv3 Large
F1: 92.08

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp