Command Palette
Search for a command to run...
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Huang Yupan ; Lv Tengchao ; Cui Lei ; Lu Yutong ; Wei Furu

Abstract
Self-supervised pre-training techniques have achieved remarkable progress inDocument AI. Most multimodal pre-trained models use a masked language modelingobjective to learn bidirectional representations on the text modality, but theydiffer in pre-training objectives for the image modality. This discrepancy addsdifficulty to multimodal representation learning. In this paper, we propose\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI withunified text and image masking. Additionally, LayoutLMv3 is pre-trained with aword-patch alignment objective to learn cross-modal alignment by predictingwhether the corresponding image patch of a text word is masked. The simpleunified architecture and training objectives make LayoutLMv3 a general-purposepre-trained model for both text-centric and image-centric Document AI tasks.Experimental results show that LayoutLMv3 achieves state-of-the-art performancenot only in text-centric tasks, including form understanding, receiptunderstanding, and document visual question answering, but also inimage-centric tasks such as document image classification and document layoutanalysis. The code and models are publicly available at\url{https://aka.ms/layoutlmv3}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-ai-on-ephoie | LayoutLMv3 | Average F1: 99.21 |
| document-image-classification-on-rvl-cdip | LayoutLMV3Large | Accuracy: 95.93% Parameters: 368M |
| document-image-classification-on-rvl-cdip | LayoutLMv3BASE | Accuracy: 95.44% Parameters: 133M |
| document-layout-analysis-on-publaynet-val | LayoutLMv3-B | Figure: 0.970 List: 0.955 Overall: 0.951 Table: 0.979 Text: 0.945 Title: 0.906 |
| key-information-extraction-on-cord | LayoutLMv3 Large | F1: 97.46 |
| key-information-extraction-on-ephoie | LayoutLMv3 | Average F1: 99.21 |
| key-value-pair-extraction-on-rfund-en | LayoutLMv3 | key-value pair F1: 57.66 |
| key-value-pair-extraction-on-sibr | LayoutLMv3_base_chinese | key-value pair F1: 73.51 |
| named-entity-recognition-ner-on-cord-r | LayoutLMv3 | F1: 82.72 |
| named-entity-recognition-ner-on-funsd-r | LayoutLMv3 | F1: 78.77 |
| relation-extraction-on-funsd | LayoutLMv3 large | F1: 80.35 |
| semantic-entity-labeling-on-funsd | LayoutLMv3 Large | F1: 92.08 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.