5 months ago

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Xu Yang ; Xu Yiheng ; Lv Tengchao ; Cui Lei ; Wei Furu ; Wang Guoxin ; Lu Yijuan ; Florencio Dinei ; Zhang Cha ; Che

Abstract

Pre-training of text and layout has proved effective in a variety ofvisually-rich document understanding tasks due to its effective modelarchitecture and the advantage of large-scale unlabeled scanned/digital-borndocuments. We propose LayoutLMv2 architecture with new pre-training tasks tomodel the interaction among text, layout, and image in a single multi-modalframework. Specifically, with a two-stream multi-modal Transformer encoder,LayoutLMv2 uses not only the existing masked visual-language modeling task butalso the new text-image alignment and text-image matching tasks, which make itbetter capture the cross-modality interaction in the pre-training stage.Meanwhile, it also integrates a spatial-aware self-attention mechanism into theTransformer architecture so that the model can fully understand the relativepositional relationship among different text blocks. Experiment results showthat LayoutLMv2 outperforms LayoutLM by a large margin and achieves newstate-of-the-art results on a wide variety of downstream visually-rich documentunderstanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520),RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made ourmodel and code publicly available at \url{https://aka.ms/layoutlmv2}.

Code Repositories

MS-P3/code3/tree/main/layoutlmv2

mindspore

facebookresearch/data2vec_vision

pytorch

Mentioned in GitHub

MindSpore-scientific/code-7/tree/main/LayoutLMv2

MindSpore-scientific-2/code-14/tree/main/layoutlmv2

mindspore

pwc-1/Paper-9/tree/main/layoutlmv2

mindspore

huggingface/transformers

pytorch

Mentioned in GitHub

PaddlePaddle/PaddleOCR

paddle

PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/layoutlmv2

paddle

microsoft/unilm

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	LayoutLMv2LARGE	Accuracy: 95.64%
document-image-classification-on-rvl-cdip	LayoutLMv2BASE	Accuracy: 95.25% Parameters: 200M
key-information-extraction-on-cord	LayoutLMv2BASE	F1: 94.95
key-information-extraction-on-cord	LayoutLMv2LARGE	F1: 96.01
key-information-extraction-on-kleister-nda	LayoutLMv2BASE	F1: 83.3
key-information-extraction-on-kleister-nda	LayoutLMv2LARGE	F1: 85.2
key-information-extraction-on-sroie	LayoutLMv2LARGE	F1: 96.61
key-information-extraction-on-sroie	LayoutLMv2LARGE (Excluding OCR mismatch)	F1: 97.81
key-information-extraction-on-sroie	LayoutLMv2BASE	F1: 96.25
key-value-pair-extraction-on-rfund-en	LayoutLMv2_base	key-value pair F1: 49.06
relation-extraction-on-funsd	LayoutLMv2 large	F1: 70.57
semantic-entity-labeling-on-funsd	LayoutLMv2LARGE	F1: 84.2
semantic-entity-labeling-on-funsd	LayoutLMv2BASE	F1: 82.76
visual-question-answering-on-docvqa-test	LayoutLMv2LARGE	ANLS: 0.8672
visual-question-answering-on-docvqa-test	LayoutLMv2BASE	ANLS: 0.7808

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Xu Yang ; Xu Yiheng ; Lv Tengchao ; Cui Lei ; Wei Furu ; Wang Guoxin ; Lu Yijuan ; Florencio Dinei ; Zhang Cha ; Che3 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Xu Yang ; Xu Yiheng ; Lv Tengchao ; Cui Lei ; Wei Furu ; Wang Guoxin ; Lu Yijuan ; Florencio Dinei ; Zhang Cha ; Che