HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
  Understanding

Abstract

Pre-training of text and layout has proved effective in a variety ofvisually-rich document understanding tasks due to its effective modelarchitecture and the advantage of large-scale unlabeled scanned/digital-borndocuments. We propose LayoutLMv2 architecture with new pre-training tasks tomodel the interaction among text, layout, and image in a single multi-modalframework. Specifically, with a two-stream multi-modal Transformer encoder,LayoutLMv2 uses not only the existing masked visual-language modeling task butalso the new text-image alignment and text-image matching tasks, which make itbetter capture the cross-modality interaction in the pre-training stage.Meanwhile, it also integrates a spatial-aware self-attention mechanism into theTransformer architecture so that the model can fully understand the relativepositional relationship among different text blocks. Experiment results showthat LayoutLMv2 outperforms LayoutLM by a large margin and achieves newstate-of-the-art results on a wide variety of downstream visually-rich documentunderstanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520),RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made ourmodel and code publicly available at \url{https://aka.ms/layoutlmv2}.

Benchmarks

BenchmarkMethodologyMetrics
document-image-classification-on-rvl-cdipLayoutLMv2LARGE
Accuracy: 95.64%
document-image-classification-on-rvl-cdipLayoutLMv2BASE
Accuracy: 95.25%
Parameters: 200M
key-information-extraction-on-cordLayoutLMv2BASE
F1: 94.95
key-information-extraction-on-cordLayoutLMv2LARGE
F1: 96.01
key-information-extraction-on-kleister-ndaLayoutLMv2BASE
F1: 83.3
key-information-extraction-on-kleister-ndaLayoutLMv2LARGE
F1: 85.2
key-information-extraction-on-sroieLayoutLMv2LARGE
F1: 96.61
key-information-extraction-on-sroieLayoutLMv2LARGE (Excluding OCR mismatch)
F1: 97.81
key-information-extraction-on-sroieLayoutLMv2BASE
F1: 96.25
key-value-pair-extraction-on-rfund-enLayoutLMv2_base
key-value pair F1: 49.06
relation-extraction-on-funsdLayoutLMv2 large
F1: 70.57
semantic-entity-labeling-on-funsdLayoutLMv2LARGE
F1: 84.2
semantic-entity-labeling-on-funsdLayoutLMv2BASE
F1: 82.76
visual-question-answering-on-docvqa-testLayoutLMv2LARGE
ANLS: 0.8672
visual-question-answering-on-docvqa-testLayoutLMv2BASE
ANLS: 0.7808

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | Papers | HyperAI