HyperAIHyperAI

Command Palette

Search for a command to run...

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Abstract

Pre-training of text and layout has proved effective in a variety ofvisually-rich document understanding tasks due to its effective modelarchitecture and the advantage of large-scale unlabeled scanned/digital-borndocuments. We propose LayoutLMv2 architecture with new pre-training tasks tomodel the interaction among text, layout, and image in a single multi-modalframework. Specifically, with a two-stream multi-modal Transformer encoder,LayoutLMv2 uses not only the existing masked visual-language modeling task butalso the new text-image alignment and text-image matching tasks, which make itbetter capture the cross-modality interaction in the pre-training stage.Meanwhile, it also integrates a spatial-aware self-attention mechanism into theTransformer architecture so that the model can fully understand the relativepositional relationship among different text blocks. Experiment results showthat LayoutLMv2 outperforms LayoutLM by a large margin and achieves newstate-of-the-art results on a wide variety of downstream visually-rich documentunderstanding tasks, including FUNSD (0.7895 \to 0.8420), CORD (0.9493 \to0.9601), SROIE (0.9524 \to 0.9781), Kleister-NDA (0.8340 \to 0.8520),RVL-CDIP (0.9443 \to 0.9564), and DocVQA (0.7295 \to 0.8672). We made ourmodel and code publicly available at \url{https://aka.ms/layoutlmv2}.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp