HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Yi Tu; Ya Guo; Huan Chen; Jinyang Tang

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Abstract

Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and layout modalities in a unified model and produce adaptive and robust multi-modal representations for downstream tasks. Experimental results show that our proposed method can achieve state-of-the-art results on a wide variety of VrDU problems, including form understanding, receipt understanding, and document image classification.

Benchmarks

BenchmarkMethodologyMetrics
key-information-extraction-on-cordLayoutMask (base)
F1: 96.99
key-information-extraction-on-cordLayoutMask (large)
F1: 97.19
named-entity-recognition-ner-on-cord-rLayoutMask
F1: 81.84
named-entity-recognition-ner-on-funsd-rLayoutMask
F1: 77.10
semantic-entity-labeling-on-funsdLayoutMask (large)
F1: 93.20
semantic-entity-labeling-on-funsdLayoutMask (base)
F1: 92.91

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding | Papers | HyperAI