HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Xu Yiheng ; Lv Tengchao ; Cui Lei ; Wang Guoxin ; Lu Yijuan ; Florencio Dinei ; Zhang Cha ; Wei Furu

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich
  Document Understanding

Abstract

Multimodal pre-training with text, layout, and image has achieved SOTAperformance for visually-rich document understanding tasks recently, whichdemonstrates the great potential for joint learning across differentmodalities. In this paper, we present LayoutXLM, a multimodal pre-trained modelfor multilingual document understanding, which aims to bridge the languagebarriers for visually-rich document understanding. To accurately evaluateLayoutXLM, we also introduce a multilingual form understanding benchmarkdataset named XFUND, which includes form understanding samples in 7 languages(Chinese, Japanese, Spanish, French, Italian, German, Portuguese), andkey-value pairs are manually labeled for each language. Experiment results showthat the LayoutXLM model has significantly outperformed the existing SOTAcross-lingual pre-trained models on the XFUND dataset. The pre-trainedLayoutXLM model and the XFUND dataset are publicly available athttps://aka.ms/layoutxlm.

Benchmarks

BenchmarkMethodologyMetrics
document-image-classification-on-rvl-cdipLayoutXLM
Accuracy: 95.21%
key-value-pair-extraction-on-rfund-enLayoutXLM_base
key-value pair F1: 53.98
key-value-pair-extraction-on-sibrLayoutXLM
key-value pair F1: 70.45

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding | Papers | HyperAI