Command Palette
Search for a command to run...
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Xu Yiheng ; Lv Tengchao ; Cui Lei ; Wang Guoxin ; Lu Yijuan ; Florencio Dinei ; Zhang Cha ; Wei Furu

Abstract
Multimodal pre-training with text, layout, and image has achieved SOTAperformance for visually-rich document understanding tasks recently, whichdemonstrates the great potential for joint learning across differentmodalities. In this paper, we present LayoutXLM, a multimodal pre-trained modelfor multilingual document understanding, which aims to bridge the languagebarriers for visually-rich document understanding. To accurately evaluateLayoutXLM, we also introduce a multilingual form understanding benchmarkdataset named XFUND, which includes form understanding samples in 7 languages(Chinese, Japanese, Spanish, French, Italian, German, Portuguese), andkey-value pairs are manually labeled for each language. Experiment results showthat the LayoutXLM model has significantly outperformed the existing SOTAcross-lingual pre-trained models on the XFUND dataset. The pre-trainedLayoutXLM model and the XFUND dataset are publicly available athttps://aka.ms/layoutxlm.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | LayoutXLM | Accuracy: 95.21% |
| key-value-pair-extraction-on-rfund-en | LayoutXLM_base | key-value pair F1: 53.98 |
| key-value-pair-extraction-on-sibr | LayoutXLM | key-value pair F1: 70.45 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.