8 months ago

Multimodal Representation

Method/Architecture

Yupan Huang Tengchao Lv Lei Cui Yutong Lu Furu Wei

Abstract

Self-supervised pre-training techniques have achieved remarkable progress inDocument AI. Most multimodal pre-trained models use a masked language modelingobjective to learn bidirectional representations on the text modality, but theydiffer in pre-training objectives for the image modality. This discrepancy addsdifficulty to multimodal representation learning. In this paper, we propose\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI withunified text and image masking. Additionally, LayoutLMv3 is pre-trained with aword-patch alignment objective to learn cross-modal alignment by predictingwhether the corresponding image patch of a text word is masked. The simpleunified architecture and training objectives make LayoutLMv3 a general-purposepre-trained model for both text-centric and image-centric Document AI tasks.Experimental results show that LayoutLMv3 achieves state-of-the-art performancenot only in text-centric tasks, including form understanding, receiptunderstanding, and document visual question answering, but also inimage-centric tasks such as document image classification and document layoutanalysis. The code and models are publicly available at\url{https://aka.ms/layoutlmv3}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Method/Architecture

Yupan Huang Tengchao Lv Lei Cui Yutong Lu Furu Wei

Abstract

Self-supervised pre-training techniques have achieved remarkable progress inDocument AI. Most multimodal pre-trained models use a masked language modelingobjective to learn bidirectional representations on the text modality, but theydiffer in pre-training objectives for the image modality. This discrepancy addsdifficulty to multimodal representation learning. In this paper, we propose\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI withunified text and image masking. Additionally, LayoutLMv3 is pre-trained with aword-patch alignment objective to learn cross-modal alignment by predictingwhether the corresponding image patch of a text word is masked. The simpleunified architecture and training objectives make LayoutLMv3 a general-purposepre-trained model for both text-centric and image-centric Document AI tasks.Experimental results show that LayoutLMv3 achieves state-of-the-art performancenot only in text-centric tasks, including form understanding, receiptunderstanding, and document visual question answering, but also inimage-centric tasks such as document image classification and document layoutanalysis. The code and models are publicly available at\url{https://aka.ms/layoutlmv3}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | Papers | HyperAI