5 months ago

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li; Yiheng Xu; Tengchao Lv; Lei Cui; Cha Zhang; Furu Wei

Abstract

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

Code Repositories

huggingface/transformers

pytorch

Mentioned in GitHub

thibaultvt/Diard

pytorch

microsoft/unilm/tree/master/dit

Official

pytorch

MindCode-4/code-3/tree/main/deit

mindspore

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	DiT-B	Accuracy: 92.11% Parameters: 87M
document-image-classification-on-rvl-cdip	DiT-L	Accuracy: 92.69% Parameters: 304M
document-layout-analysis-on-publaynet-val	DiT-L	Figure: 0.972 List: 0.960 Overall: 0.949 Table: 0.978 Text: 0.944 Title: 0.893
table-detection-on-ctdar	DiT-B (Cascade)	Weighted Average F1-score: 96.14
table-detection-on-ctdar	DiT-L (Cascade)	Weighted Average F1-score: 96.55

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li; Yiheng Xu; Tengchao Lv; Lei Cui; Cha Zhang; Furu Wei

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters