Command Palette
Search for a command to run...
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
Yuechen Yu; Yulin Li; Chengquan Zhang; Xiaoqiang Zhang; Zengyuan Guo; Xiameng Qin; Kun Yao; Junyu Han; Errui Ding; Jingdong Wang

Abstract
In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | StrucTexTv2 (small) | Accuracy: 93.4% Parameters: 28M |
| document-image-classification-on-rvl-cdip | StrucTexTv2 (large) | Accuracy: 94.62% Parameters: 238M |
| semantic-entity-labeling-on-funsd | StrucTexTv2 (large) | F1: 91.82 |
| semantic-entity-labeling-on-funsd | StrucTexTv2 (small) | F1: 89.23 |
| table-recognition-on-wtw | StrucTexTv2 (small) | F1: 78.9% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.