4 months ago

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

SR Nikitha ; Menta Tarun Ram ; Sarkar Mausoom

Abstract

The advent of multimodal learning has brought a significant improvement indocument AI. Documents are now treated as multimodal entities, incorporatingboth textual and visual information for downstream analysis. However, works inthis space are often focused on the textual aspect, using the visual space asauxiliary information. While some works have explored pure vision basedtechniques for document image understanding, they require OCR identified textas input during inference, or do not align with text in their learningprocedure. Therefore, we present a novel image-text alignment techniquespecially designed for leveraging the textual information in document images toimprove performance on visual tasks. Our document encoder model DoPTA - trainedwith this technique demonstrates strong performance on a wide range of documentimage understanding tasks, without requiring OCR during inference. Combinedwith an auxiliary reconstruction objective, DoPTA consistently outperformslarger models, while using significantly lesser pre-training compute. DoPTAalso sets new state-of-the art results on D4LA, and FUNSD, two challengingdocument visual analysis benchmarks.

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	DoPTA	Accuracy: 94.12% Parameters: 85M
document-layout-analysis-on-d4la	DoPTA	mAP: 70.72 Model Parameters: 85M
document-layout-analysis-on-publaynet-val	DoPTA-HR	Figure: 0.970 List: 0.957 Overall: 0.949 Table: 0.977 Text: 0.944 Title: 0.895

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning