HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

SR Nikitha ; Menta Tarun Ram ; Sarkar Mausoom

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

Abstract

The advent of multimodal learning has brought a significant improvement indocument AI. Documents are now treated as multimodal entities, incorporatingboth textual and visual information for downstream analysis. However, works inthis space are often focused on the textual aspect, using the visual space asauxiliary information. While some works have explored pure vision basedtechniques for document image understanding, they require OCR identified textas input during inference, or do not align with text in their learningprocedure. Therefore, we present a novel image-text alignment techniquespecially designed for leveraging the textual information in document images toimprove performance on visual tasks. Our document encoder model DoPTA - trainedwith this technique demonstrates strong performance on a wide range of documentimage understanding tasks, without requiring OCR during inference. Combinedwith an auxiliary reconstruction objective, DoPTA consistently outperformslarger models, while using significantly lesser pre-training compute. DoPTAalso sets new state-of-the art results on D4LA, and FUNSD, two challengingdocument visual analysis benchmarks.

Benchmarks

BenchmarkMethodologyMetrics
document-image-classification-on-rvl-cdipDoPTA
Accuracy: 94.12%
Parameters: 85M
document-layout-analysis-on-d4laDoPTA
mAP: 70.72
Model Parameters: 85M
document-layout-analysis-on-publaynet-valDoPTA-HR
Figure: 0.970
List: 0.957
Overall: 0.949
Table: 0.977
Text: 0.944
Title: 0.895

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DoPTA: Improving Document Layout Analysis using Patch-Text Alignment | Papers | HyperAI