Command Palette
Search for a command to run...
SR Nikitha ; Menta Tarun Ram ; Sarkar Mausoom

Abstract
The advent of multimodal learning has brought a significant improvement indocument AI. Documents are now treated as multimodal entities, incorporatingboth textual and visual information for downstream analysis. However, works inthis space are often focused on the textual aspect, using the visual space asauxiliary information. While some works have explored pure vision basedtechniques for document image understanding, they require OCR identified textas input during inference, or do not align with text in their learningprocedure. Therefore, we present a novel image-text alignment techniquespecially designed for leveraging the textual information in document images toimprove performance on visual tasks. Our document encoder model DoPTA - trainedwith this technique demonstrates strong performance on a wide range of documentimage understanding tasks, without requiring OCR during inference. Combinedwith an auxiliary reconstruction objective, DoPTA consistently outperformslarger models, while using significantly lesser pre-training compute. DoPTAalso sets new state-of-the art results on D4LA, and FUNSD, two challengingdocument visual analysis benchmarks.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | DoPTA | Accuracy: 94.12% Parameters: 85M |
| document-layout-analysis-on-d4la | DoPTA | mAP: 70.72 Model Parameters: 85M |
| document-layout-analysis-on-publaynet-val | DoPTA-HR | Figure: 0.970 List: 0.957 Overall: 0.949 Table: 0.977 Text: 0.944 Title: 0.895 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.